# Pandas -- Data Science

<tr>
<td> <img src = "https://miro.medium.com/max/1400/0*DdYAfo_NsnAeHrur" width="80%" height="80%"/ </td>
<td> <img src = "https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/2880px-Pandas_logo.svg.png" width="50%" height="50%"/> </td>
</tr>

**Pandas** is a popular software library for the Python in data manipulation and analysis. It is one of the most important and useful tools in the arsenal of a Data Scientist and a Data Analyst.

**Four steps for business intelligence:**
1. data collection
2. data analysis
3. data visualization
4. decision-making 

# Introduction

Install Pandas

In [4]:
#import libraries
import pandas as pd

In [22]:
#create an example dataset
mydata = { #dataset name
  'cars': ["BMW", "Volvo", "Ford"], #'column name':[data]
  'passings': [3, 7, 2]
} #end dictionary

mydata

{'cars': ['BMW', 'Volvo', 'Ford'], 'passings': [3, 7, 2]}

In [23]:
#convert the data into pandas dataframe
mydf = pd.DataFrame(mydata)
mydf

Unnamed: 0,cars,passings
0,BMW,3
1,Volvo,7
2,Ford,2


In [24]:
#display the first row
print(mydf[:1])

  cars  passings
0  BMW         3


In [25]:
#refer to the row index 0:
print(mydf.loc[0])

cars        BMW
passings      3
Name: 0, dtype: object


In [26]:
#use a list of indexes:
print(mydf.loc[[0, 1]])

    cars  passings
0    BMW         3
1  Volvo         7


In [27]:
#instead of using the default index [0....n], name your own index
mydf2 = pd.DataFrame(mydata, index = ["Car1", "Car2", "Car3"])

print(mydf2) 

       cars  passings
Car1    BMW         3
Car2  Volvo         7
Car3   Ford         2


In [28]:
#refer to the named index:
print(mydf2.loc["Car2"])

cars        Volvo
passings        7
Name: Car2, dtype: object


In [29]:
# rename columns
mydf2 = mydf2.rename(columns={'cars': 'Car_brands'})
mydf2.head()

Unnamed: 0,Car_brands,passings
Car1,BMW,3
Car2,Volvo,7
Car3,Ford,2


### JSON data format -- Twitter data

Example:https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview

{
  "created_at": "Thu Apr 06 15:24:15 +0000 2017",
  "id_str": "850006245121695744",
  "text": "1\/ Today we\u2019re sharing our vision for the future of the Twitter API platform!\nhttps:\/\/t.co\/XweGngmxlP",
  "user": {
    "id": 2244994945,
    "name": "Twitter Dev",
    "screen_name": "TwitterDev",
    "location": "Internet",
    "url": "https:\/\/dev.twitter.com\/",
    "description": "Your official source for Twitter Platform news, updates & events. Need technical help? Visit https:\/\/twittercommunity.com\/ \u2328\ufe0f #TapIntoTwitter"
  },
  "place": {   
  },
  "entities": {
    "hashtags": [      
    ],
    "urls": [
      {
        "url": "https:\/\/t.co\/XweGngmxlP",
        "unwound": {
          "url": "https:\/\/cards.twitter.com\/cards\/18ce53wgo4h\/3xo1c",
          "title": "Building the Future of the Twitter API Platform"
        }
      }
    ],
    "user_mentions": [     
    ]
  }
}

In [30]:
# JSON to Pandas dataframe
json = {
  "Duration":{
    "0":60,
    "1":60,
    "2":60,
    "3":45,
    "4":45,
    "5":60
  },
  "Pulse":{
    "0":110,
    "1":117,
    "2":103,
    "3":109,
    "4":117,
    "5":102
  },
  "Maxpulse":{
    "0":130,
    "1":145,
    "2":135,
    "3":175,
    "4":148,
    "5":127
  },
  "Calories":{
    "0":409,
    "1":479,
    "2":340,
    "3":282,
    "4":406,
    "5":300
  }
}

dfjson = pd.DataFrame(json)

dfjson 

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409
1,60,117,145,479
2,60,103,135,340
3,45,109,175,282
4,45,117,148,406
5,60,102,127,300


In [31]:
#query sepecific columns
print(dfjson[["Duration","Maxpulse"]])

   Duration  Maxpulse
0        60       130
1        60       145
2        60       135
3        45       175
4        45       148
5        60       127


### Parsing JSON Dataset

**“Normalize” semi-structured JSON data into a flat table**

In [32]:
from pandas.io.json import json_normalize

In [33]:
jsondata2 = [{'state': 'Florida',
      'shortname': 'FL',
       'info': {
            'governor': 'Rick Scott'
         },
#parent node         
              'counties': [{'name': 'Dade', 'population': 12345},
                    {'name': 'Broward', 'population': 40000},
                     {'name': 'Palm Beach', 'population': 60000}]}, 
        {'state': 'Ohio',
        'shortname': 'OH',
          'info': {
              'governor': 'John Kasich'
        },
      'counties': [{'name': 'Summit', 'population': 1234},
                      {'name': 'Cuyahoga', 'population': 1337}]}]

In [34]:
norm = json_normalize(jsondata2, 'counties', ['state', 'shortname',
                                           ['info', 'governor']])
norm

  norm = json_normalize(jsondata2, 'counties', ['state', 'shortname',


Unnamed: 0,name,population,state,shortname,info.governor
0,Dade,12345,Florida,FL,Rick Scott
1,Broward,40000,Florida,FL,Rick Scott
2,Palm Beach,60000,Florida,FL,Rick Scott
3,Summit,1234,Ohio,OH,John Kasich
4,Cuyahoga,1337,Ohio,OH,John Kasich


# Data cleaning

In [35]:
#load nba dataset
nba = pd.read_csv("https://cdncontribute.geeksforgeeks.org/wp-content/uploads/nba.csv")

nba

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
...,...,...,...,...,...,...,...,...,...
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0


In [36]:
import numpy as np
# add one coulmn with random numbers
nba['Add'] = np.random.randint(0,5, size=len(nba))
nba

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Add
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,4
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,4
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,,3
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0,2
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0,4
...,...,...,...,...,...,...,...,...,...,...
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0,0
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0,1
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0,0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0,3


In [37]:
#save the data to csv file
nba.to_csv("nba1.csv")

In [38]:
#remove the index
nba.to_csv("nba2.csv",index=False)

In [39]:
#save the data to excel file
from openpyxl.workbook import Workbook
nba.to_excel("nba.xlsx",
             sheet_name='nba_example', index=False)

In [40]:
#read excel file
readxlsx = pd.read_excel('nba.xlsx',sheet_name='nba_example')
readxlsx

#if you have the following error message, you need to install openpyxl
#ImportError: Missing optional dependency 'openpyxl'.  Use pip or conda to install openpyxl.
#!pip install openpyxl
#https://anaconda.org/anaconda/openpyxl

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Add
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,4
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,4
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,,3
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0,2
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0,4
...,...,...,...,...,...,...,...,...,...,...
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0,0
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0,1
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0,0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0,3


In [41]:
# in real world, when you facing a very large dataset, you don't have to show them all
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Add
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,4
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,4
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,,3
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0,2
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0,4


In [42]:
# display the entire dataset
pd.set_option('display.max_rows',458)

In [43]:
#nba

In [23]:
# describe your dataset
nba.describe()

Unnamed: 0,Number,Age,Weight,Salary,Add
count,457.0,457.0,457.0,446.0,458.0
mean,17.678337,26.938731,221.522976,4842684.0,1.936681
std,15.96609,4.404016,26.368343,5229238.0,1.388578
min,0.0,19.0,161.0,30888.0,0.0
25%,5.0,24.0,200.0,1044792.0,1.0
50%,13.0,26.0,220.0,2839073.0,2.0
75%,25.0,30.0,240.0,6500000.0,3.0
max,99.0,40.0,307.0,25000000.0,4.0


In [24]:
nba.info()
#if you see the following error:
#TypeError: Cannot interpret '<attribute 'dtype' of 'numpy.generic' objects>' as a data type
#you need to update your Pandas and Numpy, you need to have pandas>=1.0.5
#anaconda prompt: conda update pandas/numpy
#check the version of your pandas: conda list pandas/numpy

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 458 entries, 0 to 457
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      457 non-null    object 
 1   Team      457 non-null    object 
 2   Number    457 non-null    float64
 3   Position  457 non-null    object 
 4   Age       457 non-null    float64
 5   Height    457 non-null    object 
 6   Weight    457 non-null    float64
 7   College   373 non-null    object 
 8   Salary    446 non-null    float64
 9   Add       458 non-null    int64  
dtypes: float64(4), int64(1), object(5)
memory usage: 35.9+ KB


## Empty cells / Null values

In [34]:
#remove rows that contains null values (NaN)
new_nba = nba.dropna()
#new_nba

**Note**: By default, the dropna() method returns a new DataFrame, and will not change the original.

In [26]:
new_nba.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 364 entries, 0 to 456
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      364 non-null    object 
 1   Team      364 non-null    object 
 2   Number    364 non-null    float64
 3   Position  364 non-null    object 
 4   Age       364 non-null    float64
 5   Height    364 non-null    object 
 6   Weight    364 non-null    float64
 7   College   364 non-null    object 
 8   Salary    364 non-null    float64
 9   Add       364 non-null    int64  
dtypes: float64(4), int64(1), object(5)
memory usage: 31.3+ KB


In [27]:
# check how many rows and columns for a dataframe
nba.shape

(458, 10)

If you want to change the original DataFrame, use the inplace = True argument:

In [None]:
# change in the originial dataframe
#nba.dropna(inplace = True)#drop all the rows contain null values
#nba

In [None]:
#Replace all empty/null Values with a specific value
#nba.fillna(130000, inplace = True) #hint: get back the original dataframe if you run the previous module
#nba

In [None]:
#use replace() to replace the null values with 0 for a specific column
#import numpy as np
nba["Salary"].replace(np.nan, 0)

In [12]:
# check unique values for a column 
nba.Team.unique()

NameError: name 'nba' is not defined

In [30]:
# Action 1: Count how many teams use len()
print(len(nba.Team.unique()))

31


In [35]:
# Action 2: how many colleges these football palyers from?
# hints: check the unique values for "College", then count the numbers use len()
print(len(nba.College.unique()))

119


In [38]:
# replace strings with numbers e.g., replace the team "Boston Celtics" to 1 and "Brooklyn Nets" to 2
# change categorical data to numeric
teams = {'Boston Celtics': 1, 'Brooklyn Nets': 2}

nba_replace = nba.replace({'Team': teams})
nba_replace

In [1]:
# Action 3: replace all the team names with numbers 
# Change categorical data to numeric
new_dict = {}
count = 0
for team in nba.Team.unique():
    new_dict[team] = count
    count += 1
nba_replace = nba.replace({'Team': new_dict})
#nba_replace

NameError: name 'nba' is not defined

In [1]:
# select a specific value from a certain column
# find the Team 2 : "Brooklyn Nets" 
# we didn't use inplace = True in the previous module, if you use the "nba" dataset, there is no team names 2
nba_replace.loc[nba_replace['Team'] == 2]

NameError: name 'nba_replace' is not defined

In [None]:
#Calculate the mean/median value
mean_Sal=nba["Salary"].mean() #.median()
mean_Sal

In [None]:
# rounded off
round((nba["Salary"].mean()),0)

In [None]:
# Round to specific decimal 
round((nba["Salary"].mean()),2)

In [None]:
#Calculate the total salary for all the people in the nba dataset
sum_Sal = nba["Salary"].sum()
sum_Sal

## Texts Cleaning

In [13]:
# Action 4: open the sampletweets2022.csv and name it as dftweets
dftweets = pd.read_csv("sampletweets2022.csv")
df = pd.DataFrame(dftweets)
df

Unnamed: 0,Netsmart,Attention Kansas State University! We'll be at the @KState All-University Career Fair next week on Sept. 21 & 23. Stop by and learn more why #NTST is the place to start your career! okt.to/ZpLR2I #EMAW
0,KSU Foundation,I love all regent institutions ... I love this...
1,DavidRosowsky,What got us to now won?t get us to next. -Greg...
2,K-State,RT I applied to #KState because... It felt lik...
3,KSUFoundation,Welcome KSU Foundation Trustees! We're so glad...
4,r_bsal,.@irisisnotabot Ohh lol okay then I recommend ...
5,r_bsal,.@r_bsal Do you go to #KState? If not how do y...
6,ksu_FAN,@irisisnotabot Yeah I got to KState. Really yo...
7,FormerPlantProf,@setonbachle @kstatebio @KStateArtSci @KState ...
8,K-State,Set yourself up for success this semester and ...
9,Mosquito,Congratulations! Hope you enjoy K-State and Ma...


In [14]:
# Action 5: display the first 5 rows
df.head(5)

Unnamed: 0,Netsmart,Attention Kansas State University! We'll be at the @KState All-University Career Fair next week on Sept. 21 & 23. Stop by and learn more why #NTST is the place to start your career! okt.to/ZpLR2I #EMAW
0,KSU Foundation,I love all regent institutions ... I love this...
1,DavidRosowsky,What got us to now won?t get us to next. -Greg...
2,K-State,RT I applied to #KState because... It felt lik...
3,KSUFoundation,Welcome KSU Foundation Trustees! We're so glad...
4,r_bsal,.@irisisnotabot Ohh lol okay then I recommend ...


In [15]:
# Action 6: display the dataframe basic information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 2 columns):
 #   Column                                                                                                                                                                                                      Non-Null Count  Dtype 
---  ------                                                                                                                                                                                                      --------------  ----- 
 0   Netsmart                                                                                                                                                                                                    15 non-null     object
 1   Attention Kansas State University! We'll be at the @KState All-University Career Fair next week on Sept. 21 & 23. Stop by and learn more why #NTST is the place to start your career! okt.to/ZpLR2I #

In [16]:
# display all the columns names
print(dftweets.columns)

Index(['Netsmart', 'Attention Kansas State University! We'll be at the @KState All-University Career Fair next week on Sept. 21 & 23. Stop by and learn more why #NTST is the place to start your career! okt.to/ZpLR2I #EMAW '], dtype='object')


In [17]:
# Action 7: display all the information (hints: set_options())
pd.set_option("display.max_rows", 15)

In [18]:
# Action 8: show the generate descriptive statistics
df.describe(include='all')
df

Unnamed: 0,Netsmart,Attention Kansas State University! We'll be at the @KState All-University Career Fair next week on Sept. 21 & 23. Stop by and learn more why #NTST is the place to start your career! okt.to/ZpLR2I #EMAW
0,KSU Foundation,I love all regent institutions ... I love this...
1,DavidRosowsky,What got us to now won?t get us to next. -Greg...
2,K-State,RT I applied to #KState because... It felt lik...
3,KSUFoundation,Welcome KSU Foundation Trustees! We're so glad...
4,r_bsal,.@irisisnotabot Ohh lol okay then I recommend ...
5,r_bsal,.@r_bsal Do you go to #KState? If not how do y...
6,ksu_FAN,@irisisnotabot Yeah I got to KState. Really yo...
7,FormerPlantProf,@setonbachle @kstatebio @KStateArtSci @KState ...
8,K-State,Set yourself up for success this semester and ...
9,Mosquito,Congratulations! Hope you enjoy K-State and Ma...


In [21]:
# Action 9: rename columns (hints: 0 (first column) to user; 1 to tweet) (columns={0: 'user', 1: 'tweet'})
columns={0: 'user', 1: 'tweet'}
df.rename(columns={"Attention Kansas State University! We'll be at the @KState All-University Career Fair next week on Sept. 21 & 23. Stop by and learn more why #NTST is the place to start your career! okt.to/ZpLR2I #EMAW ": 'tweet'}, inplace = True)
df

Unnamed: 0,Netsmart,tweet
0,KSU Foundation,I love all regent institutions ... I love this...
1,DavidRosowsky,What got us to now won?t get us to next. -Greg...
2,K-State,RT I applied to #KState because... It felt lik...
3,KSUFoundation,Welcome KSU Foundation Trustees! We're so glad...
4,r_bsal,.@irisisnotabot Ohh lol okay then I recommend ...
5,r_bsal,.@r_bsal Do you go to #KState? If not how do y...
6,ksu_FAN,@irisisnotabot Yeah I got to KState. Really yo...
7,FormerPlantProf,@setonbachle @kstatebio @KStateArtSci @KState ...
8,K-State,Set yourself up for success this semester and ...
9,Mosquito,Congratulations! Hope you enjoy K-State and Ma...


In [22]:
# lower case usernames
dftweets['user'].str.lower()
dftweets['user']

KeyError: 'user'

In [23]:
# above fuction dosen't work
# Lower case tweets with function and Apply
def textLower(x):
    return x.lower()

In [24]:
dftweets['user'].apply(textLower) 

KeyError: 'user'

In [25]:
# upper case usernames
dftweets['user'].str.upper()
dftweets['user']

KeyError: 'user'

In [26]:
# Action 10: write a textUpper fuction and apply it to the usernames
def textUpper(string):
    return string.str.upper()
print(textUpper(dftweets['user']).str)


KeyError: 'user'

In [27]:
# save and change it in the original dataset
dftweets['user'] = dftweets['user'].apply(textLower) 

KeyError: 'user'

In [28]:
dftweets.head()

Unnamed: 0,Netsmart,tweet
0,KSU Foundation,I love all regent institutions ... I love this...
1,DavidRosowsky,What got us to now won?t get us to next. -Greg...
2,K-State,RT I applied to #KState because... It felt lik...
3,KSUFoundation,Welcome KSU Foundation Trustees! We're so glad...
4,r_bsal,.@irisisnotabot Ohh lol okay then I recommend ...


In [29]:
# Get the number of characters in each tweet
dftweets['tweet'].str.len()

0     153
1      72
2     230
3     137
4     143
5      97
6     147
7      75
8     213
9      92
10    259
11    300
12    201
13    136
14     49
Name: tweet, dtype: int64

In [30]:
# Action 11: add a new column and name is "count_text" to store the number of character in each tweet
df['count_text'] = df.tweet.str.split().str.len()
df

Unnamed: 0,Netsmart,tweet,count_text
0,KSU Foundation,I love all regent institutions ... I love this...,26
1,DavidRosowsky,What got us to now won?t get us to next. -Greg...,15
2,K-State,RT I applied to #KState because... It felt lik...,42
3,KSUFoundation,Welcome KSU Foundation Trustees! We're so glad...,19
4,r_bsal,.@irisisnotabot Ohh lol okay then I recommend ...,27
5,r_bsal,.@r_bsal Do you go to #KState? If not how do y...,18
6,ksu_FAN,@irisisnotabot Yeah I got to KState. Really yo...,27
7,FormerPlantProf,@setonbachle @kstatebio @KStateArtSci @KState ...,9
8,K-State,Set yourself up for success this semester and ...,30
9,Mosquito,Congratulations! Hope you enjoy K-State and Ma...,16


In [31]:
# find the words starts with... hashtags
dftweets.tweet.str.findall("(?<=#)\w+")

0                                    []
1                                    []
2                  [KState, aggieville]
3                        [KStateFamily]
4                              [KState]
5                              [KState]
6                                    []
7                                    []
8                                    []
9                                    []
10                              [ATGTG]
11    [kstatefb, HelpKStateFightCancer]
12                           [KStateFB]
13                  [ThrowbackThursday]
14                 [Kstate, aggieville]
Name: tweet, dtype: object

In [32]:
# Action 12: Create a column named "hashtags"
df['hashtags'] = dftweets.tweet.str.findall("(?<=#)\w+")
df

Unnamed: 0,Netsmart,tweet,count_text,hashtags
0,KSU Foundation,I love all regent institutions ... I love this...,26,[]
1,DavidRosowsky,What got us to now won?t get us to next. -Greg...,15,[]
2,K-State,RT I applied to #KState because... It felt lik...,42,"[KState, aggieville]"
3,KSUFoundation,Welcome KSU Foundation Trustees! We're so glad...,19,[KStateFamily]
4,r_bsal,.@irisisnotabot Ohh lol okay then I recommend ...,27,[KState]
5,r_bsal,.@r_bsal Do you go to #KState? If not how do y...,18,[KState]
6,ksu_FAN,@irisisnotabot Yeah I got to KState. Really yo...,27,[]
7,FormerPlantProf,@setonbachle @kstatebio @KStateArtSci @KState ...,9,[]
8,K-State,Set yourself up for success this semester and ...,30,[]
9,Mosquito,Congratulations! Hope you enjoy K-State and Ma...,16,[]


In [33]:
#find all the words starting with @ (mentions in Twitter)
dftweets.tweet.str.findall("(?<=#)\w+")

0                                    []
1                                    []
2                  [KState, aggieville]
3                        [KStateFamily]
4                              [KState]
5                              [KState]
6                                    []
7                                    []
8                                    []
9                                    []
10                              [ATGTG]
11    [kstatefb, HelpKStateFightCancer]
12                           [KStateFB]
13                  [ThrowbackThursday]
14                 [Kstate, aggieville]
Name: tweet, dtype: object

In [34]:
# Action 13: Create a column named "mentions"
df['mentions'] = dftweets.tweet.str.findall("(?<=#)\w+")
df

Unnamed: 0,Netsmart,tweet,count_text,hashtags,mentions
0,KSU Foundation,I love all regent institutions ... I love this...,26,[],[]
1,DavidRosowsky,What got us to now won?t get us to next. -Greg...,15,[],[]
2,K-State,RT I applied to #KState because... It felt lik...,42,"[KState, aggieville]","[KState, aggieville]"
3,KSUFoundation,Welcome KSU Foundation Trustees! We're so glad...,19,[KStateFamily],[KStateFamily]
4,r_bsal,.@irisisnotabot Ohh lol okay then I recommend ...,27,[KState],[KState]
5,r_bsal,.@r_bsal Do you go to #KState? If not how do y...,18,[KState],[KState]
6,ksu_FAN,@irisisnotabot Yeah I got to KState. Really yo...,27,[],[]
7,FormerPlantProf,@setonbachle @kstatebio @KStateArtSci @KState ...,9,[],[]
8,K-State,Set yourself up for success this semester and ...,30,[],[]
9,Mosquito,Congratulations! Hope you enjoy K-State and Ma...,16,[],[]


In [35]:
# Action 14: what is the data type of the values in columns "hashtags" and "mentions"?
type(df['hashtags'])
type(df['mentions'])

pandas.core.series.Series

In [36]:
# remove the brackets and keep the string values
dftweets['hash_str'] = dftweets['hashtags'].apply(', '.join)

In [37]:
# Action 15: remove the brackets and keep the string values for column "mentions"
df['mentions'] = dftweets['hash_str'] = dftweets['hashtags'].apply(', '.join)
df

Unnamed: 0,Netsmart,tweet,count_text,hashtags,mentions,hash_str
0,KSU Foundation,I love all regent institutions ... I love this...,26,[],,
1,DavidRosowsky,What got us to now won?t get us to next. -Greg...,15,[],,
2,K-State,RT I applied to #KState because... It felt lik...,42,"[KState, aggieville]","KState, aggieville","KState, aggieville"
3,KSUFoundation,Welcome KSU Foundation Trustees! We're so glad...,19,[KStateFamily],KStateFamily,KStateFamily
4,r_bsal,.@irisisnotabot Ohh lol okay then I recommend ...,27,[KState],KState,KState
5,r_bsal,.@r_bsal Do you go to #KState? If not how do y...,18,[KState],KState,KState
6,ksu_FAN,@irisisnotabot Yeah I got to KState. Really yo...,27,[],,
7,FormerPlantProf,@setonbachle @kstatebio @KStateArtSci @KState ...,9,[],,
8,K-State,Set yourself up for success this semester and ...,30,[],,
9,Mosquito,Congratulations! Hope you enjoy K-State and Ma...,16,[],,


In [38]:
# tokenization
dftweets['tweet'].str.split()

0     [I, love, all, regent, institutions, ..., I, l...
1     [What, got, us, to, now, won?t, get, us, to, n...
2     [RT, I, applied, to, #KState, because..., It, ...
3     [Welcome, KSU, Foundation, Trustees!, We're, s...
4     [.@irisisnotabot, Ohh, lol, okay, then, I, rec...
5     [.@r_bsal, Do, you, go, to, #KState?, If, not,...
6     [@irisisnotabot, Yeah, I, got, to, KState., Re...
7     [@setonbachle, @kstatebio, @KStateArtSci, @KSt...
8     [Set, yourself, up, for, success, this, semest...
9     [Congratulations!, Hope, you, enjoy, K-State, ...
10    [#ATGTG, After, a, great, conversation, with, ...
11    [Former, K-State, Wildcat, &, NFL, Wide, Recei...
12    [Noticed, something, new, building, around, ca...
13    [#ThrowbackThursday, Fort, Riley, Soldiers, pe...
14    [What?s, happening, at, #Kstate, today, ???, #...
Name: tweet, dtype: object

In [39]:
# Determine the number of words in each tweet and create a column "wordcount"
dftweets['wordcount'] = dftweets['tweet'].str.split().str.len()
dftweets.head()

Unnamed: 0,Netsmart,tweet,count_text,hashtags,mentions,hash_str,wordcount
0,KSU Foundation,I love all regent institutions ... I love this...,26,[],,,26
1,DavidRosowsky,What got us to now won?t get us to next. -Greg...,15,[],,,15
2,K-State,RT I applied to #KState because... It felt lik...,42,"[KState, aggieville]","KState, aggieville","KState, aggieville",42
3,KSUFoundation,Welcome KSU Foundation Trustees! We're so glad...,19,[KStateFamily],KStateFamily,KStateFamily,19
4,r_bsal,.@irisisnotabot Ohh lol okay then I recommend ...,27,[KState],KState,KState,27


In [40]:
# How many hashtags per each tweet?
dftweets.tweet.str.count('#')

0     0
1     0
2     2
3     1
4     1
5     1
6     0
7     0
8     0
9     0
10    1
11    2
12    1
13    1
14    2
Name: tweet, dtype: int64

In [41]:
# Action 16: Find the number of mentions in each tweet 
dftweets['tweet'].str.split()
dftweets['mentionesTweets'] = dftweets['tweet'].str.split().str.len()
dftweets.head()

Unnamed: 0,Netsmart,tweet,count_text,hashtags,mentions,hash_str,wordcount,mentionesTweets
0,KSU Foundation,I love all regent institutions ... I love this...,26,[],,,26,26
1,DavidRosowsky,What got us to now won?t get us to next. -Greg...,15,[],,,15,15
2,K-State,RT I applied to #KState because... It felt lik...,42,"[KState, aggieville]","KState, aggieville","KState, aggieville",42,42
3,KSUFoundation,Welcome KSU Foundation Trustees! We're so glad...,19,[KStateFamily],KStateFamily,KStateFamily,19,19
4,r_bsal,.@irisisnotabot Ohh lol okay then I recommend ...,27,[KState],KState,KState,27,27


In [42]:
# check a string (#KState) in the data using boolean indexing
dftweets.tweet.str.contains("#KState")

0     False
1     False
2      True
3      True
4      True
5      True
6     False
7     False
8     False
9     False
10    False
11    False
12     True
13    False
14    False
Name: tweet, dtype: bool

In [43]:
# check a string contains either A or B and view the results
ka = dftweets.tweet.str.contains("#KState|#aggieville")
dftweets[ka]

Unnamed: 0,Netsmart,tweet,count_text,hashtags,mentions,hash_str,wordcount,mentionesTweets
2,K-State,RT I applied to #KState because... It felt lik...,42,"[KState, aggieville]","KState, aggieville","KState, aggieville",42,42
3,KSUFoundation,Welcome KSU Foundation Trustees! We're so glad...,19,[KStateFamily],KStateFamily,KStateFamily,19,19
4,r_bsal,.@irisisnotabot Ohh lol okay then I recommend ...,27,[KState],KState,KState,27,27
5,r_bsal,.@r_bsal Do you go to #KState? If not how do y...,18,[KState],KState,KState,18,18
12,KSU_Foundation,Noticed something new building around campus? ...,27,[KStateFB],KStateFB,KStateFB,27,27
14,Bae_ADORE,What?s happening at #Kstate today ??? #aggieville,7,"[Kstate, aggieville]","Kstate, aggieville","Kstate, aggieville",7,7


In [44]:
# contains both A and B
dftweets[(dftweets['tweet'].str.contains('#KState')) & (dftweets['tweet'].str.contains('#aggieville'))]

Unnamed: 0,Netsmart,tweet,count_text,hashtags,mentions,hash_str,wordcount,mentionesTweets
2,K-State,RT I applied to #KState because... It felt lik...,42,"[KState, aggieville]","KState, aggieville","KState, aggieville",42,42


In [45]:
# View retweets only (tweets contains RT)
retweets = dftweets.tweet.str.contains("RT")
dftweets[retweets]

Unnamed: 0,Netsmart,tweet,count_text,hashtags,mentions,hash_str,wordcount,mentionesTweets
2,K-State,RT I applied to #KState because... It felt lik...,42,"[KState, aggieville]","KState, aggieville","KState, aggieville",42,42


In [46]:
# View retweets only (tweets start with RT)
dftweets.tweet.str.startswith("RT")

0     False
1     False
2      True
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
Name: tweet, dtype: bool

In [49]:
# view tweets do not contain RT (original tweets)
new_dftweets = dftweets[~dftweets["tweet"].str.contains("RT")]

In [118]:
# Action 17: remove all the symbols from the tweets hints:re.sub(r"[^a-zA-Z0-9]+",' ',text)
#df['tweet'] = df['tweet'].str.re.sub(r"[^a-zA-Z0-9]+",' ')
df['tweet'] = df['tweet'].map(lambda x: x.lstrip('+-').rstrip('^a-zA-Z0-9'))
df

Unnamed: 0,user,tweet,count_text,hashtags,mentions,hash_str,wordcount,mentionesTweets
0,ksu foundation,I love all regent institutions ... I love this...,26,[],0,,0,26
1,davidrosowsky,What got us to now won?t get us to next. -Greg...,15,[],0,,0,15
2,k-state,RT I applied to #KState because... It felt lik...,42,"[KState, aggieville]",2,"KState, aggieville",2,42
3,ksufoundation,Welcome KSU Foundation Trustees! We're so glad...,19,[KStateFamily],1,KStateFamily,1,19
4,r_bsal,.@irisisnotabot Ohh lol okay then I recommend ...,27,[KState],1,KState,1,27
5,r_bsal,.@r_bsal Do you go to #KState? If not how do y...,18,[KState],1,KState,1,18
6,ksu_fan,@irisisnotabot Yeah I got to KState. Really yo...,27,[],0,,0,27
7,formerplantprof,@setonbachle @kstatebio @KStateArtSci @KState ...,9,[],0,,0,9
8,k-state,Set yourself up for success this semester and ...,30,[],0,,0,30
9,mosquito,Congratulations! Hope you enjoy K-State and Ma...,16,[],0,,0,16


In [None]:
# same results without library re
dftweets['tweet'].str.replace(r'[^a-zA-Z0-9]+', ' ')

In [119]:
# Action 18: remove http links hints:replace('http\S+|www.\S+', '')
dftweets['tweet'].str.replace('http\S+|www.\S+', '')

  dftweets['tweet'].str.replace('http\S+|www.\S+', '')


0     I love all regent institutions ... I love this...
1     What got us to now won?t get us to next. -Greg...
2     RT I applied to #KState because... It felt lik...
3     Welcome KSU Foundation Trustees! We're so glad...
4     .@irisisnotabot Ohh lol okay then I recommend ...
5     .@r_bsal Do you go to #KState? If not how do y...
6     @irisisnotabot Yeah I got to KState. Really yo...
7     @setonbachle @kstatebio @KStateArtSci @KState ...
8     Set yourself up for success this semester and ...
9     Congratulations! Hope you enjoy K-State and Ma...
10    #ATGTG After a great conversation with @CoachA...
11    Former K-State Wildcat & NFL Wide Receiver Jor...
12    Noticed something new building around campus? ...
13    #ThrowbackThursday Fort Riley Soldiers perform...
14    What?s happening at #Kstate today ??? #aggieville
Name: tweet, dtype: object

In [120]:
# Action 19: remove numbers hints:replace('\d+', '')
dftweets['tweet'].str.replace('\d+', '')

  dftweets['tweet'].str.replace('\d+', '')


0     I love all regent institutions ... I love this...
1     What got us to now won?t get us to next. -Greg...
2     RT I applied to #KState because... It felt lik...
3     Welcome KSU Foundation Trustees! We're so glad...
4     .@irisisnotabot Ohh lol okay then I recommend ...
5     .@r_bsal Do you go to #KState? If not how do y...
6     @irisisnotabot Yeah I got to KState. Really yo...
7     @setonbachle @kstatebio @KStateArtSci @KState ...
8     Set yourself up for success this semester and ...
9     Congratulations! Hope you enjoy K-State and Ma...
10    #ATGTG After a great conversation with @CoachA...
11    Former K-State Wildcat & NFL Wide Receiver Jor...
12    Noticed something new building around campus? ...
13    #ThrowbackThursday Fort Riley Soldiers perform...
14    What?s happening at #Kstate today ??? #aggieville
Name: tweet, dtype: object

In [130]:
# Action 20: Remove both mentions and hashtags from tweets
dftweets['tweet'].str.replace('@#', '')

0     I love all regent institutions ... I love this...
1     What got us to now won?t get us to next. -Greg...
2     RT I applied to #KState because... It felt lik...
3     Welcome KSU Foundation Trustees! We're so glad...
4     .@irisisnotabot Ohh lol okay then I recommend ...
5     .@r_bsal Do you go to #KState? If not how do y...
6     @irisisnotabot Yeah I got to KState. Really yo...
7     @setonbachle @kstatebio @KStateArtSci @KState ...
8     Set yourself up for success this semester and ...
9     Congratulations! Hope you enjoy K-State and Ma...
10    #ATGTG After a great conversation with @CoachA...
11    Former K-State Wildcat & NFL Wide Receiver Jor...
12    Noticed something new building around campus? ...
13    #ThrowbackThursday Fort Riley Soldiers perform...
14    What?s happening at #Kstate today ??? #aggieville
Name: tweet, dtype: object

In [None]:
# Action 21: safe the cleaned tweets to a new column and name it "cleaned_tweet"