In [1]:
import pandas as pd
import numpy as np

In this document, I use `.del`, `df.drop("...", inplace = True, axis = 1)`, `iloc`, `loc`, `.copy()`, `df.T` (the transpose of `df`)

**Example**

In [33]:
record1 = pd.Series({'name':'Alice',
                     'Class': 'Physics',
                     'Score':95})
record2 = pd.Series({'name':'Jack',
                     'Class': 'Chem',
                     'Score':93})
record3 = pd.Series({'name':'Helen',
                     'Class': 'Bio',
                     'Score':90})

In [34]:
df = pd.DataFrame([record1, record2, record3],
                  index = ['school1','school2','school1']) # the dataframe is 2-dimensional

In [4]:
df.head()


Unnamed: 0,name,Class,Score
school1,Alice,Physics,95
school2,Jack,Chem,93
school1,Helen,Bio,90


In [5]:
df.loc['school2']

name     Jack
Class    Chem
Score      93
Name: school2, dtype: object

In [6]:
# column name is included in the output
df.iloc[1]

name     Jack
Class    Chem
Score      93
Name: school2, dtype: object

If we only interested in the `school1`'s student names

In [7]:
df.loc['school1','name'] # index operator

school1    Alice
school1    Helen
Name: name, dtype: object

In [8]:
df.loc['school1']['name']

school1    Alice
school1    Helen
Name: name, dtype: object

**What if we just want to select a single column through?**
- Firstly, we can transpose the matrix. This pivots all of rows into columns and all of the columns into rows. and it is done with T attribute.

In [9]:
df.T.loc['name']

school1    Alice
school2     Jack
school1    Helen
Name: name, dtype: object

In [10]:
df['name']

school1    Alice
school2     Jack
school1    Helen
Name: name, dtype: object

In [14]:
# Case when we are going to ask for all the names and scores for all schools, using the .loc operator
df.loc[:,['name','Score']]

Unnamed: 0,name,Score
school1,Alice,95
school2,Jack,93
school1,Helen,90


**Drop function cannot change the original dataframe, but it could be seen as copying a dataframe**

In [18]:
df.drop('school1'), df

(         name Class  Score
 school2  Jack  Chem     93,           name    Class  Score
 school1  Alice  Physics     95
 school2   Jack     Chem     93
 school1  Helen      Bio     90)

In [22]:
copy_df = df.copy()
copy_df.drop('name', inplace=True, axis = 1) # axis = 1: column
copy_df

Unnamed: 0,Class,Score
school1,Physics,95
school2,Chem,93
school1,Bio,90


In [23]:
del copy_df['Class']


In [24]:
copy_df

Unnamed: 0,Score
school1,95
school2,93
school1,90


## DataFrame Indexing and Loading

When we are using `pd.read_csv` to import data, the default version will create index for the dataframe. If we don't want to get the index, we can use `pd.read_csv(..., index_col = 0)`



In [26]:
df.columns

Index(['name', 'Class', 'Score'], dtype='object')

In [36]:
# way to change the column names
new_df = df.rename(columns = {'Class': 'Class Name'})

In [28]:
new_df

Unnamed: 0,name,Class Name,Score
school1,Alice,Physics,95
school2,Jack,Chem,93
school1,Helen,Bio,90


In [38]:
# Another Approach
cols = list(df.columns)
# then a little list comprehension
cols = [x.upper().strip() for x in cols]
# then we overwrite what is already in the .columns attribute
df.columns = cols
df.head()

Unnamed: 0,NAME,CLASS,SCORE
school1,Alice,Physics,95
school2,Jack,Chem,93
school1,Helen,Bio,90


## Querying a DataFrame

In this case, we combine `.where()` and `.dropna()`\
we have learned to query dataframe using boolean masking, which is extremely important 
and often used in the world of data science. With boolean masking, we can select data based on the criteria 
we desire. 

In [6]:
df = pd.read_csv("/Users/ybzhang/Downloads/Python UMich/University of Michigan - Intro to Data Science in Python/resources/week-2/datasets/Admission_Predict.csv",index_col = 0)

In [8]:
df.columns = [x.lower().strip() for x in df.columns]
# And we'll take a look at the results
df.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [9]:
# Boolean masks are created by applying operators directly to the pandas Series or DataFrame objects. 
# For instance, in our graduate admission dataset, we might be interested in seeing only those students 
# that have a chance higher than 0.7
admit_mask=df['chance of admit'] > 0.7
admit_mask

Serial No.
1       True
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398     True
399    False
400     True
Name: chance of admit, Length: 400, dtype: bool

In [10]:
# So, what do you do with the boolean mask once you have formed it? Well, you can just lay it on top of the
# data to "hide" the data you don't want, which is represented by all of the False values. We do this by using
# the .where() function on the original DataFrame.
df.where(admit_mask).head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
2,324.0,107.0,4.0,4.0,4.5,8.87,1.0,0.76
3,316.0,104.0,3.0,3.0,3.5,8.0,1.0,0.72
4,322.0,110.0,3.0,3.5,2.5,8.67,1.0,0.8
5,,,,,,,,


 We see that the resulting data frame keeps the original indexed values, and only data which met 
 the condition was retained. All of the rows which did not meet the condition have NaN data instead,
 but these rows were not dropped from our dataset. 

In [25]:
# The next step is, if we don't want the NaN data, we use the dropna() function
df.where(admit_mask).dropna().head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
2,324.0,107.0,4.0,4.0,4.5,8.87,1.0,0.76
3,316.0,104.0,3.0,3.0,3.5,8.0,1.0,0.72
4,322.0,110.0,3.0,3.5,2.5,8.67,1.0,0.8
6,330.0,115.0,5.0,4.5,3.0,9.34,1.0,0.9


In [26]:
df[df['chance of admit'] > 0.7].head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
6,330,115,5,4.5,3.0,9.34,1,0.9


In [29]:
# when we want to run/filter multiple requirement (boolean mask)
# Method1
(df['chance of admit'] > 0.7) & (df['chance of admit'] < 0.9)

Serial No.
1      False
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398    False
399    False
400    False
Name: chance of admit, Length: 400, dtype: bool

In [31]:
# Method2
df['chance of admit'].gt(0.7) & df['chance of admit'].lt(0.9)

Serial No.
1      False
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398    False
399    False
400    False
Name: chance of admit, Length: 400, dtype: bool

In [32]:
# Method3
df['chance of admit'].gt(0.7).lt(0.9)

Serial No.
1      False
2      False
3      False
4      False
5       True
       ...  
396    False
397    False
398    False
399     True
400    False
Name: chance of admit, Length: 400, dtype: bool

## Indexing Dataframe

Another option for setting an index is to use the `set_index()` function. This function 
takes a list of columns and promotes those columns to an index.\
`set_index()` function is a destructive process, and it doesn't keep the current index. 
If you want to keep the current index, you need to manually create a new column and copy into 
it values from the index attribute.

In [37]:
df = pd.read_csv("/Users/ybzhang/Downloads/Python UMich/University of Michigan - Intro to Data Science in Python/resources/week-2/datasets/Admission_Predict.csv",index_col = 0)

In [46]:
df.head()
# in this case, the "Serial No." is the index of the dataframe
df.columns

Index(['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR ', 'CGPA',
       'Research', 'Chance of Admit ', 'Serial Number'],
      dtype='object')

In [47]:
# So we copy the indexed data into its own column
df['Serial Number'] = df.index
# Then we set the index to another column
df = df.set_index('Chance of Admit ')
df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Serial Number
Chance of Admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0.92,337,118,4,4.5,4.5,9.65,1,1
0.76,324,107,4,4.0,4.5,8.87,1,2
0.72,316,104,3,3.0,3.5,8.0,1,3
0.8,322,110,3,3.5,2.5,8.67,1,4
0.65,314,103,2,2.0,3.0,8.21,0,5


In [48]:
# You'll see that when we create a new index from an existing column the index has a name, 
# which is the original name of the column.

# We can get rid of the index completely by calling the function reset_index(). This promotes the 
# index into a column and creates a default numbered index.
df = df.reset_index()
df.head()

Unnamed: 0,Chance of Admit,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Serial Number
0,0.92,337,118,4,4.5,4.5,9.65,1,1
1,0.76,324,107,4,4.0,4.5,8.87,1,2
2,0.72,316,104,3,3.0,3.5,8.0,1,3
3,0.8,322,110,3,3.5,2.5,8.67,1,4
4,0.65,314,103,2,2.0,3.0,8.21,0,5


One nice feature of Pandas is multi-level indexing. This is similar to composite keys in 
relational database systems. To create a multi-level index, we simply call set index and 
give it a list of columns that we're interested in promoting to an index.

In [83]:
df = pd.read_csv('/Users/ybzhang/Downloads/Python UMich/University of Michigan - Intro to Data Science in Python/resources/week-2/datasets/census.csv')
df.shape

(3193, 100)

In [84]:
# Here we can run unique on the sum level of our current DataFrame 
df['SUMLEV'].unique() # the result [40, 50] implies that "SUMLEV" only has 2 different values '40' and '50'

array([40, 50])

In [85]:
# Let's exclue all of the rows that are summaries 
# at the state level and just keep the county data. 
df=df[df['SUMLEV'] == 50]
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
5,50,3,6,1,9,Alabama,Blount County,57322,57322,57373,...,1.807375,-1.177622,-1.748766,-2.062535,-1.36997,1.859511,-0.84858,-1.402476,-1.577232,-0.884411


In [86]:
# we try to reduce the dataframe's dimension and extract only the columns we like
columns_to_keep = ['STNAME','CTYNAME','BIRTHS2010','BIRTHS2011','BIRTHS2012','BIRTHS2013',
                   'BIRTHS2014','BIRTHS2015','POPESTIMATE2010','POPESTIMATE2011',
                   'POPESTIMATE2012','POPESTIMATE2013','POPESTIMATE2014','POPESTIMATE2015']
df = df[columns_to_keep]
df.head(), df.shape

(    STNAME         CTYNAME  BIRTHS2010  BIRTHS2011  BIRTHS2012  BIRTHS2013  \
 1  Alabama  Autauga County         151         636         615         574   
 2  Alabama  Baldwin County         517        2187        2092        2160   
 3  Alabama  Barbour County          70         335         300         283   
 4  Alabama     Bibb County          44         266         245         259   
 5  Alabama   Blount County         183         744         710         646   
 
    BIRTHS2014  BIRTHS2015  POPESTIMATE2010  POPESTIMATE2011  POPESTIMATE2012  \
 1         623         600            54660            55253            55175   
 2        2186        2240           183193           186659           190396   
 3         260         269            27341            27226            27159   
 4         247         253            22861            22733            22642   
 5         618         603            57373            57711            57776   
 
    POPESTIMATE2013  POPESTIMATE2014

In [87]:
df = df.set_index(['STNAME', 'CTYNAME'])
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Alabama,Autauga County,151,636,615,574,623,600,54660,55253,55175,55038,55290,55347
Alabama,Baldwin County,517,2187,2092,2160,2186,2240,183193,186659,190396,195126,199713,203709
Alabama,Barbour County,70,335,300,283,260,269,27341,27226,27159,26973,26815,26489
Alabama,Bibb County,44,266,245,259,247,253,22861,22733,22642,22512,22549,22583
Alabama,Blount County,183,744,710,646,618,603,57373,57711,57776,57734,57658,57673


In [88]:
# If we want to see the population results from Washtenaw County in Michigan the state
df.loc['Michigan', 'Washtenaw County']

BIRTHS2010            977
BIRTHS2011           3826
BIRTHS2012           3780
BIRTHS2013           3662
BIRTHS2014           3683
BIRTHS2015           3709
POPESTIMATE2010    345563
POPESTIMATE2011    349048
POPESTIMATE2012    351213
POPESTIMATE2013    354289
POPESTIMATE2014    357029
POPESTIMATE2015    358880
Name: (Michigan, Washtenaw County), dtype: int64

In [91]:
# Therefore, in this case, we will have a list of two tuples, in each tuple, the first element is 
# Michigan, and the second element is either Washtenaw County or Wayne County

df.loc[ [('Michigan', 'Washtenaw County'),
         ('Michigan', 'Wayne County')] ]

Unnamed: 0_level_0,Unnamed: 1_level_0,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Michigan,Washtenaw County,977,3826,3780,3662,3683,3709,345563,349048,351213,354289,357029,358880
Michigan,Wayne County,5918,23819,23270,23377,23607,23586,1815199,1801273,1792514,1775713,1766008,1759335


## Missing Values

`isnull()`, `fillna()`

In [63]:
df = pd.read_csv('/Users/ybzhang/Downloads/Python UMich/University of Michigan - Intro to Data Science in Python/resources/week-2/datasets/class_grades.csv')
df.shape

(99, 6)

In [64]:
df.head()

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
2,8,83.7,83.17,,63.15,48.89
3,7,,,49.38,105.93,80.56
4,8,91.32,93.64,95.0,107.41,73.89


In [66]:
# We can actually use the function .isnull() to create a 'boolean' mask of the whole dataframe. This effectively
# broadcasts the isnull() function to every cell of data. 
mask=df.isnull()
mask.head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,True,False,False
3,False,True,True,False,False,False
4,False,False,False,False,False,False
5,False,False,False,False,False,False
6,False,False,False,False,False,False
7,False,False,False,False,True,False
8,False,False,False,False,False,False
9,False,False,False,False,False,False


In [72]:
# This can be useful for processing rows based on certain columns of data. Another useful operation is to be
# able to drop all of those rows which have any missing data, which can be done with the dropna() function.
a = df.dropna()
a.shape

(81, 6)

In [73]:
# So, if we wanted to fill all missing values with 0, we would use fillna
df.fillna(0, inplace=True)
df.head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
2,8,83.7,83.17,0.0,63.15,48.89
3,7,0.0,0.0,49.38,105.93,80.56
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
7,7,72.85,86.85,60.0,0.0,56.11
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61


In [79]:
df = pd.read_csv('/Users/ybzhang/Downloads/Python UMich/University of Michigan - Intro to Data Science in Python/resources/week-2/datasets/log.csv')
df.shape

(33, 6)

In [80]:
df = df.set_index('time')
df = df.sort_index()
df.head(20)

Unnamed: 0_level_0,user,video,playback position,paused,volume
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,,
1469974454,sue,advanced.html,24,,
1469974484,cheryl,intro.html,7,,
1469974514,cheryl,intro.html,8,,
1469974524,sue,advanced.html,25,,
1469974544,cheryl,intro.html,9,,
1469974554,sue,advanced.html,26,,
1469974574,cheryl,intro.html,10,,


In [81]:
# If we look closely at the output though we'll notice that the index 
# isn't really unique. Two users seem to be able to use the system at the same 
# time. Again, a very common case. Let's reset the index, and use some 
# multi-level indexing on time AND user together instead,
# promote the user name to a second level of the index to deal with that issue.

df = df.reset_index()
df = df.set_index(['time', 'user'])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,video,playback position,paused,volume
time,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,,
1469974454,sue,advanced.html,24,,
1469974484,cheryl,intro.html,7,,
1469974514,cheryl,intro.html,8,,
1469974524,sue,advanced.html,25,,
1469974544,cheryl,intro.html,9,,
1469974554,sue,advanced.html,26,,
1469974574,cheryl,intro.html,10,,


## Replace

In [93]:
# We can also do customized fill-in to replace values with the replace() function. It allows replacement from
# several approaches: value-to-value, list, dictionary, regex Let's generate a simple example
df = pd.DataFrame({'A': [1, 1, 2, 3, 4],
                   'B': [3, 6, 3, 8, 9],
                   'C': ['a', 'b', 'c', 'd', 'e']})
df

Unnamed: 0,A,B,C
0,1,3,a
1,1,6,b
2,2,3,c
3,3,8,d
4,4,9,e


In [94]:
# We can replace 1's with 100, let's try the value-to-value approach
df.replace(1, 100)

Unnamed: 0,A,B,C
0,100,3,a
1,100,6,b
2,2,3,c
3,3,8,d
4,4,9,e


In [95]:
# How about changing two values? Let's try the list approach For example, we want to change 1's to 100 and 3's
# to 300
df.replace([1, 3], [100, 300])

Unnamed: 0,A,B,C
0,100,300,a
1,100,6,b
2,2,300,c
3,300,8,d
4,4,9,e
