### Chapter 2 - Basic Data Processing with Pandas

# Table of Contents

1.1 Introduction to Pandas and Series Data

- The Series Data Structure
- Querying a Series

1.2 DataFrame

- DataFrame Data Structure
- DataFrame Indexing and Loading
- Querying a DataFrame
- Indexing DataFrames
- Mssing Values
- Example: Manipulating DataFrame


# 1.1 Introduction to Pandas + Series Data

## 1. The Series Data Structure

In [1]:
# Let's import pandas to get started
import pandas as pd


#### a Series "object" from a list of strings

In [2]:
# As you might expect, you can create a series by passing in a list of values.
# When you do this, Pandas automatically assigns an index startign with zero and
# sets the name of the series to None. Let's work on an example of this.

# One of the easiest ways to create a eries is to use an array-like object, like a list.

# Here, I'll make a list of the three stuents, all as strings
students = ['Alice', 'Jack', 'Molly']

# Now, we just call the Series function in Pandas and pass in the students
pd.Series(students)


0    Alice
1     Jack
2    Molly
dtype: object

In [None]:
# The result is a Series object which is nicely rendered to the screen. We see here that
# the pandas has automatically identified the type of data in this Series as "object" and
# set the dtype parameter as appropriate. We see that the values are indexed with integers,
# starting at zero


#### a Series "int64" from a list of integers

In [3]:
# We don't have to use strings. If we passed in a list of whole number, for instance,
# we could see that pandas sets the type to int64. Underneath panda stores series values in a 
# typed array using the Numpy library. This offers significant speedup when processing data
# versus traditional Python lists.

# Let's create a little list of numbers
numbers = [1, 2, 3]
# And turn that into a series
pd.Series(numbers)


0    1
1    2
2    3
dtype: int64

In [None]:
# And we see on my architecture that the result is a dtype of int64 objects


#### How Pandas handles missing data "None"

- "None" becomes an Object within a list of strings

In [5]:
# There's some other typing details that exist for performance that are important to know.
# The most important is how Numpy and thus pandas handle missing data.

# In Python, we have the None type to indicate a lack of data. But what do we do if we want
# to have a typed list like we do in the series object?

# Underneath, Pandas does some type conversion. If we create a list of strings and we have 
# one element, a None type, Pandas inserts it as a None and uses the type object for the
# underlying array.

# Let's recreate our list of students, but leave the last one as a None
students = ['Alice', 'Jack', None]
# And let's convert this to a series
pd.Series(students)


0    Alice
1     Jack
2     None
dtype: object

- "None" becomes NaN(float64), a special Floating point value

In [7]:
# However, if we create a list of numbers, integers or floats, and put in the None type,
# Pandas automatically converts this to a special floating point value desiganted as NaN.
# which stands for a "Not a Number"

# So let's create a list with a None value in it
number = [1, 2, None]
# And turn that into a series
pd.Series(number)


0    1.0
1    2.0
2    NaN
dtype: float64

In [None]:
# You'll notice a couple of things. First, NaN is a different value. Second, Pandas
# set the dtype of this series to floating point numbers instead of objects or ints. That's
# maybe a bit of surprise - why not just leave this as an integer? Underneath, Pandas
# represents NaN as a floating point number, and because integers can be typecast to 
# floats, Pandas went and converted our integers to floats. So when you're wondering why the
# list of integers you put into a Series is not floats, it's probably because there is some
# missing data.


- NaN is a numeric value

In [8]:
# For those who might not have done scientific computing in Python before, it is important
# to stress that None and NaN might be being used by the data scientist in the same way, to
# denote missing data, but that underneath these are not represented by Pandas in the same
# way.

# NaN is *NOT* equivalent to None and when we try the equality test, the result is False.

# Let's bring in Numpy which allows us to generate an NaN value
import numpy as np
# And let's compare it to None
np.nan == None


False

In [9]:
# It turns out that you actually can't do an equality test of NaN to itself. When you do,
# the answer is ALWAYS False.

np.nan == np.nan


False

In [10]:
# Instead, you need to use special functions to test for the presence of Not a Number,
# such as the Numpy library isnan().

np.isnan(np.nan)


True

In [11]:
# So keep in mind when you see NaN, it's meaning is similar to None, but it's a
# numeric value and treated differently for efficiency reasons.


#### A String "Dicionaries" turn into Object (both for Keys and Values) in a Series
- Its Key beceoms its Index

In [20]:
# Let's talk more about how pandas' Series can be created. While my list might be a common
# way to create some play data, often you have label data that you want to manipulate.
# A series can be created ddirectly from dictionary data. If you do this, the index is
# automattically assigned to the keys of the dictionary that you provided and not just
# incrementing integers.

# Here's an example using some data of students and their classes.

students_scores = {'Alice': 'Phyics',
                   'Jack' : 'Chemistry',
                   'Molly': 'English'}

s = pd.Series(students_scores)
s


Alice       Phyics
Jack     Chemistry
Molly      English
dtype: object

In [None]:
# We see that, since it was String data, Pandas set the data type of the series to "obejct".
# We see that the index, the first column, is also a list of strings


In [21]:
# Once the series has been created, we can get the index object using the index atrritbute.

s.index


Index(['Alice', 'Jack', 'Molly'], dtype='object')

#### "Arbitrary objects" also becomes the type Object in a Series

In [None]:
# As you play more with Pandas, you'll notice that a lot of things are implemented as Numpy
# arrays, and have the dtype value set. This is true of indices, and here Pandas inferred
# that we were using objects for the index.


In [22]:
# Now, this is kind of interesting. The dtype of object is not just for strings, but for
# arbitrary objects. Let's create a more complex type of data, say, a list of tuples.
students = [("Alice", "Brown"), ("Jack", "White"), ("Molly", "Green")]
pd.Series(students)


0    (Alice, Brown)
1     (Jack, White)
2    (Molly, Green)
dtype: object

In [None]:
# We see that each of the tuples is stored in the Series object, and the type is object.


#### Creating a Series with Dictionaries

In [23]:
# You can also separate your index creation from the data by passing in the index as a 
# list explicitly to the series.

s = pd.Series(['Physics', 'Chemistry', 'English'], index = ['Alice', 'Jack', 'Molly'])
s


Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

In [24]:
# So what happens if your list of values in the index object are not aligned with the keys
# in your dictionary for creating the series? Well, Pandas overides the automatic creation
# to favor only and all of the indices values that you provided. So it will ignore from your
# dictionary all keys which are not in your index, and Pandas will add None or NaN type values
# for any index value provided, which is not in your dictionary key list.

# Here's an example. I'll pass in a dictionary of three items, in this case students and
# their courses
student_score = {'Alice': 'Phytics',
                 'Jack': 'Chemistry',
                 'Molly': 'English'}

# When I create the series object thorugh, I'll only ask for an index with three studnets, and
# I'll exclude Jack
s = pd.Series(students_scores, index = ['Alice', 'Sam', 'Molly'])
s


Alice     Phyics
Sam          NaN
Molly    English
dtype: object

In [25]:
# The result is that the Series object doesn't have Jack in it, even though he was in our
# original dataset, but it explicitly does have Sam in it as a missing value.


In this lecture, we've explore the Pandas Series data structure. You've seen how to create a Series from lists and dictionaries, how indicies on data work, and the way that Pandas typecasts data including missing values.

## 2. Querying a Series

In [26]:
# A Pandas Series can be queried either by the index position or the index label. IF you don't give an
# index to the Series when querying, the position and the label are are effectively the same values. To
# query by numeric location, starting at zero, use the "iloc" attribute. To query by the index label,
# you can use the loc attribute.

import pandas as pd
students_classes = {'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English',
                   'Sam': 'History'}

s = pd.Series(students_classes)
s


Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

#### the iloc[ ] attribute to query by numeric location

In [27]:
s.iloc[3]


'History'

In [29]:
s[3]


'History'

#### the loc[ ] attribute to query by label (key)

In [28]:
s.loc['Molly']


'English'

In [30]:
s['Molly']


'English'

- What if the label itself is an Index?

In [31]:
class_code = {99: 'Physics',
             100: 'Chemistry',
             101: 'English',
             102: 'History'}

s = pd.Series(class_code)


In [35]:
# s[0]
# This will give you an error, because it is not executed as s.iloc[0]

print(s.iloc[0])
print(s.loc[99])


Physics
Physics


#### the NumPy Sum method

In [36]:
grades = pd.Series([90, 80, 70, 60])

total = 0

for grade in grades :
    total += grade
    
print(total / len(grades))


75.0


In [37]:
import numpy as np

total = np.sum(grades)

print(total / len(grades))


75.0


#### the NumPy Random.randint method

In [56]:
numbers = pd.Series(np.random.randint(0, 1000, 10000))    # from 0 to 1000, with a size(n) of 10,000

numbers.head()


0    248
1    175
2    836
3    762
4    162
dtype: int64

In [40]:
len(numbers)


10000

#### The cellular magic function %% with timeit

In [46]:
%%timeit -n 100         # In order to use a cellular magic function, it has to be the first line in the cell
                        # Default is 1,000 loops
total = 0

for number in numbers :
    total += number
    
total / len(numbers)


1.18 ms ± 47.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [47]:
%%timeit -n 100

total = np.sum(numbers)
total / len(numbers)


76.3 µs ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


#### += operator on the Series object

In [57]:
numbers.head()


0    248
1    175
2    836
3    762
4    162
dtype: int64

In [58]:
# And now, let's just increase everything in the Serie by 2
numbers += 2
numbers.head()


0    250
1    177
2    838
3    764
4    164
dtype: int64

In [59]:
# Which is a comparable to the traditional, procedural method, as shown below
for label, value in numbers.iteritems() :
    numbers.set_value(label, value + 2)    # = numbers.at(label, value + 2)

numbers.head()


  This is separate from the ipykernel package so we can avoid doing imports until


0    252
1    179
2    840
3    766
4    166
dtype: int64

- Another example:

In [61]:
%%timeit -n 10

s = pd.Series(np.random.randint(0, 1000, 1000))

for label, value in s.iteritems() :
    s.loc[label] = value + 2


131 ms ± 7.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [63]:
%%timeit -n 10

s = pd.Series(np.random.randint(0, 1000, 1000))

s += 2


349 µs ± 51.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


#### Adding new data with indexing operators, with mixed data values or index labels

In [67]:
s = pd.Series([1, 2, 3])

s.loc['History'] = 102

s

0            1
1            2
2            3
History    102
dtype: int64

#### When an index is not unique, with append( )

In [68]:
students_classes = pd.Series({'Alice': 'Physics',
                             'Jack': 'Chemistry',
                             'Molly': 'English',
                             'Sam': 'History'})

students_classes


Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [69]:
kelly_classes = pd.Series(['Philosophy', 'Arts', 'Math'], index = ['Kelly', 'Kelly', 'Kelly'])
kelly_classes


Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

In [70]:
all_students_classes = students_classes.append(kelly_classes)
all_students_classes


Alice       Physics
Jack      Chemistry
Molly       English
Sam         History
Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

In [71]:
# The original Series hasn't changed
students_classes


Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [73]:
# returns multiple values, having the same label
all_students_classes.loc['Kelly']


Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

# 1.2 DataFrame

- A two-dimensional Series object
- index + multiple columns of content, with each column having a label
- The distinction between a column and a row is only conceptual (indexed both by row and column)
- can be thought of as a two-axes labeled array

## 1. DataFrame Data Structure

#### pd.DataFrame( )

In [74]:
import pandas as pd


- Turning multiple Series into a DataFrame

In [75]:
record1 = pd.Series({'Name': 'Alice',
                    'Class': 'Physics',
                    'Score': 85})

record2 = pd.Series({'Name': 'Jack',
                    'Class': 'Chemistry',
                    'Score': 82})

record3 = pd.Series({'Name': 'Helen',
                    'Class': 'Biology',
                    'Score': 90})


In [79]:
df = pd.DataFrame([record1, record2, record3], index = ['school1', 'school2', 'school1'])
df.head()


Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90


- Turning a list of dictionaries into a DataFrame

In [81]:
students = [{'Name': 'Alice',
            'Class': 'Physics',
            'Score': 85}, 
            
            {'Name': 'Jack',
            'Class': 'Chemistry',
            'Score': 82}, 
            
            {'Name': 'Helen',
            'Class': 'Biology',
            'Score': 90}]

df = pd.DataFrame(students, index = ['school1', 'school2', 'school1'])
df.head()


Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90


#### loc[ ] returns a series, if there's only one row to return

In [82]:
df.loc['school2']


Name          Jack
Class    Chemistry
Score           82
Name: school2, dtype: object

The type is Series

In [84]:
type(df.loc['school2'])


pandas.core.series.Series

#### loc[ ] returns multiple rows of the DataFrame, if there's more than one row to return

In [83]:
df.loc['school1']


Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school1,Helen,Biology,90


The type is DataFrame

In [85]:
type(df.loc['school1'])


pandas.core.frame.DataFrame

#### Supplying two parameters to .loc[ ]

In [86]:
df.loc['school1', 'Name']


school1    Alice
school1    Helen
Name: Name, dtype: object

#### Selecting a single column, instead of a row

- T attribute

In [87]:
df.T


Unnamed: 0,school1,school2,school1.1
Name,Alice,Jack,Helen
Class,Physics,Chemistry,Biology
Score,85,82,90


In [88]:
df.T.loc['Name']


school1    Alice
school2     Jack
school1    Helen
Name: Name, dtype: object

In [89]:
df['Name']


school1    Alice
school2     Jack
school1    Helen
Name: Name, dtype: object

The type is Series

In [90]:
type(df['Name'])


pandas.core.series.Series

#### Chaining, avoid it if possible

- First select the rows with a particular label, then return the values in a single column

In [91]:
df.loc['school1']['Name']


school1    Alice
school1    Helen
Name: Name, dtype: object

In [92]:
print(type(df.loc['school1']))          # DataFrame, since it has two rows to return
print(type(df.loc['school1']['Name']))  # Series


<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


#### An alternative way, recommend

In [96]:
print(df.loc[:])  # : return all rows

df.loc[:, ['Name', 'Score']]


          Name      Class  Score
school1  Alice    Physics     85
school2   Jack  Chemistry     82
school1  Helen    Biology     90


Unnamed: 0,Name,Score
school1,Alice,85
school2,Jack,82
school1,Helen,90


#### Dropping data in Series and DataFrame, with the drop( ) function

- the drop( ) function doesn't fchnage the DataFrame by default
- it resturns a copy of the DataFrame with the given rows removed

In [97]:
df.drop('school1')


Unnamed: 0,Name,Class,Score
school2,Jack,Chemistry,82


The DataFrame is still intact!

In [98]:
df


Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90


#### drop("colName/rowName", inplace = True/False, axis = 0/1)

- if inplace = True, the DataFrame will be updated, instead of merely returning its copy
- axis = 0 (default) indicates the row axis to drop
- axis = 1 indicates the column axies to drop

In [102]:
copy_df = df.copy()

copy_df.drop("Name", inplace = True, axis = 1)
copy_df


Unnamed: 0,Class,Score
school1,Physics,85
school2,Chemistry,82
school1,Biology,90


#### Alternative way to drop a column, with the "del" keyword

In [103]:
del copy_df['Class']
copy_df


Unnamed: 0,Score
school1,85
school2,82
school1,90


#### Adding a new column to the DataFrame

In [104]:
df['ClassRanking'] = None
df


Unnamed: 0,Name,Class,Score,ClassRanking
school1,Alice,Physics,85,
school2,Jack,Chemistry,82,
school1,Helen,Biology,90,


## 2. DataFrame Indexing and Loading

#### Shell command: "cat" for concatenate

- outputs the contents of a file
- "!" executes the remainder of the line as a shell command

In [118]:
!cat Admission_Predict.csv


Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR ,CGPA,Research,Chance of Admit 
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4,4.5,8.87,1,0.76
3,316,104,3,3,3.5,8,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2,3,8.21,0,0.65
6,330,115,5,4.5,3,9.34,1,0.9
7,321,109,3,3,4,8.2,1,0.75
8,308,101,2,3,4,7.9,0,0.68
9,302,102,1,2,1.5,8,0,0.5
10,323,108,3,3.5,3,8.6,0,0.45
11,325,106,3,3.5,4,8.4,1,0.52
12,327,111,4,4,4.5,9,1,0.84
13,328,112,4,4,4.5,9.1,1,0.78
14,307,109,3,4,3,8,1,0.62
15,311,104,3,3.5,2,8.2,1,0.61
16,314,105,3,3.5,2.5,8.3,0,0.54
17,317,107,3,4,3,8.7,0,0.66
18,319,106,3,4,3,8,1,0.65
19,318,110,3,4,3,8.8,0,0.63
20,303,102,3,3.5,3,8.5,0,0.62
21,312,107,3,3,2,7.9,1,0.64
22,325,114,4,3,2,8.4,0,0.7
23,328,116,5,5,5,9.5,1,0.94
24,334,119,5,5,4.5,9.7,1,0.95
25,336,119,5,4,3.5,9.8,1,0.97
26,340,120,5,4.5,4.5,9.6,1,0.94
27,322,109,5,4.5,3.5,8.8,0,0.76
28,298,98,2,1.5,2.5,7.5,1,0.44
29,295,93,1,2,2,7.2,0,0.46
30,310,99

In [119]:
import pandas as pd

df = pd.read_csv("Admission_Predict.csv")
df.head()


Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


Notice the index number at the beginning and the Serial No. does not match.
#### We can use the Serial No. (column 0) as our index, with "index_col"


In [120]:
df = pd.read_csv('Admission_Predict.csv', index_col = 0)
df.head()


Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


#### Renaming column names, with "rename()"

In [121]:
new_df = df.rename(columns = {'GRE Score': 'GRE Score',
                             'TOEFL Score': 'TOEFL Score',
                             'University Rating': 'University Rating',
                             'SOP': 'Statement of Purpose',
                             'LOR': 'Letter of Recommendation',
                             'CGPA': 'CGPA',
                             'Research': 'Research',
                             'Chance of Admit': 'Chance of Admit'})
new_df.head()


Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,Statement of Purpose,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


Why did only SOP changed, but not LOR?

Let's investigate this. First, we need to make sure all the column names correct

We can use the "columns" attribute of DataFrame to get a list.

In [122]:
new_df.columns


Index(['GRE Score', 'TOEFL Score', 'University Rating', 'Statement of Purpose',
       'LOR ', 'CGPA', 'Research', 'Chance of Admit '],
      dtype='object')

We can see that there is a space right after 'LOR' and 'Chance of Admit'

#### One way to fix:

In [123]:
new_df = new_df.rename(columns = {'LOR ': 'Letter of Recommendation'})
new_df.head()


Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,Statement of Purpose,Letter of Recommendation,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


#### Alternative way to fix, with strip() :

In [124]:
new_df = new_df.rename(mapper = str.strip, axis = 'columns')    # remove white-spaces in columns
new_df.head()                                                   # Here, Chance of Admit is trimmed up


Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,Statement of Purpose,Letter of Recommendation,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


#### The .rename( ) function does not modify the original dataframe. 

In [125]:
df.columns     # still has white-spaces


Index(['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR ', 'CGPA',
       'Research', 'Chance of Admit '],
      dtype='object')

#### Changing all of the column names to lower case

In [126]:
cols = list(df.columns)
print(cols)

cols = [x.lower().strip() for x in cols]

df.columns = cols
df.head()


['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR ', 'CGPA', 'Research', 'Chance of Admit ']


Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


## 3. Querying a DataFrame

In [127]:
import pandas as pd

df = pd.read_csv('Admission_Predict.csv', index_col = 0)
df.columns = [x.lower().strip() for x in df.columns]

df.head()


Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


#### Boolean masks to filter the result

In [129]:
admit_mask = df['chance of admit'] > 0.7
admit_mask


Serial No.
1       True
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398     True
399    False
400     True
Name: chance of admit, Length: 400, dtype: bool

#### Using the .where( ) function to apply those boolean masks

In [135]:
df.where(admit_mask).head(8)


Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
2,324.0,107.0,4.0,4.0,4.5,8.87,1.0,0.76
3,316.0,104.0,3.0,3.0,3.5,8.0,1.0,0.72
4,322.0,110.0,3.0,3.5,2.5,8.67,1.0,0.8
5,,,,,,,,
6,330.0,115.0,5.0,4.5,3.0,9.34,1.0,0.9
7,321.0,109.0,3.0,3.0,4.0,8.2,1.0,0.75
8,,,,,,,,


In [134]:
df.where(admit_mask).dropna().head(8)


Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
2,324.0,107.0,4.0,4.0,4.5,8.87,1.0,0.76
3,316.0,104.0,3.0,3.0,3.5,8.0,1.0,0.72
4,322.0,110.0,3.0,3.5,2.5,8.67,1.0,0.8
6,330.0,115.0,5.0,4.5,3.0,9.34,1.0,0.9
7,321.0,109.0,3.0,3.0,4.0,8.2,1.0,0.75
12,327.0,111.0,4.0,4.0,4.5,9.0,1.0,0.84
13,328.0,112.0,4.0,4.0,4.5,9.1,1.0,0.78


#### A syntax that combines where( ) and dropna( )

In [136]:
df[df['chance of admit'] > 0.7].head(8)


Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
6,330,115,5,4.5,3.0,9.34,1,0.9
7,321,109,3,3.0,4.0,8.2,1,0.75
12,327,111,4,4.0,4.5,9.0,1,0.84
13,328,112,4,4.0,4.5,9.1,1,0.78


#### two things that indexing operators can do on DataFrame

In [137]:
df["gre score"].head()


Serial No.
1    337
2    324
3    316
4    322
5    314
Name: gre score, dtype: int64

In [138]:
df[["gre score", "toefl score"]].head()


Unnamed: 0_level_0,gre score,toefl score
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1
1,337,118
2,324,107
3,316,104
4,322,110
5,314,103


In [139]:
df[df["gre score"] > 320].head()


Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
4,322,110,3,3.5,2.5,8.67,1,0.8
6,330,115,5,4.5,3.0,9.34,1,0.9
7,321,109,3,3.0,4.0,8.2,1,0.75


#### Combining multiple boolean masks
What doesn't work:

In [140]:
(df['chance of admit'] > 0.7) and (df['chance of admit'] < 0.9)


ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

What works:

In [144]:
(df['chance of admit'] > 0.7) & (df['chance of admit'] < 0.9)    # To be () & (), the parenthesis is required


Serial No.
1      False
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398    False
399    False
400    False
Name: chance of admit, Length: 400, dtype: bool

#### Another way:

In [145]:
df['chance of admit'].gt(0.7) & df['chance of admit'].lt(0.9)        # gt = greater than
                                                                     # lt = less than

Serial No.
1      False
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398    False
399    False
400    False
Name: chance of admit, Length: 400, dtype: bool

You can chain them too to get the same result

In [146]:
df['chance of admit'].gt(0.7).lt(0.9)

Serial No.
1      False
2      False
3      False
4      False
5       True
       ...  
396    False
397    False
398    False
399     True
400    False
Name: chance of admit, Length: 400, dtype: bool

## 4. Indexing DataFrames

The set_index( ) function is a destructive process, and it doesn't keep the current index

In [147]:
import pandas as pd
df = pd.read_csv("Admission_Predict.csv", index_col = 0)

df.head()


Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [148]:
# Copy the index data into its own column
df['Serial Number'] = df.index
# Then we set the index to another column
df = df.set_index('Chance of Admit ')

df.head()


Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Serial Number
Chance of Admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0.92,337,118,4,4.5,4.5,9.65,1,1
0.76,324,107,4,4.0,4.5,8.87,1,2
0.72,316,104,3,3.0,3.5,8.0,1,3
0.8,322,110,3,3.5,2.5,8.67,1,4
0.65,314,103,2,2.0,3.0,8.21,0,5


Resetting the index

In [150]:
df = df.reset_index()

df.head()


Unnamed: 0,Chance of Admit,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Serial Number
0,0.92,337,118,4,4.5,4.5,9.65,1,1
1,0.76,324,107,4,4.0,4.5,8.87,1,2
2,0.72,316,104,3,3.0,3.5,8.0,1,3
3,0.8,322,110,3,3.5,2.5,8.67,1,4
4,0.65,314,103,2,2.0,3.0,8.21,0,5


#### Multi-level indexing

In [166]:
df = pd.read_csv('census.txt')

df.head()


Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2019,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015,RNETMIG2016,RNETMIG2017,RNETMIG2018,RNETMIG2019
0,40,3,6,1,0,Alabama,Alabama,4779736,4780125,4785437,...,1.917501,0.578434,1.186314,1.522549,0.563489,0.626357,0.745172,1.090366,1.773786,2.483744
1,50,3,6,1,1,Alabama,Autauga County,54571,54597,54773,...,4.84731,6.018182,-6.226119,-3.902226,1.970443,-1.712875,4.777171,0.849656,0.540916,4.560062
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183112,...,24.017829,16.64187,17.488579,22.751474,20.184334,17.725964,21.279291,22.398256,24.727215,24.380567
3,50,3,6,1,5,Alabama,Barbour County,27457,27455,27327,...,-5.690302,0.292676,-6.897817,-8.132185,-5.140431,-15.724575,-18.238016,-24.998528,-8.754922,-5.165664
4,50,3,6,1,7,Alabama,Bibb County,22915,22915,22870,...,1.385134,-4.998356,-3.787545,-5.797999,1.331144,1.329817,-0.708717,-3.234669,-6.857092,1.831952


Return unique value of 'SUMLEV', just like SELECT DISTINCT in SQL

In [167]:
df['SUMLEV'].unique()


array([40, 50])

In [168]:
df = df[df['SUMLEV'] == 50]
df.head()


Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2019,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015,RNETMIG2016,RNETMIG2017,RNETMIG2018,RNETMIG2019
1,50,3,6,1,1,Alabama,Autauga County,54571,54597,54773,...,4.84731,6.018182,-6.226119,-3.902226,1.970443,-1.712875,4.777171,0.849656,0.540916,4.560062
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183112,...,24.017829,16.64187,17.488579,22.751474,20.184334,17.725964,21.279291,22.398256,24.727215,24.380567
3,50,3,6,1,5,Alabama,Barbour County,27457,27455,27327,...,-5.690302,0.292676,-6.897817,-8.132185,-5.140431,-15.724575,-18.238016,-24.998528,-8.754922,-5.165664
4,50,3,6,1,7,Alabama,Bibb County,22915,22915,22870,...,1.385134,-4.998356,-3.787545,-5.797999,1.331144,1.329817,-0.708717,-3.234669,-6.857092,1.831952
5,50,3,6,1,9,Alabama,Blount County,57322,57322,57376,...,1.020788,0.208812,-1.650165,-0.347225,-2.04959,-1.338525,-1.391062,6.193562,-0.069229,1.124597


In [169]:
columns_to_keep = ['STNAME', 'CTYNAME', 
                   'BIRTHS2010', 'BIRTHS2011', 'BIRTHS2012', 'BIRTHS2013', 'BIRTHS2014', 'BIRTHS2015',
                   'POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 
                   'POPESTIMATE2014', 'POPESTIMATE2015']

df = df[columns_to_keep]
df.head()


Unnamed: 0,STNAME,CTYNAME,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
1,Alabama,Autauga County,150,638,615,571,640,651,54773,55227,54954,54727,54893,54864
2,Alabama,Baldwin County,516,2189,2093,2160,2212,2257,183112,186558,190145,194885,199183,202939
3,Alabama,Barbour County,71,331,300,282,264,271,27327,27341,27169,26937,26755,26283
4,Alabama,Bibb County,44,264,246,258,253,251,22870,22745,22667,22521,22553,22566
5,Alabama,Blount County,184,744,712,647,619,716,57376,57560,57580,57619,57526,57526


In [170]:
df = df.set_index(['STNAME', 'CTYNAME'])    # Note, the order is important (State -> City)
df.head()


Unnamed: 0_level_0,Unnamed: 1_level_0,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Alabama,Autauga County,150,638,615,571,640,651,54773,55227,54954,54727,54893,54864
Alabama,Baldwin County,516,2189,2093,2160,2212,2257,183112,186558,190145,194885,199183,202939
Alabama,Barbour County,71,331,300,282,264,271,27327,27341,27169,26937,26755,26283
Alabama,Bibb County,44,264,246,258,253,251,22870,22745,22667,22521,22553,22566
Alabama,Blount County,184,744,712,647,619,716,57376,57560,57580,57619,57526,57526


In [171]:
df.loc['Michigan', 'Washtenaw County']     # Note, the order is important (State -> City)


BIRTHS2010            975
BIRTHS2011           3824
BIRTHS2012           3778
BIRTHS2013           3666
BIRTHS2014           3750
BIRTHS2015           3649
POPESTIMATE2010    345717
POPESTIMATE2011    349753
POPESTIMATE2012    352303
POPESTIMATE2013    356040
POPESTIMATE2014    360021
POPESTIMATE2015    362975
Name: (Michigan, Washtenaw County), dtype: int64

In [172]:
df.loc[ [('Michigan', 'Washtenaw County'),
         ('Michigan', 'Wayne County'    ) ] ]


Unnamed: 0_level_0,Unnamed: 1_level_0,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Michigan,Washtenaw County,975,3824,3778,3666,3750,3649,345717,349753,352303,356040,360021,362975
Michigan,Wayne County,5916,23818,23271,23376,23721,23285,1815081,1803189,1795929,1780225,1771679,1764872


## 5. Missing Values

In [175]:
import pandas as pd

df = pd.read_csv('class_grades.txt', error_bad_lines = False) # This reads the missing values
df.head(10)


b'Skipping line 22: expected 6 fields, saw 7\nSkipping line 40: expected 6 fields, saw 7\nSkipping line 62: expected 6 fields, saw 7\n'


Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
2,8,83.7,83.17,30.0,63.15,48.89
3,7,81.22,96.06,49.38,105.93,80.56
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
7,7,72.85,86.85,60.0,,56.11
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61


In [180]:
mask = df.isnull()
mask.head(10)


Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False
5,False,False,False,False,False,False
6,False,False,False,False,False,False
7,False,False,False,False,True,False
8,False,False,False,False,False,False
9,False,False,False,False,False,False


In [177]:
df.dropna().head(10)


Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
2,8,83.7,83.17,30.0,63.15,48.89
3,7,81.22,96.06,49.38,105.93,80.56
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61
10,7,80.44,90.2,75.0,91.48,39.72


Fill in all missing values to 0

In [182]:
df.fillna(0, inplace = True)
df.head(10)


Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
2,8,83.7,83.17,30.0,63.15,48.89
3,7,81.22,96.06,49.38,105.93,80.56
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
7,7,72.85,86.85,60.0,0.0,56.11
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61


#### Another example:

In [183]:
df = pd.read_csv('log.txt')
df.head(20)


Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,intro.html,5,False,10.0
1,1469974454,cheryl,intro.html,6,,
2,1469974544,cheryl,intro.html,9,,
3,1469974574,cheryl,intro.html,10,,
4,1469977514,bob,intro.html,1,,
5,1469977544,bob,intro.html,1,,
6,1469977574,bob,intro.html,1,,
7,1469977604,bob,intro.html,1,,
8,1469974604,cheryl,intro.html,11,,
9,1469974694,cheryl,intro.html,14,,


ffill

- forward filing, which updates an NA value for a particular cell with the value from the previous row

bfill

- backward filling, which is the oppsitive of ffill

In [184]:
# Data needs to be sorted in order for this to have the effect you might want
# So, sorting the data by time index
df = df.set_index('time')
df = df.sort_index()

df.head(20)


Unnamed: 0_level_0,user,video,playback position,paused,volume
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,,
1469974454,sue,advanced.html,24,,
1469974484,cheryl,intro.html,7,,
1469974514,cheryl,intro.html,8,,
1469974524,sue,advanced.html,25,,
1469974544,cheryl,intro.html,9,,
1469974554,sue,advanced.html,26,,
1469974574,cheryl,intro.html,10,,


The time index isn't unqiue for every row, which is a very common case.

So we use multi-level indexing on time AND user together (two users seem to be able to use the systeam at the same time)

In [186]:
df = df.reset_index()
df = df.set_index(['time', 'user'])

df.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,video,playback position,paused,volume
time,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,,
1469974454,sue,advanced.html,24,,
1469974484,cheryl,intro.html,7,,
1469974514,cheryl,intro.html,8,,
1469974524,sue,advanced.html,25,,
1469974544,cheryl,intro.html,9,,
1469974554,sue,advanced.html,26,,
1469974574,cheryl,intro.html,10,,


- Using the fillna( )

In [187]:
df = df.fillna(method = 'ffill')
df.head()


Unnamed: 0_level_0,Unnamed: 1_level_0,video,playback position,paused,volume
time,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,False,10.0
1469974454,sue,advanced.html,24,False,10.0
1469974484,cheryl,intro.html,7,False,10.0


#### Customzing fill-in to replace values with the replace( ) function
- It allows replacement from several approaches: value-to-value, list, dictionary, regex

In [188]:
df = pd.DataFrame({'A': [1, 1, 2, 3, 4,],
                   'B': [3, 6, 3, 8, 9],
                   'C': ['a', 'b', 'c', 'd', 'e']})

df


Unnamed: 0,A,B,C
0,1,3,a
1,1,6,b
2,2,3,c
3,3,8,d
4,4,9,e


In [189]:
# We can replace 1's with 100, a value-to-value approach
df.replace(1, 100)


Unnamed: 0,A,B,C
0,100,3,a
1,100,6,b
2,2,3,c
3,3,8,d
4,4,9,e


In [190]:
# Changing two values, a list approach
df.replace([1, 3], [100, 300])


Unnamed: 0,A,B,C
0,100,300,a
1,100,6,b
2,2,300,c
3,300,8,d
4,4,9,e


#### Using regex with replace( )

In [192]:
df = pd.read_csv("log.txt")
df.head(20)


Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,intro.html,5,False,10.0
1,1469974454,cheryl,intro.html,6,,
2,1469974544,cheryl,intro.html,9,,
3,1469974574,cheryl,intro.html,10,,
4,1469977514,bob,intro.html,1,,
5,1469977544,bob,intro.html,1,,
6,1469977574,bob,intro.html,1,,
7,1469977604,bob,intro.html,1,,
8,1469974604,cheryl,intro.html,11,,
9,1469974694,cheryl,intro.html,14,,


In [193]:
# Matching any number of characters then ending in .html
df.replace(to_replace = ".*html$", value = "webpage", regex = True)


Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,webpage,5,False,10.0
1,1469974454,cheryl,webpage,6,,
2,1469974544,cheryl,webpage,9,,
3,1469974574,cheryl,webpage,10,,
4,1469977514,bob,webpage,1,,
5,1469977544,bob,webpage,1,,
6,1469977574,bob,webpage,1,,
7,1469977604,bob,webpage,1,,
8,1469974604,cheryl,webpage,11,,
9,1469974694,cheryl,webpage,14,,


When you use statistical functions on DataFrames, these functions typically ignore missing values.

## 6. Example: Manipulating DataFrame

#### Basic data cleaning process

In [298]:
import pandas as pd

df = pd.read_csv('presidents.txt', error_bad_lines = False)
df.head()


Unnamed: 0,#,President,Born,Age at start of presidency,Age at end of presidency,Post-presidency timespan,Died,Age
0,1,George Washington,Feb 22 1732[a],57 years 67 days Apr 30 1789,65 years 10 days Mar 4 1797,2 years 285 days,Dec 14 1799,67 years 295 days
1,2,John Adams,Oct 30 1735[a],61 years 125 days Mar 4 1797,65 years 125 days Mar 4 1801,25 years 122 days,Jul 4 1826,90 years 247 days
2,3,Thomas Jefferson,Apr 13 1743[a],57 years 325 days Mar 4 1801,65 years 325 days Mar 4 1809,17 years 122 days,Jul 4 1826,83 years 82 days


In [299]:
df.columns


Index(['#', 'President', 'Born', 'Age at start of presidency',
       'Age at end of presidency', 'Post-presidency timespan', 'Died', 'Age'],
      dtype='object')

apply( ) function

In [300]:
def splitname(row) :
    # extract the firstname of a president, and create a new entry in the series
    row['First'] = row['President'].split(" ")[0]
    # Let's do the same with the last word in the string
    row['Last'] = row['President'].split(" ")[-1]
    # Now we just return the row and the pandas .apply() will take of merging them back into a DataFrame
    return row

# Now we apply this to the dataframe, while indicating that we want to apply it across columns
df = df.apply(splitname, axis = 'columns')
df


Unnamed: 0,#,President,Born,Age at start of presidency,Age at end of presidency,Post-presidency timespan,Died,Age,First,Last
0,1,George Washington,Feb 22 1732[a],57 years 67 days Apr 30 1789,65 years 10 days Mar 4 1797,2 years 285 days,Dec 14 1799,67 years 295 days,George,Washington
1,2,John Adams,Oct 30 1735[a],61 years 125 days Mar 4 1797,65 years 125 days Mar 4 1801,25 years 122 days,Jul 4 1826,90 years 247 days,John,Adams
2,3,Thomas Jefferson,Apr 13 1743[a],57 years 325 days Mar 4 1801,65 years 325 days Mar 4 1809,17 years 122 days,Jul 4 1826,83 years 82 days,Thomas,Jefferson


.extract( ) function

In [301]:
del(df['First'])
del(df['Last'])

In [302]:
# If you were going to write a regular expression that returned groups and just had the 
# firstname and lastname in it, what would that look like?
pattern = "(^[\w]*)(?:.* )([\w]*$)"

# Now the extract funtion is built into the str attribute of the Series object, so we can call it
# using Series.str.extract(pattern)
df["President"].str.extract(pattern).head()


Unnamed: 0,0,1
0,George,Washington
1,John,Adams
2,Thomas,Jefferson


In [303]:
# That looks pretty nice, other than the column names
pattern = "(?P<First>^[\w]*)(?:.* )(?P<Last>[\w]*$)"

names = df["President"].str.extract(pattern)
names

Unnamed: 0,First,Last
0,George,Washington
1,John,Adams
2,Thomas,Jefferson


In [304]:
# And we can just copy these into our main dataframe if we want to
df["First"] = names["First"]
df["Last"] = names["Last"]

df.head()


Unnamed: 0,#,President,Born,Age at start of presidency,Age at end of presidency,Post-presidency timespan,Died,Age,First,Last
0,1,George Washington,Feb 22 1732[a],57 years 67 days Apr 30 1789,65 years 10 days Mar 4 1797,2 years 285 days,Dec 14 1799,67 years 295 days,George,Washington
1,2,John Adams,Oct 30 1735[a],61 years 125 days Mar 4 1797,65 years 125 days Mar 4 1801,25 years 122 days,Jul 4 1826,90 years 247 days,John,Adams
2,3,Thomas Jefferson,Apr 13 1743[a],57 years 325 days Mar 4 1801,65 years 325 days Mar 4 1809,17 years 122 days,Jul 4 1826,83 years 82 days,Thomas,Jefferson


More on Pandas str module
https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html

#### Let's get clean up the Born column

In [305]:
# First, let's get rid of anything that isn't in the
# pattern of Month Day and Year

# Match any # of characters 3 times, followed by a space
# followed by any number of characters somewhere in that 1 to 2 length (to capture a single or a double digit)
# followed by a space, and then any # of characters 4 times
print(df["Born"])
print()
print("'Has now changed to:'")

df["Born"] = df["Born"].str.extract("([\w]{3} [\w]{1,2} [\w]{4})")
df["Born"]


0    Feb 22 1732[a]
1    Oct 30 1735[a]
2    Apr 13 1743[a]
Name: Born, dtype: object

'Has now changed to:'


0    Feb 22 1732
1    Oct 30 1735
2    Apr 13 1743
Name: Born, dtype: object

#### Changing the Born column's format from Object to date/time

In [306]:
df["Born"] = pd.to_datetime(df["Born"])
df["Born"]   # is useful for future DataFrame processing around dates (e.g. who was born before 1740?)


0   1732-02-22
1   1735-10-30
2   1743-04-13
Name: Born, dtype: datetime64[ns]