## Basic Data Processing in Python

### Example: Manipulating DataFrames

In [1]:
import pandas as pd

In [2]:
# Our goal is to clean the list of presidents of the US from Wikipedia
df = pd.read_csv('../resources/week-2/datasets/presidents.csv')
df.head()

Unnamed: 0,#,President,Born,Age atstart of presidency,Age atend of presidency,Post-presidencytimespan,Died,Age
0,1,George Washington,"Feb 22, 1732[a]","57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797","2 years, 285 days","Dec 14, 1799","67 years, 295 days"
1,2,John Adams,"Oct 30, 1735[a]","61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801","25 years, 122 days","Jul 4, 1826","90 years, 247 days"
2,3,Thomas Jefferson,"Apr 13, 1743[a]","57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809","17 years, 122 days","Jul 4, 1826","83 years, 82 days"
3,4,James Madison,"Mar 16, 1751[a]","57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817","19 years, 116 days","Jun 28, 1836","85 years, 104 days"
4,5,James Monroe,"Apr 28, 1758","58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825","6 years, 122 days","Jul 4, 1831","73 years, 67 days"


In [3]:
# Let's start with cleaning up the name into firstname and lastname:
# --> Create two new columns by applying regex to the projection of the "President" column
df['First'] = df['President'] # Copy of the same column

In [4]:
# For the first name
df['First'] = df['First'].replace('[ ].*', '', regex=True) # Use replace() to set the last name pattern as empty string
df.head()

Unnamed: 0,#,President,Born,Age atstart of presidency,Age atend of presidency,Post-presidencytimespan,Died,Age,First
0,1,George Washington,"Feb 22, 1732[a]","57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797","2 years, 285 days","Dec 14, 1799","67 years, 295 days",George
1,2,John Adams,"Oct 30, 1735[a]","61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801","25 years, 122 days","Jul 4, 1826","90 years, 247 days",John
2,3,Thomas Jefferson,"Apr 13, 1743[a]","57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809","17 years, 122 days","Jul 4, 1826","83 years, 82 days",Thomas
3,4,James Madison,"Mar 16, 1751[a]","57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817","19 years, 116 days","Jun 28, 1836","85 years, 104 days",James
4,5,James Monroe,"Apr 28, 1758","58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825","6 years, 122 days","Jul 4, 1831","73 years, 67 days",James


That works, but it's kind of gross and it's slow. 

There are a few other ways we can deal with this:

* A general way, using the `apply()` function.
* Using the `extract()` function. 


In [5]:
del(df['First']) # Drop the column we created in the previous code

**`apply()` function**:

The `apply()` function on a dataframe will take some arbitrary function you have written and apply it to either a Series (single column) or DataFrame across all rows or columns.

In [6]:
# A written function to split a string into two pieces using a single row of data
def split_name(row):
    
    # The row is a Series object which is a single row indexed by column values
    row['First'] = row['President'].split(' ')[0] # Extract first name from President and create a new entry
    row['Last'] = row['President'].split(' ')[-1] # Extract last name from President

    return row

In [7]:
# Now we apply this function across all the dataframe columns
df = df.apply(split_name, axis='columns')
df.head()

Unnamed: 0,#,President,Born,Age atstart of presidency,Age atend of presidency,Post-presidencytimespan,Died,Age,First,Last
0,1,George Washington,"Feb 22, 1732[a]","57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797","2 years, 285 days","Dec 14, 1799","67 years, 295 days",George,Washington
1,2,John Adams,"Oct 30, 1735[a]","61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801","25 years, 122 days","Jul 4, 1826","90 years, 247 days",John,Adams
2,3,Thomas Jefferson,"Apr 13, 1743[a]","57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809","17 years, 122 days","Jul 4, 1826","83 years, 82 days",Thomas,Jefferson
3,4,James Madison,"Mar 16, 1751[a]","57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817","19 years, 116 days","Jun 28, 1836","85 years, 104 days",James,Madison
4,5,James Monroe,"Apr 28, 1758","58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825","6 years, 122 days","Jul 4, 1831","73 years, 67 days",James,Monroe


**`extract()` function**:

The `extract()` takes a regular expression as input and specifically requires you to set capture groups that correspond to the output columns you are interested in. 

The `extract()` function is built into the `str` attribute of the `Series` object: 

```
Series.str.extract(pattern)
```

In [8]:
# First, we drop the columns for first and last name from the previous code
del(df['First'])
del(df['Last'])

We need to think of a **regex pattern** to return groups with the first and last names:

```
pattern = '(^[\w]*)(?:.* )([\w]*$)'
```
`()`: Creates groups

`^`: Anchored at the beginning of the string

`[\w]*`: Any number of characters or digits

`?:`: To prevent from returning this group

`.* `: Any number of characters followed by a white space

`$`: Anchored to the back of the string


In [11]:
# The regex pattern
pattern = '(^[\w]*)(?:.* )([\w]*$)'

In [14]:
# Using the extract() function
df['President'].str.extract(pattern).head() # Column have no names as we did not name the groups

Unnamed: 0,0,1
0,George,Washington
1,John,Adams
2,Thomas,Jefferson
3,James,Madison
4,James,Monroe


A **regex pattern** with group names:

```
pattern = '(?P<First>^[\w]*)(?:.* )(?P<Last>[\w]*$)'
```

`?P<group_name>`: Used to create groups with names. The name is within the `< >` signs.

In [15]:
pattern = '(?P<First>^[\w]*)(?:.* )(?P<Last>[\w]*$)'

In [16]:
# Extract again but with group names
names = df['President'].str.extract(pattern).head()
names

Unnamed: 0,First,Last
0,George,Washington
1,John,Adams
2,Thomas,Jefferson
3,James,Madison
4,James,Monroe


In [17]:
# Add these extracted names to the main df
df['First'] = names['First']
df['Last'] = names['Last']
df.head()

Unnamed: 0,#,President,Born,Age atstart of presidency,Age atend of presidency,Post-presidencytimespan,Died,Age,First,Last
0,1,George Washington,"Feb 22, 1732[a]","57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797","2 years, 285 days","Dec 14, 1799","67 years, 295 days",George,Washington
1,2,John Adams,"Oct 30, 1735[a]","61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801","25 years, 122 days","Jul 4, 1826","90 years, 247 days",John,Adams
2,3,Thomas Jefferson,"Apr 13, 1743[a]","57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809","17 years, 122 days","Jul 4, 1826","83 years, 82 days",Thomas,Jefferson
3,4,James Madison,"Mar 16, 1751[a]","57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817","19 years, 116 days","Jun 28, 1836","85 years, 104 days",James,Madison
4,5,James Monroe,"Apr 28, 1758","58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825","6 years, 122 days","Jul 4, 1831","73 years, 67 days",James,Monroe


**Functions for Pandas str module**:

https://pandas.pydata.org/docs/user_guide/text.html

**Handling the 'Born' column**:

Clean anything that isn't in the pattern of Month Day and Year.

The pattern is:

```
'([\w]{3} [\w]{1,2}, [\w]{4})'
```

`{3}`: Any characteres with a length of 3 (month, e.g.: Apr, Feb)

` `: Followed by a space

`{1,2}`: Any characters with a length of 1 or 2 (day, e.g.: 22, 13, 5)

`, `: Followed by a comma and a space

`{4}`: Any characters with a length of 4 (year, e.g. 1794)

In [18]:
df['Born'] = df['Born'].str.extract('([\w]{3} [\w]{1,2}, [\w]{4})')
df['Born'].head()

0    Feb 22, 1732
1    Oct 30, 1735
2    Apr 13, 1743
3    Mar 16, 1751
4    Apr 28, 1758
Name: Born, dtype: object

Now we want to set the type of this column to *date/time*, instead of *object* (String). We can achieve this by using the built-in function `to_datetime()`.

In [19]:
df['Born'] = pd.to_datetime(df['Born'])
df['Born'].head()

0   1732-02-22
1   1735-10-30
2   1743-04-13
3   1751-03-16
4   1758-04-28
Name: Born, dtype: datetime64[ns]

#### Quiz questions

1. Question 1. For the following code, which of the following statements will not return True?

In [20]:
import pandas as pd
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj1 = pd.Series(sdata)
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj2 = pd.Series(sdata, index=states)
obj3 = pd.isnull(obj2)

In [22]:
obj1

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [23]:
obj2

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [24]:
obj3

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [25]:
x = obj2['California']
obj2['California'] != x

True

In [26]:
x

nan

In [27]:
obj2['California']

nan

In [28]:
obj2['California'] == None

False

2. Question 2. The keys of the dictionary d represent student ranks and the value for each key is a student name. Which of the following can be used to extract rows with student ranks that are lower than or equal to 3?

In [29]:
import pandas as pd
d = {'1': 'Alice','2': 'Bob','3': 'Rita','4': 'Molly','5': 'Ryan'}
S = pd.Series(d)

In [30]:
S.iloc[0:3]

1    Alice
2      Bob
3     Rita
dtype: object

7. Question 7. For the Series s1 and s2 defined below, which of the following statements will give an error?



In [32]:
import pandas as pd
s1 = pd.Series({1: 'Alice', 2: 'Jack', 3: 'Molly'})
s2 = pd.Series({'Alice': 1, 'Jack': 2, 'Molly': 3})

In [33]:
s1.loc[1]

'Alice'

In [34]:
s2.loc[1]

KeyError: 1

In [35]:
s2.iloc[1]

2

In [36]:
s2[1]

2