# UBC
## Programming in Python for DS

### Week 8

Instructor: Socorro Dominguez-Vidana

In [1]:
# Import libraries needed for this lab
import pandas as pd
import numpy as np

Overview

- [] Use **NumPy** to create **ndarrays** with `np.array()` and from functions such as `np.arange()`, `np.linspace()` and `np.ones()`.
- [] Describe the shape, dimension and size of an array.
- [] Identify null values in a dataframe and manage them by removing them using `.dropna()` or replacing them using `.fillna()`.
- [] Manipulate non-standard date/time formats into standard Pandas datetime using `pd.to_datetime()`.
- [] Find, and replace text from a dataframe using verbs such as `.replace()` and `.contains()`

### NumPy

$NumPy = Numerical Python (Extension)$

A NumPy array is like a container in Python that can hold a grid of numbers (or other data).
It facilitates doing math work.

In [2]:
data = {'col1': [1,2,3],
        'col1a': [2,3,5],
        'col2': ['a','b','c']}
df = pd.DataFrame(data)
df

Unnamed: 0,col1,col1a,col2
0,1,2,a
1,2,3,b
2,3,5,c


```python
for row in df.iterows():
   df.loc[row, 'col1'] + 3
```

In [4]:
df['col1'] + 3

0    4
1    5
2    6
Name: col1, dtype: int64

Creating:
- With `np.arange()`, you can create sequence arrays:

In [6]:
np.arange(0, 10, 2)

array([0, 2, 4, 6, 8])

- With `np.linspace()`, you can create an array that generates evenly spaced values within a specified range:

In [8]:
np.linspace(0, 10, 12)

array([ 0.        ,  0.90909091,  1.81818182,  2.72727273,  3.63636364,
        4.54545455,  5.45454545,  6.36363636,  7.27272727,  8.18181818,
        9.09090909, 10.        ])

In [9]:
np.linspace((3, 1), 12, 3)

array([[ 3. ,  1. ],
       [ 7.5,  6.5],
       [12. , 12. ]])

In [10]:
pd.DataFrame(np.linspace((3, 1), 12, 3), columns = ['a', 'b'])

Unnamed: 0,a,b
0,3.0,1.0
1,7.5,6.5
2,12.0,12.0


- With `np.ones()`, you can create an array filled with ones

In [12]:
np.ones(10)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [13]:
np.ones((2,3))

array([[1., 1., 1.],
       [1., 1., 1.]])

Knowing the **shape**, **dimension**, and **size** of an array helps us understand the structure and characteristics of the data stored in the array.

We can then decide which operations we can/should do.

In [14]:
my_list = [1,1,1]
my_list

[1, 1, 1]

In [15]:
my_list+3

TypeError: can only concatenate list (not "int") to list

In [16]:
[i + 3 for i in my_list]

[4, 4, 4]

In [17]:
array_ones = np.ones((2, 3))
array_ones

array([[1., 1., 1.],
       [1., 1., 1.]])

In [18]:
array_ones+3

array([[4., 4., 4.],
       [4., 4., 4.]])

In [19]:
array_ones[1] = array_ones[1]+3
array_ones

array([[1., 1., 1.],
       [4., 4., 4.]])

More about np.arrays:  
https://realpython.com/numpy-scipy-pandas-correlation-python/  
https://realpython.com/numpy-array-programming/  

### Filling NAs

In [21]:
data = {'A': [1, 2, np.nan, 4],
        'B': [5, None, 7, 8],
        'C': [9, 10, 11, 12]}

df = pd.DataFrame(data)
df

Unnamed: 0,A,B,C
0,1.0,5.0,9
1,2.0,,10
2,,7.0,11
3,4.0,8.0,12


In [22]:
df['A'] + df['C']

0    10.0
1    12.0
2     NaN
3    16.0
dtype: float64

In [26]:
df_filled = df.fillna(0)
df_filled

Unnamed: 0,A,B,C
0,1.0,5.0,9
1,2.0,0.0,10
2,0.0,7.0,11
3,4.0,8.0,12


Review the use of `.sum(axis = 1)`

In [None]:
df_filled['A'] + df_filled['B'] + df_filled['C']

It might be tempting to do `fillna(0)` before applying that `+` but remember you can do this instead:

In [None]:
df.sum(axis = 1)

### Droppping Rows

In [27]:
df

Unnamed: 0,A,B,C
0,1.0,5.0,9
1,2.0,,10
2,,7.0,11
3,4.0,8.0,12


In [28]:
df.dropna()

Unnamed: 0,A,B,C
0,1.0,5.0,9
3,4.0,8.0,12


In [29]:
df.dropna(subset='B')

Unnamed: 0,A,B,C
0,1.0,5.0,9
2,,7.0,11
3,4.0,8.0,12


In [32]:
df[~df['A'].isnull()]

Unnamed: 0,A,B,C
0,1.0,5.0,9
1,2.0,,10
3,4.0,8.0,12


### DateTime Wrangling 

In [33]:
df = pd.read_csv('data/chopped.csv')
df.loc[1, 'air_date']

'January 20, 2009'

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   season            569 non-null    int64 
 1   season_episode    569 non-null    int64 
 2   series_episode    569 non-null    int64 
 3   episode_name      569 non-null    object
 4   episode_notes     456 non-null    object
 5   air_date          569 non-null    object
 6   judge1            568 non-null    object
 7   judge2            568 non-null    object
 8   judge3            568 non-null    object
 9   appetizer         569 non-null    object
 10  entree            569 non-null    object
 11  dessert           568 non-null    object
 12  contestant1       568 non-null    object
 13  contestant1_info  556 non-null    object
 14  contestant2       569 non-null    object
 15  contestant2_info  555 non-null    object
 16  contestant3       569 non-null    object
 17  contestant3_info

In [35]:
chopped = pd.read_csv('data/chopped.csv', parse_dates=["air_date"])
chopped.loc[1,'air_date']

Timestamp('2009-01-20 00:00:00')

In [37]:
chopped.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   season            569 non-null    int64         
 1   season_episode    569 non-null    int64         
 2   series_episode    569 non-null    int64         
 3   episode_name      569 non-null    object        
 4   episode_notes     456 non-null    object        
 5   air_date          569 non-null    datetime64[ns]
 6   judge1            568 non-null    object        
 7   judge2            568 non-null    object        
 8   judge3            568 non-null    object        
 9   appetizer         569 non-null    object        
 10  entree            569 non-null    object        
 11  dessert           568 non-null    object        
 12  contestant1       568 non-null    object        
 13  contestant1_info  556 non-null    object        
 14  contestant2       569 non-

How can we look at the difference of dates?

In [39]:
chopped["air_date"].max()

Timestamp('2020-08-04 00:00:00')

In [40]:
my_diff = chopped["air_date"].max()-chopped["air_date"].min()
my_diff

Timedelta('4221 days 00:00:00')

In [43]:
type(my_diff.days)

int

In [49]:
(chopped["air_date"].max() - chopped["air_date"].min()).days

4221

Review [Documentation](https://pandas.pydata.org/pandas-docs/version/2.1/user_guide/timedeltas.html#attributes) to see different attributes.

How can I check the differences between dates in the whole data frame?

In [50]:
chopped["air_date"]

0     2009-01-13
1     2009-01-20
2     2009-01-27
3     2009-02-03
4     2009-02-10
         ...    
564   2020-06-23
565   2020-06-30
566   2020-07-07
567   2020-07-21
568   2020-07-28
Name: air_date, Length: 569, dtype: datetime64[ns]

In [55]:
chopped['air_date'].dt.month_name()

0       January
1       January
2       January
3      February
4      February
         ...   
564        June
565        June
566        July
567        July
568        July
Name: air_date, Length: 569, dtype: object

In [59]:
chopped["air_date"].diff()
# Notice the first NaT

0         NaT
1      7 days
2      7 days
3      7 days
4      7 days
        ...  
564    7 days
565    7 days
566    7 days
567   14 days
568    7 days
Name: air_date, Length: 569, dtype: timedelta64[ns]

### Replacing Strings in Data Frame

In [60]:
data = {'A': ["Africa", "  Asia", "America", "Asia"],
        'B': [5, np.nan, 7, 8],
        'C': ['$9', '$10', '$11', '$12']}

df = pd.DataFrame(data)
df

Unnamed: 0,A,B,C
0,Africa,5.0,$9
1,Asia,,$10
2,America,7.0,$11
3,Asia,8.0,$12


In [None]:
df['B'] + df['C']

In [61]:
df['A'].value_counts()

A
Africa     1
  Asia     1
America    1
Asia       1
Name: count, dtype: int64

In [62]:
df['A'].str.strip().value_counts()

A
Asia       2
Africa     1
America    1
Name: count, dtype: int64

In [63]:
"hello".replace("e", "a")

'hallo'

In [64]:
df['A'].str.replace('  Asia', 'Asia').value_counts()

A
Asia       2
Africa     1
America    1
Name: count, dtype: int64

In [67]:
df['A'].value_counts()

A
Africa     1
  Asia     1
America    1
Asia       1
Name: count, dtype: int64

In [68]:
df['A'].str.replace('a', 'b')

0     Africb
1       Asib
2    Americb
3       Asib
Name: A, dtype: object

In [None]:
sentence = "hello"
sentence

In [None]:
sentence.replace('e', 'a')

In [69]:
df[df['A'].str.contains('Am')]

Unnamed: 0,A,B,C
2,America,7.0,$11


In [70]:
df[df['A'].str.contains('Am')] = 'B'

In [71]:
df

Unnamed: 0,A,B,C
0,Africa,5.0,$9
1,Asia,,$10
2,B,B,B
3,Asia,8.0,$12


#### Assignment Question 3: Cleaning the dirty data frame

In [None]:
#clean.copy()

In [72]:
dirty = pd.read_csv('data/dirty_gapminder.csv')
dirty.head(2)

Unnamed: 0,year,pop,lifeExp,gdpPercap,continent,country
0,1952,8425333.0,28.801,779.445314,Asia,Afghanistan
1,1957,9240934.0,30.332,820.85303,Asia,Afghanistan


In [73]:
clean = pd.read_csv('data/clean_gapminder.csv')
clean.head(2)

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
0,Afghanistan,1952,8425333.0,Asia,28.801,779.445314
1,Afghanistan,1957,9240934.0,Asia,30.332,820.85303


Motivation: Make dirty look the same way as clean

In [75]:
dirty[(~clean.eq(dirty)).any(axis=1)]#.shape

Unnamed: 0,year,pop,lifeExp,gdpPercap,continent,country
240,1952,14785580.0,68.75,11367.16112,,Canada
241,1957,17010150.0,69.96,12489.95006,,Canada
242,1962,18985850.0,71.3,13462.48555,,Canada
243,1967,20819770.0,72.13,16076.58803,,Canada
244,1972,22284500.0,72.88,18970.57086,,Canada
245,1977,23796400.0,74.21,22090.88306,,Canada
246,1982,25201900.0,75.76,22898.79214,,Canada
247,1987,26549700.0,76.86,26626.51503,,Canada
248,1992,28523500.0,77.95,26342.88426,,Canada
249,1997,30305840.0,78.61,28954.92589,,Canada


Breaking the code into parts

In [80]:
clean[(~clean.eq(dirty)).any(axis=1)]

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
240,Canada,1952,14785580.0,Americas,68.75,11367.16112
241,Canada,1957,17010150.0,Americas,69.96,12489.95006
242,Canada,1962,18985850.0,Americas,71.3,13462.48555
243,Canada,1967,20819770.0,Americas,72.13,16076.58803
244,Canada,1972,22284500.0,Americas,72.88,18970.57086
245,Canada,1977,23796400.0,Americas,74.21,22090.88306
246,Canada,1982,25201900.0,Americas,75.76,22898.79214
247,Canada,1987,26549700.0,Americas,76.86,26626.51503
248,Canada,1992,28523500.0,Americas,77.95,26342.88426
249,Canada,1997,30305840.0,Americas,78.61,28954.92589


In [None]:
dirty[(~clean.eq(dirty)).any(axis=1)]

In [81]:
dirty['continent'].value_counts()

continent
Africa      624
Asia        393
Europe      360
Americas    288
Oceania      24
    Asia      3
Name: count, dtype: int64

In [None]:
# This is the goal
clean[(~clean.eq(dirty)).any(axis=1)]

```python
dirty['country'].str.replace('china', 'China')
### str.strip() <- country and continent
```

```python
def data_cleaner(dirty_df):
    clean_df['country'] = dirty_df['country'].str.replace('china', 'China')
    clean_df['continent'] = clean_df['continent'].str.split()
    return clean_df

cleaned_data = data_cleaner(dirty)
```
```python
cleaned_data[(~clean.eq(cleaned_data)).any(axis=1)]
```
```python
t.test_
```

```python
def data_cleaner(dirty):
    steps 1
    steps 2
    steps 3
    return cleaned_data
```

```python
new_data = data_cleaner(dirty_df)
```
```python
clean[(~clean.eq(new_data)).any(axis=1)]
```

In [None]:
##On Wednesday

In [None]:
(dirty = dirty[dirty.isnull().any(axis=1)])

In [None]:
def cleaned_gapminder(dirty_df):
    
    # Let's fill the na's with `.fillna()`
    dirty_df = dirty_df.fillna(value="Americas")
    dirty_df.columns = ['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap']

    return dirty_df

In [None]:
cleaned_data = cleaned_gapminder(dirty)

In [None]:
clean[(~cleaned_data.eq(clean)).any(axis=1)].head(2)

In [None]:
clean.columns

In [None]:
cleaned_data[(~cleaned_data.eq(clean)).any(axis=1)].head(2)