# UBC
## Programming in Python for DS
### Week 8

Instructor: Socorro Dominguez-Vidana

In [1]:
# Import libraries needed for this lab
import pandas as pd
import numpy as np

Overview

- [] Use NumPy to create ndarrays with np.array() and from functions such as np.arange(), np.linspace() and np.ones().
- [] Describe the shape, dimension and size of an array.
- [] Identify null values in a dataframe and manage them by removing them using .dropna() or replacing them using .fillna().
- [] Manipulate non-standard date/time formats into standard Pandas datetime using pd.to_datetime().
- [] Find, and replace text from a dataframe using verbs such as .replace() and .contains()

### NumPy

$NumPy = Numerical Python (Extension)$

A NumPy array is like a container in Python that can hold a grid of numbers (or other data).
It facilitates doing math work.

In [3]:
data = {'col1': [1,2,3],
        'col1a': [2,3,5],
        'col2': ['a','b','c']}
df = pd.DataFrame(data)
df

Unnamed: 0,col1,col1a,col2
0,1,2,a
1,2,3,b
2,3,5,c


Creating:
- With `np.arange()`, you can create sequence arrays:

In [5]:
np.arange(0, 10, 1)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

- With `np.linspace()`, you can create an array that generates evenly spaced values within a specified range:

In [7]:
np.linspace(0, 1, 15)

array([0.        , 0.07142857, 0.14285714, 0.21428571, 0.28571429,
       0.35714286, 0.42857143, 0.5       , 0.57142857, 0.64285714,
       0.71428571, 0.78571429, 0.85714286, 0.92857143, 1.        ])

- With `np.ones()`, you can create an array filled with ones

In [8]:
np.ones((2, 3))

array([[1., 1., 1.],
       [1., 1., 1.]])

Knowing the **shape**, **dimension**, and **size** of an array helps us understand the structure and characteristics of the data stored in the array.

We can then decide which operations we can/should do.

In [9]:
my_list = [1,1,1]
my_list

[1, 1, 1]

In [12]:
my_list+3

TypeError: can only concatenate list (not "int") to list

In [10]:
[i + 3 for i in my_list]

[4, 4, 4]

In [11]:
array_ones = np.ones((2, 3))
array_ones

array([[1., 1., 1.],
       [1., 1., 1.]])

In [13]:
array_ones+3

array([[4., 4., 4.],
       [4., 4., 4.]])

In [16]:
array_ones[1] = array_ones[1]+3
array_ones

array([[1., 1., 1.],
       [7., 7., 7.]])

More about np.arrays:  
https://realpython.com/numpy-scipy-pandas-correlation-python/  
https://realpython.com/numpy-array-programming/  

### Filling NAs

In [17]:
data = {'A': [1, 2, np.nan, 4],
        'B': [5, np.nan, 7, 8],
        'C': [9, 10, 11, 12]}

df = pd.DataFrame(data)
df

Unnamed: 0,A,B,C
0,1.0,5.0,9
1,2.0,,10
2,,7.0,11
3,4.0,8.0,12


In [20]:
df_filled = df.fillna(0)
df_filled

Unnamed: 0,A,B,C
0,1.0,5.0,9
1,2.0,0.0,10
2,0.0,7.0,11
3,4.0,8.0,12


.sum(axis = 1)

### Droppping Rows

In [21]:
df

Unnamed: 0,A,B,C
0,1.0,5.0,9
1,2.0,,10
2,,7.0,11
3,4.0,8.0,12


In [22]:
df_dropped = df.dropna()
df_dropped

Unnamed: 0,A,B,C
0,1.0,5.0,9
3,4.0,8.0,12


In [None]:
df.dropna(subset='B')

### DateTime Wrangling 

In [25]:
df = pd.read_csv('data/chopped.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   season            569 non-null    int64 
 1   season_episode    569 non-null    int64 
 2   series_episode    569 non-null    int64 
 3   episode_name      569 non-null    object
 4   episode_notes     456 non-null    object
 5   air_date          569 non-null    object
 6   judge1            568 non-null    object
 7   judge2            568 non-null    object
 8   judge3            568 non-null    object
 9   appetizer         569 non-null    object
 10  entree            569 non-null    object
 11  dessert           568 non-null    object
 12  contestant1       568 non-null    object
 13  contestant1_info  556 non-null    object
 14  contestant2       569 non-null    object
 15  contestant2_info  555 non-null    object
 16  contestant3       569 non-null    object
 17  contestant3_info

In [26]:
chopped = pd.read_csv('data/chopped.csv', parse_dates=["air_date"])
chopped.head()

Unnamed: 0,season,season_episode,series_episode,episode_name,episode_notes,air_date,judge1,judge2,judge3,appetizer,entree,dessert,contestant1,contestant1_info,contestant2,contestant2_info,contestant3,contestant3_info,contestant4,contestant4_info
0,1,1,1,"""Octopus, Duck, Animal Crackers""",This is the first episode with only three offi...,2009-01-13,Marc Murphy,Alex Guarnaschelli,Aarón Sánchez,"baby octopus, bok choy, oyster sauce, smoked ...","duck breast, green onions, ginger, honey","prunes, animal crackers, cream cheese",Summer Kriegshauser,Private Chef and Nutrition Coach New York NY,Perry Pollaci,Private Chef and Sous chef Bar Blanc New Yo...,Katie Rosenhouse,Pastry Chef Olana Restaurant New York NY,Sandy Davis,Catering Chef Showstoppers Catering at Union...
1,1,2,2,"""Tofu, Blueberries, Oysters""",This is the first of a few episodes with five ...,2009-01-20,Aarón Sánchez,Alex Guarnaschelli,Marc Murphy,"firm tofu, tomato paste, prosciutto","daikon, pork loin, Napa cabbage, Thai chiles,...","phyllo dough, gorgonzola cheese, pineapple ri...",Raymond Jackson,Private Caterer and Culinary Instructor West...,Klaus Kronsteiner,Chef de cuisine Liberty National Golf Course...,Christopher Jackson,Executive Chef and Owner Ted and Honey Broo...,Pippa Calland,Owner and Chef Chef for Hire LLC Newville PA
2,1,3,3,"""Avocado, Tahini, Bran Flakes""",,2009-01-27,Aarón Sánchez,Alex Guarnaschelli,Marc Murphy,"lump crab meat, dried shiitake mushrooms, pin...","ground beef, cannellini beans, tahini paste, ...","brioche, cantaloupe, pecans, avocados",Margaritte Malfy,Executive Chef and Co-owner La Palapa New Y...,Rachelle Rodwell,Chef de cuisine SoHo Grand Hotel New York NY,Chris Burke,Private Chef New York NY,Andre Marrero,Chef tournant L’Atelier de Joël Robuchon Ne...
3,1,4,4,"""Banana, Collard Greens, Grits""","In the appetizer round, Chef Chuboda refused t...",2009-02-03,Scott Conant,Amanda Freitag,Geoffrey Zakarian,"ground beef, wonton wrappers, cream of mushro...","scallops, collard greens, anchovies, sour cream","maple syrup, black plums, almond butter, waln...",Sean Chudoba,Executive Chef Ayza Wine Bar New York NY,Kyle Shadix,Chef Registered Dietician and Culinary Consu...,Luis Gonzales,Executive Chef Knickerbocker Bar & Grill Ne...,Einat Admony,Chef and Owner Taïm New York NY
4,1,5,5,"""Yucca, Watermelon, Tortillas""",,2009-02-10,Geoffrey Zakarian,Alex Guarnaschelli,Marc Murphy,"watermelon, canned sardines, pepper jack chee...","beef shoulder, yucca, raisins, ancho chiles, ...","flour tortillas, prosecco, Canadian bacon, ro...",John Keller,Personal Chef New York NY,Andrea Bergquist,Executive Chef New York NY,Ed Witt,Executive Chef / Wine Director Bloomingdale ...,Josh Emett,Chef de cuisine Gordon Ramsay at The London ...


In [27]:
chopped.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   season            569 non-null    int64         
 1   season_episode    569 non-null    int64         
 2   series_episode    569 non-null    int64         
 3   episode_name      569 non-null    object        
 4   episode_notes     456 non-null    object        
 5   air_date          569 non-null    datetime64[ns]
 6   judge1            568 non-null    object        
 7   judge2            568 non-null    object        
 8   judge3            568 non-null    object        
 9   appetizer         569 non-null    object        
 10  entree            569 non-null    object        
 11  dessert           568 non-null    object        
 12  contestant1       568 non-null    object        
 13  contestant1_info  556 non-null    object        
 14  contestant2       569 non-

How can we look at the difference of dates?

In [29]:
chopped["air_date"].max()

Timestamp('2020-08-04 00:00:00')

In [30]:
my_diff = chopped["air_date"].max()-chopped["air_date"].min()
my_diff

Timedelta('4221 days 00:00:00')

In [31]:
my_diff.days

4221

In [32]:
(chopped["air_date"].max() - chopped["air_date"].min()).days

4221

How can I check the differences between dates in the whole data frame?

In [33]:
chopped["air_date"]

0     2009-01-13
1     2009-01-20
2     2009-01-27
3     2009-02-03
4     2009-02-10
         ...    
564   2020-06-23
565   2020-06-30
566   2020-07-07
567   2020-07-21
568   2020-07-28
Name: air_date, Length: 569, dtype: datetime64[ns]

In [34]:
chopped["air_date"].diff()
# Notice the first NaT

0         NaT
1      7 days
2      7 days
3      7 days
4      7 days
        ...  
564    7 days
565    7 days
566    7 days
567   14 days
568    7 days
Name: air_date, Length: 569, dtype: timedelta64[ns]

In [35]:
chopped['air_date'].dt.month_name()

0       January
1       January
2       January
3      February
4      February
         ...   
564        June
565        June
566        July
567        July
568        July
Name: air_date, Length: 569, dtype: object

### Replacing Strings in Data Frame

In [54]:
data = {'A': ["Africa", "  Asia", "America", "Asia"],
        'B': [5, np.nan, 7, 8],
        'C': [9, 10, 11, 12]}

df = pd.DataFrame(data)
df

Unnamed: 0,A,B,C
0,Africa,5.0,9
1,Asia,,10
2,America,7.0,11
3,Asia,8.0,12


In [55]:
df['A'].value_counts()

A
Africa     1
  Asia     1
America    1
Asia       1
Name: count, dtype: int64

In [50]:
df['A']

0       Africa
1      A s i a
2      America
3      A s i a
Name: A, dtype: object

In [51]:
df['A'] = df['A'].str.replace('A s i a', 'Asia')

In [52]:
assert df.shape == (4,3), "Shape wrong!"

In [46]:
df.shape

(4,)

In [56]:
df['A'] = df['A'].str.strip()

In [57]:
df['A'].value_counts()

A
Asia       2
Africa     1
America    1
Name: count, dtype: int64

In [58]:
df[df['A'].str.contains('Am')]

Unnamed: 0,A,B,C
2,America,7.0,11


#### Assignment Question 3: Cleaning the dirty data frame

In [None]:
#clean.copy()

In [59]:
dirty = pd.read_csv('data/dirty_gapminder.csv')
dirty.head(2)

Unnamed: 0,year,pop,lifeExp,gdpPercap,continent,country
0,1952,8425333.0,28.801,779.445314,Asia,Afghanistan
1,1957,9240934.0,30.332,820.85303,Asia,Afghanistan


In [60]:
clean = pd.read_csv('data/clean_gapminder.csv')
clean.head(2)

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
0,Afghanistan,1952,8425333.0,Asia,28.801,779.445314
1,Afghanistan,1957,9240934.0,Asia,30.332,820.85303


Motivation: Make dirty look the same way as clean

In [62]:
dirty[(~clean.eq(dirty)).any(axis=1)]#.shape

Unnamed: 0,year,pop,lifeExp,gdpPercap,continent,country
240,1952,14785580.0,68.75,11367.16112,,Canada
241,1957,17010150.0,69.96,12489.95006,,Canada
242,1962,18985850.0,71.3,13462.48555,,Canada
243,1967,20819770.0,72.13,16076.58803,,Canada
244,1972,22284500.0,72.88,18970.57086,,Canada
245,1977,23796400.0,74.21,22090.88306,,Canada
246,1982,25201900.0,75.76,22898.79214,,Canada
247,1987,26549700.0,76.86,26626.51503,,Canada
248,1992,28523500.0,77.95,26342.88426,,Canada
249,1997,30305840.0,78.61,28954.92589,,Canada


In [63]:
# This is the goal
clean[(~clean.eq(dirty)).any(axis=1)]

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
240,Canada,1952,14785580.0,Americas,68.75,11367.16112
241,Canada,1957,17010150.0,Americas,69.96,12489.95006
242,Canada,1962,18985850.0,Americas,71.3,13462.48555
243,Canada,1967,20819770.0,Americas,72.13,16076.58803
244,Canada,1972,22284500.0,Americas,72.88,18970.57086
245,Canada,1977,23796400.0,Americas,74.21,22090.88306
246,Canada,1982,25201900.0,Americas,75.76,22898.79214
247,Canada,1987,26549700.0,Americas,76.86,26626.51503
248,Canada,1992,28523500.0,Americas,77.95,26342.88426
249,Canada,1997,30305840.0,Americas,78.61,28954.92589


```python
def data_cleaner(dirty):
    steps 1
    steps 2
    steps 3
    return cleaned_data
```

```python
new_data = data_cleaner(dirty_df)
```
```python
clean[(~clean.eq(new_data)).any(axis=1)]
```

In [None]:
##On Wednesday

In [None]:
dirty[dirty.isnull().any(axis=1)]

In [None]:
def cleaned_gapminder(dirty_df):
    
    # Let's fill the na's with `.fillna()`
    dirty_df = dirty_df.fillna(value="Americas")
    dirty_df.columns = ['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap']

    return dirty_df

In [None]:
cleaned_data = cleaned_gapminder(dirty)

In [None]:
clean[(~cleaned_data.eq(clean)).any(axis=1)].head(2)

In [None]:
clean.columns

In [None]:
cleaned_data[(~cleaned_data.eq(clean)).any(axis=1)].head(2)

In [65]:
grouped_chopped = chopped.groupby('season')

In [66]:
grouped_chopped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10a374ed0>

In [71]:
counter = 0
for group, rows in grouped_chopped:
    print(group)
    print(rows['air_date'].diff().min())
    print(rows['air_date'].diff().max())
    if rows['air_date'].diff().min() == rows['air_date'].diff().max() == 7:
        counter += 1

1
7 days 00:00:00
7 days 00:00:00
2
2 days 00:00:00
14 days 00:00:00
3
7 days 00:00:00
96 days 00:00:00
4
7 days 00:00:00
21 days 00:00:00
5
7 days 00:00:00
28 days 00:00:00
6
7 days 00:00:00
33 days 00:00:00
7
7 days 00:00:00
7 days 00:00:00
8
7 days 00:00:00
61 days 00:00:00
9
7 days 00:00:00
14 days 00:00:00
10
5 days 00:00:00
63 days 00:00:00
11
7 days 00:00:00
161 days 00:00:00
12
7 days 00:00:00
42 days 00:00:00
13
7 days 00:00:00
37 days 00:00:00
14
7 days 00:00:00
28 days 00:00:00
15
2 days 00:00:00
35 days 00:00:00
16
2 days 00:00:00
56 days 00:00:00
17
5 days 00:00:00
16 days 00:00:00
18
2 days 00:00:00
43 days 00:00:00
19
5 days 00:00:00
21 days 00:00:00
20
7 days 00:00:00
92 days 00:00:00
21
5 days 00:00:00
105 days 00:00:00
22
7 days 00:00:00
42 days 00:00:00
23
7 days 00:00:00
56 days 00:00:00
24
7 days 00:00:00
42 days 00:00:00
25
2 days 00:00:00
27 days 00:00:00
26
2 days 00:00:00
17 days 00:00:00
27
2 days 00:00:00
7 days 00:00:00
28
7 days 00:00:00
7 days 00:00:00
29
