# UBC
## Programming in Python for DS
### Week 8

Instructor: Socorro Dominguez-Vidana

In [None]:
# Import libraries needed for this lab
import pandas as pd
import numpy as np

Overview

- [] Use **NumPy** to create **ndarrays** with `np.array()` and from functions such as `np.arange()`, `np.linspace()` and `np.ones()`.
- [] Describe the shape, dimension and size of an array.
- [] Identify null values in a dataframe and manage them by removing them using `.dropna()` or replacing them using `.fillna()`.
- [] Manipulate non-standard date/time formats into standard Pandas datetime using `pd.to_datetime()`.
- [] Find, and replace text from a dataframe using verbs such as `.replace()` and `.contains()`

### NumPy

$NumPy = Numerical Python (Extension)$

A NumPy array is like a container in Python that can hold a grid of numbers (or other data).
It facilitates doing math work.

In [None]:
data = {'col1': [1,2,3],
        'col1a': [2,3,5],
        'col2': ['a','b','c']}
df = pd.DataFrame(data)
df

Creating:
- With `np.arange()`, you can create sequence arrays:

In [None]:
np.arange(0, 10, 1)

- With `np.linspace()`, you can create an array that generates evenly spaced values within a specified range:

In [None]:
np.linspace(0, 1, 15)

In [None]:
np.linspace((3, 1, 4), 7, 3)

In [None]:
pd.DataFrame(np.linspace((3, 1, 4), 7, 3), columns = ['a', 'b', 'c'])

- With `np.ones()`, you can create an array filled with ones

In [None]:
np.ones(2)

In [None]:
np.ones((2,3))

Knowing the **shape**, **dimension**, and **size** of an array helps us understand the structure and characteristics of the data stored in the array.

We can then decide which operations we can/should do.

In [None]:
my_list = [1,1,1]
my_list

In [None]:
my_list+3

In [None]:
[i + 3 for i in my_list]

In [None]:
array_ones = np.ones((2, 3))
array_ones

In [None]:
array_ones+3

In [None]:
array_ones[1] = array_ones[1]+3
array_ones

More about np.arrays:  
https://realpython.com/numpy-scipy-pandas-correlation-python/  
https://realpython.com/numpy-array-programming/  

### Filling NAs

In [None]:
data = {'A': [1, 2, np.nan, 4],
        'B': [5, np.nan, 7, 8],
        'C': [9, 10, 11, 12]}

df = pd.DataFrame(data)
df

In [None]:
df_filled = df.fillna(0)
df_filled

Review the use of `.sum(axis = 1)`

In [None]:
df['A']+df['B'] + df['C']

It might be tempting to do `fillna(0)` before applying that `+` but remember you can do this instead:

In [None]:
df.sum(axis = 1)

### Droppping Rows

In [None]:
df

In [None]:
df.dropna()

In [None]:
df.dropna(subset='B')

### DateTime Wrangling 

In [None]:
df = pd.read_csv('data/chopped.csv')
df.info()

In [None]:
chopped = pd.read_csv('data/chopped.csv', parse_dates=["air_date"])
chopped.head()

In [None]:
chopped.info()

How can we look at the difference of dates?

In [None]:
chopped["air_date"].max()

In [None]:
my_diff = chopped["air_date"].max()-chopped["air_date"].min()
my_diff

In [None]:
my_diff.days

In [None]:
(chopped["air_date"].max() - chopped["air_date"].min()).days

Review [Documentation](https://pandas.pydata.org/pandas-docs/version/2.1/user_guide/timedeltas.html#attributes) to see different attributes.

How can I check the differences between dates in the whole data frame?

In [None]:
chopped["air_date"]

In [None]:
chopped['air_date'].dt.month_name()

In [None]:
chopped["air_date"].diff()
# Notice the first NaT

### Replacing Strings in Data Frame

In [None]:
data = {'A': ["Africa", "  Asia", "America", "Asia"],
        'B': [5, np.nan, 7, 8],
        'C': [9, 10, 11, 12]}

df = pd.DataFrame(data)
df

In [None]:
df['A']

In [None]:
df['A'].value_counts()

In [None]:
df['A'].str.replace('  Asia', 'Asia').value_counts()

In [None]:
df['A'].str.strip().value_counts()

In [None]:
df[df['A'].str.contains('Am')]

In [None]:
df[df['A'].str.contains('Am')] = 'B'

In [None]:
df

#### Assignment Question 3: Cleaning the dirty data frame

In [None]:
#clean.copy()

In [None]:
dirty = pd.read_csv('data/dirty_gapminder.csv')
dirty.head(2)

In [None]:
clean = pd.read_csv('data/clean_gapminder.csv')
clean.head(2)

Motivation: Make dirty look the same way as clean

In [None]:
dirty[(~clean.eq(dirty)).any(axis=1)].shape

Breaking the code into parts

In [None]:
clean.eq(dirty)

In [None]:
dirty[(~clean.eq(dirty)).any(axis=1)]

In [None]:
# This is the goal
clean[(~clean.eq(dirty)).any(axis=1)]

```python
def data_cleaner(dirty):
    steps 1
    steps 2
    steps 3
    return cleaned_data
```

```python
new_data = data_cleaner(dirty_df)
```
```python
clean[(~clean.eq(new_data)).any(axis=1)]
```

In [None]:
##On Wednesday

In [None]:
dirty[dirty.isnull().any(axis=1)]

In [None]:
def cleaned_gapminder(dirty_df):
    
    # Let's fill the na's with `.fillna()`
    dirty_df = dirty_df.fillna(value="Americas")
    dirty_df.columns = ['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap']

    return dirty_df

In [None]:
cleaned_data = cleaned_gapminder(dirty)

In [None]:
clean[(~cleaned_data.eq(clean)).any(axis=1)].head(2)

In [None]:
clean.columns

In [None]:
cleaned_data[(~cleaned_data.eq(clean)).any(axis=1)].head(2)