# UBC
## Programming in Python for DS
### Week 8

Instructor: Socorro Dominguez-Vidana

In [None]:
# Import libraries needed for this lab
import pandas as pd
import numpy as np
import inspect

Overview

- [] Use NumPy to create ndarrays with np.array() and from functions such as np.arange(), np.linspace() and np.ones().
- [] Describe the shape, dimension and size of an array.
- [] Identify null values in a dataframe and manage them by removing them using .dropna() or replacing them using .fillna().
- [] Manipulate non-standard date/time formats into standard Pandas datetime using pd.to_datetime().
- [] Find, and replace text from a dataframe using verbs such as .replace() and .contains()

### NumPy

$NumPy = Numerical Python (Extension)$

A NumPy array is like a container in Python that can hold a grid of numbers (or other data).
It facilitates doing math work.

Creating:
- With `np.arange()`, you can create sequence arrays:

In [None]:
np.arange(0, 10, 2)

- With `np.linspace()`, you can create an array that generates evenly spaced values within a specified range:

In [None]:
np.linspace(0, 1, 5)

- With `np.ones()`, you can create an array filled with ones

In [None]:
np.ones((2, 3))

Knowing the **shape**, **dimension**, and **size** of an array helps us understand the structure and characteristics of the data stored in the array.

We can then decide which operations we can/should do.

### Filling NAs

In [None]:
data = {'A': [1, 2, np.nan, 4],
        'B': [5, np.nan, 7, 8],
        'C': [9, 10, 11, 12]}

df = pd.DataFrame(data)
df

In [None]:
df_filled = df.fillna(0)
df_filled

### Droppping Rows

In [None]:
df

In [None]:
df_dropped = df.dropna()
df_dropped

In [None]:
df.dropna(subset=['B'])

### DateTime Wrangling 

In [None]:
df = pd.read_csv('data/chopped.csv')
df.info()

In [None]:
chopped = pd.read_csv('data/chopped.csv', parse_dates=["air_date"])
chopped.info()

In [None]:
chopped.head()

How can we look at the difference of dates?

In [None]:
chopped["air_date"].max()-chopped["air_date"].min()

In [None]:
(chopped["air_date"].max() - chopped["air_date"].min()).days

How can I check the differences between dates in the whole data frame?

In [None]:
chopped["air_date"]

In [None]:
chopped["air_date"].diff()
# Notice the first NaT

In [None]:
chopped["air_date"]

In [None]:
chopped['air_date'].dt.month_name()

### Replacing Strings in Data Frame

In [None]:
data = {'A': ["Africa", "  A s i a", "America", "  A s i a"],
        'B': [5, np.nan, 7, 8],
        'C': [9, 10, 11, 12]}

df = pd.DataFrame(data)
df

In [None]:
df['A']

In [None]:
df['A'].str.replace('A s i a', 'Asia')

In [None]:
df['A'].str.strip()

In [None]:
df[df['A'].str.contains('Am')]

#### Assignment Question 3: Cleaning the dirty data frame

In [None]:
#clean.copy()

In [None]:
dirty = pd.read_csv('data/dirty_gapminder.csv')
dirty.head(2)

In [None]:
clean = pd.read_csv('data/clean_gapminder.csv')
clean.head(2)

Motivation: Make dirty look the same way as clean

In [None]:
dirty[(~clean.eq(dirty)).any(axis=1)].shape

In [None]:
# This shows what you have
dirty[(~clean.eq(dirty)).any(axis=1)].head(1)

In [None]:
# This is the goal
clean[(~clean.eq(dirty)).any(axis=1)].head(1)

In [1]:
##On Wednesday

In [None]:
dirty[dirty.isnull().any(axis=1)]

In [None]:
def cleaned_gapminder(dirty_df):
    
    # Let's fill the na's with `.fillna()`
    dirty_df = dirty_df.fillna(value="Americas")
    dirty_df.columns = ['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap']

    return dirty_df

In [None]:
cleaned_data = cleaned_gapminder(dirty)

In [None]:
clean[(~cleaned_data.eq(clean)).any(axis=1)].head(2)

In [None]:
clean.columns

In [None]:
cleaned_data[(~cleaned_data.eq(clean)).any(axis=1)].head(2)