# Pandas for Data Analysis: Data Wrangling (Part 2)

## Outline:

* [Scaling/Normalization](#Scaling/Normalization)
* [Parsing Dates](#Parsing-Dates)
* [Handling Missing Data](#Handling-Missing-Data)

## Scaling/Normalization

In [None]:
import pandas as pd

In [None]:
data = {
    'data': [2, 3, 4, 10, 12, 20, 30, 11, 25]
}
df = pd.DataFrame(data)

In [None]:
df

### Rescaling (Min-Max Normalization)

$x' = \frac{x - \text{min}(x)}{\text{max}(x) - \text{min}(x)}$

where $x$ is an original value, $x'$ is the normalized value.

In [None]:
df['min_max_normalization'] = (df['data'] - df['data'].min()) / (df['data'].max() - df['data'].min())

In [None]:
df

### Mean Normalization

$x' = \frac{x - \text{average}(x)}{\text{max}(x)-\text{min}(x)}$

where $x$ is an original value, $x'$ is the normalized value.

In [None]:
df['mean_normalization'] = (df['data'] - df['data'].mean()) / (df['data'].max() - df['data'].min())

In [None]:
df

### Standardization (z-Scaling)

$x' = \frac{x - \bar{x}}{\sigma}$

where $x$ is the original feature vector, $\bar{x} = \text{average}(x)$ is the mean of that feature vector, and $\sigma$ is its standard deviation.

In [None]:
df['standardization'] = (df['data'] - df['data'].mean()) / df['data'].std()

In [None]:
df

### Scaling to Unit Length

$x' = \frac{x}{||x||}$

In [None]:
import numpy as np

In [None]:
df['squared'] = np.square(df['data'])
df['unit_length'] = df['data'] / np.sqrt(df['squared'].sum())

In [None]:
df

### Challenges

ลองโหลดข้อมูล [Summary of Weather](https://www.kaggle.com/smid80/weatherww2/#Summary%20of%20Weather.csv) มาลองทำ Scaling/Normalization ทั้ง 4 แบบข้างต้นกับ column ไหนก็ได้

---

## Parsing Dates

เราจะลอง convert ค่า date ในข้อมูล [Landslides After Rainfall, 2007-2016](https://www.kaggle.com/nasa/landslide-events) ให้กลายเป็น datetime

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('data/landslides.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df['date_parsed'] = pd.to_datetime(df['date'], format='%m/%d/%y')

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df.date_parsed.dt.day[0:3]

In [None]:
df.date_parsed.dt.month[0:3]

In [None]:
df.date_parsed.dt.year[0:3]

### Challenges

ลอง convert ค่า Date ในข้อมูล [Significant Earthquakes, 1965-2016](https://www.kaggle.com/usgs/earthquake-database) ให้เป็น datetime

---

## Handling Missing Data

In [None]:
import pandas as pd

In [None]:
titanic_data_url = 'http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv'
titanic = pd.read_csv(titanic_data_url)
titanic.head()

วิธีการตรวจสอบ missing data ในข้อมูลของเราวิธีหนึ่งแบบง่ายๆ คือใช้ `info`

In [None]:
titanic.info()

หรือจะใช้ `isnull`

In [None]:
titanic.isnull().sum()

เราสามารถนำเอา open source package มาใช้ [Missingno](https://anaconda.org/conda-forge/missingno) ก็ได้เช่นกัน เอามาช่วย Visualize ข้อมูลของเราให้เข้าใจภาพรวมได้ดีขึ้น

In [None]:
%matplotlib inline
import missingno as msno

msno.matrix(titanic)

### Drop Missing Data

In [None]:
titanic.drop('body', axis='columns').dropna().shape

In [None]:
titanic.dropna(subset=['age', 'body'], how='any').shape

In [None]:
titanic.dropna(subset=['age', 'body'], how='all').shape

### Use Imputation

In [None]:
body_mean = titanic.body.mean()

In [None]:
titanic.body = titanic.body.fillna(body_mean).head()

In [None]:
titanic.head()

In [None]:
titanic.cabin.value_counts(dropna=False).head()

In [None]:
titanic.cabin.fillna('C23 C25 C27').value_counts().head()

### Challenges

ลองแก้ missing data ใน age โดยใช้ค่า median

ลองแก้ missing data ใน boat โดยใช้ค่า mode