# Pandas for Data Analysis: Data Wrangling (Part 1)

## Outline:

* [Dealing with Columns](#Dealing-with-Columns)
* [Dealing with String Data (Using `str` Functions)](#Dealing-with-String-Data-(Using-str-Function))
* [Dealing with Categorical Data](#Dealing-with-Categorical-Data)
* [Mapping or Applying Function Along Axis](#Mapping-or-Applying-Function-Along-Axis)

In [None]:
import pandas as pd

In [None]:
adult_data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
columns = ['age', 'Work Class', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'Money Per Year']
adult = pd.read_csv(adult_data_url, names=columns)

In [None]:
adult.head()

## Dealing with Columns

### Renaming Columns

In [None]:
adult.rename(columns={'Work Class': 'workclass'})

In [None]:
adult.rename(columns={
    'Work Class': 'workclass',
    'education-num': 'education_num'
})

In [None]:
adult.head()

In [None]:
adult_new = adult.rename(columns={
    'Work Class': 'workclass',
    'education-num': 'education_num'
})

In [None]:
adult_new.head()

#### Challenges

ลองแก้ชื่อ Column ตามนี้
* marital-stauts เป็น marital_status
* capital-gain เป็น capital_gain
* hours-per-week เป็น hours_per_week
* native-country เป็น native_country
* Money Per Year เป็น money_per_year

### Adding New Columns

In [None]:
adult['my_new_column'] = 1

In [None]:
adult.head()

In [None]:
adult['normalized-age'] = (adult.age - adult.age.mean()) / adult.age.std()

In [None]:
adult.head()

In [None]:
adult[adult['normalized-age'] > 1]

#### Challenges

ลองเพิ่ม Column ใหม่ชื่อ over_time ถ้า hours per week มากกว่าเท่ากับ 40 ให้เป็นค่า True ถ้าน้อยกว่าให้เป็นค่า False

### Removing Existing Columns

In [None]:
adult.drop('normalized-age')

We need to specify the parameter called `axis` when we drop.

In [None]:
adult.drop('normalized-age', axis=1)

In [None]:
adult.drop('normalized-age', axis='columns')

In [None]:
adult.head()

In [None]:
adult_without_normalized_age = adult.drop('normalized-age', axis=1)

In [None]:
adult_without_normalized_age.head()

Remove rows?

In [None]:
adult.drop([0, 1], axis=0)

---

## Dealing with String Data (Using `str` Function)

In [None]:
adult[adult.education == 'Masters']

In [None]:
adult.education[0]

In [None]:
adult.education = adult.education.str.replace(' ', '')

In [None]:
adult[adult.education == 'Masters'].head()

In [None]:
adult[adult.education.str.contains('Mas')]

ใช้ `isin` ในการเช็คได้

In [None]:
adult[adult.education.isin(['Bachelors', 'Masters'])].head()

สามารถนำไปใช้กับการแก้ชื่อ Column ได้ด้วย

In [None]:
adult_data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
columns = ['age', 'Work Class', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'Money Per Year']
adult = pd.read_csv(adult_data_url, names=columns)

In [None]:
adult.head()

In [None]:
adult.columns.str.replace('-', '_')

ทำ Chaining

In [None]:
adult.columns.str.lower().str.replace('-', '_').str.replace(' ', '_')

### Challenges

ลองแก้ Male กับ Female ใน sex ให้เป็น male และ female ตามลำดับ

---

## Dealing with Categorical Data

In [None]:
adult.education.unique()

In [None]:
adult.info()

In [None]:
from pandas.api.types import CategoricalDtype

In [None]:
adult.education = adult.education.astype(CategoricalDtype())

In [None]:
adult.head()

In [None]:
adult.info()

In [None]:
adult.education.head()

In [None]:
adult.education.cat.codes.head()

In [None]:
for each in adult.education.unique():
    print(each)

โหลดข้อมูลใหม่หลังจากที่ได้ทดลองเล่นไปแล้วข้างต้น

In [None]:
adult_data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
columns = ['age', 'Work Class', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'Money Per Year']
adult = pd.read_csv(adult_data_url, names=columns)

In [None]:
from pandas.api.types import CategoricalDtype

In [None]:
adult.education = adult.education.str.replace(' ', '')
categories = ['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th', '10th', '11th', '12th', 'HS-grad', 'Bachelors', 'Some-college', 'Masters', 'Doctorate', 'Prof-school', 'Assoc-acdm', 'Assoc-voc']
adult.education = adult.education.astype(CategoricalDtype(categories=categories, ordered=True))

In [None]:
adult.sort_values('education').head(2)

In [None]:
adult.sort_values('education').tail(2)

In [None]:
adult.loc[adult.education >= 'Masters', :]

### Challenges

ลองทำ occupation ให้เป็น categorical data

ลองทำ sex ให้เป็น categorical data

---

## Mapping or Applying Function Along Axis

In [None]:
data = {
    'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
    'year': [2012, 2012, 2013, 2014, 2014], 
    'reports': [4, 24, 31, 2, 3],
    'coverage': [25, 94, 57, 62, 70]
}
df = pd.DataFrame(data)

In [None]:
df.head()

### Map

In [None]:
def capitalizer(x):
    return x.upper()

In [None]:
df['name'].map(capitalizer)

### Apply

In [None]:
df.apply(print)

In [None]:
df.apply(print, axis='columns')

### Challenges

ลองใช้ map หรือ apply กับข้อมูล [Age of Female Oscar Winner](https://people.sc.fsu.edu/~jburkardt/data/csv/oscar_age_female.csv) เพื่อแยก name ให้เป็น 2 columns คือ first_name กับ last_name