# 5 Data Cleanse

Data cleanse and preprocess are essential tasks for quanlity data analysis. Pandas and Python are convenient tools to ease these labor-intensive tasks.

Common data cleanse and preprocess tasks include:

- eliminate whitespaces
- handle missing data
- handle duplicate data
- handle proper case of text data(English)
- optimize data type to save runtime memory and boost query performance

In [None]:
import pandas as pd

In [None]:
inspections = pd.read_csv("chicago-food-inspections.csv")
inspections

## 5.1 Eliminate whitespaces

In [None]:
for column in inspections.columns:
    inspections[column] = inspections[column].str.strip()

## 5.2 Capitalize first letter

In [None]:
inspections["Name"] = inspections["Name"].str.title()

## 5.3 Optimize memory usage

By choosing appropriate data types for coloum. You can save memory and make query run fast.
General rules are:

- use `datetime` instead of `object` for date columns
- use `int` instead of `float` if possible
- use `bool` instead of `object` for boolean columns
- use `category` to replace data type of column which has limited distinct values

### 5.3.1 identify optimization opportunities

In [None]:
inspections.nunique()

### 5.3.2 check DataFrame before optimization

In [None]:
inspections.info()

### 5.3.3 change `Risk` to category

In [None]:
inspections["Risk"] = inspections["Risk"].astype("category")
inspections.info()

In [None]:
employees = pd.read_csv("employees.csv", parse_dates=["Start Date"])
employees

## 5.4 Deal with missing data

Pandas represents missing data as `NaN` for number and string, `NaT` for datetime.
To cleanse missing data, we can either drop them using `dropna()` or replace them with constant value with `fillna()`.

You can drop rows with missing data in specific columns by setting the `subset` parameter of `dropna()` method.

### 5.4.1 Remove employees without `First Name`

In [None]:
employees.dropna(subset=["First Name"])

## 5.4 Deal with duplicate data

You can drop duplicated rows by specifying a combination of columns by setting the `subset` parameter of `drop_duplicates()` method.


## 5.4.1 Remove duplicated male employees named `Douglas`

In [None]:
employees.drop_duplicates(subset=["First Name", "Gender"], keep='first')

In [None]:
inspections.dropna(subset=["Name"])
inspections.info()