# Data Preprocessing

## Review
### Tabular Data
- Numerical Values
    - line
    - hist
    - box
    - scatter
- Categorical Values
    - bar
    - pie
- Temporal data
- Spatial data
- Graph data

## Outline
- Missing Values
- Categorical Features
- Feature Scaling

## Missing Values
#### Find the missing Values
- dataFrame.isnull()
    - return same-sized object indicating if the values are NA


In [2]:
import pandas as pd

df = pd.read_csv('housing.csv')

print(df.shape)

df.isnull().sum()

FileNotFoundError: [Errno 2] No such file or directory: 'housing.csv'

## Methods

1. Remove the feature with a lot of missing values

In [None]:
print(df.isnull().sum()/df.shape[0])

2. Remove the feature with a lot of missing values

In [None]:
df = pd.read_csv('housing.csv')

print(df.columns)

df = df.drop('total_bedrooms', axis=1)
print(df.columns)

3. Fill in the missing values

In [None]:
mean_val = df['total_bedrooms'].mean()
median_val = df['total_bedrooms'].median()

print(mean_val)
print(median_val)

df['total_bedrooms'] = df['total_bedrooms'].fillna(mean_val)
print(df.isnull.sum())

Check the distribution

In [None]:
import matplotlib.pyplot as plt

plt.hist(df['total_bedrooms'].values, 100)

plt.show()

## Categorical Features

#### Convert Categorical Values to Numerical Values
- Label Encoding
    - Each categorical feature is converted into an integer value

In [None]:
from sklearn.preprocessing import LabelEncoder

print(df["ocean_proximity"].value_counts())

labelencoder= LabelEncoder()
df['ocean_proximity'] = labelencoder.fit_transform(df['ocean_proximity'])

print(df['ocean_proximity'].value_counts())

- One-Hot Encoding
    - Each category is mapped with a vector containing either 0 or 1

In [None]:
from sklearn.preprocessing import OneHotEncoder

print(df["ocean_proximity"][0])

onehotencoder= OneHotEncoder(sparse = False)
result = onehotencoder.fit_transform(df[['ocean_proximity']])
print(result[0,:])

- Ordinal Encoding
    - Retains Order

In [3]:
data = {'rating': ['Poor', 'Good', 'Very Good', 'Excellent']}
df = pd.DataFrame(data)
print(df)

coding_map = {'Poor': 1, 'Good': 2, 'Very Good': 3, 'Excellent': 4}
df['rating'] = df.rating.map(coding_map)
print(df)

      rating
0       Poor
1       Good
2  Very Good
3  Excellent
   rating
0       1
1       2
2       3
3       4


## Feature Scaling

- Different features have different scales