# Data Preprocessing
Numerical values
- line plot
- Hist plot
- Boxplot
- Scatter plot
Categorical Values
- Bar plot
- pie plot  
Temporal Data  
Spatial Data  
Graph Data  

## Missing Values

Find the missing values  
DataFrame.isnull()  
- return a boolean same-sized object indicating if the values are NA  
1. Remove the feature with a lot of missing values

In [9]:
import pandas as pd

df = pd.read_csv('housing.csv')

print(df.isnull().sum()/df.shape[0])

longitude             0.000000
latitude              0.000000
housing_median_age    0.000000
total_rooms           0.000000
total_bedrooms        0.010029
population            0.000000
households            0.000000
median_income         0.000000
median_house_value    0.000000
ocean_proximity       0.000000
dtype: float64


Methods: Remove the feature with a lot of missing Values

In [7]:
df = pd.read_csv('housing.csv')

print(df.columns)

df = df.drop('total_bedrooms', axis = 1)
    
print(df.columns)

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')
Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'population', 'households', 'median_income', 'median_house_value',
       'ocean_proximity'],
      dtype='object')


Methods: Fill in the missing values
- Numerical values
    - Fill in the missing values with mean or median


In [10]:
df = pd.read_csv('housing.csv')

mean_val = df['total_bedrooms'].mean()
median_val = df['total_bedrooms'].median()

print(mean_val)
print(median_val)

df['total_bedrooms'] = df['total_bedrooms'].fillna(mean_val)
print(df.isnull().sum)

537.8705525375618
435.0
longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64


Do median if long tail distribution  
Do average if non-long tail

## Categorical Features
impossible to compute mean or median
- so add a new categorical value

In [11]:
df = pd.read_csv('housing.csv')

print(df['ocean_proximity'].unique())

filling_value = 'PA'
df['ocean_proximity'] = df['ocean_proximity'].fillna(filling_value)

print(df['ocean_proximity'].unique())

['NEAR BAY' '<1H OCEAN' 'INLAND' 'NEAR OCEAN' 'ISLAND']
['NEAR BAY' '<1H OCEAN' 'INLAND' 'NEAR OCEAN' 'ISLAND']


Label Encoding  
- Each categorical feature is converted into an integer value  

| Proximity | Label |  
|----------|----------|  
|Near Bay   | 0 |     
| <1H OCEAN | 1 |     
| INLAND    | 2 |
| NEAR OCEAN | 3 |     
| ISLAND | 4 |     



In [13]:
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('housing.csv')
print(df["ocean_proximity"].value_counts())

labelencoder = LabelEncoder()
df['ocean_proximity'] = labelencoder.fit_transform(df['ocean_proximity'])

print(df["ocean_proximity"].value_counts())

ocean_proximity
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: count, dtype: int64
ocean_proximity
0    9136
1    6551
4    2658
3    2290
2       5
Name: count, dtype: int64


One-Hot Encoding
- Each category is mapped with a vector containing either 0 or 1


In [17]:
from sklearn.preprocessing import OneHotEncoder
df = pd.read_csv('housing.csv')
print(df["ocean_proximity"][0])

onehotencoder = OneHotEncoder()
result = onehotencoder.fit_transform(df[['ocean_proximity']]).toarray()
print(result[0,:])

NEAR BAY
[0. 0. 0. 1. 0.]


Ordinal Encoding
- the categorical feature is ordinal
- retaining the order is important

In [18]:
data = {'rating': ['Poor','Good','Very Good', 'Excellent']}
df = pd.DataFrame(data)
print(df)

coding_map = {'Poor': 1, 'Good':2, 'Very Good': 3, 'Excellent':4}
df['rating'] = df.rating.map(coding_map)
print(df)


      rating
0       Poor
1       Good
2  Very Good
3  Excellent
   rating
0       1
1       2
2       3
3       4


## Feature Scaling
different features have different scales  
||x-y||2, the feature with large scale dominates the distance
Examples:
- x = (1,1000) y = (2,2000)
- x = (1,4) y = (3,5)

### Feature Scaling Methods
Min-Max Normilization
- sensitive to outliers
- X' = (X - Xmin) / (Xmax - Xmin)

In [19]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder
np.set_printoptions(precision=4)

df = pd.read_csv('housing.csv')
X = df.values[0:5, 5:9].astype(dtype=np.float32)
print('Original data')
print(X)

x_min = X.min(axis=0)
x_max = X.max(axis=0)
print('min and max')
print(x_min)
print(x_max)

X = (X - x_min)/(x_max - x_min)
print('Scaling data')
print(X)

Original data
[[3.2200e+02 1.2600e+02 8.3252e+00 4.5260e+05]
 [2.4010e+03 1.1380e+03 8.3014e+00 3.5850e+05]
 [4.9600e+02 1.7700e+02 7.2574e+00 3.5210e+05]
 [5.5800e+02 2.1900e+02 5.6431e+00 3.4130e+05]
 [5.6500e+02 2.5900e+02 3.8462e+00 3.4220e+05]]
min and max
[3.2200e+02 1.2600e+02 3.8462e+00 3.4130e+05]
[2.4010e+03 1.1380e+03 8.3252e+00 4.5260e+05]
Scaling data
[[0.     0.     1.     1.    ]
 [1.     1.     0.9947 0.1545]
 [0.0837 0.0504 0.7616 0.097 ]
 [0.1135 0.0919 0.4012 0.    ]
 [0.1169 0.1314 0.     0.0081]]


Z-Score Normalization
- Good for normal distribution  
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; _  
x' = (x - x)/ sigma  
_  
x = (1/n)Sum(from: i, to: n)(Xi)  
&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;_   
Sigma^2 = (1/n)Sum(from: i, to: n)(Xi - X)^2