# Chapter 1 Data Preprocessing 

* Data set has to be preprocessed before putting into ML algorithms 
* dataset has attributes set, comprises of dependent ($D$) and independent variables ($I$)
* there exists a map ($f$) such that $f:I \rightarrow D$
* ML algorithms find $f$ 

## Import Libraries 
* Numpy : Numeric , math computation
* matplotlib : plotting
* pandas : data import and management 

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd 

## Importing the dataset
for the test purpose I'm using a open source dataset from https://www.superdatascience.com/pages/machine-learning stored as `/ds/Data.csv`

In [2]:
dataset=pd.read_csv('ds/Data.csv')

In [3]:
dataset.head(5)

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


### Seperate Dependent and Independent variables 

In [4]:
# Independent Variables
X=dataset.iloc[:,:-1].values   #[all rows, all col but last one]

# Dependent Variables
Y=dataset.iloc[:,-1].values    #[all roes, last col only]

In [5]:
pd.DataFrame(X)

Unnamed: 0,0,1,2
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,
5,France,35.0,58000.0
6,Spain,,52000.0
7,France,48.0,79000.0
8,Germany,50.0,83000.0
9,France,37.0,67000.0


In [6]:
pd.DataFrame(Y)

Unnamed: 0,0
0,No
1,Yes
2,No
3,No
4,Yes
5,Yes
6,No
7,Yes
8,No
9,Yes


## Dealing with missing data 
some times the data set may contains missing data. there are two strategies 
1. Delete the rows with missing data (Dangerous)
2. fill with mean value of the given attributes (Preferred)

In [7]:
# import the class 
from sklearn.preprocessing import Imputer

# create an object 
imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)

# apply impurter
imputer = imputer.fit(X[:,1:3]) #specify the target attribute with missing data (0 index)
X[:,1:3] = imputer.transform(X[:,1:3])

pd.DataFrame(X)

Unnamed: 0,0,1,2
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,63777.8
5,France,35.0,58000.0
6,Spain,38.7778,52000.0
7,France,48.0,79000.0
8,Germany,50.0,83000.0
9,France,37.0,67000.0


the impurter object takes
* `missing_value` argument name. this is the name it look for replacing. 
* `strategy` is by default `mean`, however other startegies are (Median, Most Frequent)
* `Axis` 0 = mean along the colums (Veritcal) , 1 = mean along rows (Horizontal)

## Encode Catagorical Data 

since ML models are based on numeric computation. Thus, it is nessesary to encode any string value into numbers.  

In [8]:
from sklearn.preprocessing import LabelEncoder

#create an object 
le_X = LabelEncoder()
X[:,0] = le_X.fit_transform(X[:,0])   # transform 0th col of X and replace original
pd.DataFrame(X)

Unnamed: 0,0,1,2
0,0,44.0,72000.0
1,2,27.0,48000.0
2,1,30.0,54000.0
3,2,38.0,61000.0
4,1,40.0,63777.8
5,0,35.0,58000.0
6,2,38.7778,52000.0
7,0,48.0,79000.0
8,1,50.0,83000.0
9,0,37.0,67000.0


In [9]:
Y=le_X.fit_transform(Y[:])
pd.DataFrame(Y)

Unnamed: 0,0
0,0
1,1
2,0
3,0
4,1
5,1
6,0
7,1
8,0
9,1


* Now this may lead to another problem, since the transformation will create a ordered number list for each identical item. the model may try to find corelation between them which is absolutely makes no sense (Since if the categories are not always ordinal)

* in such a case we use __Dummy Encoding__ where each type is treated as a seperate column and encoded accordingly 

### One Hot Encoding 
one hot encoding is used to perform "Dummy Encoding". the object has follwing attributes
* categorical_features : Specify which column you want to encode 

In [10]:
from sklearn.preprocessing import OneHotEncoder
OHE_X=OneHotEncoder(categorical_features=[0]) # specify target column []
X= OHE_X.fit_transform(X).toarray() 

pd.DataFrame(X)

Unnamed: 0,0,1,2,3,4
0,1.0,0.0,0.0,44.0,72000.0
1,0.0,0.0,1.0,27.0,48000.0
2,0.0,1.0,0.0,30.0,54000.0
3,0.0,0.0,1.0,38.0,61000.0
4,0.0,1.0,0.0,40.0,63777.777778
5,1.0,0.0,0.0,35.0,58000.0
6,0.0,0.0,1.0,38.777778,52000.0
7,1.0,0.0,0.0,48.0,79000.0
8,0.0,1.0,0.0,50.0,83000.0
9,1.0,0.0,0.0,37.0,67000.0


* Use label Encoder if variable is (yes/no) or ordinal categorigal 
* Use OHE if variable has no corelation and categorigal 

## Train Test Split 

* ML algorithms learns model from Data sets 
* it's not a good practice for ML to perform good on Dataset but not on difference data
* This occurs if the model didn't learn the concept but memorised it 
* use train test split 

Train_Test_Split option 
1. Test_size : fraction of test set (typically .25 - .3)
2. Train_seze : train + test = 1
3. Random_State : random sampling 

In [11]:
# import library 
from sklearn.cross_validation import train_test_split

# perform splitting
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 5)



### Splitted Datasets 

In [12]:
pd.DataFrame(X_train)

Unnamed: 0,0,1,2,3,4
0,0.0,1.0,0.0,30.0,54000.0
1,0.0,1.0,0.0,40.0,63777.777778
2,1.0,0.0,0.0,48.0,79000.0
3,0.0,0.0,1.0,27.0,48000.0
4,1.0,0.0,0.0,44.0,72000.0
5,0.0,1.0,0.0,50.0,83000.0
6,0.0,0.0,1.0,38.777778,52000.0
7,0.0,0.0,1.0,38.0,61000.0


In [13]:
pd.DataFrame(X_test)

Unnamed: 0,0,1,2,3,4
0,1.0,0.0,0.0,37.0,67000.0
1,1.0,0.0,0.0,35.0,58000.0


In [14]:
pd.DataFrame(Y_train)

Unnamed: 0,0
0,0
1,1
2,1
3,1
4,0
5,0
6,0
7,0


In [15]:
pd.DataFrame(Y_test)

Unnamed: 0,0
0,1
1,1


## Feature Scaling

* attributes containing numerical data, it mey happen that two numeric attributes aren't not in same scale (eg. Age, Salary)
* ML Models are performs badly if scalling missmatch happens, as many of them uses Eucledian distance to minimise error. 
* large scale variable may dominate the smaller scales, thus introducing bias 
* there are two mechanism to scalling 
> 1. Standardisation $X_{stand} = \frac{x-mean(X)}{SD(X)}$ 
> 2. Normalization $X_{norm} = \frac {x - min(X)}{max(x) - min(x)}$

In [16]:
from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

### Facts
1. Feature scalling for Dummy variables : depends on scenario, for non-ordinal data Not needed
2. Feature scalling for Dependent Variables : Non needed for Classification but for Regression 

In [17]:
pd.DataFrame(X_train)

Unnamed: 0,0,1,2,3,4
0,-0.57735,1.290994,-0.774597,-1.259796,-0.8389
1,-0.57735,1.290994,-0.774597,0.070194,-0.02654
2,1.732051,-0.774597,-0.774597,1.134186,1.238156
3,-0.57735,-0.774597,1.290994,-1.658793,-1.337393
4,1.732051,-0.774597,-0.774597,0.60219,0.65658
5,-0.57735,1.290994,-0.774597,1.400184,1.570485
6,-0.57735,-0.774597,1.290994,-0.09236,-1.005064
7,-0.57735,-0.774597,1.290994,-0.195804,-0.257324


In [18]:
pd.DataFrame(X_test)

Unnamed: 0,0,1,2,3,4
0,1.732051,-0.774597,-0.774597,-0.328803,0.241169
1,1.732051,-0.774597,-0.774597,-0.594801,-0.506571
