# Data Preprocessing

- Before going to any Machine Learning Algorithm, we need to do Data Preprocessing steps
    - Import dataset from the CSV file.  
    - Split Feature variable (X) & Depended variable (Y).  
    - Replace missing data.  
    - Replace string or categorical data.  
    - Feature scaling.  
    - Split training and test dataset.  

## Import dataset from the CSV file. 

- We are using CSV as our dataset file & with the help of the pandas module, we can easily read CSV and have it in a variable called a Dataframe. 
- Dataframe is a cool type provided by pandas where we can do complex operations on any dataset.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd 
dataset = pd.read_csv(r"../dataset/Data.csv")
print(dataset)

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


## Split Feature variable (X) & Depended variable (Y).  

- Feature variable, in-depended variable, input variable are represented as X in the code, which helps to predict Y.
- Depended variable, output variable are represented as Y in the code, which is the expected prediction from the X.  
- Most commonly last comlumn will be the depended variable Y rest of them will be feature variable X.
- So, we are spliting the given dataset into X and Y.

In [2]:
X = dataset.iloc[:, :-1].values # [row, column]
Y = dataset.iloc[:, -1].values
print("X", X, "Y", Y, sep="\n")

X
[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
Y
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Replace missing data
- As you can see there are some of the cell are empty and they are filled with ```nan``` 
- We need to eleminate or replace those empty cells with the ```strategy```
- Yes, there are many ```strategy``` like ["mean", "median", "most_frequent", "constant"]
- But, we are going to use mean strategy to replace all the missing datas.

In [3]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
imputer.fit(X[:, 1:2+1])
X[:, 1:2+1] = imputer.transform(X[:, 1:2+1])
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Replace string data
- We can't perform any algorithm with the string data so we need to convert those string data into some other algorithm understandable format
- Before that we can understand that you dataset have two type of string data columns which are ["Country", "Purchased"]
- Where, ```Country``` is the categorical string data where only 3 category countries are repeating again and again.
- Where as, ```Purchased``` is the lable string data where it is more over look like boolean data.
- So, we are replaceing Country column as vector eg : ```(0, 0, 1)``` with the help of ```ColumnTransformer``` & ```OneHotEncoder``` similarly replacing Purchased column as ```0 & 1``` with the help of ```LabelEncoder```.

### Tranform feature's Country column
- Since our Country is categorical we are convergting that into Vector ```Example : (1.0 0.0 0.0) for France``` 
- We are using sklearn's ```ColumnTransformer``` for tranform the column from the feature and ```OneHotEncoder``` as the preprocessor to convert string data into vetor.


In [4]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])],remainder='passthrough')
X = ct.fit_transform(X)
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Tranform depended variable's Purchased column
- As we know Purchased column is more like boolean we can label them as 0 & 1.
- To do that Label encoding we are using sklearn's ```LabelEncoder``` which encode the dataset and tranform the same.

In [5]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
Y = le.fit_transform(Y)
print(Y)

[0 1 0 0 1 1 0 1 0 1]


## Split dataset into training and test set

In [6]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2) # random_state=1 to don't shuffle the datasets

In [7]:
print("X Train", x_train, "Y Train", y_train, sep="\n")

X Train
[[0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 37.0 67000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [0.0 1.0 0.0 30.0 54000.0]]
Y Train
[0 1 0 1 1 1 1 0]


In [8]:
print("X Test", x_test, "Y Test", y_test, sep="\n")

X Test
[[0.0 0.0 1.0 38.0 61000.0]
 [1.0 0.0 0.0 44.0 72000.0]]
Y Test
[0 0]


## Feature scaling

In [9]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
# we only take the large value column which is not in the range of (-3, 3)
x_train[:, 3:4+1] = sc.fit_transform(x_train[:, 3:4+1])
x_test[:, 3:4+1] = sc.transform(x_test[:, 3:4+1])

In [10]:
print(x_train)

[[0.0 1.0 0.0 1.582353946227601 1.6787007930711877]
 [1.0 0.0 0.0 -0.4329081551000039 -0.42992546479911053]
 [0.0 0.0 1.0 0.0746393370862078 -0.9359957666879821]
 [1.0 0.0 0.0 -0.16420654158965658 0.32917998803419685]
 [1.0 0.0 0.0 1.3136523327172536 1.34132059181194]
 [0.0 0.0 1.0 -1.507714609141393 -1.2733759679472298]
 [0.0 1.0 0.0 0.2388458786758644 0.05740149257535867]
 [0.0 1.0 0.0 -1.1046621888758723 -0.7673056660583583]]


In [11]:
print(x_test)

[[0.0 0.0 1.0 -0.02985573483448293 -0.17689031385467474]
 [1.0 0.0 0.0 0.776249105696559 0.7509052396082565]]
