<a href="https://colab.research.google.com/github/ranj10/ML101_Data_pre_processing_and_feature_scaling/blob/main/ML101_Data_pre_processing_and_feature_scaling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing

### Importing the libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Importing the dataset


In [None]:
dataset = pd.read_csv('Data.csv')

In [None]:
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


### Selecting Independent variables

In [None]:
X = dataset.iloc[:, :-1].values

In [None]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

### Selecting Dependent variable

In [None]:
y = dataset.iloc[:, -1].values

In [None]:
# Taking care of missing data
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [None]:
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

## Encoding categorical data

## 2.1: Encoding the Independent Variable

If we convert it into France, Spain and Germany into 0, 1 & 2. ML models may take it as a numerical order.
ML model may think that order matters.
However, there is no relationship. There will be some misinterpreted correlation.
Onehot encoding convert this country column into three columns, using binary vectors.
Hence, there will be no numerical order.

In [None]:
from sklearn.compose import ColumnTransformer # Importing class column transform
from sklearn.preprocessing import OneHotEncoder # import class OneHotEncoder

# create object ct
# Call class ColumnTransformer
# First call transformers to tell which tranformation we have to do and which column we have to use
# encoder means we want encoding and then what type of encoding means OneHotEncoder
# remainder will tell us to keep the column where we will not apply the transformation
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')

# using fit_transform method to do onehotencoding
# It does not give result into numpy array
X = np.array(ct.fit_transform(X))
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [None]:
from sklearn.preprocessing import LabelEncoder # importing labelencoder
le = LabelEncoder()
y = le.fit_transform(y)
print(y)

[0 1 0 0 1 1 0 1 0 1]


#### When should we do feature scaling? Is it before or after splitting the dataset?

Feature scaling should be done after splitting the dataset. For example, if we do before mean and standard deviation will be from all the values including the one's from test set. Test set is not supposed to have information from training set. If we use all values for feature scaling, it would lead to information leakage on the test set. Test set is supposed to be new data or new observation.

## Splitting the dataset into the Training set and Test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
# test_size 0.2 means 20 percent of data will be in test
# random_State = 1 means, we will get the same information everytime as we are randomly splitting the data.
# random_state means we are just fixing the seet here.


In [None]:
X_train

array([[0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 35.0, 58000.0]], dtype=object)

In [None]:
X_test

array([[0.0, 1.0, 0.0, 30.0, 54000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

In [None]:
y_train

array([0, 1, 0, 0, 1, 1, 0, 1])

In [None]:
y_test

array([0, 1])

## Feature Scaling

#### Standardization
$X = \frac{Xi - {X}_{mean}} {SD}$

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

# fit will calculate mean and sd
# transform will calculate standardized value
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])

# Here we will use only transform method as test data set is like new dataset.
# Hence, we have to use same scalar that was used for training datset because ML model will be trained with particular scaler
# using fit method will add new scalar
X_test[:, 3:] = sc.transform(X_test[:, 3:])

In [None]:
X_train

array([[0.0, 0.0, 1.0, -0.1915918438457856, -1.0781259408412427],
       [0.0, 1.0, 0.0, -0.014117293757057902, -0.07013167641635401],
       [1.0, 0.0, 0.0, 0.5667085065333239, 0.6335624327104546],
       [0.0, 0.0, 1.0, -0.3045301939022488, -0.30786617274297895],
       [0.0, 0.0, 1.0, -1.901801144700799, -1.4204636155515822],
       [1.0, 0.0, 0.0, 1.1475343068237056, 1.2326533634535488],
       [0.0, 1.0, 0.0, 1.4379472069688966, 1.5749910381638883],
       [1.0, 0.0, 0.0, -0.7401495441200352, -0.5646194287757336]],
      dtype=object)

In [None]:
X_test

array([[0.0, 1.0, 0.0, -1.4661817944830127, -0.9069571034860731],
       [1.0, 0.0, 0.0, -0.44973664397484425, 0.20564033932253029]],
      dtype=object)

#### Min-max scaler
$X = \frac{{X}_{i} - min(X)} {max(X) - min(X)}$

In [None]:
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler()

# fit will calculate mean and sd
# transform will calculate standardized value
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])

# Here we will use only transform method as test data set is like new dataset.
# Hence, we have to use same scalar that was used for training datset because ML model will be trained with particular scaler
# using fit method will add new scalar
X_test[:, 3:] = sc.transform(X_test[:, 3:])

In [None]:
X_train

array([[0.0, 0.0, 1.0, 0.5120772946859904, 0.11428571428571427],
       [0.0, 1.0, 0.0, 0.5652173913043479, 0.4507936507936508],
       [1.0, 0.0, 0.0, 0.7391304347826088, 0.6857142857142856],
       [0.0, 0.0, 1.0, 0.47826086956521746, 0.37142857142857133],
       [0.0, 0.0, 1.0, 0.0, 0.0],
       [1.0, 0.0, 0.0, 0.9130434782608696, 0.8857142857142857],
       [0.0, 1.0, 0.0, 1.0, 0.9999999999999998],
       [1.0, 0.0, 0.0, 0.34782608695652184, 0.28571428571428564]],
      dtype=object)

In [None]:
X_test

array([[0.0, 1.0, 0.0, 0.13043478260869568, 0.17142857142857137],
       [1.0, 0.0, 0.0, 0.4347826086956522, 0.5428571428571427]],
      dtype=object)