# Data Preprocessing Tools

## Importing the libraries

In [651]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

### independent vs dependent variables


generally, first columns of the dataset are independent variables or features, these are the columns with which we are going to predict the dependent variable.

dependent variables are last columns, they are outputs based on features.

in given dataset Data.csv

Country, Age, Salary are features and last column Purchased is the outcome, 

so given the country, age and salary of a customer whether purchase was made or not.

### using pandas to import dataset

In [652]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [653]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [654]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

In [655]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [656]:
print(X)                                

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


### Practice

#### mean is returned by sklearn

In [657]:
prac_X = dataset.iloc[:, :-1].values
print("prac_X[:,1]:", prac_X[:, 1])
ages = [v for v in prac_X[:, 1] if str(v) != 'nan']
print("ages:", ages)
print("mean(ages):", np.mean(ages))

prac_X[:,1]: [44.0 27.0 30.0 38.0 40.0 35.0 nan 48.0 50.0 37.0]
ages: [44.0, 27.0, 30.0, 38.0, 40.0, 35.0, 48.0, 50.0, 37.0]
mean(ages): 38.77777777777778


In [658]:
print("prac_X[:,2]:", prac_X[:, 2])
salaries = [v for v in prac_X[:, 2] if str(v) != 'nan']
print("salaries:", salaries)
print("mean(salaries):", np.mean(salaries))

prac_X[:,2]: [72000.0 48000.0 54000.0 61000.0 nan 58000.0 52000.0 79000.0 83000.0
 67000.0]
salaries: [72000.0, 48000.0, 54000.0, 61000.0, 58000.0, 52000.0, 79000.0, 83000.0, 67000.0]
mean(salaries): 63777.77777777778


#### How multiple nan are handled?


all nan are replaced by mean, presence of more nan has no impact, mean is calculated as follows

sum of valid values / number of valid values

In [659]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
marks = np.array([1, 2, 3, np.nan, 4, np.nan, 5, np.nan])
marks = marks.reshape(len(marks), 1)
print("marks:", marks)
imputer.fit(marks)
marks = imputer.transform(marks)
print('transformation...')
print("marks:", marks)


marks: [[ 1.]
 [ 2.]
 [ 3.]
 [nan]
 [ 4.]
 [nan]
 [ 5.]
 [nan]]
transformation...
marks: [[1.]
 [2.]
 [3.]
 [3.]
 [4.]
 [3.]
 [5.]
 [3.]]


#### using other methods for filling the missing values

In [660]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
marks = np.array([1, 2, 3, np.nan, 4, 5, 6])
marks = marks.reshape(len(marks), 1)
print("marks:", marks)
imputer.fit(marks)
marks = imputer.transform(marks)
print('transformation...')
print("marks:", marks)

# median is taken (3+4)/2


marks: [[ 1.]
 [ 2.]
 [ 3.]
 [nan]
 [ 4.]
 [ 5.]
 [ 6.]]
transformation...
marks: [[1. ]
 [2. ]
 [3. ]
 [3.5]
 [4. ]
 [5. ]
 [6. ]]


In [661]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
# 5 is specified twice
marks = np.array([1, 2, 3, np.nan, 4, 5, 6, 7, 5, np.nan])
marks = marks.reshape(len(marks), 1)
print("marks:", marks)
imputer.fit(marks)
marks = imputer.transform(marks)
print('transformation...')
print("marks:", marks)



marks: [[ 1.]
 [ 2.]
 [ 3.]
 [nan]
 [ 4.]
 [ 5.]
 [ 6.]
 [ 7.]
 [ 5.]
 [nan]]
transformation...
marks: [[1.]
 [2.]
 [3.]
 [5.]
 [4.]
 [5.]
 [6.]
 [7.]
 [5.]
 [5.]]


In [662]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=10) # fill_value=0 by default
marks = np.array([1, 2, 3, np.nan, 4, 5, 6, np.nan, 7, 5])
marks = marks.reshape(len(marks), 1)
print("marks:", marks)
imputer.fit(marks)
marks = imputer.transform(marks)
print('transformation...')
print("marks:", marks)


marks: [[ 1.]
 [ 2.]
 [ 3.]
 [nan]
 [ 4.]
 [ 5.]
 [ 6.]
 [nan]
 [ 7.]
 [ 5.]]
transformation...
marks: [[ 1.]
 [ 2.]
 [ 3.]
 [10.]
 [ 4.]
 [ 5.]
 [ 6.]
 [10.]
 [ 7.]
 [ 5.]]


## Encoding categorical data

### Encoding the Independent Variable

#### Explanation

##### Why simply assigning numerically ordered values to countries is bad?

currently the Country column contains France, Spain, Germany

now we want to turn strings into numerical values,

one idea would be to encode France into 0, Spain into 1 and Germany into 2

but by following above approach, our model could interpret that 

there is a numerical order between these three countries, and it could interpret that this order matters whereas of course it is not the case.

There is not a relationship order between these 3 countries.

So, we want to avoid the model to have such an interpretation because that could cause some misinterpreted correlations between features and the outcome which we want to predict.

##### What is One Hot Encoding

one hot encoding consists of creating binary vectors for each of the countries.

since data contains 3 different countries, it will create 3 different columns and assignment will be like 

France -> 100

Spain -> 010

Germany -> 001

this way each country got uniquely identified

#### using OneHotEncoder

In [663]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough') 
X = np.array(ct.fit_transform(X))

In [664]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


#### Explanation

('encoder', OneHotEncoder(), [0])

'encoder' -> kind of transformation

OneHotEncoder() -> class that will proceed to this encoding

[0] -> columns we want to apply one hot encoding to

___


remainder='passthrough' -> passthrough rest of the columns (don't modify them)

___

`ct.fit_transform(X)` doesn't returns output as a NumPy array, we need X to be a numpy array for future machine learning models

#### Practice

performing one hot encoding on other column

In [665]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
#                                                           column 1 contains name
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])], remainder='passthrough')
names = np.array([[22, 'rahul'], [19, 'mehak']])
print(names)
names = np.array(ct.fit_transform(names))
print(names)

[['22' 'rahul']
 ['19' 'mehak']]
[['0.0' '1.0' '22']
 ['1.0' '0.0' '19']]


if we don't pass remainder='passthrough', we would only remian with columns resulting from one hot encoding

In [666]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])])
names = np.array([[22, 'rahul'], [19, 'mehak']])
print(names)
names = np.array(ct.fit_transform(names))
print(names)

[['22' 'rahul']
 ['19' 'mehak']]
[[0. 1.]
 [1. 0.]]


### Encoding the Dependent Variable

In [667]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [668]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


#### Practice

In [669]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

names = ['rahul', 'mehak', 'rahul', 'mehak']
names = le.fit_transform(names)
print(names)

[1 0 1 0]


In [670]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
# LabelEncoder assigns numerically ordered values
# but it was useful for our Dataset as it contained only Yes, No
names = ['rahul', 'mehak', 'rahul', 'aman', 'raja']
names = le.fit_transform(names)
print(names)

[2 1 2 0 3]


### Conclusion

use one hot encoding when you have several categories in one of the features of your matrix of features

but also you can do a simple lable encoding when you have, two classes which you can directly encode into zero and onces, ie. a binary outcome.

## Splitting the dataset into the Training set and Test set

In [671]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [672]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [673]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [674]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [675]:
print(y_test)

[0 1]


### Explanation

`test_size = 0.2`

signifies the percentage of entries going into test set
 
generally 80:20 is a good split
___

passing

`random_state = 1`

ensures that `train_test_split` function returns the same entries in training and test set, every time it is ran

it also ensured that split used by instructor is same as ours

### Practice

In [676]:
from sklearn.model_selection import train_test_split
names = [
    ['rahul'],
    ['mehak'],
    ['aman'],
    ['raja']
]
ages = [
    23, 
    19, 
    23,
    20
]
# value of random_state doesn't matters as long as it is same everytime
names_train, names_test, ages_train, ages_test = train_test_split(names, ages, test_size = 0.2, random_state = 2)
print("names_train:",names_train)
print("names_test:",names_test)
print("ages_train:",ages_train)
print("ages_test:",ages_test)

names_train: [['raja'], ['mehak'], ['rahul']]
names_test: [['aman']]
ages_train: [20, 19, 23]
ages_test: [23]


In [677]:
names_train, names_test, ages_train, ages_test = train_test_split(names, ages, test_size = 0.5)
print("names_train:",names_train)
print("names_test:",names_test)
print("ages_train:",ages_train)
print("ages_test:",ages_test)

names_train: [['aman'], ['raja']]
names_test: [['rahul'], ['mehak']]
ages_train: [23, 20]
ages_test: [23, 19]


## Feature Scaling

### What it is?

feature scaling simply consists of scaling all features to make sure they all take values in the same scale.

And we do this so as to prevent one feature to dominate the other, which therefore would be neglected by the machine learning model.

___

feature scaling is required for some ml models,

for eg. for Regression based models -> 

Multiple Linear Regression 

y = b0 + b1 * x1 + b2 * x2 + ... + bn * xn

cofficients would compensate for features having values at different scales.


### feature scaling methods


![](assets/feature-scaling-methods.jpeg)

Standardisation method will put all the values in the feature between -2 and 2
for, python implementation its -2 and 2

Normalisation method will make all the values ranging between 0 and 1

because x - min(x) > 0 and max(x) - min(x) > 0, so positive / positive also denomenator > numerator, so value will be less than equal to 1

### Which method to use??


Normalisation is good when most of the features are following normal distribution

Standardisation will do the job will work all the time

### implementation

we will use Standardisation below

note: don't apply feature scaling to dummy variables (oneHotEncoded variables), as they are already between the range of 0-1, so they don't need to be scaled, moreover, if we apply feature scaling to them, then would take values in range -2,2. but that completely defeat the purpose of oneHotEncoding, which assigned vectors to uniquely identify each country

In [678]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

X_train_before_scaling = X_train.copy()
X_test_before_scaling = X_test.copy()

# cols after Country OneHotEncoded columns
# fit will get the mean and starndard deviation, transform will apply the method to each of the values
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])


# sc.fit_transform is not called below
# reason ->
# features of the test set need to be scaled by the same scaler that was used on the training set.
# machine learning model will be trained with scaler applied on the training set, in order to make 
# predictions that will be congruent with the way the model was trained, we need to apply the same scaler
# that was used on the training set onto the test set, so that we can get indeed the same transformation. 
# and therefore, in the end, some relevant predictions with the predict method applied to x_test.
X_test[:, 3:] = sc.transform(X_test[:, 3:])

In [679]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [680]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]


### Practice

In [681]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
values = np.array([1, 2, 3, 4, 5])
values = values.reshape(len(values), 1)
sc.fit_transform(values)

array([[-1.41421356],
       [-0.70710678],
       [ 0.        ],
       [ 0.70710678],
       [ 1.41421356]])

In [682]:
values = np.array([1, 2, 3, 4, 5])
mean = values.mean()
std = values.std()
print("mean:",mean)
print("std:",std)
scaled_values = [(v - mean)/std for v in values]
print("scaled_values:",scaled_values)

mean: 3.0
std: 1.4142135623730951
scaled_values: [-1.414213562373095, -0.7071067811865475, 0.0, 0.7071067811865475, 1.414213562373095]


In [684]:
def custom_scaler(values: np.ndarray):
    mean = values.mean()
    std = values.std()
    scaled_values = [(v - mean)/std for v in values]
    return [scaled_values, mean, std]


print(custom_scaler(np.array([1, 2, 3, 4, 5])))
print("X_train_before_scaling:", X_train_before_scaling)
scaled_values, mean, std = custom_scaler(X_train_before_scaling[:, 3])
print(scaled_values)
print(X_train)

print([(v - mean)/std for v in X_test_before_scaling[:, 3]])
print(X_test)


[[-1.414213562373095, -0.7071067811865475, 0.0, 0.7071067811865475, 1.414213562373095], 3.0, 1.4142135623730951]
X_train_before_scaling: [[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]
[-0.19159184384578545, -0.014117293757057777, 0.566708506533324, -0.30453019390224867, -1.9018011447007988, 1.1475343068237058, 1.4379472069688968, -0.7401495441200351]
[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]
[-1.46618

## Why Splitting before Feature Scaling

the test set is supposed to be a brand new set on which you're going to evaluate machine learning model. 

If feature scaling is applied before splitting, then your machine learning model will be training on your test set.

But test set is not supposted to be worked with for the traning, it should remain completely untouched by the model. (should remain completely new to the model so it can be used to determine accuracy of the model)

feature scaling will get the mean and standard deviation of your feature, in order to perform the scaling.

So, if we apply feature scaling before the split, then it will get the mean and the standard deviation of all the values, including the one's in the test set.

So, applying Feature Scaling before Splitting will cause some information leakage on the test set. 

we want our model to be tested on completely new observations.

### Conclusion

you should not apply feature scaling before the split to prevent information leakage on the test set, which you are not supposed to have until the training is done.