<a href="https://colab.research.google.com/github/techonair/Machine-Learing-A-Z/blob/main/Data%20Preprocessing/Data_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Data Preprocessing Tools**

## **Importing Libraries**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## **Import Dataset**

In [None]:
from google.colab import files
files.upload()

Saving Data.csv to Data.csv


{'Data.csv': b'Country,Age,Salary,Purchased\r\nFrance,44,72000,No\r\nSpain,27,48000,Yes\r\nGermany,30,54000,No\r\nSpain,38,61000,No\r\nGermany,40,,Yes\r\nFrance,35,58000,Yes\r\nSpain,,52000,No\r\nFrance,48,79000,Yes\r\nGermany,50,83000,No\r\nFrance,37,67000,Yes'}

In [None]:
dataset = pd.read_csv('Data.csv')
print(dataset)
dataset.describe()

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


Unnamed: 0,Age,Salary
count,9.0,9.0
mean,38.777778,63777.777778
std,7.693793,12265.579662
min,27.0,48000.0
25%,35.0,54000.0
50%,38.0,61000.0
75%,44.0,72000.0
max,50.0,83000.0


## Seprating Features and dependent variables

In [None]:
# creating a matrix of features - [['Country'  Age  Salary]]
# Values: all rows, all columns except the last (-1) 
X = dataset.iloc[:, :-1].values
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [None]:
# creating a matrix of dependent variable or result - [ ]
# values: all rows, last column 
Y = dataset.iloc[:, -1].values
print(Y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## **Taking care of missing data**

In [None]:
dataset.isna().any()

Country      False
Age           True
Salary        True
Purchased    False
dtype: bool

### Replacing NaN values with mean, median, or mode 

In [None]:
from sklearn.impute import SimpleImputer

# creating an object
Imputer = SimpleImputer(missing_values= np.nan, strategy='mean' )

# fit method to apply object on the data
Imputer.fit(X[:,1:3])
X[:, 1:3] = Imputer.transform(X[:, 1:3])
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## **Encoding Independent Variable**
### This is needed because during the training of our machine learning model it could try to see patterns in country column which is unnecessary and not needed. Therefore dividing country column in three columns as vector will give neat data to work with. 

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# creation of object for producing dummy variables
columnTransform = ColumnTransformer(transformers= [('encoder', OneHotEncoder(), [0])], remainder='passthrough')
# columnTransform.fit_transform( ) will not give array in output which is necessary in training ml model therefore
X = np.array(columnTransform.fit_transform(X))

In [None]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


## **Encoding Dependent Variable**
### Conveting Y values into binary

In [None]:
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()
# this is output from the ml model, therefore need not to be in array
Y = label.fit_transform(Y)

In [None]:
print(Y)

[0 1 0 0 1 1 0 1 0 1]


## **Spliting the dataset into training set and test set**

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size= 0.2, random_state = 1)

In [None]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [None]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [None]:
print(Y_train)

[0 1 0 0 1 1 0 1]


In [None]:
print(Y_test)

[0 1]


## **Feature Scaling**

Feature scaling is done after the dataset spliting not before, beacause that has chances of data leakage

###  **Standarization**
y = (x – mean) / standard_deviation

returns value between -3 and 3

works in most cases

### **Normalization**
y = (x – min)/ max - min

returns value between 0 and 1

works in only specific cases

In [None]:
from sklearn.preprocessing import StandardScaler
scaled = StandardScaler()

# only applying on non-dummy variables age, salary not on the transformed country column
X_train[:, 3:] = scaled.fit_transform(X_train[:, 3:])

# Just transform the X_test set, **Do not fit it into the machine learning model**
X_test[:, 3:] = scaled.transform(X_test[:, 3:])

In [None]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [None]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
