# **Data PreProcessing**

**Step for data Preprocessing:**
    
    1- Acquire the dataset

    2- Import all the crucial libraries

    3- Import the dataset

    4- Identifying and handling the missing values

    5- Encoding the categorical data

    6- Splitting the dataset

    7- Feature scaling

Data preprocessing in Machine Learning is a crucial step that helps enhance the quality of data to promote the extraction of meaningful insights from the data. Data preprocessing in Machine Learning refers to the technique of preparing (cleaning and organizing) the raw data to make it suitable for a building and training Machine Learning models. In simple words, data preprocessing in Machine Learning is a data mining technique that transforms raw data into an understandable and readable format.

 **2.Import all the crucial libraries**

In [84]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

 **3- Import the dataset**

In [85]:
df=pd.read_csv("Data.csv")
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [86]:
x = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

In [87]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [88]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


**4- Identifying and handling the missing values**
In data preprocessing, it is pivotal to identify and correctly handle the missing values, failing to do this, you might draw inaccurate and faulty conclusions and inferences from the data. Needless to say, this will hamper your ML project.

some typical reasons why data is missing:

    A. User forgot to fill in a field.

    B. Data was lost while transferring manually from a legacy database.

    C. There was a programming error.

    D. Users chose not to fill out a field tied to their beliefs about how the results would be used or interpreted.

In [89]:
df.isnull().sum()

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

**Solution 1 : Dropna**

In [90]:
df1=df.copy()
df1

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [132]:
# summarize the shape of the raw data
print("Before remove null values:",df1.shape)

# drop rows with missing values
df1.dropna(inplace=True)

# summarize the shape of the data with missing rows removed
print("After remove null values:",df1.shape)

Before remove null values: (8, 4)
After remove null values: (8, 4)


**Solution 2 : Fillna**

In [92]:
df2 = df.copy()
df2

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [93]:
# fill missing values with mean column values

df2.fillna(df.mean(),inplace=True)
print(df2.isnull().sum())

Country      0
Age          0
Salary       0
Purchased    0
dtype: int64


  df2.fillna(df.mean(),inplace=True)


In [94]:
x

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

**Solution 3 : Scikit-Learn   :**
    Fill Null values by using Scikit-Learn

In [95]:
x

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [96]:
from sklearn.impute import SimpleImputer
imputer=SimpleImputer(missing_values=np.nan,strategy='mean')
imputer.fit(x[:,1:3])
x[:,1:3]=imputer.transform(x[:,1:3])

In [97]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## 5- Encoding the categorical data

Categorical data refers to the information that has specific categories within the dataset. In the dataset cited above, there are two categorical variables – country and purchased.

Machine Learning models are primarily based on mathematical equations. Thus, you can intuitively understand that keeping the categorical data in the equation will cause certain issues since you would only need numbers in the equations.

**Solution 1 : ColumnTransformer** using of Scikit-Learn

In [98]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct=ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
x=np.array(ct.fit_transform(x))

In [99]:
df.head(5)

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [101]:
print(x)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


**Soluton 2 : Pd.get_dummies() by using Pandas**

In [103]:
df2

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,63777.777778,Yes
5,France,35.0,58000.0,Yes
6,Spain,38.777778,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [106]:
pd.get_dummies(df2)

Unnamed: 0,Age,Salary,Country_France,Country_Germany,Country_Spain,Purchased_No,Purchased_Yes
0,44.0,72000.0,1,0,0,1,0
1,27.0,48000.0,0,0,1,0,1
2,30.0,54000.0,0,1,0,1,0
3,38.0,61000.0,0,0,1,1,0
4,40.0,63777.777778,0,1,0,0,1
5,35.0,58000.0,1,0,0,0,1
6,38.777778,52000.0,0,0,1,1,0
7,48.0,79000.0,1,0,0,0,1
8,50.0,83000.0,0,1,0,1,0
9,37.0,67000.0,1,0,0,0,1


**Solution 3 : LabelEncoder**

In [108]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
le.fit_transform(y)

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

# 6- Splitting the dataset
**Splitting the dataset is the next step in data preprocessing in machine learning. Every dataset for Machine Learning model must be split into two separate sets – training set and test set.**

In [110]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=1)

In [112]:
print(x_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [114]:
print(x_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [117]:
print(y_train)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes']


In [119]:
print(y_test)

['No' 'Yes']


# 7- Feature scaling
**Feature scaling marks the end of the data preprocessing in Machine Learning. It is a method to standardize the independent variables of a dataset within a specific range. In other words, feature scaling limits the range of variables so that you can compare them on common grounds.**

**nother reason why feature scaling is applied is that few algorithms like gradient descent converge much faster with feature scaling than without it.**

# 1. MinMax Scaler
**MinMax Scaler shrinks the data within the given range, usually of 0 to 1. 
It transforms data by scaling features to a given range. It scales the values to a specific value range without changing the shape of the original distribution.


In [123]:
from sklearn.preprocessing import MinMaxScaler
mm=MinMaxScaler()
x_train[:, 3:] = mm.fit_transform(x_train[:, 3:])
x_test[:, 3:] = mm.transform(x_test[:, 3:])

In [125]:
print(x_train)

[[0.0 0.0 1.0 0.5120772946859904 0.11428571428571432]
 [0.0 1.0 0.0 0.5652173913043479 0.45079365079365075]
 [1.0 0.0 0.0 0.7391304347826089 0.6857142857142855]
 [0.0 0.0 1.0 0.4782608695652175 0.37142857142857144]
 [0.0 0.0 1.0 0.0 0.0]
 [1.0 0.0 0.0 0.9130434782608696 0.8857142857142857]
 [0.0 1.0 0.0 1.0 1.0]
 [1.0 0.0 0.0 0.34782608695652173 0.2857142857142856]]


In [127]:
print(x_test)

[[0.0 1.0 0.0 0.1304347826086958 0.17142857142857149]
 [1.0 0.0 0.0 0.43478260869565233 0.5428571428571427]]


**Standard Scaler**

In [129]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
x_train[:,3:]=sc.fit_transform(x_train[:,3:])
x_test[:,3:]=sc.fit_transform(x_test[:,3:])

In [130]:
print(x_train)

[[0.0 0.0 1.0 -0.19159184384578537 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057581 -0.07013167641635436]
 [1.0 0.0 0.0 0.5667085065333245 0.6335624327104541]
 [0.0 0.0 1.0 -0.3045301939022482 -0.3078661727429788]
 [0.0 0.0 1.0 -1.9018011447007983 -1.4204636155515822]
 [1.0 0.0 0.0 1.1475343068237058 1.2326533634535486]
 [0.0 1.0 0.0 1.4379472069688963 1.5749910381638883]
 [1.0 0.0 0.0 -0.740149544120035 -0.5646194287757338]]


In [131]:
print(x_test)

[[0.0 1.0 0.0 -1.0 -1.0]
 [1.0 0.0 0.0 1.0 1.0]]
