<a href="https://colab.research.google.com/github/yousufislam191/Machine-Learning-Practice/blob/main/DataPreprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing
### It is the first and crucial step while creating a machine learning model.
It involves below steps:

*   Getting the dataset
*   Importing libraries
*   Importing datasets
*   Handling Missing Data

    *   By deleting the particular row
    *   By calculating the mean / median / mode

*   Encoding Categorical Data

  *   One-hot/dummy encoding
  *   Label / Ordinal encoding
  *   Target encoding
  *   Frequency / count encoding
  *   Binary encoding
  *   Feature Hashing

*   Splitting dataset into training and test set
*   Feature scaling





In [150]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [151]:
#importing datasets
dataset = pd.read_csv('/content/Data.csv')

In [152]:
dataset.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [153]:
dataset.shape

(10, 4)

Extracting Independent Variable

In [154]:
x = dataset.iloc[:, :-1].values

In [155]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


Extracting Dependent variable

In [156]:
y = dataset.iloc[:, 3].values

In [157]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


Checking, there are any null value available or not.

In [158]:
dataset.isnull().sum()

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

Handling missing data

(Replacing missing data with the mean value)

In [159]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

In [160]:
#Fitting imputer object to the independent variables x.
fitting= imputer.fit(x[:, 1:3])

In [161]:
#Replacing missing data with the calculated mean value
x[:, 1:3]= fitting.transform(x[:, 1:3])

In [162]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


Encoding Categorical data

In [163]:
#Catgorical data for Country Variable
from sklearn.preprocessing import LabelEncoder
label_encoder_x = LabelEncoder()
x[:, 0] = label_encoder_x.fit_transform(x[:, 0])

In [164]:
print(x)

[[0 44.0 72000.0]
 [2 27.0 48000.0]
 [1 30.0 54000.0]
 [2 38.0 61000.0]
 [1 40.0 63777.77777777778]
 [0 35.0 58000.0]
 [2 38.77777777777778 52000.0]
 [0 48.0 79000.0]
 [1 50.0 83000.0]
 [0 37.0 67000.0]]


In [165]:
#Encoding for dummy variables
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
onehot_encoder = ColumnTransformer([('encoder', OneHotEncoder(), [0])], remainder='passthrough')
x= onehot_encoder.fit_transform(x)

In [166]:
print(x)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [167]:
# Encoding for purchased variable
labelencoder_y= LabelEncoder()
y= labelencoder_y.fit_transform(y)

In [168]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


Splitting the dataset into training and test set

In [169]:
# before spliting the shape of the dataset
x.shape

(10, 5)

In [170]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)

In [171]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((8, 5), (2, 5), (8,), (2,))

Feature Scaling of datasets

In [172]:
# import library
from sklearn.preprocessing import StandardScaler

st_x = StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.fit_transform(x_test)

In [173]:
print(x_train)

[[-1.          2.64575131 -0.77459667  0.26306757  0.12381479]
 [ 1.         -0.37796447 -0.77459667 -0.25350148  0.46175632]
 [-1.         -0.37796447  1.29099445 -1.97539832 -1.53093341]
 [-1.         -0.37796447  1.29099445  0.05261351 -1.11141978]
 [ 1.         -0.37796447 -0.77459667  1.64058505  1.7202972 ]
 [-1.         -0.37796447  1.29099445 -0.0813118  -0.16751412]
 [ 1.         -0.37796447 -0.77459667  0.95182631  0.98614835]
 [ 1.         -0.37796447 -0.77459667 -0.59788085 -0.48214934]]


In [174]:
print(x_test)

[[ 0.  0.  0. -1. -1.]
 [ 0.  0.  0.  1.  1.]]
