<a href="https://colab.research.google.com/github/rhythmMachineCoder/colab-files/blob/main/DataPreprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Preprocessing steps :**

1. Importing the Libraries
2. Importing the dataset
3. Taking care of missing data
4. Encoding categorical data
5. Encoding the independent Variable
6. Encoding the Dependent Variable
7. Splitting the dataset into the Training and Testing part
8. Feature Scaling

Normalization:
normalization happening in column

we have to normalize  each column and convert into 0 to 1

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# these are the libraries
# matplotlib is a library and pyplot is a module

# Import the data

In [None]:
dataset = pd.read_csv('Data.csv')

dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [None]:
# creating the dependent dataset from the dataset
# .values--> convert into numpy array
datasetx = dataset.iloc[:,:-1].values    #this iloc includes the lower bound and exclude the upper bound
datasety = dataset.iloc[:,-1].values  #Actually-->  dataset.iloc[row,column]    row--> all rows(:)   and for columns --> last column(-1)

In [None]:
print(datasetx)
print(datasety)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


# TAKING CARE OF MISSING DATA

# Handling missing values with the help of mean

---

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy='mean')

imputer.fit(datasetx[:,1:3])
datasetx[:,1:3]= imputer.transform(datasetx[:,1:3])


In [None]:
datasetx

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

# Encoding

---

#  Independent variable encoding

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])],remainder = 'passthrough')
X  = np.array(ct.fit_transform(datasetx))   #it returns new matrix

In [None]:
X

array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

In [None]:
# lets encode the dependent variable
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(datasety)

**Splitting** the dataset into the training data and testing data

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2, random_state = 1)

In [None]:
print(X_train ,X_test,y_train,y_test)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]] [[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]] [0 1 0 0 1 1 0 1] [0 1]


## **Feature Scaling**

There are two methods in feature scaling:

1. Normalization: convert all in 0 to 1
  
  -> X' = (X -  Xmin) / (Xmax - Xmin)

2. Standardization:
standardisation converts the values in -3 to +3
or -2 to +2

  -> X' = (X - mean) /std



**Standarsation**

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
# feature scaling is not applied to the dummy variables(encoded values)
# only apply to the numerical values

X_train[:,3:] = sc.fit_transform(X_train[:,3:])
X_test[:,3:] = sc.transform(X_test[:,3:])

# Since we don't use the 'fit' in test set because in a testing , we use the same mean and std that we use use in training dataset
# we use mean and std of training dataset in testing dataset


In [None]:
X_train

array([[0.0, 0.0, 1.0, -0.19159184384578545, -1.0781259408412425],
       [0.0, 1.0, 0.0, -0.014117293757057777, -0.07013167641635372],
       [1.0, 0.0, 0.0, 0.566708506533324, 0.633562432710455],
       [0.0, 0.0, 1.0, -0.30453019390224867, -0.30786617274297867],
       [0.0, 0.0, 1.0, -1.9018011447007988, -1.420463615551582],
       [1.0, 0.0, 0.0, 1.1475343068237058, 1.232653363453549],
       [0.0, 1.0, 0.0, 1.4379472069688968, 1.5749910381638885],
       [1.0, 0.0, 0.0, -0.7401495441200351, -0.5646194287757332]],
      dtype=object)

In [None]:
X_test

array([[0.0, 1.0, 0.0, -1.4661817944830124, -0.9069571034860727],
       [1.0, 0.0, 0.0, -0.44973664397484414, 0.2056403393225306]],
      dtype=object)