# Data Preprocessing

In this section, we learn how to preprocess the dataset before we we use it for our machine learning model. We will cover importing libraries, reading csv files, how to take of missing data, how to encode categorical data, feature scaling and splitting the dataset into test and training set.

In [26]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [27]:
%cd /content/drive/MyDrive/Colab Notebooks/Machine-Learning-A-Z-Codes-Datasets/Part 1 - Data Preprocessing/Section 2 -------------------- Part 1 - Data Preprocessing --------------------/Python

/content/drive/MyDrive/Colab Notebooks/Machine-Learning-A-Z-Codes-Datasets/Part 1 - Data Preprocessing/Section 2 -------------------- Part 1 - Data Preprocessing --------------------/Python


## Importing the libraries

In [28]:
import numpy as np
import matplotlib as plt
import pandas as pd


## Importing the dataset

In [29]:
dataset =pd.read_csv('Data.csv')
X=dataset.iloc[:,:-1].values
y=dataset.iloc[:,-1].values

In [30]:
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [31]:
dataset["Age"]

Unnamed: 0,Age
0,44.0
1,27.0
2,30.0
3,38.0
4,40.0
5,35.0
6,
7,48.0
8,50.0
9,37.0


## Taking care of missing data

If there is any missing data in the dataset, we can replace that missing data in a column, with an average/mean/median/frequent data in that column. To do that we can call `SimpleImputer` from `sklearn.impute`.

1. First we import this library.
2. Then we create an object of the `SimpleImputer` class where we put the parameter `missing_values=np.nan` and we provide the `strategy='mean'`.
3. We call the fit method of `SimpleImputer` class to specify the columns we would like to process/fit with this strategy.
4. Then we transform our dataset with the strategy by calling the method `transform` of `SimpleImputer` class.


In [32]:
from sklearn.impute import SimpleImputer
imputer=SimpleImputer(missing_values=np.nan,strategy='mean')
imputer.fit(X[:,1:3])
#imputer.transform(X[:,1:3])
X[:,1:3]=imputer.transform(X[:,1:3])#

In [33]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

The features which has string/name instead of numerical values such as 'Country' needs to be encoded during the data preprocessing phase. Now in this 'Country' feature, we have three countries France, Germany and Spain. If we encode this country with numbers such as 1,2,3. in future, when we train our model, the model may assume that this order matters. But in reality, it does not. We want to avoid any misinterpretation of the model by assuming some correlation between data which may generate wrong outcomes. Hence, we use One hot encoding, which divides each country to a separate column. One hot encoding also creates binary vectors for each country. In this way, we can encode both indepent variable and dependepent variable. Here the independent variable is the `Country` and dependent variable is the `Purchased`. To do this, we need to import two libraries of `sklearn` which is `ColumnTransformer` and `OneHotEncoder`. First we need to create an object for ColumnTransformer which takes an argument of encoding process name or transmformation we would like to make and what to do with the remaining column. If we do not specify, the remainder column then it will totally ignore those columns. Hence, it is very important to specify the `remainder='passthrough'`. Then, we call the `fit_transform` method to encode the categorical feature and fit it it our original dataframe. This fit_transform will require to be transformed to an numpy array in order to train our future model.

### Encoding the Independent Variable

In [34]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct=ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])],remainder='passthrough')
X=np.array(ct.fit_transform(X))

In [35]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable
Now we want to encode the dependent variable `Purchased` which is labeled as `y`

In [36]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
y=le.fit_transform(y)

In [37]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set
We need to split our dataset (X,y(=) into train and test set. Ideally, it is recommended to split the data set into 80% from training and 20% for test. However, we need for values such as X_train, y_train, X_test, y_test which is requiered as an input for building our machine laerning model. We call train_test_split class from sklearn.model_selection, then specify test_size=0.2 which means 20% of the data will be sued as test, random_state=1 represents the seed value for the reproducible purpose.

In [38]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.2, random_state=1)

In [39]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [40]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [41]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [42]:
print(y_test)

[0 1]


## Feature Scaling
Feature scaling needs to be done always after splitting the dataset into training and test set. It is because feature scaling performs normalization/standardization of the data. If we apply feature scaling beore scaling the data into train and test then there will be information leakage from test set to training set. Particularly, our aim is to have a totally new test set to evaluate the training set. Hence, we dont want to have any relation leakage from test to train set.

x_standardization = x-mean(x)/standard deviation(x)

x_norm=(x-x_min)/(x_max-x_min)

x_standardaization works in all time. x_norm only works data with normal distribution.


feature scaling should be applied in both x_test and x_train

Standardization only needs to be applied in features. It does not need to be applied in the dummy variables.

dummy variables means the variables we got after encoding (categorical features)

We only apply fit-transform on X_train. fit_transform first calculate the standardization for each value of the feature and then transform the dataset.


However, we only apply transform tor X_test, because X_test needs to be scaled with the same scaling of X_train. If we use fit_transform to X_test, it will calculate a new scaling for the X_test, which we do not want.

In [43]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X_train[:,3:]=sc.fit_transform(X_train[:,3:])
X_test[:,3:]=sc.transform(X_test[:,3:])

In [44]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [45]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
