# Data Preprocessing

This is data preprocessing template that will be used in further projects.

## Libraries

In [42]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

If the libraries are not available, try to install it first using:
pip install <library name>

## Importing Dataset

In [5]:
dataset = pd.read_csv("resources/Data.csv")
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


Get to know:
1. Independent Variable/Feature : Sets of information used to predict the outcome (country, age, salary)
2. Dependent Variable           : Desired outcome based on Features

## Subset Your Dataset

### Using iloc

iloc by definition means "integer-location based indexing for selection by position"

example syntax:
<dataset_name>.iloc[<lower boud of row>:<upper bound of row>,<lower boud of column>:<upper bound of column>]

Notes:    
1. first index starts from 0
2. if lower bound not specified, it will be using lowest index
3. if upper bound not specified, it will be using highest index    

### Selecting all rows and column

In [28]:
dataset.iloc[:,:]

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


### Selecting 5 top rows

In [7]:
dataset.iloc[0:5,:]

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


### Selecting 5 top rows from last column

In [9]:
dataset.iloc[0:5,-1:]

Unnamed: 0,Purchased
0,No
1,Yes
2,No
3,No
4,Yes


### Assigning your subset to a variable

We wanted to subset original dataset into two new dataset:
1. x dataset will contain the features only
2. y dataset will contain dependent variable only

.values used to cast dataframe type to array of objet type for further ML processing

In [14]:
x = dataset.iloc[:,:-1].values
x

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [38]:
y = dataset.iloc[:,-1:].values
y

array([['No'],
       ['Yes'],
       ['No'],
       ['No'],
       ['Yes'],
       ['Yes'],
       ['No'],
       ['Yes'],
       ['No'],
       ['Yes']], dtype=object)

## Handling Missing Data

You can use 2 following method to handle missing data:
1. delete it (it might affect fo sample population)
2. imputation (provide estimate value)

![image.png](attachment:image.png)

reference: https://www.analyticsvidhya.com/blog/2021/10/guide-to-deal-with-missing-values/

### Data Imputation using sklearn

In [25]:
imputer=SimpleImputer(missing_values=np.nan, strategy='mean')
# strategy could be mean/median, reference= https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

imputer.fit(x[:,1:3])
# impute all row from age and salary column (continous data type)

x[:,1:3]=imputer.transform(x[:,1:3])
# to commit override new imputed value to a existing value in a column

x

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

## Encoding Data

Encoding data used for interpret categorical data into matrix values

### Encoding Features

In [32]:
ct = ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])],remainder='passthrough')
x = np.array(ct.fit_transform(x))
x
# reference: https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html?highlight=columntrans#sklearn.compose.ColumnTransformer

array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

Notice that Country values will be converted to matrix as:
France  = [1,0,0]
Spain   = [0,0,1]
Germany = [0,1,0]

### Encoding Dependent Variable

In [40]:
le = LabelEncoder()
y = le.fit_transform(y)
y
# reference: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html?highlight=label%20encoder#sklearn.preprocessing.LabelEncoder

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1], dtype=int64)

## Spliting Data into Training and Test Datasets

In [43]:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=1)
# reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train%20test#sklearn.model_selection.train_test_split

Datasets splitted into:
x_train = 80% of features data for training purpose
x_test = 20% of features data for evaluation purpose
y_train = 80% of dependent variable data for training purpose
y_test = 20% of dependent variable data for evaluation purpose

evaluation means to measure performance of machine learning model using certain evaluation measure (accuracy, precision, sensitivity/recall, specificity)

random_state is used for seeding random number to reproduce same randomize factor for every run

In [49]:
print(x_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [50]:
print(x_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [51]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [52]:
print(y_test)

[0 1]
