<a href="https://colab.research.google.com/github/tuan9310/Machine-Learning-Data-Science-Python-/blob/main/data_preprocessing_tools.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing Tools

## Importing the libraries

In [6]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

**Function:**
*   read_csv('File_name'): Read csv file
*   iloc[row, column].values: interger position based (from 0 to lenghth -1 of the axis)
    
    Paramater: ':' is range, ':-1' is range from 
the beginning to the second from last column.





In [7]:
# Load the dataset into a dataframe (2D Array)
dataset = pd.read_csv('Data.csv')

# Check the data type of object dataset
print(type(dataset))

# Get all the independent variables. iloc[:, :-1].values get values of all rows, from the first column to the second to last column
x = dataset.iloc[:, :-1].values

# Get the dependent variable, which is locate at the last column. 
y = dataset.iloc[:,-1].values

<class 'pandas.core.frame.DataFrame'>


In [8]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [9]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

In [10]:
# Use module impute to Import Simple Imputer class from sklearn to take care of the missing data.
from sklearn.impute import SimpleImputer

# Use class 'SimpleImputer' to create a new object. This object will be the tool to replace the missing data with average
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')

# Apply the Object 'imputer' to our matrix using fit method
imputer.fit(x[:, 1:3])

# Use transform method to apply the fit to the dataset
x[:,1:3] = imputer.transform(x[:,1:3])
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

### Encoding the Independent Variable

 Convert categorical data into numerical data in order for the ML to understand the data.


**1.   Label Encorder:**
*   Encode target labels with value between 0 and n_classes-1. This transformer should be used to encode target values, i.e. y, and not the input X
*   The problem here is, since there are different numbers in the same column, the model will misunderstand the data to be in some kind of order, 0 < 1 < 2. But this isn’t the case at all. To overcome this problem, we use One Hot Encoder.

**2.   One Hot Encorder**
*   Encode categorical features as a one-hot numeric array.
*   The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. 
*   The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme.  
*   This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse parameter). 
*   By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.

 


In [11]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
x = np.array(ct.fit_transform(x))
print(x)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [12]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
print(y)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

We need to split the dataset into the training set, which's used to train the model, and a test set, which is used to evaluate the model.

In [13]:
# Using module model_selection to import train_test_split class from sklearn 
from sklearn.model_selection import train_test_split

# Split the dataset into training set & testing set. Set the test size. Random_state = 1 just to make sure everyone have the same result.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1)
print(x_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [14]:
print(x_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [15]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [16]:
print(y_test)

[0 1]


We apply the feature scaling AFTER splitting the data set.

Feature scaling consists of scaling all of the variabiles to make sure they take all values in the same scale in order to prevent one feature to dominiate the other which therefore would be neglected by the ML.

*   Standardization: Work well most of the time. Value between -3 & 3
Xstand = [x - mean(x)] / Sd(x)
*   Normalization: Work for normal distribution. Value between 0 and 1
Xnorm = [x - min(x)] / [max(x) - min(x)]



## Feature Scaling

In [17]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train[:, 3:] = sc.fit_transform(x_train[:, 3:])
x_test[:, 3:] = sc.transform(x_test[:, 3:])

In [18]:
print(x_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [19]:
print(x_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
