## **DATA PREPROCESSING**

* Imagine you are working on a machine learning project where you need to predict customer churn based on various features. The first step in this process is to import the necessary libraries to handle data preprocessing tasks.

* Data preprocessing is the first step in any machine learning or data mining project. Data preprocessing can refer to manipulation or dropping of data before it is used in order to ensure or enhance performance,and is an important step in machine learning and data mining process.
* The phrase "garbage in, garbage out" is particularly applicable to data mining and machine learning projects. If we use unclean data for machine learning, the result will not be satisfying enough for our end application.

In this hands-on practice, we will learn most commonly utilized data preprocessing steps using sklearn.

For simplicity, we only deal with .csv files in this tutorial.

**MOUNTING DRIVE TO COLAB**

In this tutorial we will be storing our data in google drive. We need to ensure that the drive is accessible from colab notebook. Upon executing this code, it will lead us to an authentication stage, follow the instructions.

If you are uploading data directly from your computer to the colab runtime environment, you can comment off the code using #

In [1]:
from google.colab import drive
drive.mount("/content/drive/")

Mounted at /content/drive/


We can also upload the dataset from local drive using the following code segment.

In [None]:
from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

**STEP 1: IMPORTING NECESSARY LIBRARIES**

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

**STEP 2: IMPORTING DATASET & SPLITTING INTO DEPENDEDNT AND INDEPENDENT VARIABLES**

In [3]:
# Mount your Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Importing the dataset
Dataset = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Dataset.csv')
# importing an array of features
x = Dataset.iloc[:, :-1].values
# importing an array of dependent variable
y = Dataset.iloc[:, -1].values

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Let's have a look at our dataset!

In [4]:
Dataset.head(15)

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


Let's have a look at our independent variables!

In [5]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


Let's have a look at our dependent variable!

In [6]:
print(y)

['No ' 'Yes' 'No ' 'No ' 'Yes' 'Yes' 'No ' 'Yes' 'No ' 'Yes']


**STEP 3: TAKING CARE OF THE MISSING DATA**

Missing data is a very common problem in data collected manually or automatically.  This problem occurs when a dataset has not value for a feature in an observation. It can be due to human error, sensor error, accidental deletion etc.

Missing values badly affects accuracy of most machine learning alogorithms or some algorithms may not work with missing values.

Note: Please note that missing values do not mean a "Zero" value, but there will be no values at all. Missing values are normally represented by "nan" when you try to print it.

There are several techniques that we use to handle missing data. They include:

*   Mean Imputation
*   Hot Deck Imputation
*   Cold Deck Imputation
*   Regression Imputation etc.

For the purpose of this tutorial we use Mean Imputation, which is one of the commonly used techniques for handling missing values.

In [7]:
# Importing the class called SimpleImputer from impute model in sklearn
from sklearn.impute import SimpleImputer
# To replace the missing value we create below object of SimpleImputer class
imputa = SimpleImputer(missing_values = np.nan, strategy = 'mean')
''' Using the fit method, we apply the `imputa` object on the matrix of our feature x.
The `fit()` method identifies the missing values and computes the mean of such feature a missing value is present.
'''
imputa.fit(x[:, 1:3])
# Repalcing the missing value using transform method
x[:, 1:3] = imputa.transform(x[:, 1:3])



Let's have a look at the new x. The missing values on the age and salary columns are replaced with their respective column means, ie. 38.77777777777778 and 63777.77777777778.

In [8]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


STEP 4: ENCODING CATEGORICAL VALUES

Since machine learning models are based on mathematical equations, which takes numerical value only as inputs, if we have string values as dependent variables or independent variable, it is challenging to detect the correlation to come out with the model. To ensure this does not happen, we need to convert string entries (categorical values) in the dataset to numbers.

Eg. of categorical values: France/Germany etc., Big/Small/Medium etc.

A simplest way is to code the string with some numbers, like 0 for France, 1 for Spain. But this method got some issues. Since these are running numbers, the machine learning model will interpret these numbers as having some ordinal values, where that is not the case. To ensure this misinterpretation does not occur, we use an encoding mechanism called "one-hot encoding". In one-hot encoding, the column with the categorical values is converted to multiple columns equal to number of unique categorical values.

Let's see how one-hot encoding works by executing the code below:

In [9]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder= 'passthrough')
x = np.array(ct.fit_transform(x))
#remainder = 'passthrough' means, all reamaining columns unaffected
#sklearn column transformer documentation
#https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
#sklearn one hot encoding documentation
#https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

Let's have a look at the new x.

In [10]:
print(x)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


From the output the country column has been transformed into 3 columns with each row representing one encoded column, where France was encoded into a vector [1.0 0 0], Spain into [0 0 1.0] and Germany into [0 1.0 0]

**THE DUMMY VARIABLE TRAP**

Dummy variable trap occurs when dummy variables generated is having multicollinearity. In this case we will drop one column from the encoded variables.

**Explain:**

Avoiding the Dummy Variable Trap
This code addresses a concept called the dummy variable trap in machine learning. It's a situation that can arise when using one-hot encoding.

One-Hot Encoding Reminder
Before this code, the 'Country' column, which originally had categorical values like 'France', 'Spain', and 'Germany', was transformed using one-hot encoding. This created three new columns, one for each country. Each row would have a '1' in the column representing the country and '0' in the others.

The Problem: Multicollinearity
The dummy variable trap occurs because these new columns are perfectly correlated (multicollinearity). In simpler terms, if you know the values of two of the country columns, you can automatically determine the value of the third. This can cause issues with some machine learning algorithms.

The Solution: Deleting a Column

x = np.delete(x, 0, 1): This line uses the np.delete() function from the NumPy library (np) to remove a column from the x array (which holds our features).

x: The array we're modifying.
0: This refers to the index of the column we want to delete (column 0 in this case, which is the first column of the one-hot encoded country).
1: This specifies the axis along which we're deleting. 1 represents columns (0 would represent rows).
print(x): This line simply prints the modified x array to show the result of the column deletion.

In essence, this code removes one of the dummy variable columns created during one-hot encoding to avoid the dummy variable trap, ensuring our data is better suited for machine learning models.



In [11]:
x = np.delete(x, 0, 1)
print(x)
#Documentation for numpy delete()
#https://www.geeksforgeeks.org/numpy-delete-python/#:~:text=The%20numpy.,along%20with%20the%20mentioned%20axis.&text=Return%20%3A,object%20along%20a%20given%20axis.

[[0.0 0.0 44.0 72000.0]
 [0.0 1.0 27.0 48000.0]
 [1.0 0.0 30.0 54000.0]
 [0.0 1.0 38.0 61000.0]
 [1.0 0.0 40.0 63777.77777777778]
 [0.0 0.0 35.0 58000.0]
 [0.0 1.0 38.77777777777778 52000.0]
 [0.0 0.0 48.0 79000.0]
 [1.0 0.0 50.0 83000.0]
 [0.0 0.0 37.0 67000.0]]


**ENCODING DEPENDENT VARIABLE**

We will encode dependent variables using the below code:

In [12]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

Let's see how our new y will look like!

In [13]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


**STEP 5: SPLITTING THE DATASET INTO TRAINNIG AND TEST SET**

In machine learning the original dataset is not completely used for training purpose, as we loose the opportunity to test the model with data that is not used for training.

The training set is the fraction of the original dataset that we use for machine learning process for developing the model.

The test set is the fraction of the original dataset

In [14]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2,random_state= 1)
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Let's see how the training set look like!

In [15]:
print(x_train)
print("--------------------------------")
print(y_train)


[[0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 40.0 63777.77777777778]
 [0.0 0.0 44.0 72000.0]
 [0.0 1.0 38.0 61000.0]
 [0.0 1.0 27.0 48000.0]
 [0.0 0.0 48.0 79000.0]
 [1.0 0.0 50.0 83000.0]
 [0.0 0.0 35.0 58000.0]]
--------------------------------
[0 1 0 0 1 1 0 1]


Let's see how the test set look like!

In [16]:
print(x_test)
print("--------------------------------")
print(y_test)

[[1.0 0.0 30.0 54000.0]
 [0.0 0.0 37.0 67000.0]]
--------------------------------
[0 1]


Our dataset is successfully split, and this is a random split. Every time you execute the code for splitting the original dataset you may get different combinations of x_train, y_train, x_test and y_test.

**STEP 6: FEATURE SCALING**

In most cases we work with dataset where the features are not on the same scale, some features have tremendous values and others have small values.

If we implement machine learning models on such datasets, most algorithms have a tendency to lean towards higher values than smaller values, that means those high values will dominate those with small values, and there is a chance that the model will treat those fatures with small values as if hey don't exist.

To ensure this issue doesn't happen, we need to scale all the feature values into one scale say between -1 and +1 or 0 and 1 etc. This process is called feature scaling.

In this example we will scale the Age and Salary features in the x_train and x_test into one scale using the code below.

In [17]:

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
# we only aply the feature scaling on the features other than dummy variables.
x_train[:, 2:] = sc.fit_transform(x_train[:, 2:])
x_test[:, 2:] = sc.fit_transform(x_test[:, 2:])
#https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html


Now let's see how the new x_train and x_test look like.

In [18]:
print(x_train)
print("--------------------------------")
print(x_test)

[[0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [0.0 0.0 1.1475343068237058 1.232653363453549]
 [1.0 0.0 1.4379472069688968 1.5749910381638885]
 [0.0 0.0 -0.7401495441200351 -0.5646194287757332]]
--------------------------------
[[1.0 0.0 -1.0 -1.0]
 [0.0 0.0 1.0 1.0]]


**CONCLUSION**
Now we have learned some of the most common data preprocessing steps that is done prior to any machine learning project. Refer to the corresponding lecture note for more info.