# 🧹 Data Pre-Processing

### 📓 Definition

Data preprocessing refers to the cleaning, transforming, and integrating of data in order to make it ready for analysis.

### 🔹 What we will do in this section?

- Importing the basic libraries.
- Importing the dataset directly from **<a href="https://github.com/rikisupriyo/end-to-end-ml/tree/main/DATASETS">here</a>**.
- Checking if the dataset got any missing values or not.
- Replacing the missing values (if any).
- Seperating the dependent and independent variables from the dataset.
- OneHotEncoding the categorical variables column from the dataset.
- Splitting the dataset into training and testing set.
- Feature Scaling the dataset.

In [29]:
# Importing basic libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [30]:
# Importing the dataset

dataset = pd.read_csv('https://raw.githubusercontent.com/rikisupriyo/end-to-end-ml/main/DATASETS/OTHERS/preprocessing.csv')
dataset.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [31]:
# Checking the datatype of each columns 

dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    10 non-null     object 
 1   Age        9 non-null      float64
 2   Salary     9 non-null      float64
 3   Purchased  10 non-null     object 
dtypes: float64(2), object(2)
memory usage: 452.0+ bytes


In [32]:
# Checking the number of states in our dataset

dataset['Country'].value_counts()

Country
France     4
Spain      3
Germany    3
Name: count, dtype: int64

In [33]:
# Checking if there are any missing values in the dataset we need to replace

dataset.isnull().sum()

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

In [34]:
# Seperating the dependent and independent variables into X and y respectively

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

# Checking the shape of X and y

print(f'Shape of X: {X.shape}\nShape of y: {y.shape}')

Shape of X: (10, 3)
Shape of y: (10,)


### 📝 NOTE:

As we can see the **Age** and **Salary** column got missing values in it. So we can replace the missing values with the mean of the values in each of those two columns respectively. We can use SimpleImputer from sklearn to do this operation easily.

In [35]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

X[:, 1:] = imputer.fit_transform(X[:, 1:])

### 📝 NOTE:

Now we need to OneHotEncode the **Country** column and LabelEncode the **Puchased** column to transform the text data to numerical data for our machine to understand the data properly.

In [36]:
# Onehotencoding the column containing the Countries

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

transformer = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
encoder = LabelEncoder()

X = np.array(transformer.fit_transform(X))
y = encoder.fit_transform(y)

### 🔑 IMPORTANT NOTE:

When we use OneHotEncode on a specific column, the encoded column comes get split into the same amount of columns as the value count of that column and as you can see, it also appears infront of the dataset. In our case the **Country** column got seperated into 3 columns because our value count for this column is 3. Here the most important thing we need to do is omit one of the columns we got from OneHotEncoding to avoid the **Dummy Variable** trap and not ending up with correlated features. It is a very good practice to do so. If we got ***k*** columns from the encoding, we must use only ***k-1*** columns.

In [37]:
# Printing the first 5 rows from X and y

print(f'X : {X[:5]}')
print(f'y : {y[:5]}')

X : [[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]]
y : [0 1 0 0 1]


In [38]:
# dropping the first column to avoid dummy variable trap

X = X[:, 1:]
print(X[:5])

[[0.0 0.0 44.0 72000.0]
 [0.0 1.0 27.0 48000.0]
 [1.0 0.0 30.0 54000.0]
 [0.0 1.0 38.0 61000.0]
 [1.0 0.0 40.0 63777.77777777778]]


In [39]:
# Splitting the dataset into train and test set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

print(f'Shape of X_train: {X_train.shape}\nShape of y_train: {y_train.shape}\nShape of X_test: {X_test.shape}\nShape of y_test: {y_test.shape}')

Shape of X_train: (8, 4)
Shape of y_train: (8,)
Shape of X_test: (2, 4)
Shape of y_test: (2,)


In [40]:
# Standardscaling the data to avoid large gaps between each numbers

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

X_train[:, 2:] = scaler.fit_transform(X_train[:, 2:])
X_test[:, 2:] = scaler.transform(X_test[:, 2:])

# Printing the first 5 rows of the Standardized dataset

print(X_test[:1])

[[1.0 0.0 -1.4588292694047795 -0.9016629672292141]]


### 🔑 IMPORTANT NOTE:

We should always feature scale our data only after splitting our data into training and testing set because if we apply it before the split then it will actually get the mean and the standard deviation of all the values, including the ones in the test set. Since the test set is something we are supposed to have like some future data in production, applying feature scaling on the original data set, before the split would cause some information leakage on the test set. That's why it's a very good practice.

# ✅ DONE!