Sumit Mohod | sumitmohod1991@gmail.com  
**Data Preprocessing** - Importing, Cleaning, Preparing, Splitting Data and Feature Scaling.  
The following Notebook can be used as template with example to understand Data Preprocessing.

# Data Preprocessing

## Importing the libraries

In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

## Importing the dataset

In [2]:
dataset = pd.read_csv('Data.csv')
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [3]:
X = dataset.drop(labels= 'Purchased', axis= 1)          # Selecting all the variables columns except Target variable
X.head(3)

Unnamed: 0,Country,Age,Salary
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0


In [4]:
y = dataset['Purchased']                # Selecting the Target variable column
y.head(3)

0     No
1    Yes
2     No
Name: Purchased, dtype: object

## Taking care of missing data

In [5]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values= np.nan, strategy= 'mean')
imputer.fit(X[['Age', 'Salary']])               # Taking only the variable columns containing missing values
X[['Age', 'Salary']] = imputer.transform(X[['Age', 'Salary']])

In [6]:
X

Unnamed: 0,Country,Age,Salary
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,63777.777778
5,France,35.0,58000.0
6,Spain,38.777778,52000.0
7,France,48.0,79000.0
8,Germany,50.0,83000.0
9,France,37.0,67000.0


## Encoding categorical data

### Encoding the Independent Variable

In [7]:
Country = pd.get_dummies(X['Country'], drop_first= True)           # Introduces a new column for each sub-category in categorical column, from which first column is droped to avoiding dummy variable trap.

X.drop(labels= 'Country', axis= 1, inplace= True)                  # To drop original categorical column for which dummy variable is created from df (X).

X = pd.concat(objs= [X, Country], axis= 1)                         # For concatnating dataframe (X) and dataframe containing dummy variable (Country) as a dataframe(X)

In [8]:
X.head()

Unnamed: 0,Age,Salary,Germany,Spain
0,44.0,72000.0,0,0
1,27.0,48000.0,0,1
2,30.0,54000.0,1,0
3,38.0,61000.0,0,1
4,40.0,63777.777778,1,0


### Encoding the Dependent Variable

In [9]:
y = pd.get_dummies(y, drop_first= True)               # Introduces a new column for each sub-category in categorical column, from which first column is droped to avoiding dummy variable trap.

In [10]:
y.head()

Unnamed: 0,Yes
0,0
1,1
2,0
3,0
4,1


## Splitting the dataset into the Training set and Test set

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state= 1)

In [12]:
X_train.head()

Unnamed: 0,Age,Salary,Germany,Spain
4,40.0,63777.777778,1,0
0,44.0,72000.0,0,0
3,38.0,61000.0,0,1
1,27.0,48000.0,0,1
7,48.0,79000.0,0,0


In [13]:
X_test.head()

Unnamed: 0,Age,Salary,Germany,Spain
2,30.0,54000.0,1,0
9,37.0,67000.0,0,0
6,38.777778,52000.0,0,1


In [14]:
y_train.head()

Unnamed: 0,Yes
4,1
0,0
3,0
1,1
7,1


In [15]:
y_test.head()

Unnamed: 0,Yes
2,0
9,1
6,0


## Feature Scaling

* Feature Scaling is used when there is a huge difference between the range of the features under consideration
* Feature Scaling is used so that one feature does not dominate the other

In [16]:
from sklearn.preprocessing import StandardScaler
scalar = StandardScaler()
X_train[['Age','Salary']] = scalar.fit_transform(X_train[['Age','Salary']])             # Select features (variable columns) to be scaled.
X_test[['Age','Salary']] = scalar.transform(X_test[['Age','Salary']])

In [17]:
X_train.head()

Unnamed: 0,Age,Salary,Germany,Spain
4,-0.03891,-0.2296,1,0
0,0.505833,0.491205,0,0
3,-0.311282,-0.473116,0,1
1,-1.809325,-1.612768,0,1
7,1.050576,1.104864,0,0


In [18]:
X_test.head()

Unnamed: 0,Age,Salary,Germany,Spain
2,-1.400768,-1.086774,1,0
9,-0.447467,0.052878,0,0
6,-0.205359,-1.262106,0,1


Now, above data is ready for application of ML Model.

# Thank you!