# Basic Data Preprocessing Steps

These are the basic steps of data preprocessing that is extremely crucial to perform before beginning on your algorithm.

# 1. Importing the libraries





<img width="400" height="400" src="https://pbs.twimg.com/media/DqTK8dXX4AkkKy_.jpg">

### Numpy 
NumPy stands for ‘Numerical Python’ or ‘Numeric Python’. It is an open source module of Python which provides fast mathematical computation on arrays and matrices. 

### Pandas
Similar to NumPy, Pandas is one of the most widely used python libraries in data science. It provides high-performance, easy to use structures and data analysis tools. Unlike NumPy library which provides objects for multi-dimensional arrays, Pandas provides in-memory 2d table object called Dataframe. It is like a spreadsheet with column names and row labels.

### Matplotlib
Matplotlib is a 2d plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments. Matplotlib can be used in Python scripts, Python and IPython shell, Jupyter Notebook, web application servers and GUI toolkits.

More info: https://cloudxlab.com/blog/numpy-pandas-introduction/

Make sure to read the official documentation too!



In [1]:
#importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 2. Importing the dataset

### Steps: 
1. Import the dataset as a dataframe using pandas.
2. Split the data into independent variables(X) and dependent variable(y) using iloc.
3. The first parameter in iloc represents the number of rows and second parameter represents coloumns required. Eg. iloc[:, :-1] - selects all rows and all coloumns except for the last one.

In [2]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values
print("X:\n",X)
print("\ny:\n",y)

X:
 [['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]

y:
 ['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


# 3. Dealing with Missing data

Missing values can be filled in with NaN, mean, median, the highest value of that coloumn.
Imputer from the preprocessing sublibrary under the sklearn library could be used.

In [3]:
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
imputer = imputer.fit(X[:, 1:3])
X[:,1:3] = imputer.transform(X[:,1:3])
print("X:\n",X)
print("\ny:\n",y)

X:
 [['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]

y:
 ['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']




# 4. Encoding categorical data

Since machines understand numbers more than they understand text. Categorical textual values are transformed to numerical values.

From colomn 1 - France, Germany, Spain - can be transformed to - 0, 1, 2 -  numerical categories. However, since the cateogirs are given numbers as - 0, 1, 2 - the machine might misunderstand this to be ranking. It could misunderstand that Spain (category 2) is representing some value greater than Germany (category 1) or France (category 0). 


Thus, to solve this problem onehotencoding is performed.

Must read:  https://www.kaggle.com/dansbecker/using-categorical-data-with-one-hot-encoding

OneHotEncoding official documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [4]:
#labeling data numerically and Encoding the Independent Variables
from sklearn.preprocessing import LabelEncoder
labelencoder_X = LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])
print("X:\n",X)

X:
 [[0 44.0 72000.0]
 [2 27.0 48000.0]
 [1 30.0 54000.0]
 [2 38.0 61000.0]
 [1 40.0 63777.77777777778]
 [0 35.0 58000.0]
 [2 38.77777777777778 52000.0]
 [0 48.0 79000.0]
 [1 50.0 83000.0]
 [0 37.0 67000.0]]


In [5]:
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.compose import make_column_transformer

# OneHotEncoding the Independent Variables
preprocess = make_column_transformer(([0],OneHotEncoder(categories='auto')),remainder="passthrough")
X = preprocess.fit_transform(X)

# Encoding the Dependent Variable
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
print("X:\n",X)
print("\ny:\n",y)

X:
 [[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]

y:
 [0 1 0 0 1 1 0 1 0 1]


# 5. Splitting data into training and testing data

The training set is a subset of your data on which your model will learn how to predict the dependent
variable with the independent variables. The test set is the complimentary subset from the training set, on
which you will evaluate your model to see if it manages to predict correctly the dependent variable with the
independent variables.

In [6]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0,test_size=0.2)
print("X train:\n",X_train)
print("\ny train:\n",y_train)
print("\nX test:\n",X_test)
print("\ny test:\n",y_test)

X train:
 [[0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 37.0 67000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [1.0 0.0 0.0 44.0 72000.0]
 [1.0 0.0 0.0 35.0 58000.0]]

y train:
 [1 1 1 0 1 0 0 1]

X test:
 [[0.0 1.0 0.0 30.0 54000.0]
 [0.0 1.0 0.0 50.0 83000.0]]

y test:
 [0 0]


# 6. Feature Scaling

Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Eucledian distance between two data points in their computations, this is a problem.

If left alone, these algorithms only take in the magnitude of features neglecting the units. The results would vary greatly between different units, 5kg and 5000gms. The features with high magnitudes will weigh in a lot more in the distance calculations than features with low magnitudes.

To supress this effect, we need to bring all features to the same level of magnitudes. This can be acheived by scaling.

There are two common methods to perform Feature Scaling.

### 1. Standardisation:

Standardisation replaces the values by their Z scores.

<img width="100" height="100" src="https://cdn-images-1.medium.com/max/600/1*LysCPCvg0AzQenGoarL_hQ.png">


This redistributes the features with their mean μ = 0 and standard deviation σ =1 . sklearn.preprocessing.scale helps us implementing standardisation in python.

### 2. Mean Normalisation:

<img width="100" height="100" src="https://cdn-images-1.medium.com/max/900/1*fyK4gMQrfJKV5pmbXSrNbg.png">


This distribution will have values between -1 and 1 with μ=0.

##### Note:

##### Do we really have to apply Feature Scaling on the dummy variables?
Yes, if you want to optimize the accuracy of your model predictions.
No, if you want to keep the most interpretation as possible in your model.

##### When should we use Standardization and Normalization?
Generally you should normalize (normalization) when the data is normally distributed, and scale (standardization) when the data is not normally distributed. In doubt, you should go for standardization. However
what is commonly done is that the two scaling methods are tested.



In [7]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
print("X train:\n",X_train)
print("\ny train:\n",y_train)
print("\nX test:\n",X_test)
print("\ny test:\n",y_test)

X train:
 [[-1.          2.64575131 -0.77459667  0.26306757  0.12381479]
 [ 1.         -0.37796447 -0.77459667 -0.25350148  0.46175632]
 [-1.         -0.37796447  1.29099445 -1.97539832 -1.53093341]
 [-1.         -0.37796447  1.29099445  0.05261351 -1.11141978]
 [ 1.         -0.37796447 -0.77459667  1.64058505  1.7202972 ]
 [-1.         -0.37796447  1.29099445 -0.0813118  -0.16751412]
 [ 1.         -0.37796447 -0.77459667  0.95182631  0.98614835]
 [ 1.         -0.37796447 -0.77459667 -0.59788085 -0.48214934]]

y train:
 [1 1 1 0 1 0 0 1]

X test:
 [[-1.          2.64575131 -0.77459667 -1.45882927 -0.90166297]
 [-1.          2.64575131 -0.77459667  1.98496442  2.13981082]]

y test:
 [0 0]


