# Data Preprocessing Tools

## Importing the libraries

In [28]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [29]:
dataset = pd.read_csv('Data.csv') # read the csv file
X = dataset.iloc[:, :-1].values # features
y = dataset.iloc[:, -1].values # dependent variable vector

**Q. What is iloc?**
  - iloc is used to select rows and columns by number, in the order that they appear in the data frame

**Q. What is/are feature(s) and dependent variable?**
  - Feature(s) is/are the independent variable(s) that we use to predict the dependent variable.
  - Dependent variable is the variable that we want to predict.


In [30]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [31]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

There are two ways to handle missing data:
1. Remove the missing data
2. Replace the missing data with the mean of the column

We will use the second method here

In [32]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3]) # fit method will calculate the mean of the columns
# Note: We are not including the columns with non-numeric values i.e categorical data

X[:, 1:3] = imputer.transform(X[:, 1:3]) # transform method will replace the missing values with the mean


In [33]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

**Encoding categorical data is a process of converting categorical data into numerical data.**

There are two ways to encode categorical data:
1. Label Encoding
2. One Hot Encoding

_Label Encoding_

  In label encoding, each category is assigned a unique number.
  This is done by using the `LabelEncoder` class from the `sklearn` library.
  for example, if we have a column with the values: France, Spain, Germany then the label encoding will be:
  ```
  France -> 0
  Spain -> 1
  Germany -> 
  ```

_One Hot Encoding_

  In one hot encoding, we create a new column for each category.
  This is done by using the `OneHotEncoder` class from the `sklearn` library.
  for example, if we have a column with the values: France, Spain, Germany then the one hot encoding will be:
  ```
  France -> 1 0 0
  Spain -> 0 1 0
  Germany -> 0 0
  ```

### Encoding the Independent Variable

In [34]:
# We will use the one hot encoding method here

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
# ColumnTransformer method takes transformers as a list of tuples
# The first element of the tuple is the name of the transformer
# The second element of the tuple is the transformer object
# The third element of the tuple is the list of columns that we want to apply the transformer on
# The remainder parameter is used to specify what to do with the remaining columns that we did not specify in the list of columns
# The default value is drop which means that the remaining columns will be dropped
# We can also specify passthrough which means that the remaining columns will be left as it is

X = np.array(ct.fit_transform(X))
# fit_transform method will fit the transformer to the data and then transform the data
# Note: We are forcing the X to be a numpy array for consistency

In [35]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [36]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [37]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

In [38]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [39]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [40]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [41]:
print(y_train) 

[0 1 0 0 1 1 0 1]


In [42]:
print(y_test)

[0 1]


## Feature Scaling

**Q. What is feature scaling?**
  - Feature scaling is a method used to normalize the range of independent variables or features of data.
  - It is performed during the data preprocessing step.
  - It is performed to standardize the range of independent variables or features of data.
  - It is performed so that each feature contributes approximately proportionately to the final distance.
  - It is performed so that each feature is treated equally when applying the algorithms.
  - It is performed so that no feature dominates the other.

There are two ways to perform feature scaling:
1. Standardization
2. Normalization

_Standardization_

  Standardization is a method of rescaling the features such that they have the properties of a standard normal distribution with a mean of zero and a standard deviation of one.
  This is done by using the `StandardScaler` class from the `sklearn` library.

  ```python
  X_std = (X - X.mean(axis=0)) / X.std(axis=0)
  ```
  where `X` is the feature matrix and `axis=0` means that the standardization is performed on each column.

  > This will put all the values in the range of -3 to 3.

_Normalization_

  Normalization is a method of rescaling real valued numeric attributes into the range 0 and 1.
  This is done by using the `MinMaxScaler` class from the `sklearn` library.

  ```python
  X_norm = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
  ```
  where `X` is the feature matrix and `axis=0` means that the normalization is performed on each column.

  > This will put all the values in the range of 0 to 1.

**Q. When to use standardization and when to use normalization?**
  - Standardization is used when the distribution of the data is normal, and it works all the time.
  - Normalization is used when the distribution of the data is not normal.

In [45]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])
# Note: We are not fitting the test set because we want to use the same scaling parameters that we used for the training set

In [46]:
print(X_train)

[[0.0 0.0 1.0 -0.1915918438457856 -1.0781259408412427]
 [0.0 1.0 0.0 -0.014117293757057902 -0.07013167641635401]
 [1.0 0.0 0.0 0.5667085065333239 0.6335624327104546]
 [0.0 0.0 1.0 -0.3045301939022488 -0.30786617274297895]
 [0.0 0.0 1.0 -1.901801144700799 -1.4204636155515822]
 [1.0 0.0 0.0 1.1475343068237056 1.2326533634535488]
 [0.0 1.0 0.0 1.4379472069688966 1.5749910381638883]
 [1.0 0.0 0.0 -0.7401495441200352 -0.5646194287757336]]


In [47]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830127 -0.9069571034860731]
 [1.0 0.0 0.0 -0.44973664397484425 0.20564033932253029]]
