## Feature Scaling

__Columns with different ranges__

annual income ($$70000, 60000, 52000)

age (45years, 44, 40)

What can happen with unscaled features is that the unit values of one column can be so much larger than the unit values of the other that it might overpower.
<br>So we might make the erroneous conclusion that, okay, we're going to ignore values 1 and 4 because those are such small differences compared to 10.000 and 8.000, we're gonna focus on these large of magnitude numbers, 10.000 and 8.000.

And that's why __we need to normalize variables because we can't compare__.
<br>Right now, we're comparing salaries to years, it's like comparing apples and oranges. These are non-comparable things.

And even if you have the same units of measurement, like dollars and dollars in two columns, they still might not be comparable
because they're relating to different things.

So it's important to scale your features.

---

## Importing the libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#pyplot module

---

## Importing the Dataset

In [3]:
df = pd.read_csv('Data.csv')  

In [4]:
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [5]:
# the features and the dependent variable vector
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

A very important principle in Python which you must absolutely know,
a __range__ in Python __includes the lower bound__, but __excludes the upper bound__.

In [6]:
df.iloc[:, :-1]
# now it's still a dataframe

Unnamed: 0,Country,Age,Salary
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,
5,France,35.0,58000.0
6,Spain,,52000.0
7,France,48.0,79000.0
8,Germany,50.0,83000.0
9,France,37.0,67000.0


In [7]:
# we need arrays
df.iloc[:, :-1].values

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [8]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [9]:
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

---

## Object-oriented programming

A __class__ is the model, or a blueprint, of something we want to build. For example, if we make a house construction plan that gathers the instructions on how to build a house, then this construction plan is the class.

An __object__ is an instance of the class. So if we take that same example of the house construction plan, then an object is simply a house. A house (the object) that was built by following the instructions of the construction plan (the class).
And therefore there can be many objects of the same class, because we can build many houses from the construction plan.

A __method__ is a tool we can use on the object to complete a specific action. So in this same example, a tool can be to open the main door of the house if a guest is coming. A method can also be seen as a function that is applied onto the object, takes some inputs (that were defined in the class) and returns some output.

---

## Taking care of Missing data

In [10]:
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


There are actually several ways __to handle missing values__.

- __The first way__ is to just ignore the observation __by deleting it__. (and this actually works if you have a large data set, and only 1% missing data, you know removing 1% of the observations won't change much the learning quality of your model)
- __The second way__ is to replace the missing value by __the average__ of all the values in the column, in which the data is missing.

In [15]:
# in order to access a module, we have to add a dot (sklearn.), because actually this SimpleImputer class 
# which we want to import, belongs to a certain module of scikit-learn called impute.

In [14]:
# the class that we're gonna use from sklearn, is called SimpleImputer.

# first import the class,
# then create an instance, an object of the class.

In [17]:
# impute - приписывать 

In [18]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# replace all the missing values

There are actually many replacements that you could do (instead of replacing it by the average salary, replace it by the median salary. By the most frequent value for categories).



In [21]:
# apply this imputer object on the matrix of features (only for numerical columns)
imputer.fit(X[:, 1:3])
# to do the transformation
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [22]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

---

## Encoding Categorical Data

This data set contains one column with categories - France, Spain, or Germany.
<br>It'll be difficult for machine learning model to compute some correlations between these columns (the features) and the outcome. And therefore, we'll have to __turn__ these strings, these __categories into numbers__.

So one idea would be to encode France into zero, Spain into one, and Germany into two. However, if we do this, our future ML model could understand that there is a numerical order between these three countries and this order matters whereas of course it is absolutely not the case.

In [26]:
set(X[:, 0])

{'France', 'Germany', 'Spain'}

Another way is __One-Hot Encoding__ (consists of creating binary vectors).
<br>France - the vector 1 0 0
<br>Spain - the vector 0 1 0
<br>Germany - the vector 0 0 1

To replace the dependent variable by zeros and ones. And that's totally fine for the dependent variable as long as it is a binary outcome. It will actually not compromise the future accuracy of the model if you just replace no and yes by zero and one.

In [27]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

__encoding the independent variable__


In [28]:
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
# remainder='passthrough' means we want to keep the columns that won't be applied some transformation, 
# that won't be one hot encoded.

In [38]:
X = ct.fit_transform(X)
# the result should be an array (if it's not, then make it by using np.array(), as it will be expected
# by the future ML models which we're gonna build).

In [40]:
X # country column in three binary column-vectors 

array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

__encoding the dependent variable__

In [43]:
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

In [44]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [45]:
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

---

## Splitting the dataset into Training set and Test set

### !!!

We have to apply __feature scaling after splitting__ the data set 
into the training set and the test set. 

Why?
<br>Because the test set
is something you're not supposed to work with
for the training. And feature scaling is
a technique that will get the mean
and the standard deviation of your feature.
So, if we apply feature scaling before the split
then it will actually get the mean
and the standard deviation of all the values,
including the ones in the test set.
And since the test set is
something you're not supposed to have,
applying feature scaling on the original data set,
before the split,
would cause some what we call __information leakage__
on the test set.
You know, we would grab some information from the test set,
which we're not supposed to get.

__to prevent information leakage on the test set__

In [46]:
from sklearn.model_selection import train_test_split

In [47]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# random_state parameter - so that we can have the same results displayed

In [48]:
X_train

array([[0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 35.0, 58000.0]], dtype=object)

In [49]:
X_test

array([[0.0, 1.0, 0.0, 30.0, 54000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

In [52]:
y_train

array([0, 1, 0, 0, 1, 1, 0, 1])

In [53]:
y_test

array([0, 1])

---

## Feature Scaling

It will allow to put all our features on the same scale.

You also need to be aware that we won't have to apply feature scaling for all the ML models, just for some of them.

Now the question is much asked by the data science community.
<br>__Should we go for standardization or normalization?__

- __Normalization__ is recommended when you have __a normal distribution__ in most of your __features__.
- __Standardization__ actually works well all the time. It will do the job all the time.

We won't apply feature scaling on the whole matrix of features X,
but of course on both X-train and X-test separately.

__Performing standardization__

In [54]:
from sklearn.preprocessing import StandardScaler

In [55]:
sc = StandardScaler()

__!!! one of the most frequently asked questions in the data science community__
<br>Do we have to apply feature scaling - standardization to the dummy variables in the matrix of features?

The answer is no, because the goal of standardization or feature scaling in general, it is to have all the values of the features in the same range.
- And since standardization actually transforms your features so that they take values between more or less
minus three and plus three, while since here our dummy variables already take values between minus three and plus three
because they're equal to either one or zero.
-Well, there is nothing extra to be done here with standardization.
And actually, standardization will only make it worse
because indeed it will still transform these values
between minus three and plus three.
But then you will totally lose the interpretation
of these variables.
In other words, you will lose the information of
which country corresponds to the observation.

In [56]:
# we won't apply feature scaling on the dummy variables
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])

- __.fit() method__ will only compute the mean and the standard deviation of the values.
- And then you have the __.transform() method__ that will indeed
apply this formula by transforming each
of the values here.
- One of the methods of the StandardScaler class is actually __.fit_transform() method__, which
of course will proceed to the two tools at the same time.
-  I will __only__ apply the __.transform() method__
because indeed the features of __the test set__
need to be scaled by the same scaler
that was used on the training set.

In [57]:
X_train

array([[0.0, 0.0, 1.0, -0.19159184384578545, -1.0781259408412425],
       [0.0, 1.0, 0.0, -0.014117293757057777, -0.07013167641635372],
       [1.0, 0.0, 0.0, 0.566708506533324, 0.633562432710455],
       [0.0, 0.0, 1.0, -0.30453019390224867, -0.30786617274297867],
       [0.0, 0.0, 1.0, -1.9018011447007988, -1.420463615551582],
       [1.0, 0.0, 0.0, 1.1475343068237058, 1.232653363453549],
       [0.0, 1.0, 0.0, 1.4379472069688968, 1.5749910381638885],
       [1.0, 0.0, 0.0, -0.7401495441200351, -0.5646194287757332]],
      dtype=object)

In [58]:
X_test

array([[0.0, 1.0, 0.0, -1.4661817944830124, -0.9069571034860727],
       [1.0, 0.0, 0.0, -0.44973664397484414, 0.2056403393225306]],
      dtype=object)