# Feature Scaling

The Age and Salary variables (below) are not on the same scale. Feature scaling is really important for the following reason:  

Most ML models are based on the Euclidean Distance.  
Euclidean Distance between $P1$ and $P2$ $= \sqrt{(X2 - X1)^2 + (Y2 - Y1)^2}$

Taking the data below as an example, the Euclidean Distance will be dominated by the salary.  
Salary: $(79800 - 48000) ^2 = 961000000$  
Age: $(48 - 27) ^2 = 441$

In the model, therefore, it would be like the Age doesn't exist. To ensure that all variables are equal, they need to be in the same scale. Transformation can either be done via Standardization or Normalization. 


$Xstand = \dfrac{x - mean(x)}{Standard Deviation (x)}$  
  
$Xnorm = \dfrac{x - min(x)}{max(x) - min(x)}$  




Do after train/test split as outcomes may vary slightly and it's important to ensure generalization. 

In [1]:
## ------- Import libraries and data ------- ##

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

lst = [['France', 44, 72000, 'No'], 
       ['Spain', 27, 48000, 'Yes'],
       ['Germany', 30, 54000, 'No'],
       ['Spain', 38, 61000, 'No'],
       ['Germany', 40, 63777.78 , 'Yes'],
       ['France', 35, 58000, 'Yes'],
       ['Spain', 38.77 , 52000, 'No'],
       ['France', 48,79000, 'Yes'],
       ['Germany', 50, 83000, 'No'],
       ['France', 37, 67000, 'Yes']]  

df = pd.DataFrame(lst, columns =['Country', 'Age', 'Salary', 'Purchased'])

# set X as all independent variables
X = df.iloc[:,:-1].values

# set y as dependent variable
y = df.iloc[:,-1].values

## ------- Splitting into Training and Test set ------- ##

# random_state is set to ensure that same outcome each time. Would work fine without this
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

## ------- Feature Scaling ------- ##

sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)

ValueError: could not convert string to float: 'Germany'