# Feature Scaling

The Age and Salary variables (below) are not on the same scale. Feature scaling is really important for the following reason:  

Most ML models are based on the Euclidean Distance (some models - like decision trees, are not based on the Euclidean Distance, however, feature scaling will allow the algorithm to converge much faster).  
Euclidean Distance between $P1$ and $P2$ $= \sqrt{(X2 - X1)^2 + (Y2 - Y1)^2}$

Taking the data below as an example, the Euclidean Distance will be dominated by the salary.  
Salary: $(79800 - 48000) ^2 = 961000000$  
Age: $(48 - 27) ^2 = 441$

In the model, therefore, it would be like the Age doesn't exist. To ensure that all variables are equal, they need to be in the same scale. Transformation can either be done via Standardization or Normalization. 


$Xstand = \dfrac{x - mean(x)}{Standard Deviation (x)}$  
  
$Xnorm = \dfrac{x - min(x)}{max(x) - min(x)}$  


Needs to be done after train/test split so that the the scaling is based on the training data alone. The same fit is then used for the test data set and making future predictions.   

The example here scales the categorical data that has been encoded also. If this needs to be done depends on the context. 

In [3]:
## ------- Import libraries and data ------- ##

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

lst = [['France', 44, 72000, 'No'], 
       ['Spain', 27, 48000, 'Yes'],
       ['Germany', 30, 54000, 'No'],
       ['Spain', 38, 61000, 'No'],
       ['Germany', 40, 63777.78 , 'Yes'],
       ['France', 35, 58000, 'Yes'],
       ['Spain', 38.77 , 52000, 'No'],
       ['France', 48,79000, 'Yes'],
       ['Germany', 50, 83000, 'No'],
       ['France', 37, 67000, 'Yes']]  

df = pd.DataFrame(lst, columns =['Country', 'Age', 'Salary', 'Purchased'])

# set X as all independent variables
X = df.iloc[:,:-1].values

# set y as dependent variable
y = df.iloc[:,-1].values


## ------- Categorical encoding ------- ##


# Create Column Transformer object for multi-categoical 
columntransformer = ColumnTransformer([('encoder', OneHotEncoder(), [0])], remainder='passthrough')

# take a copy for mapping purposes 
X_names = X.copy()

# fit to data
X = columntransformer.fit_transform(X) 

# need to use drop if you want to get feature names 
columntransformer_names = ColumnTransformer([('encoder', OneHotEncoder(), [0])], remainder='drop')
X_names = columntransformer_names.fit_transform(X_names)
X_mapping_vals = columntransformer_names.get_feature_names()


# label encoder create object for binary categoical features
labelencoder_y = LabelEncoder()

# fit object to the data and replace categorical data
y  = labelencoder_y.fit_transform(y)

# get mapping of features 
y_mapping_vals = dict(zip(labelencoder_y.classes_, labelencoder_y.transform(labelencoder_y.classes_)))



## ------- Splitting into Training and Test set ------- ##

# random_state is set to ensure that same outcome each time. Would work fine without this
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)


## ------- Feature Scaling ------- ##

# Make scaling object using Standardization
sc_X = StandardScaler()

# Fit and transform to training data
X_train = sc_X.fit_transform(X_train) 

# use the same fit as the training set and transform the test set
X_test = sc_X.transform(X_test) 


# Y scaling only needs to be done for regression not classification 
# sc_y = StandardScaler()
# y_train = sc_y.fit_transform(y_train)

[[0.0 1.0 0.0 40.0 63777.78]
 [1.0 0.0 0.0 37.0 67000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 0.0 1.0 38.77 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [1.0 0.0 0.0 44.0 72000.0]
 [1.0 0.0 0.0 35.0 58000.0]]
[[-1.          2.64575131 -0.77459667  0.26323727  0.12381499]
 [ 1.         -0.37796447 -0.77459667 -0.25333628  0.46175629]
 [-1.         -0.37796447  1.29099445 -1.97524812 -1.53093343]
 [-1.         -0.37796447  1.29099445  0.05144212 -1.11141981]
 [ 1.         -0.37796447 -0.77459667  1.64076674  1.72029716]
 [-1.         -0.37796447  1.29099445 -0.0811451  -0.16751415]
 [ 1.         -0.37796447 -0.77459667  0.95200201  0.98614832]
 [ 1.         -0.37796447 -0.77459667 -0.59771865 -0.48214937]]


ValueError: Expected 2D array, got 1D array instead:
array=[1. 1. 1. 0. 1. 0. 0. 1.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.