## Variable magnitude

### Does the magnitude of the variable matter?

In Linear Regression models, the scale of variables used to estimate the output matters. Linear models are of the type **y = w x + b**, where the regression coefficient w represents the expected change in y for a one unit change in x (the predictor). Thus, the magnitude of w is partly determined by the magnitude of the units being used for x. If x is a distance variable, just changing the scale from kilometers to miles will cause a change in the magnitude of the coefficient.

In addition, in situations where we estimate the outcome y by contemplating multiple predictors x1, x2, ...xn, predictors with greater numeric ranges dominate over those with smaller numeric ranges.

Gradient descent converges faster when all the predictors (x1 to xn) are within a similar scale, therefore having features in a similar scale is useful for Neural Networks as well as.

In Support Vector Machines, feature scaling can decrease the time to find the support vectors.

Finally, methods using Euclidean distances or distances in general are also affected by the magnitude of the features, as Euclidean distance is sensitive to variations in the magnitude or scales of the predictors. Therefore feature scaling is required for methods that utilise distance calculations like k-nearest neighbours (KNN) and k-means clustering.

In summary:

#### Magnitude matters because:

- The regression coefficient is directly influenced by the scale of the variable
- Variables with bigger magnitude / value range dominate over the ones with smaller magnitude / value range
- Gradient descent converges faster when features are on similar scales
- Feature scaling helps decrease the time to find support vectors for SVMs
- Euclidean distances are sensitive to feature magnitude.

#### The machine learning models affected by the magnitude of the feature are:

- Linear and Logistic Regression
- Neural Networks
- Support Vector Machines
- KNN
- K-means clustering
- Linear Discriminant Analysis (LDA)
- Principal Component Analysis (PCA)

#### Machine learning models insensitive to feature magnitude are the ones based on Trees:

- Classification and Regression Trees
- Random Forests
- Gradient Boosted Trees

===================================================================================================

## In this Demo

We will study the effect of feature magnitude on the performance of different machine learning algorithms.

We will use the Titanic dataset.

- To download the dataset please refer to the **Datasets** lecture in **Section 1** of the course.

In [1]:
import pandas as pd
import numpy as np

# import several machine learning algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier

# to scale the features
from sklearn.preprocessing import MinMaxScaler

# to evaluate performance and separate into
# train and test set
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

### Load data with numerical variables only

In [2]:
# load numerical variables of the Titanic Dataset

data = pd.read_csv('../titanic.csv',
                   usecols=['Pclass', 'Age', 'Fare', 'Survived'])
data.head()

Unnamed: 0,Survived,Pclass,Age,Fare
0,0,3,22.0,7.25
1,1,1,38.0,71.2833
2,1,3,26.0,7.925
3,1,1,35.0,53.1
4,0,3,35.0,8.05


In [3]:
# let's have a look at the values of those variables
# to get an idea of the feature magnitudes

data.describe()

Unnamed: 0,Survived,Pclass,Age,Fare
count,891.0,891.0,714.0,891.0
mean,0.383838,2.308642,29.699118,32.204208
std,0.486592,0.836071,14.526497,49.693429
min,0.0,1.0,0.42,0.0
25%,0.0,2.0,20.125,7.9104
50%,0.0,3.0,28.0,14.4542
75%,1.0,3.0,38.0,31.0
max,1.0,3.0,80.0,512.3292


We can see that Fare varies between 0 and 512, Age between 0 and 80, and Class between 0 and 3. So the variables have different magnitude.

In [4]:
# let's now calculate the range

for col in ['Pclass', 'Age', 'Fare']:
    print(col, 'range: ', data[col].max() - data[col].min())

Pclass range:  2
Age range:  79.58
Fare range:  512.3292


The range of values that each variable can take are quite different.

In [5]:
# let's separate into training and testing set
# the titanic dataset contains missing information
# so for this demo, I will fill those in with 0s

X_train, X_test, y_train, y_test = train_test_split(
    data[['Pclass', 'Age', 'Fare']].fillna(0),
    data.Survived,
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((623, 3), (268, 3))

### Feature Scaling

For this demonstration, I will scale the features between 0 and 1, using the MinMaxScaler from scikit-learn. To learn more about this scaling visit the Scikit-Learn [website](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)

The transformation is given by:

X_rescaled = X - X.min() / (X.max - X.min()

And to transform the re-scaled features back to their original magnitude:

X = X_rescaled * (max - min) + min

**There is a dedicated section to feature scaling later in the course, where I will explain this and other scaling techniques in more detail**. For now, let's carry on with the demonstration.

In [6]:
# scale the features between 0 and 1.

# cal the scaler
scaler = MinMaxScaler()

# fit the scaler
scaler.fit(X_train)

# re scale the datasets
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

  return self.partial_fit(X, y)


In [7]:
#let's have a look at the scaled training dataset

print('Mean: ', X_train_scaled.mean(axis=0))
print('Standard Deviation: ', X_train_scaled.std(axis=0))
print('Minimum value: ', X_train_scaled.min(axis=0))
print('Maximum value: ', X_train_scaled.max(axis=0))

Mean:  [0.64365971 0.30131421 0.06335433]
Standard Deviation:  [0.41999093 0.21983527 0.09411705]
Minimum value:  [0. 0. 0.]
Maximum value:  [1. 1. 1.]


Now, the maximum values for all the features is 1, and the minimum value is zero, as expected. So they are in a more similar scale.

### Logistic Regression

Let's evaluate the effect of feature scaling in a Logistic Regression.

In [8]:
# model build on unscaled variables

# call the model
logit = LogisticRegression(
    random_state=44,
    C=1000,  # c big to avoid regularization
    solver='lbfgs')

# train the model
logit.fit(X_train, y_train)

# evaluate performance
print('Train set')
pred = logit.predict_proba(X_train)
print('Logistic Regression roc-auc: {}'.format(
    roc_auc_score(y_train, pred[:, 1])))
print('Test set')
pred = logit.predict_proba(X_test)
print('Logistic Regression roc-auc: {}'.format(
    roc_auc_score(y_test, pred[:, 1])))

Train set
Logistic Regression roc-auc: 0.7134823539619531
Test set
Logistic Regression roc-auc: 0.7080952380952381


In [9]:
# let's look at the coefficients
logit.coef_

array([[-0.92585764, -0.01822689,  0.00233577]])

In [10]:
# model built on scaled variables

# call the model
logit = LogisticRegression(
    random_state=44,
    C=1000,  # c big to avoid regularization
    solver='lbfgs')

# train the model using the re-scaled data
logit.fit(X_train_scaled, y_train)

# evaluate performance
print('Train set')
pred = logit.predict_proba(X_train_scaled)
print('Logistic Regression roc-auc: {}'.format(
    roc_auc_score(y_train, pred[:, 1])))
print('Test set')
pred = logit.predict_proba(X_test_scaled)
print('Logistic Regression roc-auc: {}'.format(
    roc_auc_score(y_test, pred[:, 1])))

Train set
Logistic Regression roc-auc: 0.7134931997136721
Test set
Logistic Regression roc-auc: 0.7080952380952381


In [11]:
logit.coef_

array([[-1.85170244, -1.45782986,  1.19540159]])

We observe that the performance of logistic regression did not change when using the datasets with the features scaled (compare roc-auc values for train and test set for models with and without feature scaling). 

However, when looking at the coefficients we do see a big difference in the values. This is because the magnitude of the variable was affecting the coefficients. After scaling, all 3 variables have the relatively the same effect (coefficient) towards survival, whereas before scaling, we would be inclined to think that PClass was driving the Survival outcome.

### Support Vector Machines

In [12]:
# model build on unscaled variables

# call the model
SVM_model = SVC(random_state=44, probability=True, gamma='auto')

#  train the model
SVM_model.fit(X_train, y_train)

# evaluate performance
print('Train set')
pred = SVM_model.predict_proba(X_train)
print('SVM roc-auc: {}'.format(roc_auc_score(y_train, pred[:, 1])))
print('Test set')
pred = SVM_model.predict_proba(X_test)
print('SVM roc-auc: {}'.format(roc_auc_score(y_test, pred[:, 1])))

Train set
SVM roc-auc: 0.9016995292943755
Test set
SVM roc-auc: 0.6768154761904762


In [13]:
# model built on scaled variables

# call the model
SVM_model = SVC(random_state=44, probability=True, gamma='auto')

# train the model
SVM_model.fit(X_train_scaled, y_train)

# evaluate performance
print('Train set')
pred = SVM_model.predict_proba(X_train_scaled)
print('SVM roc-auc: {}'.format(roc_auc_score(y_train, pred[:, 1])))
print('Test set')
pred = SVM_model.predict_proba(X_test_scaled)
print('SVM roc-auc: {}'.format(roc_auc_score(y_test, pred[:, 1])))

Train set
SVM roc-auc: 0.7047081408212403
Test set
SVM roc-auc: 0.6988690476190476


Feature scaling improved the performance of the support vector machine. After feature scaling the model is no longer over-fitting to the training set (compare the roc-auc of 0.901 for the model on unscaled features vs the roc-auc of 0.704). In addition, the roc-auc for the testing set increased as well (0.67 vs 0.69).

### Neural Networks

In [14]:
# model built on unscaled features

# call the model
NN_model = MLPClassifier(random_state=44, solver='sgd')

# train the model
NN_model.fit(X_train, y_train)

# evaluate performance
print('Train set')
pred = NN_model.predict_proba(X_train)
print('Neural Network roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = NN_model.predict_proba(X_test)
print('Neural Network roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

Train set
Neural Network roc-auc: 0.6012288236697686
Test set
Neural Network roc-auc: 0.565


In [15]:
# model built on scaled features

# call the model
NN_model = MLPClassifier(random_state=44, solver='sgd')

# train the model
NN_model.fit(X_train_scaled, y_train)

# evaluate performance
print('Train set')
pred = NN_model.predict_proba(X_train_scaled)
print('Neural Network roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = NN_model.predict_proba(X_test_scaled)
print('Neural Network roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

Train set
Neural Network roc-auc: 0.7165300101950066
Test set
Neural Network roc-auc: 0.7124404761904761




We observe that scaling the features improved the performance of the neural network both for the training and the testing set (compare roc-auc values for training and testing for both models). The roc-auc increases in both training and testing sets when the model is trained on a dataset with scaled features.

### K-Nearest Neighbours

In [16]:
#model built on unscaled features

# call the model
KNN = KNeighborsClassifier(n_neighbors=3)

# train the model
KNN.fit(X_train, y_train)

# evaluate performance
print('Train set')
pred = KNN.predict_proba(X_train)
print('KNN roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = KNN.predict_proba(X_test)
print('KNN roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

Train set
KNN roc-auc: 0.8694225721784778
Test set
KNN roc-auc: 0.6253571428571428


In [17]:
# model built on scaled

# call the model
KNN = KNeighborsClassifier(n_neighbors=3)

# train the model
KNN.fit(X_train_scaled, y_train)

# evaluate performance
print('Train set')
pred = KNN.predict_proba(X_train_scaled)
print('KNN roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = KNN.predict_proba(X_test_scaled)
print('KNN roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

Train set
KNN roc-auc: 0.8880555736318084
Test set
KNN roc-auc: 0.7017559523809525


We observe for KNN as well that feature scaling improved the performance of the model. The model built on unscaled features shows a better generalisation, with a higher roc-auc for the testing set (0.70 vs 0.62 for model built on unscaled features).

Both KNN methods are over-fitting to the train set. Thus, we would need to change the parameters of the model or use less features to try and decrease over-fitting, which exceeds the purpose of this demonstration.

### Random Forests

In [18]:
# model built on unscaled features

# call the model
rf = RandomForestClassifier(n_estimators=200, random_state=39)

# train the model
rf.fit(X_train, y_train)

# evaluate performance
print('Train set')
pred = rf.predict_proba(X_train)
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:, 1])))
print('Test set')
pred = rf.predict_proba(X_test)
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:, 1])))

Train set
Random Forests roc-auc: 0.9916108110453136
Test set
Random Forests roc-auc: 0.7614285714285715


In [19]:
# model built in scaled features

# call the model
rf = RandomForestClassifier(n_estimators=200, random_state=39)

# train the model
rf.fit(X_train_scaled, y_train)

# evaluate performance
print('Train set')
pred = rf.predict_proba(X_train_scaled)
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = rf.predict_proba(X_test_scaled)
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

Train set
Random Forests roc-auc: 0.9916541940521898
Test set
Random Forests roc-auc: 0.7610714285714285


As expected, Random Forests shows no change in performance regardless of whether it is trained on a dataset with scaled or unscaled features. This model in particular, is over-fitting to the training set. So we need to do some work to remove the over-fitting. That exceeds the scope of this demonstration.

In [20]:
# train adaboost on non-scaled features

# call the model
ada = AdaBoostClassifier(n_estimators=200, random_state=44)

# train the model
ada.fit(X_train, y_train)

# evaluate model performance
print('Train set')
pred = ada.predict_proba(X_train)
print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = ada.predict_proba(X_test)
print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

Train set
AdaBoost roc-auc: 0.8477364916162339
Test set
AdaBoost roc-auc: 0.7733630952380953


In [21]:
# train adaboost on scaled features

# call the model
ada = AdaBoostClassifier(n_estimators=200, random_state=44)

# train the model
ada.fit(X_train_scaled, y_train)

# evaluate model performance
print('Train set')
pred = ada.predict_proba(X_train_scaled)
print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = ada.predict_proba(X_test_scaled)
print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

Train set
AdaBoost roc-auc: 0.8477364916162339
Test set
AdaBoost roc-auc: 0.7733630952380953


As expected, AdaBoost shows no change in performance regardless of whether it is trained on a dataset with scaled or unscaled features

**That is all for this demonstration. I hope you enjoyed the notebook, and see you in the next one.**