# VARIABLE MAGNITUDE

### Datasets: 

1. Titanic datasets.

### Content:
1. Loading Data
2. Feature Scaling (MinMaxScaler)
3. Logistic Regression
    - on unscaled variables
    - on scaled variables

## 1. Loading Data

In [1]:
import pandas as pd
import numpy as np

# import several machine learning algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier

# to scale the features
from sklearn.preprocessing import MinMaxScaler

# to evaluate performance and separate into
# train and test set
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

### Load data with numerical variables only

In [2]:
# load numerical variables of the Titanic Dataset

data = pd.read_csv('../titanic.csv',
                   usecols=['pclass', 'age', 'fare', 'survived'])
data.head()

Unnamed: 0,pclass,survived,age,fare
0,1,1,29.0,211.3375
1,1,1,0.9167,151.55
2,1,0,2.0,151.55
3,1,0,30.0,151.55
4,1,0,25.0,151.55


In [3]:
data.describe()

Unnamed: 0,pclass,survived,age,fare
count,1309.0,1309.0,1046.0,1308.0
mean,2.294882,0.381971,29.881135,33.295479
std,0.837836,0.486055,14.4135,51.758668
min,1.0,0.0,0.1667,0.0
25%,2.0,0.0,21.0,7.8958
50%,3.0,0.0,28.0,14.4542
75%,3.0,1.0,39.0,31.275
max,3.0,1.0,80.0,512.3292


Fare varies between 0 and 512, Age between 0 and 80, and Class between 0 and 3. So the variables have different magnitudes.

In [4]:
# calculate the range

for col in ['pclass', 'age', 'fare']:
    print(col, 'range: ', data[col].max() - data[col].min())

pclass range:  2
age range:  79.8333
fare range:  512.3292


The range of values that each variable can take is quite different.

In [5]:
# train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    data[['pclass', 'age', 'fare']].fillna(0),
    data.survived,
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((916, 3), (393, 3))

## 2. Feature Scaling

__MinMaxScaler__:<br>
X_rescaled = (X - X.min) / (X.max - X.min)

In [6]:
# call the scaler
scaler = MinMaxScaler()

# fit the scaler
scaler.fit(X_train)

# scale the datasets
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [7]:
# the scaled training dataset

print('Mean: ', X_train_scaled.mean(axis=0))
print('Standard Deviation: ', X_train_scaled.std(axis=0))
print('Minimum value: ', X_train_scaled.min(axis=0))
print('Maximum value: ', X_train_scaled.max(axis=0))

Mean:  [0.64628821 0.33048359 0.06349833]
Standard Deviation:  [0.42105785 0.23332045 0.09250036]
Minimum value:  [0. 0. 0.]
Maximum value:  [1. 1. 1.]


## 3. Logistic Regression

In [8]:
# model build on unscaled variables

# call the model
logit = LogisticRegression(
    random_state=44,
    C=1000,
    solver='lbfgs')

# train the model
logit.fit(X_train, y_train)

# evaluate performance
print('Train set')
pred = logit.predict_proba(X_train)
print('Logistic Regression roc-auc: {}'.format(
    roc_auc_score(y_train, pred[:, 1])))
print('Test set')
pred = logit.predict_proba(X_test)
print('Logistic Regression roc-auc: {}'.format(
    roc_auc_score(y_test, pred[:, 1])))

Train set
Logistic Regression roc-auc: 0.6793181006244372
Test set
Logistic Regression roc-auc: 0.7175488081411426


In [9]:
# coefficients
logit.coef_

array([[-0.71428242, -0.00923013,  0.00425235]])

In [10]:
# model built on scaled variables

# train the model using the re-scaled data
logit.fit(X_train_scaled, y_train)

# evaluate performance
print('Train set')
pred = logit.predict_proba(X_train_scaled)
print('Logistic Regression roc-auc: {}'.format(
    roc_auc_score(y_train, pred[:, 1])))
print('Test set')
pred = logit.predict_proba(X_test_scaled)
print('Logistic Regression roc-auc: {}'.format(
    roc_auc_score(y_test, pred[:, 1])))

Train set
Logistic Regression roc-auc: 0.6793281640744896
Test set
Logistic Regression roc-auc: 0.7175488081411426


In [11]:
logit.coef_

array([[-1.42875872, -0.68293349,  2.17646757]])

__not change__: the performance of logistic regression

__changed__: the regression coefficients

After scaling, all 3 variables have relatively the same effect (coefficient) towards survival, whereas before scaling, only PClass was driving the Survival outcome.

__other scalers__:<br>
I have tried other scalers such as RobustScaler and StandardScaler, however, this is not the case for those scalers so they performed worse.  

__other classification models__:<br>
In addition, I have tried other classification models such as Support Vector Machines, K-Nearest Neighbours, and Random Forests, however, they are over-fitted so results are was not valuable, even though __results were improved__. 