In [1]:
# Import the libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [6]:
# Import the dataset, drop observations with missing values
cols = ['loan_amnt', 'int_rate', 'installment']
data = pd.read_csv('./data/LendingClub_Issued_Loans/lc_loan.csv', usecols = cols).dropna()

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 887379 entries, 0 to 887378
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   loan_amnt    887379 non-null  float64
 1   int_rate     887379 non-null  float64
 2   installment  887379 non-null  float64
dtypes: float64(3)
memory usage: 27.1 MB


### Basic Analysis

In [9]:
data.describe()

Unnamed: 0,loan_amnt,int_rate,installment
count,887379.0,887379.0,887379.0
mean,14755.264605,13.24674,436.717127
std,8435.455601,4.381867,244.186593
min,500.0,5.32,15.67
25%,8000.0,9.99,260.705
50%,13000.0,12.99,382.55
75%,20000.0,16.2,572.6
max,35000.0,28.99,1445.46


The different variables present different value ranges, therefore different magnitudes. Not only the minimum and maximum values are different, but they also spread over ranges of different widths.

### Standardization (Standard Scalar):

Standardization (or Z-score normalization) means centering the variable at zero and standardizing the variance at 1. The procedure involves subtracting the mean of each observation and then dividing by the standard deviation.

StandardScaler from sci-kit-learn removes the mean and scales the data to unit variance. We can import the StandardScalar method from sci-kit learn and apply it to our dataset.

In [10]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

In [11]:
print(data_scaled.mean(axis=0))
print(data_scaled.std(axis=0))

[ 7.27694974e-17 -2.88579741e-16 -4.23036761e-16]
[1. 1. 1.]


As expected, the mean of each variable is now around zero and the standard deviation is set to 1. Thus, all the variable values lie within the same range.

In [12]:
print('Min values (Loan Amount, Int rate and Installment): ', data_scaled.min(axis=0))
print('Max values (Loan Amount, Int rate and Installment): ', data_scaled.max(axis=0))

Min values (Loan Amount, Int rate and Installment):  [-1.68992326 -1.80898767 -1.72428535]
Max values (Loan Amount, Int rate and Installment):  [2.39995891 3.5928219  4.13103531]


However, the minimum and maximum values vary according to how spread out the variable was, to begin with, and is highly influenced by the presence of outliers.

### Normalization (Min-Max Scalar):

In this approach, the data is scaled to a fixed range — usually 0 to 1.
In contrast to standardization, the cost of having this bounded range is that we will end up with smaller standard deviations, which can suppress the effect of outliers. Thus MinMax Scalar is sensitive to outliers.

A Min-Max scaling is typically done via the following equation:

    X_norm = (X - X_min) / (X_mx - X_min)

In [13]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(data)

In [14]:
print('means (Loan Amount, Int rate and Installment): ', data_scaled.mean(axis=0))
print('std (Loan Amount, Int rate and Installment): ', data_scaled.std(axis=0))

means (Loan Amount, Int rate and Installment):  [0.41319608 0.3348855  0.2944818 ]
std (Loan Amount, Int rate and Installment):  [0.24450582 0.18512315 0.17078484]


After MinMaxScaling, the distributions are not centered at zero and the standard deviation is not 1.

In [15]:
print('Min (Loan Amount, Int rate and Installment): ', data_scaled.min(axis=0))
print('Max (Loan Amount, Int rate and Installment): ', data_scaled.max(axis=0))

Min (Loan Amount, Int rate and Installment):  [0. 0. 0.]
Max (Loan Amount, Int rate and Installment):  [1. 1. 1.]


But the minimum and maximum values are standardized across variables, different from what occurs with standardization.

### Robust Scalar (Scaling to median and quantiles):

Scaling using median and quantiles consists of subtracting the median to all the observations and then dividing by the interquartile difference. It Scales features using statistics that are robust to outliers.

The interquartile difference is the difference between the 75th and 25th quantile:
    `IQR = 75th quantile - 25th quantile`

The equation to calculate scaled values: `X_scaled = (X - X.median) / IQR`

In [16]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler() 
data_scaled = scaler.fit_transform(data)

In [17]:
print('means (Loan Amount, Int rate and Installment): ', data_scaled.mean(axis=0))
print('std (Loan Amount, Int rate and Installment): ', data_scaled.std(axis=0))

means (Loan Amount, Int rate and Installment):  [0.14627205 0.04134294 0.17367103]
std (Loan Amount, Int rate and Installment):  [0.70295424 0.70561432 0.78291238]


The distributions are not centered in zero and the standard deviation is not 1.

In [18]:
print('Min (Loan Amount, Int rate and Installment): ', data_scaled.min(axis=0))
print('Max (Loan Amount, Int rate and Installment): ', data_scaled.max(axis=0))

Min (Loan Amount, Int rate and Installment):  [-1.04166667 -1.23510467 -1.1762933 ]
Max (Loan Amount, Int rate and Installment):  [1.83333333 2.57648953 3.40790971]


Neither are the minimum and maximum values set to a certain upper and lower boundaries like in the MinMaxScaler.