# Deep Learning Exam Part II

In [1]:
# importing libraries, modules and packages
import time
import copy
from io import BytesIO

from urllib.request import urlopen
from zipfile import ZipFile

import numpy as np
import pandas as pd

import plotly_express as px

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score

import torch
import torch.nn as nn
from torch.utils.tensorboard import SummaryWriter
torch.set_default_dtype(torch.float)

from ray import tune

from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline

# Customer and Task

This a contiuation of a job we performed for a bank which provided us with annonymised data of 1000 customers.

Our last objective was to investigate whether it's possible to create a statistical model to predict, which customers will pay back their credit in full and which will default on their credit obligation using machine learning approaches. There able to predict a credit default with balanced accuracy of 67.4%.

Based on this we were hired and funded again to continue this investigation using a neural network approach in order to more accurately predict defaulting customers.

In [2]:
# saving current time
start_time = time.time()

# Ingesting Data

In [3]:
# downloading the zip file
resp = urlopen('https://archive.ics.uci.edu/ml/machine-learning-databases/00522/SouthGermanCredit.zip')

# opening file and storing contents
zipfile = ZipFile(BytesIO(resp.read()))

# turning data into a table and storing as 'credit_data'
credit_data = pd.read_csv(zipfile.open('SouthGermanCredit.asc'), delimiter=' ')

## Inspecting data

In [4]:
# inspecting the first three rows
credit_data.head(3)

Unnamed: 0,laufkont,laufzeit,moral,verw,hoehe,sparkont,beszeit,rate,famges,buerge,...,verm,alter,weitkred,wohn,bishkred,beruf,pers,telef,gastarb,kredit
0,1,18,4,2,1049,1,2,4,2,1,...,2,21,3,1,1,3,2,1,2,1
1,1,9,4,0,2799,1,3,2,3,1,...,1,36,3,1,2,3,1,1,2,1
2,2,12,2,9,841,2,4,2,2,1,...,1,23,3,1,1,2,2,1,2,1


In [5]:
# inspecting the last three rows
credit_data.tail(3)

Unnamed: 0,laufkont,laufzeit,moral,verw,hoehe,sparkont,beszeit,rate,famges,buerge,...,verm,alter,weitkred,wohn,bishkred,beruf,pers,telef,gastarb,kredit
997,4,21,4,0,12680,5,5,4,3,1,...,4,30,3,3,1,4,2,2,2,0
998,2,12,2,3,6468,5,1,2,3,1,...,4,52,3,2,1,4,2,2,2,0
999,1,30,2,2,6350,5,5,4,3,1,...,2,31,3,2,1,3,2,1,2,0


The following elements up to 'Data Preparation for Machine Learning' are quoted from our previous work for this customer. These include:
- Describe Dateframe
- Variable Description
- Explorative Data Analysis

## Describe Dataframe

The dataframe has 21 features one of which is the target class. There are 17 categorical and 3 numerical attributes.
The categorical attributes already being encoded into ordinal spares us from having to do this during the data preparation. So there no need to incorporate the encodation into the preprocessing pipeline, though it does require manipulation for decent visualisation.

In [6]:
credit_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
laufkont,1000.0,2.577,1.257638,1.0,1.0,2.0,4.0,4.0
laufzeit,1000.0,20.903,12.058814,4.0,12.0,18.0,24.0,72.0
moral,1000.0,2.545,1.08312,0.0,2.0,2.0,4.0,4.0
verw,1000.0,2.828,2.744439,0.0,1.0,2.0,3.0,10.0
hoehe,1000.0,3271.248,2822.75176,250.0,1365.5,2319.5,3972.25,18424.0
sparkont,1000.0,2.105,1.580023,1.0,1.0,1.0,3.0,5.0
beszeit,1000.0,3.384,1.208306,1.0,3.0,3.0,5.0,5.0
rate,1000.0,2.973,1.118715,1.0,2.0,3.0,4.0,4.0
famges,1000.0,2.682,0.70808,1.0,2.0,3.0,3.0,4.0
buerge,1000.0,1.145,0.477706,1.0,1.0,1.0,1.0,3.0


## Variable Description

A description of each feature. Categorical features have a list of possible values with their respective definition:

**laufkont** : status of the debtor's checking account with the bank (categorical)

1 : no checking account                             
2 : ... < 0 DM                                      
3 : 0<= ... < 200 DM                                
4 : ... >= 200 DM / salary for at least 1 year      

**laufzeit** : credit duration in months (quantitative)


**moral** : history of compliance with previous or concurrent credit contracts (categorical)

0 : delay in paying off in the past                 
1 : critical account/other credits elsewhere        
2 : no credits taken/all credits paid back duly     
3 : existing credits paid back duly till now        
4 : all credits at this bank paid back duly         

**verw** : purpose for which the credit is needed (categorical)

0 : others                                          
1 : car (new)                                       
2 : car (used)                                      
3 : furniture/equipment                             
4 : radio/television                                
5 : domestic appliances                             
6 : repairs                                         
7 : education                                       
8 : vacation                                        
9 : retraining                                         
10 : business                                          

**hoehe** : credit amount in DM (quantitative; result of monotonic transformation; actual data and type of transformation unknown)


**sparkont** : debtor's savings (categorical)
                                                            
1 : unknown/no savings account                                
2 : ... <  100 DM             
3 : 100 <= ... <  500 DM      
4 : 500 <= ... < 1000 DM      
5 : ... >= 1000 DM            

**beszeit** : duration of debtor's employment with current employer (ordinal; discretized quantitative)                               
                    
1 : unemployed      
2 : < 1 yr          
3 : 1 <= ... < 4 yrs                                
4 : 4 <= ... < 7 yrs                                
5 : >= 7 yrs        

**rate** : credit installments as a percentage of debtor's disposable income (ordinal; discretized quantitative)                               
                
1 : >= 35         
2 : 25 <= ... < 35                                
3 : 20 <= ... < 25                                
4 : < 20          

**famges** : combined information on sex and marital status; categorical; sex cannot be recovered from the variable, because male singles and female non - singles are coded with the same code(2); female widows cannot be easily classified,because the code table does not list them in any of the female categories                                
                                        
1 : male : divorced/separated           
2 : female : non-single or male : single                                
3 : male : married/widowed              
4 : female : single                     

**buerge** : is there another debtor or a guarantor for the credit? (categorical)                                
                
1 : none        
2 : co-applicant                                
3 : guarantor   

**wohnzeit** : length of time (in years) the debtor lives in the present residence (ordinal; discretized quantitative)
                    
1 : < 1 yr          
2 : 1 <= ... < 4 yrs                                
3 : 4 <= ... < 7 yrs                                
4 : >= 7 yrs        

**verm** : the debtor's most valuable property, i.e. the highest possible code is used. Code 2 is used, if codes 3 or 4 are not applicable and there is a car or any other relevant property that does not fall under variable sparkont. (ordinal)
                                            
1 : unknown / no property                    
2 : car or other                             
3 : building soc. savings agr./life insurance                                
4 : real estate                              

**alter** : age in years (quantitative)                                
    

**weitkred** : installment plans from providers other than the credit-giving bank (categorical)                                
        
1 : bank  
2 : stores                                
3 : none  

**wohn** : type of housing the debtor lives in (categorical)
            
1 : for free                                
2 : rent    
3 : own     

**bishkred** : number of credits including the current one the debtor has (or had) at this bank (ordinal, discretized quantitative)
        
1 : 1   
2 : 2-3 
3 : 4-5 
4 : >= 6                                

**beruf** : quality of debtor's job (ordinal)                                
                                            
1 : unemployed/unskilled - non-resident       
2 : unskilled - resident                      
3 : skilled employee/official                 
4 : manager/self-empl./highly qualif. employee                                

**pers** : number of persons who financially depend on the debtor (i.e., are entitled to maintenance) (binary, discretized quantitative)                                
            
1 : 3 or more                                
2 : 0 to 2   
                                
**telef** : is there a telephone landline registered on the debtor's name?                               
                            
1 : no                       
2 : yes (under customer name)                                

**gastarb** : is the debtor a foreign worker? (binary)                                
    
1 : yes                                
2 : no                                 

**kredit** : has the credit contract been complied with 'good' or not 'bad'? (binary)                                
        
0 : bad                                 
1 : good                                

## Explorative Data Analysis
### Data Preparation for Visualisation

Before we dive into the data exploration, we prepare a dataframe that has meaningful column names.

In [7]:
# creating a copy of the dataframe to be used for visualisation
df_viz = credit_data.copy()

# renaming columns for better visualisation
df_viz = df_viz.rename(columns={
    'laufkont': 'Account Status',
    'laufzeit': 'Contract Duration',
    'moral': 'Credit History',
    'verw': 'Purpose',   
    'hoehe': 'Credit Amount',   
    'sparkont': 'Savings',   
    'beszeit': 'Employement Duration',   
    'rate': 'Installment Rate',   
    'bishkred': 'Other Installment Plans',
    'kredit': 'Credit Compliance',
    'alter': 'Age'})

# turning ordinal values into strings of 'Good' and 'Bad'
df_viz[['Credit Compliance']] = df_viz[['Credit Compliance']].replace({0: 'Bad', 1: 'Good'})

# turning ordinal values into strings of 'Good' and 'Bad'
df_viz[['Account Status']] = df_viz[['Account Status']].replace({1: 'no checking account', 2: '... < 0 DM', 3: '0<= ... < 200 DM', 4: '... >= 200 DM / salary for at least 1 year'})

# turning column values into strings
df_viz[['Account Status', 'Credit History', 'Purpose',
        'Savings', 'Employement Duration', 'Installment Rate',
        'Other Installment Plans', 'Credit Compliance']] = df_viz[['Account Status', 'Credit History', 'Purpose', 'Savings', 'Employement Duration',
        'Installment Rate', 'Other Installment Plans', 'Credit Compliance']].astype(str)

In [8]:
# calculating correlations
corr = credit_data.corr()
corr = corr.mask(np.tril(np.ones(corr.shape, dtype=np.bool_)))

# plotting correlation matrix
px.imshow(corr, width=900, height=900, title='Correlation Matrix', text_auto=True)

The matrix shows the variables correlate within a range from -0.2 to 0.62.

The highest correlated features are credit amount 'hoehe' and credit duration 'laufzeit': 0.62
- the correlation isn't surprising, since higher credit amounts need to be payed back over a longer time to keep monthly payments down

The features credit compliance 'kredit' and credit duration 'laufzeit' have a correlation of: -0.2
- this means increasing credit duration there is a slight increase the chance of credit default

The highest correlated feature with credit compliance is the account status 'laufkont': 0.35
- how much is in a customers account and whether he has an income does have a slight impact on the chance of a credit default



### Explorative Data Analysis of Numerical Variables

In [9]:
px.box(df_viz, x='Credit Compliance', y='Contract Duration', title='Distribution of Credit Compliance vs. Contract Duration')

The contract duration of bad credits is higher on average than that of good credits

Comparing the distribution of both classes of credit completation against duration of the credit contract we find:
- the average duration of credit contracts that weren't completed is 6 months longer than those that were completed
- the interquartile range of the completed contracts is 12 months while those that weren't completed is 24 months
- the 3rd quartile of good credits is 24 months and 36 months for bad credits
- it appears that customers with longer contracts have a higher chance of defaulting on their credit

In [10]:
px.bar(df_viz, x='Account Status', title='Distribution of Credit Compliance vs. Account Status', color='Credit Compliance', barmode='group', labels={'count': 'Instances'})

Of the customers that complied with their credit obligations most have the account status '... >= 200 DM / salary for at least 1 year'. While most of the customers who did not comply with their credit obligation either have no account or no money in it. Interestingly those of '0<= ... < 200 DM' have to lowest default rate compared to the other groups.

In [11]:
px.histogram(df_viz, x='Credit Amount', title='Distribution of Credit Amount')

The histogramm of the credit ammount is skewed right and shows the majority of contracts is in the range of 500 to 4000.

In [12]:
px.box(df_viz, x='Credit Compliance', y='Credit Amount', title='Distribution of Credit Compliance vs. Credit Amount')

Comparing the distribution of both classes of credit completation against total amount of the credit contract. We find:
- the mean for bad credit contracts is just slightly above that of the good credit contracts 
- also the maximum value of bad credit contracts is much higher than good credit contracts
- again it appears that customers with larger contracts have a higher chance of defaulting on their credit

In [13]:
px.box(df_viz, x='Credit Compliance', y='Age', title='Distribution of Credit Compliance vs. Age of debtor')

Comparing good and bad credit completation against the age of the customer yields no signifcant difference.
- it appears the age doesn't have an influence on whether a customer pays his credit back in full

### Calculating Class Imbalance Ratio

At first we are going to split the two classes into seperate tables and calculate the imbalance ratio.

In [14]:
# selecting those that complied with their credit obligation
credit_good = credit_data[credit_data['kredit'] == 1]

# selecting those that defaulted on their credit obligation
credit_bad = credit_data[credit_data['kredit'] == 0]

In [15]:
# calculating the imbalance ratio
imbalance_ratio = credit_bad.shape[0] / credit_good.shape[0]

# printing the imbalance ratio in percentage
print(f'{round((imbalance_ratio*100),2)}%')

# printing the imbalance ratio as fraction
print(f'{(credit_good.shape[0]/300)} : {round((credit_bad.shape[0]/300),3)}')

# printing the imbalance ratio as fraction
print(f'{(credit_good.shape[0]/700)} : {round((credit_bad.shape[0]/700),3)}')

42.86%
2.3333333333333335 : 1.0
1.0 : 0.429


In [16]:
px.pie(df_viz, names='Credit Compliance', title='Credit Compliance', width=600)

The classes in the dataset have a split of 70:30, while the imbalance ratio is 42.86% percent. 

This means that for every customer the complies with their credit obligation there are 0.43 customers that do not comply.

In other words, for every two customer that pay back their loan there will be on customer that defaults on their loan.

# Experiment Preparation
## Data Preparation for Machine Learning

In [17]:
# seperating and storing class labels from original data
target = credit_data[['kredit']]

# preparing target data for machine learning
target = target.to_numpy()
target = target.reshape(-1)

# separating and storing features from original data
data = credit_data.drop(['kredit'], axis=1)

# preparing feature data for machine learning
data = data.to_numpy()

## Split into Test and Training Data

In [18]:
# splitting all data into a train_full and a test set
X_train_full, X_test, y_train_full, y_test = train_test_split(data, target, test_size=0.4, shuffle=True, random_state=1)

In [19]:
# splitting train_full data set into a training and validation data
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.3, shuffle=True, random_state=1)

In [20]:
# printing dimensions for each data set
for i in (X_train, X_val, X_test, y_train, y_val, y_test):
    print(i.shape)

(420, 20)
(180, 20)
(400, 20)
(420,)
(180,)
(400,)


## Scale the data

In [21]:
# assinging scaler
x_scaler = MinMaxScaler()
y_scaler = MinMaxScaler()

In [22]:
# fitting scaler on training features
x_scaler.fit(X_train_full)

# scaling training features
X_train_sc = x_scaler.transform(X_train)
X_val_sc = x_scaler.transform(X_val)
X_test_sc = x_scaler.transform(X_test)

In [23]:
# reshaping training data for y_scaler
y_train_full = y_train_full.reshape(-1, 1)
y_train = y_train.reshape(-1, 1)
y_val = y_val.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)

# fitting scaler on training targets
y_scaler.fit(y_train_full)

# scaling training targets
y_train_sc = y_scaler.transform(y_train)
y_val_sc = y_scaler.transform(y_val)
y_test_sc = y_scaler.transform(y_test)

## Create Tensors for PyTorch

In [24]:
# creating tensors of features
X_train_tensor = torch.Tensor(X_train_sc)
X_val_tensor = torch.Tensor(X_val_sc)
X_test_tensor = torch.Tensor(X_test_sc)

# creating tensors of targets
y_train_tensor = torch.Tensor(y_train_sc)
y_val_tensor = torch.Tensor(y_val_sc)
y_test_tensor = torch.Tensor(y_test_sc)

# A Binary Classifier
## Define an Artificial Neural Network Architecture

In [25]:
class FlexibleMultiLayerPerceptron(nn.Module):
    def __init__(self, architecture, input_size):
        super(FlexibleMultiLayerPerceptron, self).__init__()
        
        self.architecture = architecture

        in_sizes = [out_size for out_size, activation_fn in architecture]
        in_sizes = [input_size] + in_sizes[:-1]

        self.layers = nn.ModuleList()
        self.activation_fns = []
        for i, (in_size, (out_size, activation_fn)) in enumerate(zip(in_sizes, architecture)):
            layer = nn.Linear(in_size, out_size)
            self.layers.append(layer)
            self.activation_fns.append(activation_fn)

    def forward(self, x):
        out = x
        for layer, activation_fn in zip(self.layers, self.activation_fns):
            out = activation_fn(layer(out))
        return out

    def short_name(self):
        res = ''
        for outsize, activation_func in self.architecture:
            if isinstance(activation_func, nn.ReLU):
                res += 'R'
            elif isinstance(activation_func, nn.Sigmoid):
                res += 'S'
            elif isinstance(activation_func, nn.Tanh):
                res += 'T'
            else:
                raise NotImplementedError()
            res += str(outsize)
            res += ', '
        return res[:-2]

In [26]:
# uses the model to predict for the values in the test set
# returns the prediction
def predict(model, X):
    model.eval()
    with torch.no_grad():
        return model(X)         # calls FlexibleMultiLayerPerceptron.forward(X)

# counts the number of trainable parameters of the model
def count_params(model):
    num_params=sum([p.numel() for p in model.parameters() if p.requires_grad])
    return num_params

## Naming scheme for NN architecture

In the following experiments we are going to describe the structure of each layer of neural network architecture by letters and numbers. The numbers describe the input size while the letters describe the kind of activation function used: 
- R: ReLU
- T: Tanh
- S: Sigmoid

# Train the model

In [27]:
# set weights for target
weights = torch.FloatTensor([
    len(target)/sum(target==0),
    len(target)/sum(target==1)
    ])

# assigning training citerion
criterion_train = nn.BCELoss(weight=weights[y_train])

# assigning validation citerion
criterion_val = nn.BCELoss(weight=weights[y_val])

In [28]:
def evaluate_architecture(
        architecture,
        learning_rate=0.01,
        weight_decay=0,         # L2 regularization
        epochs=300,
        tune_hyperparams=False
        ):
    """
    Evaluates a the performance of a given neural network architecture.

    Args:
        architecture: the architecture of the model to be trained.
        learning_rate: the learning rate to use for training the model. Has a default value of 0.01.
        weight_decay: a coefficient for L2 regularization of the model weights. Has a default value of 0.
        epochs: defines the number of learning epochs

    Returns:
        Short name of architecture
        Number of learned parameters
        Number of best epoch
        Best balanced validation accuracy
        Best valiation loss
    """
    torch.manual_seed(1)
    model = FlexibleMultiLayerPerceptron(architecture, 20)
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=weight_decay)
    
    writer = SummaryWriter()

    best_bacc_val=float(0)
    best_loss_val=float('inf')
    best_model_state=copy.deepcopy(model.state_dict())
    best_epoch=0

    for epoch in range(epochs):
        model.train()

        optimizer.zero_grad()

        y_train_pred = model(X_train_tensor)

        loss_train = criterion_train(y_train_pred, y_train_tensor)
        loss_train.backward()
        optimizer.step()
        
        model.eval()                # turns off gradient tracking temporarily
        with torch.no_grad():
            y_val_pred = model(X_val_tensor)
            loss_val = criterion_val(y_val_pred, y_val_tensor)
            
            if not tune_hyperparams:
                writer.add_scalars(f'arch_{model.short_name()}/loss', {'train': loss_train.item(), 'val': loss_val.item()}, epoch)

            bacc_train = balanced_accuracy_score(y_train_tensor, np.round(y_scaler.inverse_transform(y_train_pred)))
            bacc_val = balanced_accuracy_score(y_val_tensor, np.round(y_scaler.inverse_transform(y_val_pred)))
            
            if not tune_hyperparams:
                writer.add_scalars(f'arch_{model.short_name()}/accuracy', {'train': bacc_train, 'val': bacc_val}, epoch)
                writer.flush()

            if bacc_val > best_bacc_val:
                best_bacc_val = bacc_val
                best_loss_val = loss_val.item()
                best_model_state = copy.deepcopy(model.state_dict())
                best_epoch = epoch

    writer.close()

    with torch.no_grad():
        model.load_state_dict(best_model_state)

    if tune_hyperparams:
        tune.report(balanced_accuracy=best_bacc_val)

    architecture = f'{model.short_name()}'
    n_params = f'{count_params(model)}'

    if not tune_hyperparams:
        print(f'Architecture: {architecture}')
        print(f'Learned Parameters: {n_params}')
        print(f'Best epoch: {best_epoch}')
        print(f'Best val. loss: {np.round(best_loss_val, 4)}')
        print(f'Best bal. val. acc.: {best_bacc_val.round(4)}')

    return architecture, n_params, best_epoch, best_loss_val, best_bacc_val, model

In [29]:
def add_result(results, architecture, n_params, best_epoch, best_loss_val, best_bacc_val, model):
    """
    Takes a list of results and adds the attributes returned by the evaluate_architecure function.

    Args:
        results (_type_): list containing results of evaluate architecture
        architecture (_type_): short name of the used architecture
        n_params (_type_): number of learned parameters
        best_epoch (_type_): number of the best epoch
        best_bacc_val (_type_): best balanced validation accuracy
        best_loss_val (_type_): best valiation loss
    """
    row = {
        'architecture': architecture,
        'n_params': n_params,
        'best epoch': best_epoch,
        'bal val acc': np.round(best_bacc_val, 4),
        'best val loss': np.round(best_loss_val, 4)}
    
    results.append(row)

# instantiating an empty list to store results of classification runs
results = []

# Comparison of different architectures
## Trivial Baseline
To create a trivial baseline for comparison we are going to apply a single-width one layer neural network using Sigmoid as the activation function. This performs a basic logistic regression

In [30]:
res = evaluate_architecture((
    (1, nn.Sigmoid()),),
    epochs=600)

# appending results to list
add_result(results, *res)

Architecture: S1
Learned Parameters: 21
Best epoch: 126
Best val. loss: 1.0838
Best bal. val. acc.: 0.7426


A single layer network with single input using the sigmoid activation function reaches a balanced accuarcy of 74%

# Architecture Optimization
## Testing increasing width

We are going to add a layer with eight inputs using the ReLU activation function, before collapsing the structure to a single layer using the sigmoid activation function.

In [31]:
res =  evaluate_architecture((
    (8, nn.ReLU()),
    (1, nn.Sigmoid()),),
    epochs=600)

# appending results to list
add_result(results, *res)

# last result - best result so far
print(f'\nChange: {np.round(res[4]-0.7426, 4)}')

Architecture: R8, S1
Learned Parameters: 177
Best epoch: 104
Best val. loss: 1.0221
Best bal. val. acc.: 0.7661

Change: 0.0235


With an additional layer we are able to increase balanced accuarcy by 0.0235.

We shall test if doubling the width of the first layer increases balanced accuracy further.

In [32]:
res = evaluate_architecture((
    (16, nn.ReLU()),
    (1, nn.Sigmoid()),),
    epochs=600)

# appending results to list
add_result(results, *res)

# last result - best result so far
print(f'\nChange: {np.round(res[4]-0.7661, 4)}')

Architecture: R16, S1
Learned Parameters: 353
Best epoch: 123
Best val. loss: 1.061
Best bal. val. acc.: 0.772

Change: 0.0059


Doubling the width of the first improves balanced accuracy by 0.0059 while doubling the number of learnable parameters from 177 to 353.

By again doubling the width of the first layer to an inout size of 32, we investigate if the previous trend of diminishing returns continues.

In [33]:
res = evaluate_architecture((
    (32, nn.ReLU()),
    (1, nn.Sigmoid()),),
    epochs=600)

# appending results to list
add_result(results, *res)

# last result - best result so far
print(f'\nChange: {np.round(res[4]-0.772, 4)}')

Architecture: R32, S1
Learned Parameters: 705
Best epoch: 46
Best val. loss: 1.0563
Best bal. val. acc.: 0.7738

Change: 0.0018


A further doubling of the input layer to 32 decreases balanced accuarcy by 0.0018.

We are going to continue doubling the width of the input layer, making it 64.

In [34]:
res = evaluate_architecture((
    (64, nn.ReLU()),
    (1, nn.Sigmoid()),),
    epochs=600)

# appending results to list
add_result(results, *res)

# last result - best result so far
print(f'\nChange: {np.round(res[4]-0.7738, 4)}')

Architecture: R64, S1
Learned Parameters: 1409
Best epoch: 4
Best val. loss: 1.2112
Best bal. val. acc.: 0.7658

Change: -0.008


Compared to best result so far (input size = 32), using an input size of 64 we decrease accuracy by 0.008.

Finally we are going to test an input layer with a width of 128.

Though with every doubling of the width of the input layer we also double the number of learnable parameters, thereby increasing runtimes and cost.

In [35]:
res = evaluate_architecture((
    (128, nn.ReLU()),
    (1, nn.Sigmoid()),),
    epochs=600)

# appending results to list
add_result(results, *res)

# last result - best result so far
print(f'\nChange: {np.round(res[4]-0.7738, 4)}')

Architecture: R128, S1
Learned Parameters: 2817
Best epoch: 25
Best val. loss: 1.0442
Best bal. val. acc.: 0.7661

Change: -0.0077


Even an input layer of 128 yields a balanced accuracy 0.0077 worse than an input layer of size 32.

The best configuration so far is: R32, S1

Given these finding we are going to explore the impacts of increasing network depth. For this we are going to keep a similar number of learnable parameters by implementing two layers with input sizes of 16 instead of a single with input size 32.

The best balanced accuracy on the validation set so far is: 77.38 %

## Testing increasing depth

In [36]:
res = evaluate_architecture((
    (16, nn.ReLU()),
    (16, nn.ReLU()),
    (1, nn.Sigmoid()),),
    epochs=600)

# appending results to list
add_result(results, *res)

# last result - best result so far
print(f'\nChange: {np.round(res[4]-0.7738, 4)}')

Architecture: R16, R16, S1
Learned Parameters: 625
Best epoch: 42
Best val. loss: 1.0656
Best bal. val. acc.: 0.7756

Change: 0.0018


Increasing network depth by on layer yielded an improvement in accuracy of 0.0018.

This make the best results so far: 77.56%

In the next step we are a doubling the width of the first two layers.

In [37]:
res = evaluate_architecture((
    (32, nn.ReLU()),
    (32, nn.ReLU()),
    (1, nn.Sigmoid()),),
    epochs=600)

# appending results to list
add_result(results, *res)

# last result - best result so far
print(f'\nChange: {np.round(res[4]-0.7756, 4)}')

Architecture: R32, R32, S1
Learned Parameters: 1761
Best epoch: 46
Best val. loss: 1.0333
Best bal. val. acc.: 0.7738

Change: -0.0018


Doubling the width of the first two layers from 16 two 32 actually decreased accuarcy by 0.0018.

Nonetheless we are going to see what happens if we double the input size again.

In [38]:
res = evaluate_architecture((
    (64, nn.ReLU()),
    (64, nn.ReLU()),
    (1, nn.Sigmoid()),),
    epochs=600)

# appending results to list
add_result(results, *res)

# last result - best result so far
print(f'\nChange: {np.round(res[4]-0.7756, 4)}')

Architecture: R64, R64, S1
Learned Parameters: 5569
Best epoch: 21
Best val. loss: 1.0633
Best bal. val. acc.: 0.7681

Change: -0.0075


Compared to the best result so far, which came from two layers of 16 using two layers of 64 decreases accuarcy by 0.0075.

In the next step we are going test two layers with an input size 128.

In [39]:
res = evaluate_architecture((
    (128, nn.ReLU()),
    (128, nn.ReLU()),
    (1, nn.Sigmoid()),),
    epochs=600)

# appending results to list
add_result(results, *res)

# last result - best result so far
print(f'\nChange: {np.round(res[4]-0.7756, 4)}')

Architecture: R128, R128, S1
Learned Parameters: 19329
Best epoch: 17
Best val. loss: 1.0437
Best bal. val. acc.: 0.7544

Change: -0.0212


Compared to the best result so far, which came from two layers of 16 using two layers of 128 decreases accuarcy by 0.0212.

Since the decrease of the accuracy is getting larger, we won't explore widening the three layered structure any further, so we will investigate if adding fourth layer to the network has a positive effect on the balanced accuracy.

## Even more depth

So far, the best results were achieved using two layers of 16 and one-width output layer.

We are going to test the effect of a third layer of 16.

In [40]:
res = evaluate_architecture((
    (16, nn.ReLU()),
    (16, nn.ReLU()),
    (16, nn.ReLU()),
    (1, nn.Sigmoid()),),
    epochs=600)

# appending results to list
add_result(results, *res)

# last result - best result so far
print(f'\nChange: {np.round(res[4]-0.7756, 4)}')

Architecture: R16, R16, R16, S1
Learned Parameters: 897
Best epoch: 18
Best val. loss: 1.0951
Best bal. val. acc.: 0.7777

Change: 0.0021


The new third layer of input size 16 increased balanced accuarcy by 0.0021.

Best architecture so far is: R16, R16, R16, S1

This yields a balanced accuracy on validation set of: 77.77%

Next, we are going to double the width of the layers from 16 to 32.

In [41]:
res = evaluate_architecture((
    (32, nn.ReLU()),
    (32, nn.ReLU()),
    (32, nn.ReLU()),
    (1, nn.Sigmoid()),),
    epochs=600)

# appending results to list
add_result(results, *res)

# last result - best result so far
print(f'\nChange: {np.round(res[4]-0.7777, 4)}')

Architecture: R32, R32, R32, S1
Learned Parameters: 2817
Best epoch: 27
Best val. loss: 1.081
Best bal. val. acc.: 0.7658

Change: -0.0119


Going from three layers of 16 to three layers of 32 decreased accuaracy by 0.0119.

The network architecture of R16, R16, R16, S1 yielded a balanced accuracy of 77.77% with 897 learnable parameters, which beats the performances of the architectures with three layers in total.

In the next step we going increase the input from 32 to 64.

In [42]:
res = evaluate_architecture((
    (64, nn.ReLU()),
    (64, nn.ReLU()),
    (64, nn.ReLU()),
    (1, nn.Sigmoid()),),
    epochs=600)

# appending results to list
add_result(results, *res)

# last result - best result so far
print(f'\nChange: {np.round(res[4]-0.7777, 4)}')

Architecture: R64, R64, R64, S1
Learned Parameters: 9729
Best epoch: 25
Best val. loss: 1.0834
Best bal. val. acc.: 0.7604

Change: -0.0173


Doubling the input sizes of the first three layers decreased accuracy by -0.0173.

The architecture R32, R32, R32, S1 produced a better result than R64, R64, R64. S2, therefore we are going to test an intermediate step, where we are going to implement three layers with an inut size of 42.

In [43]:
res = evaluate_architecture((
    (42, nn.ReLU()),
    (42, nn.ReLU()),
    (42, nn.ReLU()),
    (1, nn.Sigmoid()),),
    epochs=600)

# appending results to list
add_result(results, *res)

# last result - best result so far
print(f'\nChange: {np.round(res[4]-0.7777, 4)}')

Architecture: R42, R42, R42, S1
Learned Parameters: 4537
Best epoch: 10
Best val. loss: 1.0615
Best bal. val. acc.: 0.7836

Change: 0.0059


The decrease of the input size from 64 to 42 increased the balanced accuracy by 0.0059 to 78,36%.

Since the an input size of 42 yielded the best results so far, in the next trial we are going to continue with changing the activation functions for all but the first layer from ReLU to Sigmoid.

## Optimizing activation functions

In [44]:
res = evaluate_architecture((
    (42, nn.ReLU()),
    (42, nn.Sigmoid()),
    (42, nn.Sigmoid()),
    (1, nn.Sigmoid()),),
    epochs=600)

# appending results to list
add_result(results, *res)

# last result - best result so far
print(f'\nChange: {np.round(res[4]-0.7836, 4)}')

Architecture: R42, S42, S42, S1
Learned Parameters: 4537
Best epoch: 37
Best val. loss: 1.0745
Best bal. val. acc.: 0.756

Change: -0.0276


Changing the activation functions as described did not impove the accuarcy.

Next, we are going to try an architecture of two layers with input size 42 using ReLU as activation function, on layer of 42 using Tanh as activation function and single-width output layer using Sigmoid as activation function.

In [45]:
res = evaluate_architecture((
    (42, nn.ReLU()),
    (42, nn.ReLU()),
    (42, nn.Tanh()),
    (1, nn.Sigmoid()),),
    epochs=600)

# appending results to list
add_result(results, *res)

# last result - best result so far
print(f'\nChange: {np.round(res[4]-0.7836, 4)}')

Architecture: R42, R42, T42, S1
Learned Parameters: 4537
Best epoch: 3
Best val. loss: 1.2431
Best bal. val. acc.: 0.8012

Change: 0.0176


The architecture of R42, R42, T42, S1 yields a balanced accuracy of 80.12% which is 1,76% better than the previous best architecture.

Changing the activation function of third layer worked in our favour, so we are going to test if doing so for the second layer does so as well.

In [46]:
res = evaluate_architecture((
        (42, nn.ReLU()),
        (42, nn.Tanh()),
        (42, nn.Tanh()),
        (1, nn.Sigmoid()),),
        epochs=600,
        learning_rate=0.01,
        weight_decay=0,
        tune_hyperparams=False)
        
# appending results to list
add_result(results, *res)

# last result - best result so far
print(f'\nChange: {np.round(res[4]-0.8012, 4)}')

Architecture: R42, T42, T42, S1
Learned Parameters: 4537
Best epoch: 13
Best val. loss: 1.0614
Best bal. val. acc.: 0.7818

Change: -0.0194


Changing the activation of the second layer did not improve accuracy any further. Finally we will test a different combination of ReLU and Tanh activation functions in the first three layers.

In [47]:
res = evaluate_architecture((
        (42, nn.Tanh()),
        (42, nn.ReLU()),
        (42, nn.Tanh()),
        (1, nn.Sigmoid()),),
        epochs=600,
        learning_rate=0.01,
        weight_decay=0,
        tune_hyperparams=False)
        
# appending results to list
add_result(results, *res)

# last result - best result so far
print(f'\nChange: {np.round(res[4]-0.8012, 4)}')

Architecture: T42, R42, T42, S1
Learned Parameters: 4537
Best epoch: 12
Best val. loss: 1.0438
Best bal. val. acc.: 0.7816

Change: -0.0196


The best performance was achivied by the architecture R42, R42, T42, S1.

To make the comparison of all test results easier, we will represented these in a tabular format.

In [48]:
# converting list of results into a dataframe
results = pd.DataFrame(results)

# displaying and sorting data frame descending by balanced accuary
results.sort_values(by='bal val acc', ascending=False).iloc[:,:-1]

Unnamed: 0,architecture,n_params,best epoch,bal val acc
15,"R42, R42, T42, S1",4537,3,0.8012
13,"R42, R42, R42, S1",4537,10,0.7836
16,"R42, T42, T42, S1",4537,13,0.7818
17,"T42, R42, T42, S1",4537,12,0.7816
10,"R16, R16, R16, S1",897,18,0.7777
6,"R16, R16, S1",625,42,0.7756
3,"R32, S1",705,46,0.7738
7,"R32, R32, S1",1761,46,0.7738
2,"R16, S1",353,123,0.772
8,"R64, R64, S1",5569,21,0.7681


Based on the tests we performed the architecture R42, R42, T42, S1 yielded the highest balanced accuracy on the validation set of: 80.12%

Though this architecure also results in the highest loss value of 1.2431 and the best epoch of 3 in that runs is rather low.

## Hyperparameter Tuning

The next objective is to find the best combination of hyperparameters (learning rate and weight decay) for the architecture of R42, R42, T42, S1. 

The module 'tune' in the library 'ray' provides a convenient implementation to tune hyperparameters of pytorch models.

In [49]:
# inspired by: https://stackoverflow.com/questions/44260217/hyperparameter-optimization-for-pytorch-model
def tune_architecture(config):
    evaluate_architecture((
        (42, nn.ReLU()),
        (42, nn.ReLU()),
        (42, nn.Tanh()),
        (1, nn.Sigmoid()),),
        epochs=600,
        learning_rate=config['lr'],
        weight_decay=config['wd'],
        tune_hyperparams=True)

In [50]:
analysis = tune.run(
    tune_architecture, config={
        # configuring learning rates for hyperparameter tuning
        "lr": tune.grid_search([0.01, 0.001, 0.0003]),
        # configuring weight decay for hyperparameter tuning
        'wd': tune.grid_search([0, 1e-6, 1e-5, 1e-4])
        })

# display output looks best on light theme

2023-01-06 17:22:10,953	INFO worker.py:1538 -- Started a local Ray instance.


0,1
Current time:,2023-01-06 17:22:19
Running for:,00:00:07.87
Memory:,10.8/16.0 GiB

Trial name,status,loc,lr,wd,iter,total time (s),balanced_accuracy
tune_architecture_46175_00000,TERMINATED,127.0.0.1:19564,0.01,0.0,1,1.87254,0.801186
tune_architecture_46175_00001,TERMINATED,127.0.0.1:19570,0.001,0.0,1,1.59909,0.78363
tune_architecture_46175_00002,TERMINATED,127.0.0.1:19571,0.0003,0.0,1,1.67637,0.78363
tune_architecture_46175_00003,TERMINATED,127.0.0.1:19572,0.01,1e-06,1,1.7441,0.801186
tune_architecture_46175_00004,TERMINATED,127.0.0.1:19573,0.001,1e-06,1,1.86451,0.789558
tune_architecture_46175_00005,TERMINATED,127.0.0.1:19574,0.0003,1e-06,1,1.47281,0.78363
tune_architecture_46175_00006,TERMINATED,127.0.0.1:19575,0.01,1e-05,1,1.63311,0.801186
tune_architecture_46175_00007,TERMINATED,127.0.0.1:19576,0.001,1e-05,1,1.42924,0.779754
tune_architecture_46175_00008,TERMINATED,127.0.0.1:19564,0.0003,1e-05,1,2.24902,0.78363
tune_architecture_46175_00009,TERMINATED,127.0.0.1:19564,0.01,0.0001,1,1.19207,0.801186


Trial name,balanced_accuracy,date,done,episodes_total,experiment_id,experiment_tag,hostname,iterations_since_restore,node_ip,pid,time_since_restore,time_this_iter_s,time_total_s,timestamp,timesteps_since_restore,timesteps_total,training_iteration,trial_id,warmup_time
tune_architecture_46175_00000,0.801186,2023-01-06_17-22-15,True,,4d139beb09e440f5b2fa29492d4874a2,"0_lr=0.0100,wd=0",MacBook-Air-von-Regis.local,1,127.0.0.1,19564,1.87254,1.87254,1.87254,1673022135,0,,1,46175_00000,0.00191808
tune_architecture_46175_00001,0.78363,2023-01-06_17-22-18,True,,07e9370bf6ed4b539768f6bf467befdb,"1_lr=0.0010,wd=0",MacBook-Air-von-Regis.local,1,127.0.0.1,19570,1.59909,1.59909,1.59909,1673022138,0,,1,46175_00001,0.00383496
tune_architecture_46175_00002,0.78363,2023-01-06_17-22-18,True,,b99fcdaa999c4d3b96a6c0cb5b3dc258,"2_lr=0.0003,wd=0",MacBook-Air-von-Regis.local,1,127.0.0.1,19571,1.67637,1.67637,1.67637,1673022138,0,,1,46175_00002,0.0135534
tune_architecture_46175_00003,0.801186,2023-01-06_17-22-18,True,,7efd17b0f2a745f6bb5e17c4aa2abdaf,"3_lr=0.0100,wd=0.0000",MacBook-Air-von-Regis.local,1,127.0.0.1,19572,1.7441,1.7441,1.7441,1673022138,0,,1,46175_00003,0.00360489
tune_architecture_46175_00004,0.789558,2023-01-06_17-22-18,True,,8d40b96fc73644c4a249f0ff2cdb5e4e,"4_lr=0.0010,wd=0.0000",MacBook-Air-von-Regis.local,1,127.0.0.1,19573,1.86451,1.86451,1.86451,1673022138,0,,1,46175_00004,0.00481796
tune_architecture_46175_00005,0.78363,2023-01-06_17-22-18,True,,c2440a4e54ca4235aba672b3c2bf8524,"5_lr=0.0003,wd=0.0000",MacBook-Air-von-Regis.local,1,127.0.0.1,19574,1.47281,1.47281,1.47281,1673022138,0,,1,46175_00005,0.040993
tune_architecture_46175_00006,0.801186,2023-01-06_17-22-18,True,,cee1b1b0eb0c4efa91084d2d3921f613,"6_lr=0.0100,wd=0.0000",MacBook-Air-von-Regis.local,1,127.0.0.1,19575,1.63311,1.63311,1.63311,1673022138,0,,1,46175_00006,0.036972
tune_architecture_46175_00007,0.779754,2023-01-06_17-22-18,True,,6ecc6d2d8f8a4232ba2ebee1d30ca75c,"7_lr=0.0010,wd=0.0000",MacBook-Air-von-Regis.local,1,127.0.0.1,19576,1.42924,1.42924,1.42924,1673022138,0,,1,46175_00007,0.00346398
tune_architecture_46175_00008,0.78363,2023-01-06_17-22-17,True,,4d139beb09e440f5b2fa29492d4874a2,"8_lr=0.0003,wd=0.0000",MacBook-Air-von-Regis.local,1,127.0.0.1,19564,2.24902,2.24902,2.24902,1673022137,0,,1,46175_00008,0.00191808
tune_architecture_46175_00009,0.801186,2023-01-06_17-22-19,True,,4d139beb09e440f5b2fa29492d4874a2,"9_lr=0.0100,wd=0.0001",MacBook-Air-von-Regis.local,1,127.0.0.1,19564,1.19207,1.19207,1.19207,1673022139,0,,1,46175_00009,0.00191808


2023-01-06 17:22:19,823	INFO tune.py:762 -- Total run time: 8.18 seconds (7.86 seconds for the tuning loop).


In [51]:
# turning analysis results into dataframe
df = analysis.dataframe()
df.sort_values(by='balanced_accuracy', ascending=False).head(10)[['balanced_accuracy', 'config/lr', 'config/wd']]

Unnamed: 0,balanced_accuracy,config/lr,config/wd
0,0.801186,0.01,0.0
3,0.801186,0.01,1e-06
6,0.801186,0.01,1e-05
9,0.801186,0.01,0.0001
4,0.789558,0.001,1e-06
11,0.785682,0.0003,0.0001
1,0.78363,0.001,0.0
2,0.78363,0.0003,0.0
5,0.78363,0.0003,1e-06
8,0.78363,0.0003,1e-05


It appears as if any learning rate smaller than 0.01 leads to worse results, while the weight decay appears to have no impact on the results. For this reasson we are going to also test slightly larger learning rates of 0.05 and 0.02 while also increasing the weigh decays from 0, 1e-6, 1e-5 and 1e-4 to 0, 1e-4, 1e-3, 1e-2.

In [52]:
analysis = tune.run(
    tune_architecture, config={
        "lr": tune.grid_search([0.05, 0.02, 0.01, 0.001, 0.0003]),
        'wd': tune.grid_search([0, 1e-4, 1e-3, 1e-2])
        })

# display output looks best on light theme

0,1
Current time:,2023-01-06 17:22:29
Running for:,00:00:10.05
Memory:,10.8/16.0 GiB

Trial name,status,loc,lr,wd,iter,total time (s),balanced_accuracy
tune_architecture_4b08d_00000,TERMINATED,127.0.0.1:19600,0.05,0.0,1,1.73798,0.76197
tune_architecture_4b08d_00001,TERMINATED,127.0.0.1:19602,0.02,0.0,1,1.60192,0.782034
tune_architecture_4b08d_00002,TERMINATED,127.0.0.1:19603,0.01,0.0,1,1.60388,0.801186
tune_architecture_4b08d_00003,TERMINATED,127.0.0.1:19604,0.001,0.0,1,1.55023,0.78363
tune_architecture_4b08d_00004,TERMINATED,127.0.0.1:19605,0.0003,0.0,1,1.80358,0.78363
tune_architecture_4b08d_00005,TERMINATED,127.0.0.1:19606,0.05,0.0001,1,1.4064,0.76425
tune_architecture_4b08d_00006,TERMINATED,127.0.0.1:19607,0.02,0.0001,1,1.57228,0.789786
tune_architecture_4b08d_00007,TERMINATED,127.0.0.1:19608,0.01,0.0001,1,1.41285,0.801186
tune_architecture_4b08d_00008,TERMINATED,127.0.0.1:19600,0.001,0.0001,1,2.05467,0.779754
tune_architecture_4b08d_00009,TERMINATED,127.0.0.1:19600,0.0003,0.0001,1,1.41956,0.785682


Trial name,balanced_accuracy,date,done,episodes_total,experiment_id,experiment_tag,hostname,iterations_since_restore,node_ip,pid,time_since_restore,time_this_iter_s,time_total_s,timestamp,timesteps_since_restore,timesteps_total,training_iteration,trial_id,warmup_time
tune_architecture_4b08d_00000,0.76197,2023-01-06_17-22-24,True,,86b047a3996e477a880e2fc7ba4384a9,"0_lr=0.0500,wd=0",MacBook-Air-von-Regis.local,1,127.0.0.1,19600,1.73798,1.73798,1.73798,1673022144,0,,1,4b08d_00000,0.001755
tune_architecture_4b08d_00001,0.782034,2023-01-06_17-22-27,True,,9501e34d4f1c4a27828dc566b0e012fe,"1_lr=0.0200,wd=0",MacBook-Air-von-Regis.local,1,127.0.0.1,19602,1.60192,1.60192,1.60192,1673022147,0,,1,4b08d_00001,0.00997996
tune_architecture_4b08d_00002,0.801186,2023-01-06_17-22-27,True,,fe56d5df355344558cacf8adb6ab2b88,"2_lr=0.0100,wd=0",MacBook-Air-von-Regis.local,1,127.0.0.1,19603,1.60388,1.60388,1.60388,1673022147,0,,1,4b08d_00002,0.00859308
tune_architecture_4b08d_00003,0.78363,2023-01-06_17-22-27,True,,4a5f05fbbe914b0b80562b379fa2a34f,"3_lr=0.0010,wd=0",MacBook-Air-von-Regis.local,1,127.0.0.1,19604,1.55023,1.55023,1.55023,1673022147,0,,1,4b08d_00003,0.00333595
tune_architecture_4b08d_00004,0.78363,2023-01-06_17-22-27,True,,f36f0e281591452d8fe126c38342012c,"4_lr=0.0003,wd=0",MacBook-Air-von-Regis.local,1,127.0.0.1,19605,1.80358,1.80358,1.80358,1673022147,0,,1,4b08d_00004,0.00540805
tune_architecture_4b08d_00005,0.76425,2023-01-06_17-22-27,True,,8ce2f10b4cba400b8cfeb44bf6da31f4,"5_lr=0.0500,wd=0.0001",MacBook-Air-von-Regis.local,1,127.0.0.1,19606,1.4064,1.4064,1.4064,1673022147,0,,1,4b08d_00005,0.00985599
tune_architecture_4b08d_00006,0.789786,2023-01-06_17-22-27,True,,86cd5e8c233644dc9deb91eba4a36dc4,"6_lr=0.0200,wd=0.0001",MacBook-Air-von-Regis.local,1,127.0.0.1,19607,1.57228,1.57228,1.57228,1673022147,0,,1,4b08d_00006,0.00373197
tune_architecture_4b08d_00007,0.801186,2023-01-06_17-22-27,True,,6e871a57bb5546a6b90945d263f7dd08,"7_lr=0.0100,wd=0.0001",MacBook-Air-von-Regis.local,1,127.0.0.1,19608,1.41285,1.41285,1.41285,1673022147,0,,1,4b08d_00007,0.0118558
tune_architecture_4b08d_00008,0.779754,2023-01-06_17-22-26,True,,86b047a3996e477a880e2fc7ba4384a9,"8_lr=0.0010,wd=0.0001",MacBook-Air-von-Regis.local,1,127.0.0.1,19600,2.05467,2.05467,2.05467,1673022146,0,,1,4b08d_00008,0.001755
tune_architecture_4b08d_00009,0.785682,2023-01-06_17-22-27,True,,86b047a3996e477a880e2fc7ba4384a9,"9_lr=0.0003,wd=0.0001",MacBook-Air-von-Regis.local,1,127.0.0.1,19600,1.41956,1.41956,1.41956,1673022147,0,,1,4b08d_00009,0.001755


2023-01-06 17:22:30,221	INFO tune.py:762 -- Total run time: 10.28 seconds (10.03 seconds for the tuning loop).


In [53]:
# turning analysis results into dataframe
df = analysis.dataframe()
df.sort_values(by='balanced_accuracy', ascending=False).head(5)[['balanced_accuracy', 'config/lr', 'config/wd']]

Unnamed: 0,balanced_accuracy,config/lr,config/wd
2,0.801186,0.01,0.0
7,0.801186,0.01,0.0001
12,0.797082,0.01,0.001
16,0.791838,0.02,0.01
6,0.789786,0.02,0.0001


Based on the results of the second tuning run we can conclude that best learning rate is 0.1 while the weigt decay needs to be less than 0.001.

Given these findings we are going to use the architecture of R42, R42, T42, S1 with a learning rate of 0.1 and and weight decay of 0 to make predictions on the test set.

In [54]:
# running evaluation to get the best model
architecture, n_params, best_epoch, best_bacc_val, best_loss_val, best_model = evaluate_architecture((
        (42, nn.ReLU()),
        (42, nn.ReLU()),
        (42, nn.Tanh()),
        (1, nn.Sigmoid()),),
        epochs=600,
        learning_rate=0.01,
        weight_decay=0,
        tune_hyperparams=False)

# predicting test set with the best model
y_test_pred = best_model(X_test_tensor)

# calculating balanced accuracy score
nn_bacc_test = balanced_accuracy_score(y_test_tensor.detach().numpy(), np.round(y_scaler.inverse_transform(y_test_pred.detach().numpy())))

# printing balanced accuracy of test set on neural network
print(
    f'\nBalanced Test Accuracy: {np.round(nn_bacc_test, 4)*100}%')

Architecture: R42, R42, T42, S1
Learned Parameters: 4537
Best epoch: 3
Best val. loss: 1.2431
Best bal. val. acc.: 0.8012

Balanced Test Accuracy: 69.96%


The neural network designed based on our testing and hyperparameter tuning is able to correctly predict 69.96% of cases in a previously unseen dataset correctly.

To provide some context for this result, we are going to compare it our previous work, where we used machine learning models to make predictions.

## Comparison to ML Model (SVM Polynomial Kernel)

In our previous work order, we compared four different machine learning approaches to make predictions on this data set. Those were:
- K Nearest Neighbors
- Decision Tree
- SVM classifier using a radial basis function kernel (rbf)
- SVM classifier using a polynomial kernel function (poly)

In these tests the Support Vectore Machine Classifier using a polynomial kernel function was able to achieve a balanced accuracy of 67.4%.

In [55]:
# configuring pipeline for scaler and estimator
def get_pipe(estimator):
    return Pipeline([
        ('scaler', StandardScaler()),
        ('estimator', estimator)])

In [56]:
# configuring splits
NUM_INNER_REPEATS = 3
NUM_INNER_SPLITS = 3

# defining the classifier
poly_svm = SVC(kernel='poly', random_state=1)

# configuring hyperparameters to be tuned through nested cross validation
grid = {'estimator__C': [0.1,1,10,100], 'estimator__degree': [1,2,3]}

inner_cv = RepeatedStratifiedKFold(
    n_splits=NUM_INNER_SPLITS, n_repeats=NUM_INNER_REPEATS, random_state=1)

clf = GridSearchCV(
    estimator = get_pipe(poly_svm),
    param_grid = grid,
    cv = inner_cv,
    scoring = ('balanced_accuracy'),
    refit = ('balanced_accuracy'),
    n_jobs = -1)

clf.fit(X_train_full, y_train_full.ravel())

y_pred = clf.best_estimator_.predict(X_test)

ml_bacc_test = balanced_accuracy_score(y_test, y_pred)

print(
    f'Balanced Test Accuracy: {np.round(ml_bacc_test, 3)*100}%')

Balanced Test Accuracy: 67.4%


## Improvement

In [57]:
# calculating overall improvement
print(f'Performance Improvement: {np.round((nn_bacc_test-ml_bacc_test), 4)*100}%')

Performance Improvement: 2.55%


The best performing neural network architecture using tuned hyperparameters is able to beat the performance of the SVM classifier by 2.55%.

# Conclusion

While the approach of applying a neural network to the given data set in order to predict whether a customer will default on their loan or not, does provide a higher accuracy than the previously best performing based on an SVM classifier, the achieved gains in balanced accuracy remain marginal.

There are a range of reasons to explain this. Eventhough the dataset has 20 features of varying types, like categorical, quantitative and ordinal a higher number of features would probably benefit the task. This could be achieved by either collecting more features as far as this is possible from an operational and legal perspective or by attempting to engineer the given features to provide new ones.

Another aspect is the randomness and chance which could impact personal behaviour and descision making. In regards to given problem, a customer could look like the perfect creditor based on the information provided and then still ulimatley default on their credit simply because their life was distrubted by some unforeseen event, like having a severe accident which prevents them from working.

The size of data set also plays a role, in this case we have 1.000 examples, which leaves a distribiution of:
- 420 examples for training
- 180 examples for validation
- 400 examples for testing

While this was enough to sufficiently train a neural network which performed better than the previous approach, it really isn't a lot considering the amount of training data involved in a larger realworld applications of neural networks. While augmentating the data to increase the number of examples might seem appealing this approach would robustness. We would essentially be inventing credit customers and all of the information which they would have to provide is impossible.

This only leaves us the possibility of requesting a larger data set from our customer to be able to have more training data examples.

Futhermore, we suggest on recording more information than so far from future credit customer. This could be access to a centralized credit history, showing customers obligations with other banks, lending institutions or retail stores. If a customer is married, we suggest to also collect the same information from their spouses or partners.

In [58]:
print(f'Total runtime: {np.round((time.time()-start_time )/ 60, 2)} minutes')

Total runtime: 1.25 minutes
