# 5. Imbalanced classes

## Synthesize and Modeling the Data 

Oftenly in credit scoring use-cases and mainly in porfolios of secured loans as mortage loans, there is a significant presence of class imbalancement. Class imbalance occurs when the number of a certain class outweights the number of other classes present in the dataset. If a naive classifier model is used under the presence of imbalanced classes, it can achieve high accuracy by assigning all cases to the majority class - but this is very damaging for a a business and can have high costs associated with it.



Since we had earlier diagnosed an imbalanced class problem, using YData's state-of-the-art synthesizers, we can generate synthetic data for the minority class and balance it.

However, we would only do so for the training dataset, thus continuously evaluating the performance against the real unaltered test dataset.

### Import the needed packages

In [16]:
%%capture
!pip install xgboost

In [17]:
import os
import sys

import pickle 

import pandas as pd

from balance_model_training import train_model, augment_minority

In [18]:
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [19]:
augment=int(os.getenv('AUGMENT', 1))

In [20]:
prep = pickle.load(open('prep_parameters.pkl', 'rb'))

### Get the training dataset

In [21]:
data = pd.read_csv('prep_traindata.csv', index_col=[0])

In [22]:
fraud = data[data['SeriousDlqin2yrs']==1]
nonfraud = data[data['SeriousDlqin2yrs']==0]
nonfraud = nonfraud.sample(int(len(nonfraud)*0.6))

In [23]:
data = pd.concat([fraud, nonfraud])

#### Augment the less represented class

In [49]:
if augment==1:
    result, models = train_model(X=data, label='SeriousDlqin2yrs', augmentation=True, train_synth=True)
else:
    result, models = train_model(X=data, label='SeriousDlqin2yrs', augmentation=False, train_synth=False)
    
prep['Balancing'] = augment

40000
(6531, 10)
INFO: 2022-12-08 23:00:01,685 [SYNTHESIZER] - Number columns considered for synth: 10
INFO: 2022-12-08 23:00:03,063 [SYNTHESIZER] - Starting the synthetic data modeling process over 1x1 blocks.
INFO: 2022-12-08 23:00:03,067 [SYNTHESIZER] - Preprocess segment
INFO: 2022-12-08 23:00:03,071 [SYNTHESIZER] - Synthesizer init.
INFO: 2022-12-08 23:00:03,071 [SYNTHESIZER] - Processing the data prior fitting the synthesizer.
INFO: 2022-12-08 23:00:04,907 [SYNTHESIZER] - Start generating model samples.
Size training: 137596
Size test: 24400
Model training: DummyClassifier
Model training: RandomForestClassifier
Model training: AdaBoostClassifier
Model training: XGBClassifier


In [50]:
result

Unnamed: 0,model,f1_score,recall,precision,roc_auc,accuracy
0,DummyClassifier,0.112215,0.344424,0.067026,0.501404,0.6375
1,RandomForestClassifier,0.389459,0.514479,0.313321,0.717068,0.892705
2,AdaBoostClassifier,0.407816,0.591497,0.311183,0.749101,0.885738
3,XGBClassifier,0.400863,0.572397,0.308433,0.740472,0.886189


In [53]:
#Save the best model? Does it make sense?
#Here save the models dict? What can I do about this
optimized_model = models[result.set_index('model')['f1_score'].idxmax()]

In [54]:
#Here the model needs also to be an output to be shown
pickle.dump(optimized_model, open('best_model.pkl', 'wb'))
pickle.dump(prep, open('prep_parameters.pkl', 'wb'))

### Creating the pipeline step outputs

In [42]:
import json 

metadata = {
    'outputs' : [
        {
          'type': 'markdown',
          'storage': 'inline',
          'source': f'## **Dataset balancing:** {bool(augment)}',
        },
        {
          'type': 'table',
          'storage': 'inline',
          'format': 'csv',
          'header': list(result.columns),
          'source': result.to_csv(header=False, index=False)
        },
    ]
  }

#heatmap_output(data=data_mask.corr()[mask_cols].drop(missing_col+mask_cols), title='Missing correlation')
with open('mlpipeline-ui-metadata.json', 'w') as metadata_file:
    json.dump(metadata, metadata_file)