# 3. Imbalanced classes

## Synthesize and Modeling the Data 

Class imbalance occurs when the number of a certain class outweights the number of other classes present in the dataset. If a naive classifier model is used under the presence of imbalanced classes, it can achieve high accuracy by assigning all cases to the majority class - but this is very damaging for a a business and can have high costs associated with it.


Since we had earlier diagnosed an imbalanced class problem, using YData's state-of-the-art synthesizers, we can generate synthetic data for the minority class and balance it.

However, we would only do so for the training dataset, thus continuously evaluating the performance against the real unaltered test dataset.

### Import the needed packages

In [1]:
%%capture
!pip install xgboost

In [2]:
import os
import sys

import pickle 

import pandas as pd

from balance_model_training import train_model, augment_minority

from ydata.dataset import Dataset
from ydata.metadata import Metadata
from ydata.synthesizers.regular import RegularSynthesizer

In [3]:
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [4]:
augment=int(bool(os.getenv('augment', True)))
sample_size = int(os.getenv('sample'))

prep = {}

label = os.getenv('label', 'Class')

if not augment:
    sample_size = 0

## Traing a synthesizer for the minority class

In [5]:
data = pd.read_csv('train.csv', index_col=[0])

#load the metadata as well
metadata = 'loading the metadata'

Add here more details, below an example hon how to create a synthesizer

In [6]:
#split the dataset into minority and majority class
c_minority = data[data[label]==1]

#Metadata for the minority class 
c_minority = Dataset(c_minority)
minority_metadata = Metadata(c_minority)

[########################################] | 100% Completed | 101.68 ms
[########################################] | 100% Completed | 111.13 ms
[########################################] | 100% Completed | 202.90 ms
[########################################] | 100% Completed | 1.06 sms


### Integrate the class augmentation with the classifier training

Add here more details explaining the code below.

In [8]:
##add output ratio

if augment==1:
    result, models, n_samples= train_model(X=data, label=label, augmentation=True, train_synth=True, sample_size=sample_size)
else:
    result, models, n_samples= train_model(X=data, label=label, augmentation=False, train_synth=False, sample_size=sample_size)
    
prep['Balancing'] = augment
prep['Num_samples'] = n_samples

0
entrou
145544
(248, 30)
[########################################] | 100% Completed | 101.44 ms
[########################################] | 100% Completed | 203.00 ms
[########################################] | 100% Completed | 947.97 ms
INFO: 2023-02-16 18:54:32,998 [SYNTHESIZER] - Number columns considered for synth: 30
INFO: 2023-02-16 18:54:33,287 [SYNTHESIZER] - Starting the synthetic data modeling process over 1x1 blocks.
INFO: 2023-02-16 18:54:33,290 [SYNTHESIZER] - Preprocess segment
INFO: 2023-02-16 18:54:33,295 [SYNTHESIZER] - Synthesizer init.
INFO: 2023-02-16 18:54:33,296 [SYNTHESIZER] - Processing the data prior fitting the synthesizer.
INFO: 2023-02-16 18:54:34,461 [SYNTHESIZER] - Start generating model samples.
Size training: 291336
Size test: 36449
Model training: DummyClassifier
Model training: RandomForestClassifier
Model training: AdaBoostClassifier
Model training: XGBClassifier


In [11]:
#Save the best model? Does it make sense?
#Here save the models dict? What can I do about this
optimized_model = models[result.set_index('model')['f1_score'].idxmax()]

In [12]:
#Here the model needs also to be an output to be shown
pickle.dump(optimized_model, open('best_model.pkl', 'wb'))
pickle.dump(prep, open('prep_parameters.pkl', 'wb'))

### Creating the pipeline step outputs

In [13]:
import json 

metadata = {
    'outputs' : [
        {
          'type': 'markdown',
          'storage': 'inline',
          'source': f'## **Dataset balancing:** {bool(augment)}',
        },
        {
          'type': 'table',
          'storage': 'inline',
          'format': 'csv',
          'header': list(result.columns),
          'source': result.to_csv(header=False, index=False)
        },
    ]
  }

#heatmap_output(data=data_mask.corr()[mask_cols].drop(missing_col+mask_cols), title='Missing correlation')
with open('mlpipeline-ui-metadata.json', 'w') as metadata_file:
    json.dump(metadata, metadata_file)