# Imbalanced datasets
------------------------------------

This example shows the different approaches we can take to handle imbalanced datasets.

The data used is a variation on the Australian weather dataset from [https://www.kaggle.com/jsphyg/weather-dataset-rattle-package](https://www.kaggle.com/jsphyg/weather-dataset-rattle-package). The goal of this dataset is to predict whether or not it will rain tomorrow training a binay classifier on target `RainTomorrow`.

## Load the data

In [1]:
# Import packages
import pandas as pd
from atom import ATOMClassifier

In [2]:
# Load data
X = pd.read_csv('./datasets/weatherAUS.csv')

# Let's have a look at a subset of the data
X.sample(frac=1).iloc[:5, :8]

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed
53132,MountGinini,12.2,24.9,0.0,,,NW,30.0
141064,Uluru,12.6,26.5,0.0,,,SSE,44.0
26116,Penrith,11.1,28.8,0.0,,,ESE,33.0
62930,MelbourneAirport,5.6,12.5,1.4,1.2,2.0,N,37.0
64081,MelbourneAirport,4.5,14.5,3.4,2.0,2.3,SW,35.0


## Run the pipeline

In [3]:
# Initialize ATOM with the created dataset
atom = ATOMClassifier(X, n_rows=0.3, test_size=0.3, verbose=2, random_state=1)
atom.clean()
atom.impute()
atom.encode()

Algorithm task: binary classification.

Shape: (42658, 22)
Missing values: 95216
Categorical columns: 5
Scaled: False
-----------------------------------
Train set size: 29861
Test set size: 12797
-----------------------------------
Train set balance: No:Yes <==> 3.5:1.0
Test set balance: No:Yes <==> 3.4:1.0
-----------------------------------
Distribution of classes:
|     |   dataset |   train |   test |
|:----|----------:|--------:|-------:|
| No  |     33139 |   23247 |   9892 |
| Yes |      9519 |    6614 |   2905 |

Applying data cleaning...
 --> Label-encoding the target column.
Fitting Imputer...
Imputing missing values...
 --> Dropping 352 rows for containing less than 50% non-missing values.
 --> Dropping 92 rows due to missing values in feature MinTemp.
 --> Dropping 56 rows due to missing values in feature MaxTemp.
 --> Dropping 350 rows due to missing values in feature Rainfall.
 --> Dropping 17551 rows due to missing values in feature Evaporation.
 --> Dropping 3229 rows 

In [4]:
# First, we fit a logistic regression model directly on the imbalanced data
atom.run("LR", metric="f1")


Models: LR
Metric: f1


Results for Logistic Regression:         
Fit ---------------------------------------------
Train evaluation --> f1: 0.6174
Test evaluation --> f1: 0.6096
Time elapsed: 0.078s
-------------------------------------------------
Total time: 0.085s


Duration: 0.087s
------------------------------------------
Logistic Regression --> f1: 0.610


## Use weighted classes

In [5]:
# Add the class weights through the est_params parameter
atom.run("LR", metric="f1", est_params={"class_weight": atom.get_class_weight()})


Models: LR
Metric: f1


Results for Logistic Regression:         
Fit ---------------------------------------------
Train evaluation --> f1: 0.6449
Test evaluation --> f1: 0.6472
Time elapsed: 0.081s
-------------------------------------------------
Total time: 0.087s


Duration: 0.089s
------------------------------------------
Logistic Regression --> f1: 0.647


## Use sample weights

In [6]:
# Remember to add "_fit" to the est_params key to add the parameter to the fit method
atom.run("LR", metric="f1", est_params={"sample_weight_fit": atom.get_sample_weight()})


Models: LR
Metric: f1


Results for Logistic Regression:         
Fit ---------------------------------------------
Train evaluation --> f1: 0.6449
Test evaluation --> f1: 0.6472
Time elapsed: 0.080s
-------------------------------------------------
Total time: 0.087s


Duration: 0.089s
------------------------------------------
Logistic Regression --> f1: 0.647


## Use oversampling

In [7]:
# Perform oversampling of the minority class
atom.balance(strategy='smote', sampling_strategy=0.9)

Oversampling with SMOTE...
 --> Adding 5830 rows to class: Yes.


In [8]:
atom.classes  # Note the balanced training set!

Unnamed: 0,dataset,train,test
0,13189,9317,3872
1,9536,8385,1151


In [9]:
atom.run("LR", metric="f1")


Models: LR
Metric: f1


Results for Logistic Regression:         
Fit ---------------------------------------------
Train evaluation --> f1: 0.7918
Test evaluation --> f1: 0.6505
Time elapsed: 0.101s
-------------------------------------------------
Total time: 0.110s


Duration: 0.112s
------------------------------------------
Logistic Regression --> f1: 0.650
