**Handling Imbalanced datasets**    
There are 4 commonly used methods for handling imbalanced datasets   
**Sampling methods for handling imbalance datasets**
1. Downsampling (Under sampling)
2. Upsampling (Over sampling)
3. Upweighting
4. Combination of over- and under-sampling

1.**Downsampling(Under sampling):**  
 In this the classes are balanced by performing resampling on majority class. It reduces the size of majority class in order to match with the size of the minority class.
 There are different methods to perform this sampling. one of those method is Random Under sampler, which selectes a subset of data randomly to balance the data in target classes.

2.**Upsampling (Over sampling):**
  This method is used when there is an insufficient data. In this the resampling performed on minority class. It increases the size of the minority class in order to match with the size   of the majority class.One best example of this sampling is SMOTE (Synthetic minority over sampling Technique).It works by creating synthetic samples from the minor class to balance the   data in target classes.
  
3.**Upweighting:**
  Here the classes are balanced by scaling the weight of minority class. This weight is scaled by taking the ratio of number of samples in majority class to the number of samples in       minority class. A minority class weight of 30 (say) means the model treats the minority class as 30 times as important as it would majority class of weight 1. 
  
 Generally these 3 methods are most widely used to handle imbalanced data in building ML models. For DL models there is a separate batch generator to handle highly imbalanced data.
 
 For more details about sampling methods, please check below link,  
 https://imbalanced-learn.readthedocs.io/en/stable/api.html#module-imblearn.over_sampling
 

In [None]:
#Import required libraries
import numpy as np 
import pandas as pd 
import warnings
warnings.filterwarnings('ignore')
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df=pd.read_csv('../input/talkingdata-adtracking-fraud-detection/train_sample.csv')
df.head()

In [None]:
#convert timestamp to datatime
def todatetime(df):
    df['click_time']=pd.to_datetime(df['click_time'])
    df['click_hour']=df['click_time'].dt.hour
    df['click_day']=df['click_time'].dt.day
    df['click_weekday']=df['click_time'].dt.weekday
    df['click_month']=df['click_time'].dt.month
    df['click_year']=df['click_time'].dt.year
    return df

In [None]:
df=todatetime(df)

In [None]:
df=df.drop(['click_time'],axis=1)

In [None]:
df.head()

In [None]:
df.isnull().sum()

In [None]:
df=df.drop('attributed_time',axis=1)

In [None]:
#Shuffling observations
df=df.sample(frac=1)
df

**Distribution of classes in train dataset before sampling**

In [None]:
import seaborn as sn
import matplotlib.pyplot as plt
plt.figure(figsize=(10,8))
sn.countplot(x='is_attributed',data=df)
plt.ylabel('Number of clicks')
plt.show()

**Classes:**  
0- User will not download an app after clicking a mobile app advertisement    
1- User will download an app after clicking a mobile app advertisement

In [None]:
df['is_attributed'].value_counts()

The target label data is highly imbalanced (99.75:0.25)%. This needs to be balanced to make the model to be generalize well.

In [None]:
target_label=df['is_attributed']
target_label.shape

In [None]:
ones=df[df['is_attributed']==1]
zeros=df[df['is_attributed']==0]

In [None]:
df=df.drop(['is_attributed','ip'],axis=1)

**Split the dataset into train & test sets**

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(df,target_label,test_size=0.2,random_state=42)
print(x_train.shape,y_train.shape)
print(x_test.shape,y_test.shape)

**Create validation dataset from train dataset**

In [None]:
x_train,x_val,y_train,y_val=train_test_split(x_train,y_train,test_size=0.1,random_state=42)
print(x_train.shape,y_train.shape)
print(x_val.shape,y_val.shape)

**Building LGBM model**

In [None]:
import lightgbm as lgb
#load datasets in lgb formate
train_data=lgb.Dataset(x_train,label=y_train,free_raw_data=False)
validation_data=lgb.Dataset(x_val,label=y_val,free_raw_data=False)

**LGBM basemodel**

In [None]:
#set parameters for training
params={ 'num_leaves':160,
        'object':'binary',
        'metric':['auc','binary_logloss']
       }

In [None]:

#Original LGB model before sampling
num_round=100
def lgb_basemodel(x_train,y_train):
    lgb_model=lgb.train(params,train_data,num_round,valid_sets=validation_data,early_stopping_rounds=20)
    return lgb_model

**LGBM model with Downsampling**

In [None]:
#LGBM model after resampling the data using Under sampling techniques
from imblearn.under_sampling import RandomUnderSampler 
def lgb_downsampling(x_train,y_train):
    lgb_enn=RandomUnderSampler(random_state=42)
    x_resample,y_resample=lgb_enn.fit_resample(x_train,y_train)
    train_data=lgb.Dataset(x_resample,label=y_resample,free_raw_data=False)
    lgb_model=lgb.train(params,train_data,num_round,valid_sets=validation_data,early_stopping_rounds=20)
    return lgb_model,x_resample,y_resample;

**LGBM model with Upsampling**

In [None]:
#LGBM model after resampling the data using Up sampling techniques
from imblearn.over_sampling import SMOTE  #Balances the classes by performing upsampling on minority class
def lgb_upsampling(x_train,y_train):
    lgb_smote= SMOTE(random_state=42)
    x_resample,y_resample=lgb_smote.fit_resample(x_train,y_train)
    train_data=lgb.Dataset(x_resample,label=y_resample)
    lgb_model=lgb.train(params,train_data,num_round,valid_sets=validation_data,early_stopping_rounds=20)
    return lgb_model,x_resample,y_resample;

**LGBM model with Upweighting**

In [None]:
weight_factor=zeros.shape[0]/ones.shape[0]  # Ratio of number of samples in majority class to number of samples in minority class
print('Weight factor is %0.2f'%(weight_factor))

In [None]:
#set parameters for training
params1={ 'num_leaves':160,
        'object':'binary',
        'metric':['auc','binary_logloss'],
        'scale_pos_weight':397.41                 #Weight of minority class
       }

In [None]:
#LGBM model using Upweighting technique for handling the imbalanced data
def lgb_Upweighting(x_train,y_train):
    lgb_model=lgb.train(params1,train_data,num_round,valid_sets=validation_data,early_stopping_rounds=20)
    return lgb_model;

**Train the LGBM models**

In [None]:
#Basemodel
lgb_basemodel=lgb_basemodel(x_train,y_train)

In [None]:
#Downsampling model
lgb_downsampling,x_down,y_down=lgb_downsampling(x_train,y_train)

In [None]:
y_down_df=pd.DataFrame(y_down)
y_down.shape

In [None]:
plt.hist(y_down);

In [None]:
#Upsampling model
lgb_upsampling,x_up,y_up=lgb_upsampling(x_train,y_train)

In [None]:
y_up_df=pd.DataFrame(y_up)
y_up.shape

In [None]:
plt.hist(y_up);

In [None]:
#Upweighting model 
lgb_upweighting=lgb_Upweighting(x_train,y_train);

**Testing models on unseen dataset**

In [None]:
#Basemodel
y_base=lgb_basemodel.predict(x_test)

In [None]:
#Upsampling
y_upsampling=lgb_upsampling.predict(x_test)

In [None]:
#Downsampling
y_downsampling=lgb_downsampling.predict(x_test)

In [None]:
#Upweighting
y_upweighting=lgb_upweighting.predict(x_test)

**Plot confusion matrix for all Logistic models**

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score
import scikitplot as skplt

In [None]:
#Basemodel
skplt.metrics.plot_confusion_matrix(y_test,y_base>0.5,normalize=False,figsize=(12,8),title='Confusion matrix for base model')  
plt.show()

In [None]:
#Upsampling
skplt.metrics.plot_confusion_matrix(y_test,y_upsampling>0.5,normalize=False,figsize=(12,8),title='Confusion matrix for upsampling model')  #0.5 is threshold value
plt.show()

In [None]:
#downsampling
skplt.metrics.plot_confusion_matrix(y_test,y_downsampling>0.5,normalize=False,figsize=(12,8),title='Confusion matrix for downsampling model')  #0.5 is threshold value
plt.show()

In [None]:
#Upweighting
skplt.metrics.plot_confusion_matrix(y_test,y_upweighting>0.5,normalize=False,figsize=(12,8),title='Confusion matrix for upweighting model')  #0.5 is threshold value
plt.show()

**Classification report for all Logistic models**

In [None]:
#Base model
cm_base=classification_report(y_test,y_base>0.5)
print(cm_base)

In [None]:
#Upsampling model
cm_up=classification_report(y_test,y_upsampling>0.5)
print(cm_up)

In [None]:
#Downsampling model
cm_up=classification_report(y_test,y_downsampling>0.5)
print(cm_up)

In [None]:
#Upweighting model
cm_upweight=classification_report(y_test,y_upweighting>0.5)  # 0.5 is threshold value
print(cm_upweight)

**Conclusions:-**

1. Both LGB base & upweighting models are performing same. These models are not at all learning positive labels, so always predicting negative class.
2. LGB Upsampling (SMOTE) permormed well among all models. This model performance can be improved further by hypermeter tuning
3. LGB downsampling model performing reasonally but not good.

All above models are data dependent, so they may perform well on some datasets but not all. It is better to build all models & choose best among for prediction.
For large datasets, Ensemble methods can be employed, but these are computationally expensive. Also it is good to check all other sampling methods along with SMOTE & RandomUnderSampler for better understanding of models to handle unbalanced datasets.