# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### First, download the data from: https://www.kaggle.com/ntnu-testimon/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Download the csv file from another link (this one is broken) - https://www.kaggle.com/ealaxi/paysim1
# Problems with Memory - open csv file manually, and Windows automatically dropped until length of the csv file was slightly grater than 1M
paysim = pd.read_csv('paysim1.csv')
len(paysim)

1048575

In [3]:
# Check the first rows with df.head()
paysim.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [4]:
# General overview of the dataset
# There are 3 categorical data column --> Type
# There are no missing values
paysim.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 11 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   step            1048575 non-null  int64  
 1   type            1048575 non-null  object 
 2   amount          1048575 non-null  float64
 3   nameOrig        1048575 non-null  object 
 4   oldbalanceOrg   1048575 non-null  float64
 5   newbalanceOrig  1048575 non-null  float64
 6   nameDest        1048575 non-null  object 
 7   oldbalanceDest  1048575 non-null  float64
 8   newbalanceDest  1048575 non-null  float64
 9   isFraud         1048575 non-null  int64  
 10  isFlaggedFraud  1048575 non-null  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 76.0+ MB


In [5]:
# Check first categorical data column unique values: type
# Only 5 unique values, we can apply One Hot Encoder
len(paysim['type'].unique())

5

In [6]:
# Check second categorical data column unique values: nameOrig
# Each column has its own unique value, so too many of them are left - drop column when cleaning the dataset
len(paysim['nameOrig'].unique())

1048317

In [7]:
# Check third categorical data column unique values: nameDest
# There are 0,45M unique values, almost half of the size of the dataframe, so too many of them are left - drop column when cleaning the dataset
len(paysim['nameDest'].unique())

449635

In [8]:
# Let's check the basic info of the numeric columns with df.describe()
# Columns isFlaggedFraud has all its observations with 0 - drop column when cleaning the dataset 
paysim.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,1048575.0,1048575.0,1048575.0,1048575.0,1048575.0,1048575.0,1048575.0,1048575.0
mean,26.96617,158667.0,874005.5,893804.9,978160.0,1114193.0,0.001089097,0.0
std,15.62325,264940.9,2971725.0,3008246.0,2296779.0,2416554.0,0.03298351,0.0
min,1.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0
25%,15.0,12149.07,0.0,0.0,0.0,0.0,0.0,0.0
50%,20.0,76343.33,16002.0,0.0,126377.2,218260.4,0.0,0.0
75%,39.0,213761.9,136642.0,174600.0,915923.5,1149808.0,0.0,0.0
max,95.0,10000000.0,38939420.0,38946230.0,42054660.0,42169160.0,1.0,0.0


### What is the distribution of the outcome? 

In [9]:
# Outcome variavle is isFraud
paysim['isFraud'].value_counts()

0    1047433
1       1142
Name: isFraud, dtype: int64

In [10]:
# Percentage of Fraud is extremely low: 0,1%
# The distribution is highly inbalanced and we can expect an algorithm with high recall on 0 and low recall on 1
paysim['isFraud'].value_counts()[1]/paysim['isFraud'].value_counts().sum()

0.0010890971079798775

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [11]:
# Drop unnecessary columns
paysim.drop(['nameOrig', 'nameDest', 'isFlaggedFraud'], axis=1, inplace=True)

In [12]:
# Use pd.get_dummies to make column type numerical and drop_first
paysim = pd.get_dummies(paysim, drop_first=True)

In [13]:
paysim

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,1,9839.64,170136.00,160296.36,0.00,0.00,0,0,0,1,0
1,1,1864.28,21249.00,19384.72,0.00,0.00,0,0,0,1,0
2,1,181.00,181.00,0.00,0.00,0.00,1,0,0,0,1
3,1,181.00,181.00,0.00,21182.00,0.00,1,1,0,0,0
4,1,11668.14,41554.00,29885.86,0.00,0.00,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...
1048570,95,132557.35,479803.00,347245.65,484329.37,616886.72,0,1,0,0,0
1048571,95,9917.36,90545.00,80627.64,0.00,0.00,0,0,0,1,0
1048572,95,14140.05,20545.00,6404.95,0.00,0.00,0,0,0,1,0
1048573,95,10020.05,90605.00,80584.95,0.00,0.00,0,0,0,1,0


In [14]:
# Define dependent and independent variable
X = paysim.drop('isFraud', axis = 1)
y = paysim['isFraud']

In [15]:
# Test split in order to train and tets models
from sklearn.model_selection import train_test_split

# The dataframe is so big I will set test size as 0.4
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

In [16]:
# As we early said, it will be possible that we need a resampling as target variable is highly inbalanced
# To be more specific, in order not to face a too big dataset, we will do some UnderSampling
from imblearn.under_sampling import RandomUnderSampler

# Generate the model
resampler = RandomUnderSampler()

# Fit and resample the dataset
X_res, y_res = resampler.fit_resample(X,y)

In [17]:
# Check new shape
# Lower than 1M
X_res.shape

(2284, 10)

In [18]:
# Test split in order to train and tets models. The dataframe is consideraly lower so I will set test size as 0.2
X_res_train, X_res_test, y_res_train, y_res_test = train_test_split(X_res, y_res, test_size=0.2, random_state=42)

### Run a logisitc regression classifier and evaluate its accuracy.

In [19]:
# Import necessary methods to generate and evaluate a Logistic Regression from sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [20]:
# Logistic Regression model for original dataset
# Very low rates for Fraud detection, probably due to the inbalancement
lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00    418985
           1       0.56      0.36      0.44       445

    accuracy                           1.00    419430
   macro avg       0.78      0.68      0.72    419430
weighted avg       1.00      1.00      1.00    419430



In [21]:
# Logistic Regression model for resampled dataset
# Better scoring for Fraud detection
lr_res = LogisticRegression()
lr_res.fit(X_res_train, y_res_train)
y_res_pred = lr.predict(X_res_test)
print(classification_report(y_res_test, y_res_pred))

              precision    recall  f1-score   support

           0       0.64      1.00      0.78       230
           1       1.00      0.43      0.60       227

    accuracy                           0.72       457
   macro avg       0.82      0.72      0.69       457
weighted avg       0.82      0.72      0.69       457



### Now pick a model of your choice and evaluate its accuracy.

In [22]:
# Import necessary methods to generate and evaluate a RandomForest from sklearn
from sklearn.ensemble import RandomForestClassifier

In [23]:
# Random Forest model for original dataset
# Much better rates than LogisticRegression, but there is a difference between 0 and 1, probably due to the inbalancement
# It took much more time as well
forest = RandomForestClassifier()
y_pred = forest.fit(X_train,y_train).predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00    418985
           1       0.96      0.80      0.87       445

    accuracy                           1.00    419430
   macro avg       0.98      0.90      0.94    419430
weighted avg       1.00      1.00      1.00    419430



In [24]:
# Random Forest model for resampled dataset
# It improves Fraud detection
forest_res = RandomForestClassifier()
y_res_pred = forest.fit(X_res_train, y_res_train).predict(X_res_test)
print(classification_report(y_res_test, y_res_pred))

              precision    recall  f1-score   support

           0       0.98      0.97      0.98       230
           1       0.97      0.98      0.98       227

    accuracy                           0.98       457
   macro avg       0.98      0.98      0.98       457
weighted avg       0.98      0.98      0.98       457



### Which model worked better and how do you know?

In [25]:
# RandomForest is a much more complete Classification model than Logistic regression to generate a prediction