# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.

### First, download the data from: https://www.kaggle.com/datasets/ealaxi/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?
### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze.

In [3]:
# Your code here
import pandas as pd
import numpy as np
from numpy.random import random_sample
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('ps_log.csv', nrows=100000)
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   step            100000 non-null  int64  
 1   type            100000 non-null  object 
 2   amount          100000 non-null  float64
 3   nameOrig        100000 non-null  object 
 4   oldbalanceOrg   100000 non-null  float64
 5   newbalanceOrig  100000 non-null  float64
 6   nameDest        100000 non-null  object 
 7   oldbalanceDest  100000 non-null  float64
 8   newbalanceDest  100000 non-null  float64
 9   isFraud         100000 non-null  int64  
 10  isFlaggedFraud  100000 non-null  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 8.4+ MB


### What is the distribution of the outcome? 

In [5]:
# Your response here
df.isFraud.value_counts()

isFraud
0    99884
1      116
Name: count, dtype: int64

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [6]:
df.type.value_counts()

type
PAYMENT     39512
CASH_OUT    30718
CASH_IN     20185
TRANSFER     8597
DEBIT         988
Name: count, dtype: int64

In [7]:
df = pd.concat([df, pd.get_dummies(df.type)], axis=1)
df

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,CASH_IN,CASH_OUT,DEBIT,PAYMENT,TRANSFER
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.00,0.00,0,0,False,False,False,True,False
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.00,0.00,0,0,False,False,False,True,False
2,1,TRANSFER,181.00,C1305486145,181.0,0.00,C553264065,0.00,0.00,1,0,False,False,False,False,True
3,1,CASH_OUT,181.00,C840083671,181.0,0.00,C38997010,21182.00,0.00,1,0,False,True,False,False,False
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.00,0.00,0,0,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,10,PAYMENT,4020.66,C1410794718,159929.0,155908.34,M1257036576,0.00,0.00,0,0,False,False,False,True,False
99996,10,PAYMENT,18345.49,C744303677,6206.0,0.00,M1785344556,0.00,0.00,0,0,False,False,False,True,False
99997,10,CASH_IN,183774.91,C104331851,39173.0,222947.91,C36392889,54925.05,0.00,0,0,True,False,False,False,False
99998,10,CASH_OUT,82237.17,C707662966,6031.0,0.00,C1553004158,592635.66,799140.46,0,0,False,True,False,False,False


In [8]:
# Your code here
df.step.value_counts()

step
9     37628
10    27274
8     21097
7      6837
1      2708
6      1660
2      1014
5       665
4       565
3       552
Name: count, dtype: int64

In [9]:
#  I don't think we will need the step column for the analysis, I will drop it together with other colummns

cols_to_drop = ['step', 'type', 'nameOrig', 'nameDest', 'isFlaggedFraud']
df.drop(cols_to_drop, axis=1, inplace=True)
df

Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,CASH_IN,CASH_OUT,DEBIT,PAYMENT,TRANSFER
0,9839.64,170136.0,160296.36,0.00,0.00,0,False,False,False,True,False
1,1864.28,21249.0,19384.72,0.00,0.00,0,False,False,False,True,False
2,181.00,181.0,0.00,0.00,0.00,1,False,False,False,False,True
3,181.00,181.0,0.00,21182.00,0.00,1,False,True,False,False,False
4,11668.14,41554.0,29885.86,0.00,0.00,0,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...
99995,4020.66,159929.0,155908.34,0.00,0.00,0,False,False,False,True,False
99996,18345.49,6206.0,0.00,0.00,0.00,0,False,False,False,True,False
99997,183774.91,39173.0,222947.91,54925.05,0.00,0,True,False,False,False,False
99998,82237.17,6031.0,0.00,592635.66,799140.46,0,False,True,False,False,False


### Run a logisitc regression classifier and evaluate its accuracy.

In [14]:
# Your code here
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, classification_report, confusion_matrix

features = df.drop('isFraud', axis=1)
target = df['isFraud']

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2)

model = LogisticRegression()
model.fit(X_train, y_train)
print(model.score(X_test, y_test))

pred = model.predict(X_test)
print(classification_report(y_test, pred))
print(confusion_matrix(y_test, pred))

0.9994
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     19974
           1       0.71      0.92      0.80        26

    accuracy                           1.00     20000
   macro avg       0.85      0.96      0.90     20000
weighted avg       1.00      1.00      1.00     20000

[[19964    10]
 [    2    24]]


In [17]:
# Oversample

train = pd.concat([X_train, y_train], axis=1)

no_fraud = train[train['isFraud'] == 0]
yes_fraud = train[train['isFraud'] == 1]

print(no_fraud.shape)
print(yes_fraud.shape)

(79910, 11)
(90, 11)


In [18]:
from sklearn.utils import resample

yes_fraud_oversampled = resample(yes_fraud,
                                   replace=True,
                                   n_samples=len(no_fraud),
                                   random_state=0)
train_oversampled = pd.concat([no_fraud, yes_fraud_oversampled])

In [19]:
X_train_over = train_oversampled.drop('isFraud', axis=1)
y_train_over = train_oversampled['isFraud']

model = LogisticRegression(max_iter=1000)
model.fit(X_train_over, y_train_over)
print(model.score(X_test, y_test))

pred = model.predict(X_test)
print(classification_report(y_test, pred))
print(confusion_matrix(y_test, pred))

0.8961
              precision    recall  f1-score   support

           0       1.00      0.90      0.95     19974
           1       0.01      0.96      0.02        26

    accuracy                           0.90     20000
   macro avg       0.51      0.93      0.48     20000
weighted avg       1.00      0.90      0.94     20000

[[17897  2077]
 [    1    25]]


In [15]:
# SMOTE

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=0,
             sampling_strategy=1.0)

X_train_SMOTE, y_train_SMOTE = smote.fit_resample(X_train, y_train)

In [16]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train_SMOTE, y_train_SMOTE)
print(model.score(X_test, y_test))

pred = model.predict(X_test)
print(classification_report(y_test, pred))
print(confusion_matrix(y_test, pred))

0.9128
              precision    recall  f1-score   support

           0       1.00      0.91      0.95     19974
           1       0.01      0.96      0.03        26

    accuracy                           0.91     20000
   macro avg       0.51      0.94      0.49     20000
weighted avg       1.00      0.91      0.95     20000

[[18231  1743]
 [    1    25]]


### Now pick a model of your choice and evaluate its accuracy.

In [20]:
# Your code here
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_SMOTE, y_train_SMOTE)
print(knn.score(X_test, y_test))

pred_knn = knn.predict(X_test)
print(classification_report(y_test, pred_knn))
print(confusion_matrix(y_test, pred_knn))

0.976
              precision    recall  f1-score   support

           0       1.00      0.98      0.99     19974
           1       0.04      0.77      0.08        26

    accuracy                           0.98     20000
   macro avg       0.52      0.87      0.53     20000
weighted avg       1.00      0.98      0.99     20000

[[19500   474]
 [    6    20]]


### Which model worked better and how do you know?

In [2]:
# Your response here

# KNN has better accuracy and overall metrics

### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.