# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.

### First, download the data from: https://www.kaggle.com/datasets/ealaxi/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?
### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze.

In [1]:
# Your code here
import pandas as pd
import numpy as np
data = pd.read_csv('../PS_20174392719_1491204439457_log.csv')
data = data.sample(10000)
# data.shape
data.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
2658288,210,PAYMENT,1850.15,C793389743,11757.0,9906.85,M773354633,0.0,0.0,0,0
5676176,397,CASH_OUT,421997.07,C2146836200,433133.0,11135.93,C2028937340,496685.78,918682.85,0,0
3110670,235,CASH_OUT,36253.68,C701424110,47540.0,11286.32,C1212919276,85106.39,121360.07,0,0
2445222,203,CASH_OUT,246615.7,C1266230404,0.0,0.0,C1383253034,1350623.17,1801460.21,0,0
2989213,231,PAYMENT,17998.91,C1220347052,0.0,0.0,M1683290581,0.0,0.0,0,0


### What is the distribution of the outcome? 

In [2]:
# Your response here
display(data['isFraud'].value_counts())
display(data['isFlaggedFraud'].value_counts()) #All values are flagged Fraud because of the whole dataset distribution

0    9976
1      24
Name: isFraud, dtype: int64

0    10000
Name: isFlaggedFraud, dtype: int64

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [3]:
# Your code here
display(data.dtypes)
display(data.isnull().sum())
display(data['type'].value_counts())

step                int64
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

CASH_OUT    3488
PAYMENT     3428
CASH_IN     2078
TRANSFER     943
DEBIT         63
Name: type, dtype: int64

### Run a logisitc regression classifier and evaluate its accuracy.

In [4]:
# Your code here
x = data.drop(['step','type','nameOrig','nameDest','isFraud','isFlaggedFraud'],axis = 1)
#display(x)
y = data['isFraud']
#display(y)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=323)

from sklearn.linear_model import LogisticRegression

model1 = LogisticRegression(max_iter=1000)
model1.fit(X_train, y_train)
model1.score(X_test, y_test)

0.974

In [5]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix

pred = model1.predict(X_test)

print("precision: ",precision_score(y_test,pred))
print("recall: ",recall_score(y_test,pred))
print("f1: ",f1_score(y_test,pred))
confusion_matrix(y_test,pred)

precision:  0.07246376811594203
recall:  0.8333333333333334
f1:  0.13333333333333333


array([[2430,   64],
       [   1,    5]])

### Now pick a model of your choice and evaluate its accuracy.

In [6]:
# Your code here
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

model2 = KNeighborsClassifier(n_neighbors = 3)
model2 = model2.fit(X_train, y_train)
model2.predict(X_test)
np.array(y_test)
accuracy_score(model2.predict(X_test),np.array(y_test))

0.998

In [7]:
pred = model2.predict(X_test)

print("precision: ",precision_score(y_test,pred))
print("recall: ",recall_score(y_test,pred))
print("f1: ",f1_score(y_test,pred))
confusion_matrix(y_test,pred)

precision:  0.6666666666666666
recall:  0.3333333333333333
f1:  0.4444444444444444


array([[2493,    1],
       [   4,    2]])

### Which model worked better and how do you know?

In [None]:
# First one with LR. Because the amount of false negative frauds is higher than KNN model. The recall metric tells that KNN fails at detecting fraudulent cases by +66% of the case. 

### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.