# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.

### First, download the data from: https://www.kaggle.com/datasets/ealaxi/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?
### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze.

In [2]:
import pandas as pd

In [4]:
entire_data=pd.read_csv("C:/Users/milena.xavier/Downloads/archive (2)/PS_20174392719_1491204439457_log.csv")

In [7]:
sample=entire_data.sample(n=100000, random_state=1)

In [8]:
sample.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
6322570,688,CASH_IN,23557.12,C867750533,8059.0,31616.12,C1026934669,169508.66,145951.53,0,0
3621196,274,PAYMENT,6236.13,C601099070,0.0,0.0,M701283411,0.0,0.0,0,0
1226256,133,PAYMENT,33981.87,C279540931,18745.72,0.0,M577905776,0.0,0.0,0,0
2803274,225,CASH_OUT,263006.42,C11675531,20072.0,0.0,C529577791,390253.56,653259.98,0,0
3201247,249,CASH_OUT,152013.74,C530649214,20765.0,0.0,C1304175579,252719.19,404732.93,0,0


In [14]:
sample.dtypes

step                int64
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object

In [None]:
#there are 3 columns with text. Working on the first one:Type

In [16]:
sample["type"].value_counts()
#there are only 5 types, I could create dummies with this column

CASH_OUT    35209
PAYMENT     33694
CASH_IN     21987
TRANSFER     8416
DEBIT         694
Name: type, dtype: int64

In [21]:
sample_dummy=pd.get_dummies(sample, columns=['type'])

In [29]:
sample_dummy.drop(columns="type", axis=1, inplace=True)

In [30]:
sample_dummy.head()

Unnamed: 0,step,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
6322570,688,23557.12,C867750533,8059.0,31616.12,C1026934669,169508.66,145951.53,0,0,1,0,0,0,0
3621196,274,6236.13,C601099070,0.0,0.0,M701283411,0.0,0.0,0,0,0,0,0,1,0
1226256,133,33981.87,C279540931,18745.72,0.0,M577905776,0.0,0.0,0,0,0,0,0,1,0
2803274,225,263006.42,C11675531,20072.0,0.0,C529577791,390253.56,653259.98,0,0,0,1,0,0,0
3201247,249,152013.74,C530649214,20765.0,0.0,C1304175579,252719.19,404732.93,0,0,0,1,0,0,0


In [34]:
#Now, the columns NameOrig and NameDest:
sample_dummy[["nameOrig"]].nunique

<bound method DataFrame.nunique of             nameOrig
6322570   C867750533
3621196   C601099070
1226256   C279540931
2803274    C11675531
3201247   C530649214
...              ...
4225513  C1059072914
4989642  C1543222456
2099701   C171437065
249322   C1831253634
4679267    C41194212

[100000 rows x 1 columns]>

In [35]:
sample_dummy[["nameDest"]].nunique

<bound method DataFrame.nunique of             nameDest
6322570  C1026934669
3621196   M701283411
1226256   M577905776
2803274   C529577791
3201247  C1304175579
...              ...
4225513   C759673946
4989642   M441713839
2099701  C1175649845
249322    M912660596
4679267   C724844824

[100000 rows x 1 columns]>

In [36]:
#as we can see, it is impossible to create dummies with those columns since there are many unique values. Thus, I am going to discard them

sample_dummy.drop(columns=["nameDest", "nameOrig"], axis=1, inplace=True)

In [37]:
sample_dummy.head()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
6322570,688,23557.12,8059.0,31616.12,169508.66,145951.53,0,0,1,0,0,0,0
3621196,274,6236.13,0.0,0.0,0.0,0.0,0,0,0,0,0,1,0
1226256,133,33981.87,18745.72,0.0,0.0,0.0,0,0,0,0,0,1,0
2803274,225,263006.42,20072.0,0.0,390253.56,653259.98,0,0,0,1,0,0,0
3201247,249,152013.74,20765.0,0.0,252719.19,404732.93,0,0,0,1,0,0,0


In [38]:
sample_dummy["isFlaggedFraud"].value_counts()
#since there is no relevant data in this columns, no different items, I am going to remove as well

0    100000
Name: isFlaggedFraud, dtype: int64

In [39]:
sample_dummy.drop(columns="isFlaggedFraud", axis=1,inplace=True)

In [41]:
sample_dummy.isna().sum()

step              0
amount            0
oldbalanceOrg     0
newbalanceOrig    0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
type_CASH_IN      0
type_CASH_OUT     0
type_DEBIT        0
type_PAYMENT      0
type_TRANSFER     0
dtype: int64

In [None]:
#Now, the database is ready to be used :)

### What is the distribution of the outcome? 

In [42]:
sample_dummy["isFraud"].value_counts()
#The outcome is very unbalanced. Many more unfraud cases

0    99876
1      124
Name: isFraud, dtype: int64

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [1]:
#Dataset already cleaned


### Run a logisitc regression classifier and evaluate its accuracy.

In [44]:
from sklearn.linear_model import LogisticRegression
X = sample_dummy.drop('isFraud',axis = 1)
y = sample_dummy['isFraud']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [47]:
LR = LogisticRegression(max_iter=1000)
LR.fit(X_train, y_train)
pred = LR.predict(X_test)

print("Train accuracy score: ", LR.score(X_train, y_train))
print("Test accuracy score: ", LR.score(X_test, y_test))

Train accuracy score:  0.9981066666666667
Test accuracy score:  0.99836


In [49]:
from sklearn.metrics import confusion_matrix

pred = LR.predict(X_test)
confusion_matrix(y_test, pred)

#there are some false positives in my data, 32. Maybe it could be reduced balancing the data better

array([[24939,    32],
       [    9,    20]], dtype=int64)

### Now pick a model of your choice and evaluate its accuracy.

In [53]:
from sklearn.utils import resample

In [51]:
#I am going to pick the same model, but I will work on the unbalanced data

# separate majority/minority classes
no_fraud = sample_dummy[sample_dummy['isFraud']==0]
yes_fraud = sample_dummy[sample_dummy['isFraud']==1]

display(no_fraud.shape)
display(yes_fraud.shape)


(99876, 12)

(124, 12)

In [54]:
# oversample minority
yes_fraud_oversampled = resample(yes_fraud, #<- oversample from here 
                                    replace=True, #<- we need replacement, since we don't have enough data otherwise
                                    n_samples = len(no_fraud),#<- make both sets the same size # make the diabetes set equal to the size of no_diabetes
                                    random_state=0)

In [55]:
# both sets are now of a reasonable size
display(no_fraud.shape)
display(yes_fraud_oversampled.shape)

(99876, 12)

(99876, 12)

In [56]:
#applying it into the train data:
train_oversampled = pd.concat([no_fraud,yes_fraud_oversampled])
train_oversampled.head()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
6322570,688,23557.12,8059.0,31616.12,169508.66,145951.53,0,1,0,0,0,0
3621196,274,6236.13,0.0,0.0,0.0,0.0,0,0,0,0,1,0
1226256,133,33981.87,18745.72,0.0,0.0,0.0,0,0,0,0,1,0
2803274,225,263006.42,20072.0,0.0,390253.56,653259.98,0,0,1,0,0,0
3201247,249,152013.74,20765.0,0.0,252719.19,404732.93,0,0,1,0,0,0


In [57]:
#Now, naming X and Y:

y_train_over = train_oversampled['isFraud'].copy()
X_train_over = train_oversampled.drop('isFraud',axis = 1).copy()

In [58]:
LR = LogisticRegression(max_iter=1000)
LR.fit(X_train_over, y_train_over)
pred = LR.predict(X_test)

In [59]:
print("Train accuracy score: ", LR.score(X_train, y_train))
print("Test accuracy score: ", LR.score(X_test, y_test))

Train accuracy score:  0.9580666666666666
Test accuracy score:  0.9566


In [60]:
pred = LR.predict(X_test)
confusion_matrix(y_test, pred)

array([[23887,  1084],
       [    1,    28]], dtype=int64)

### Which model worked better and how do you know?

In [2]:
#The model that worked better was the first one without resizing the data. I analyzed the false positives and also the score of the model. Both metrics worked better without reshaping my data.

### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.