# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.

### First, download the data from: https://www.kaggle.com/ntnu-testimon/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?
### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze.

In [1]:
#Import your libraries
import numpy as np
import pandas as pd
import random

In [2]:
data = pd.read_csv('/Users/Kakurebono/Documents/GitHub/lab-imbalance/your-code/fraud.csv')
data = data.sample(n=100000, random_state=42)

In [3]:
data.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
3737323,278,CASH_IN,330218.42,C632336343,20866.0,351084.42,C834976624,452419.57,122201.15,0,0
264914,15,PAYMENT,11647.08,C1264712553,30370.0,18722.92,M215391829,0.0,0.0,0,0
85647,10,CASH_IN,152264.21,C1746846248,106589.0,258853.21,C1607284477,201303.01,49038.8,0,0
5899326,403,TRANSFER,1551760.63,C333676753,0.0,0.0,C1564353608,3198359.45,4750120.08,0,0
2544263,206,CASH_IN,78172.3,C813403091,2921331.58,2999503.88,C1091768874,415821.9,337649.6,0,0


In [4]:
data.corr()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
step,1.0,0.025131,-0.011913,-0.012198,0.024287,0.023641,0.03499,-0.000704
amount,0.025131,1.0,-0.006758,-0.011536,0.272921,0.446489,0.077999,0.027019
oldbalanceOrg,-0.011913,-0.006758,1.0,0.998908,0.062895,0.038052,0.006952,0.004488
newbalanceOrig,-0.012198,-0.011536,0.998908,1.0,0.06445,0.037949,-0.010106,0.00441
oldbalanceDest,0.024287,0.272921,0.062895,0.06445,1.0,0.973435,-0.009096,-0.001083
newbalanceDest,0.023641,0.446489,0.038052,0.037949,0.973435,1.0,-0.004081,-0.001119
isFraud,0.03499,0.077999,0.006952,-0.010106,-0.009096,-0.004081,1.0,0.084156
isFlaggedFraud,-0.000704,0.027019,0.004488,0.00441,-0.001083,-0.001119,0.084156,1.0


In [5]:
data['type'].unique()

array(['CASH_IN', 'PAYMENT', 'TRANSFER', 'CASH_OUT', 'DEBIT'],
      dtype=object)

In [6]:
data['type'].value_counts()

CASH_OUT    35334
PAYMENT     33564
CASH_IN     22141
TRANSFER     8349
DEBIT         612
Name: type, dtype: int64

In [7]:
data = pd.get_dummies(data=data, columns=['type'])

In [8]:
len(data)

100000

### What is the distribution of the outcome? 

In [9]:
data.groupby(by=['isFlaggedFraud']).mean()

Unnamed: 0_level_0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
isFlaggedFraud,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,243.709387,180533.4,836639.2,858182.5,1104204.0,1230067.0,0.0014,0.221412,0.353344,0.00612,0.335643,0.083481
1,212.0,4953893.0,4953893.0,4953893.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0


In [10]:
data["isFraud"].value_counts()

0    99859
1      141
Name: isFraud, dtype: int64

In [11]:
data["isFraud"].sum()

141

In [12]:
data["isFraud"].sum() / len(data)

0.00141

In [13]:
data['nameOrig'].value_counts()
#only unique values, can be dropped

C838212175     2
C1682659218    1
C729481720     1
C546614841     1
C111050195     1
              ..
C1739059888    1
C1159629214    1
C1500783418    1
C1293715171    1
C551908941     1
Name: nameOrig, Length: 99999, dtype: int64

In [14]:
data['nameDest'].value_counts()

C1085553281    6
C681078805     5
C1849014975    5
C1269097316    5
C709091500     5
              ..
M1802095329    1
M1431712159    1
C1022846053    1
C1309507313    1
C158799595     1
Name: nameDest, Length: 92914, dtype: int64

In [15]:
data['nameDest_letter'] = data['nameDest'].str[0]

In [16]:
data["nameDest_letter"].value_counts()
#seems like a solid 40/60 ratio of useless info

C    66436
M    33564
Name: nameDest_letter, dtype: int64

In [17]:
data.head()

Unnamed: 0,step,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER,nameDest_letter
3737323,278,330218.42,C632336343,20866.0,351084.42,C834976624,452419.57,122201.15,0,0,1,0,0,0,0,C
264914,15,11647.08,C1264712553,30370.0,18722.92,M215391829,0.0,0.0,0,0,0,0,0,1,0,M
85647,10,152264.21,C1746846248,106589.0,258853.21,C1607284477,201303.01,49038.8,0,0,1,0,0,0,0,C
5899326,403,1551760.63,C333676753,0.0,0.0,C1564353608,3198359.45,4750120.08,0,0,0,0,0,0,1,C
2544263,206,78172.3,C813403091,2921331.58,2999503.88,C1091768874,415821.9,337649.6,0,0,1,0,0,0,0,C


In [18]:
data['step'].value_counts()

19     810
18     783
187    754
307    747
163    743
      ... 
613      1
643      1
85       1
443      1
639      1
Name: step, Length: 463, dtype: int64

In [19]:
data.dtypes

step                 int64
amount             float64
nameOrig            object
oldbalanceOrg      float64
newbalanceOrig     float64
nameDest            object
oldbalanceDest     float64
newbalanceDest     float64
isFraud              int64
isFlaggedFraud       int64
type_CASH_IN         uint8
type_CASH_OUT        uint8
type_DEBIT           uint8
type_PAYMENT         uint8
type_TRANSFER        uint8
nameDest_letter     object
dtype: object

In [20]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 3737323 to 6142173
Data columns (total 16 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   step             100000 non-null  int64  
 1   amount           100000 non-null  float64
 2   nameOrig         100000 non-null  object 
 3   oldbalanceOrg    100000 non-null  float64
 4   newbalanceOrig   100000 non-null  float64
 5   nameDest         100000 non-null  object 
 6   oldbalanceDest   100000 non-null  float64
 7   newbalanceDest   100000 non-null  float64
 8   isFraud          100000 non-null  int64  
 9   isFlaggedFraud   100000 non-null  int64  
 10  type_CASH_IN     100000 non-null  uint8  
 11  type_CASH_OUT    100000 non-null  uint8  
 12  type_DEBIT       100000 non-null  uint8  
 13  type_PAYMENT     100000 non-null  uint8  
 14  type_TRANSFER    100000 non-null  uint8  
 15  nameDest_letter  100000 non-null  object 
dtypes: float64(5), int64(3), object

In [21]:
data.drop('nameOrig', axis=1, inplace=True)
#dropped due to unique values only

In [22]:
data.drop('nameDest', axis=1, inplace=True)
#dropped due to unique values only

In [23]:
data.drop('nameDest_letter', axis=1, inplace=True)
#dropped , needed only for data exploration

In [24]:
data.head()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
3737323,278,330218.42,20866.0,351084.42,452419.57,122201.15,0,0,1,0,0,0,0
264914,15,11647.08,30370.0,18722.92,0.0,0.0,0,0,0,0,0,1,0
85647,10,152264.21,106589.0,258853.21,201303.01,49038.8,0,0,1,0,0,0,0
5899326,403,1551760.63,0.0,0.0,3198359.45,4750120.08,0,0,0,0,0,0,1
2544263,206,78172.3,2921331.58,2999503.88,415821.9,337649.6,0,0,1,0,0,0,0


### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

### Run a logisitc regression classifier and evaluate its accuracy.

In [25]:
from sklearn.utils import resample 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [26]:
#separate input features and target
y= data['isFraud']
X = data.drop(labels = 'isFraud', axis = 1)

In [27]:
#setting up testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.25, random_state=27)

In [28]:
#concatenate our training data back together
X = pd.concat([X_train, y_train], axis = 1)

In [29]:
#separate minority and majority class
not_fraud = X[X['isFraud']==0]
fraud = X[X['isFraud']==1]

In [30]:
#upsample minority
fraud_upsampled = resample(fraud,
                           replace = True, #sample with replacement
                           n_samples=len(not_fraud), #match number in majority class
                           random_state=27)#reproductible results

In [31]:
fraud_upsampled['isFraud'].value_counts()

1    74892
Name: isFraud, dtype: int64

In [32]:
fraud_upsampled_concated = pd.concat([not_fraud, fraud_upsampled]) #, axis = 1) 
#when trying to concat the above two dataset, axis =1 breaks the code. Could you explain the reason behind it?

In [46]:
len(fraud_upsampled_concated)

149784

In [36]:
y_train =fraud_upsampled_concated['isFraud']
X_train =fraud_upsampled_concated.drop(labels = 'isFraud', axis = 1) 

In [58]:
model = LogisticRegression(solver='liblinear').fit(X_train,y_train) 
# model = LogisticRegression(solver='liblinear')
# model.fit(X_train,y_train) 

y_pred = model.predict(X_test)

In [38]:
from sklearn.metrics import confusion_matrix, accuracy_score
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))

[[22532  2435]
 [    7    26]]
0.90232


In [39]:
from sklearn.metrics import mean_absolute_error as mae

In [40]:
df2 = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})

In [41]:
mae(df2['Actual'], df2['Predicted'])

0.09768

In [42]:
from sklearn.metrics import f1_score

In [43]:
f1_score(y_test, y_pred)

0.020850040096230957

In [44]:
import matplotlib.pyplot as plt

In [48]:
from sklearn.metrics import classification_report, confusion_matrix

In [55]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       1.00      0.90      0.95     24967
           1       0.01      0.79      0.02        33

    accuracy                           0.90     25000
   macro avg       0.51      0.85      0.48     25000
weighted avg       1.00      0.90      0.95     25000



In [57]:
# lab done 


### Now pick a model of your choice and evaluate its accuracy.

In [None]:
# Your code here

### Which model worked better and how do you know?

In [None]:
# Your response here

### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.