# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.

### First, download the data from: https://www.kaggle.com/ntnu-testimon/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?
### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze.

In [None]:
"""
About the dataset:

step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).

type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

amount - amount of the transaction in local currency.

nameOrig - customer who started the transaction

oldbalanceOrg - initial balance before the transaction

newbalanceOrig - new balance after the transaction

nameDest - customer who is the recipient of the transaction

oldbalanceDest - initial balance recipient before the transaction. 
Note that there is not information for customers that start with M (Merchants).

newbalanceDest - new balance recipient after the transaction. 
Note that there is not information for customers that start with M (Merchants).

isFraud - This is the transactions made by the fraudulent agents inside the simulation. 
In this specific dataset the fraudulent behavior of the agents aims to profit by taking control 
or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. 
An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.


"""

In [2]:
# Import your libraries:

%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sb
import seaborn as sns

In [3]:
# loading the data:
total_data = pd.read_csv('data_imbalance.csv')

In [4]:
data = total_data.sample(100000)
data.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
3143254,236,CASH_OUT,147566.62,C2048528829,101033.0,0.0,C1413936086,0.0,147566.62,0,0
2828286,226,CASH_IN,114591.07,C180556562,3552875.16,3667466.22,C2013636495,127439.23,12848.17,0,0
6004065,428,CASH_OUT,216463.71,C334259387,30896.0,0.0,C1060739052,4333677.91,4550141.62,0,0
4824475,346,PAYMENT,17907.6,C268977496,32189.0,14281.4,M1192857340,0.0,0.0,0,0
5990599,419,CASH_OUT,117283.3,C199734727,51362.0,0.0,C1878999311,0.0,117283.3,0,0


In [5]:
data.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,242.92684,180706.9,832307.2,853776.0,1106068.0,1229308.0,0.00116,0.0
std,142.080734,624693.9,2875307.0,2912112.0,3547170.0,3829092.0,0.034039,0.0
min,1.0,0.37,0.0,0.0,0.0,0.0,0.0,0.0
25%,155.0,13541.85,0.0,0.0,0.0,0.0,0.0,0.0
50%,237.0,75632.33,14396.0,0.0,132727.3,214790.7,0.0,0.0
75%,334.0,209984.0,106876.5,145164.9,944460.8,1118321.0,0.0,0.0
max,734.0,60154460.0,37538000.0,37919820.0,272404700.0,288544800.0,1.0,0.0


In [11]:
data['type'].value_counts()

CASH_OUT    35337
PAYMENT     33586
CASH_IN     22019
TRANSFER     8387
DEBIT         671
Name: type, dtype: int64

### What is the distribution of the outcome? 

In [9]:
data['isFraud'].value_counts()

#the outcome is highly imablanced torwards 0 (no fraud)

0    99884
1      116
Name: isFraud, dtype: int64

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [15]:
#taking out what is not needed
data.drop(['step','nameOrig', 'nameDest', 'isFlaggedFraud'],axis=1,inplace=True)

In [20]:
#dummys for type of transaction
data = pd.get_dummies(data)

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 3143254 to 6015648
Data columns (total 11 columns):
step              100000 non-null int64
type              100000 non-null object
amount            100000 non-null float64
nameOrig          100000 non-null object
oldbalanceOrg     100000 non-null float64
newbalanceOrig    100000 non-null float64
nameDest          100000 non-null object
oldbalanceDest    100000 non-null float64
newbalanceDest    100000 non-null float64
isFraud           100000 non-null int64
isFlaggedFraud    100000 non-null int64
dtypes: float64(5), int64(3), object(3)
memory usage: 9.2+ MB


In [None]:
"""
In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).

"""

In [21]:
data.head() #seems okay

Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
3143254,147566.62,101033.0,0.0,0.0,147566.62,0,0,1,0,0,0
2828286,114591.07,3552875.16,3667466.22,127439.23,12848.17,0,1,0,0,0,0
6004065,216463.71,30896.0,0.0,4333677.91,4550141.62,0,0,1,0,0,0
4824475,17907.6,32189.0,14281.4,0.0,0.0,0,0,0,0,1,0
5990599,117283.3,51362.0,0.0,0.0,117283.3,0,0,1,0,0,0


### Run a logisitc regression classifier and evaluate its accuracy.

In [23]:
from sklearn.model_selection import train_test_split

X=data.drop('isFraud',axis=1)
y=data['isFraud']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

In [24]:
data.dtypes

amount            float64
oldbalanceOrg     float64
newbalanceOrig    float64
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
type_CASH_IN        uint8
type_CASH_OUT       uint8
type_DEBIT          uint8
type_PAYMENT        uint8
type_TRANSFER       uint8
dtype: object

In [38]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, accuracy_score
from sklearn.metrics import confusion_matrix


lr = LogisticRegression()
model=lr.fit(X_train, y_train)
model.score(X_test,y_test) 

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print('Accuracy:', round(accuracy*100,2), "%")

Accuracy: 99.54 %


In [40]:
confusion_matrix(y_test,y_pred)

array([[19886,    88],
       [    4,    22]])

### Now pick a model of your choice and evaluate its accuracy.

In [41]:
#let's go with Gradient Boosting Classifier
from sklearn.ensemble import GradientBoostingClassifier

gboost = GradientBoostingClassifier()
gboost.fit(X_train, y_train)

y_pred = gboost.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)


print('Accuracy:', round(accuracy*100,2), "%") #wow too good

Accuracy: 99.9 %


In [42]:
confusion_matrix(y_test,y_pred)

array([[19972,     2],
       [   17,     9]])

### Which model worked better and how do you know?

In [2]:
#we have a lot of imbalance on the dependent variable - this drives a extremely high accuracy on both models
#looking at the confusion matrix we see that Gradient Boosting provides waaaay less False Positives but a bit more 
#false negatives


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.