# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.

### First, download the data from: https://www.kaggle.com/ealaxi/paysim1 . Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?
### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze.

In [31]:
# We import relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### What is the distribution of the outcome? 

In [32]:
# We import dataset
payments = pd.read_csv('C:/Users/magavald/Desktop/data.csv')
payments = payments.sample(100000, random_state=1)

In [33]:
#We make a first observation of the dataset. We aim to predict whether a transaction is fraud or not (isFraud column)
payments.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
6322570,688,CASH_IN,23557.12,C867750533,8059.0,31616.12,C1026934669,169508.66,145951.53,0,0
3621196,274,PAYMENT,6236.13,C601099070,0.0,0.0,M701283411,0.0,0.0,0,0
1226256,133,PAYMENT,33981.87,C279540931,18745.72,0.0,M577905776,0.0,0.0,0,0
2803274,225,CASH_OUT,263006.42,C11675531,20072.0,0.0,C529577791,390253.56,653259.98,0,0
3201247,249,CASH_OUT,152013.74,C530649214,20765.0,0.0,C1304175579,252719.19,404732.93,0,0


In [34]:
#This is an imbalanced dataset as only 0.12% of instances are fraud
payments['isFraud'].value_counts()

0    99876
1      124
Name: isFraud, dtype: int64

In [35]:
#We observe the datatypes
payments.dtypes

step                int64
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [36]:
#The first clean-up we can do is define which types of transfers are potentially frauduent: TRANSFER and CASH_OUT
payments[payments['isFraud']==1]['type'].value_counts()

CASH_OUT    64
TRANSFER    60
Name: type, dtype: int64

In [37]:
#We clean-up the dataset accordingly. The dataset is now 56.3% smaller
payments = payments[payments['type'].isin(['CASH_OUT','TRANSFER'])]
print(payments.shape)
payments.head()

(43625, 11)


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
2803274,225,CASH_OUT,263006.42,C11675531,20072.0,0.0,C529577791,390253.56,653259.98,0,0
3201247,249,CASH_OUT,152013.74,C530649214,20765.0,0.0,C1304175579,252719.19,404732.93,0,0
1351584,137,CASH_OUT,336874.19,C1430396546,201316.0,0.0,C1687236810,20820.92,357695.11,0,0
5422829,378,CASH_OUT,520230.74,C1815050914,0.0,0.0,C1640500532,540059.79,1060290.53,0,0
2400263,202,TRANSFER,735977.55,C1864759705,8900.0,0.0,C2022704650,0.0,735977.55,0,0


In [38]:
#We remove unnecessary columns 
payments.drop(columns=['isFlaggedFraud'],inplace=True)

In [39]:
#We dummify the payment type
payment_type_dummy = pd.get_dummies(payments['type'])
payments_dummy = payments.join(payment_type_dummy).drop(columns=['type'])
payments_dummy.head()

Unnamed: 0,step,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,CASH_OUT,TRANSFER
2803274,225,263006.42,C11675531,20072.0,0.0,C529577791,390253.56,653259.98,0,1,0
3201247,249,152013.74,C530649214,20765.0,0.0,C1304175579,252719.19,404732.93,0,1,0
1351584,137,336874.19,C1430396546,201316.0,0.0,C1687236810,20820.92,357695.11,0,1,0
5422829,378,520230.74,C1815050914,0.0,0.0,C1640500532,540059.79,1060290.53,0,1,0
2400263,202,735977.55,C1864759705,8900.0,0.0,C2022704650,0.0,735977.55,0,0,1


In [40]:
# We see that 'nameOrig' and 'nameDest' either start with a letter
payments_dummy['nameOrig'] = payments_dummy['nameOrig'].str[0]
payments_dummy['nameDest'] = payments_dummy['nameDest'].str[0]

In [41]:
#In both cases it's only 'C', so we can remove these columns
print(payments_dummy['nameOrig'].value_counts())
print(payments_dummy['nameDest'].value_counts())

C    43625
Name: nameOrig, dtype: int64
C    43625
Name: nameDest, dtype: int64


In [42]:
payments_dummy.drop(columns=['nameOrig','nameDest','TRANSFER'], inplace=True)

In [43]:
payments_dummy.head()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,CASH_OUT
2803274,225,263006.42,20072.0,0.0,390253.56,653259.98,0,1
3201247,249,152013.74,20765.0,0.0,252719.19,404732.93,0,1
1351584,137,336874.19,201316.0,0.0,20820.92,357695.11,0,1
5422829,378,520230.74,0.0,0.0,540059.79,1060290.53,0,1
2400263,202,735977.55,8900.0,0.0,0.0,735977.55,0,0


### Run a logisitc regression classifier and evaluate its accuracy.

In [44]:
# Import 
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score

In [45]:
#Train test split
X_train, X_test, y_train, y_test = train_test_split(payments_dummy.drop(columns=['isFraud'])
                                                                        , payments_dummy['isFraud']
                                                                        , random_state=0
                                                                        , test_size =0.2)

In [46]:
y_test.value_counts()

0    8694
1      31
Name: isFraud, dtype: int64

In [47]:
#Import the model and fit it to trainning data
model = LogisticRegression(max_iter=1000)
model = model.fit(X_train, y_train)
print('Accuracy is',model.score(X_test, y_test),'however, its not a good way to assess performance as a very low % of transactions are fraud')

Accuracy is 0.9981661891117478 however, its not a good way to assess performance as a very low % of transactions are fraud


In [48]:
y_pred_logistic = model.predict(X_test)
print('Precision is', precision_score(y_test,y_pred_logistic))
print('Recall is',recall_score(y_test,y_pred_logistic))

Precision is 0.8571428571428571
Recall is 0.5806451612903226


### Now pick a model of your choice and evaluate its accuracy.

In [49]:
# Your code here
import xgboost

model = xgboost.XGBClassifier()
model = model.fit(X_train, y_train)
y_pred_xgboost = model.predict(X_test)
model.score(X_test,y_test)

0.9985100286532951

In [50]:
print('Precision is', precision_score(y_test,y_pred_xgboost))
print('Recall is',recall_score(y_test,y_pred_xgboost))

Precision is 1.0
Recall is 0.5806451612903226


### Which model worked better and how do you know?

In [54]:
# Your response here
'''I've chosen XGBoost, which has improved Precision from 85.7% to 100%. However, here we are more interested in improving Recall'''

"I've chosen XGBoost, which has improved Precision from 85.7% to 100%. However, here we are more interested in improving Recall"

In [91]:
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state = 1,sampling_strategy=0.1)

In [95]:
X_train_SMOTE,y_train_SMOTE = sm.fit_resample(X_train,y_train)

In [106]:
model = xgboost.XGBClassifier(max_depth=5)
model = model.fit(X_train_SMOTE, y_train_SMOTE)
y_pred_xgboost_smote = model.predict(X_test)
model.score(X_test,y_test)

0.9980515759312321

In [107]:
print('Precision is', precision_score(y_test,y_pred_xgboost_smote))
print('Recall is',recall_score(y_test,y_pred_xgboost_smote))

Precision is 0.7333333333333333
Recall is 0.7096774193548387


In [109]:
'''Now, we increase Recall to from 58.1% to 71.0%, which is preferable even if this comes at the cost of reducing Precision'''

'Now, we increase Recall to from 58.1% to 71.0%, which is preferable even if this comes at the cost of reducing Precision'

### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.