# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.

### First, download the data from: https://www.kaggle.com/datasets/ealaxi/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?
### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze.

In [35]:
import pandas as pd
import numpy as np

In [36]:
# Your response here
data = pd.read_csv("data.csv")

In [37]:
data.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [38]:
data.shape

(6362620, 11)

In [39]:
n = 100000
data_sample = data.sample(n=n, axis=0)

data_sample.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
5870199,403,PAYMENT,17509.95,C482191795,303892.08,286382.13,M344648530,0.0,0.0,0,0
4193018,305,CASH_OUT,164344.44,C1742479150,48329.6,0.0,C1268033679,204610.35,368954.79,0,0
314193,16,CASH_IN,182877.5,C1625640897,3824909.55,4007787.05,C2141776586,3283310.23,2965018.09,0,0
5790096,401,TRANSFER,136487.99,C103052026,2052.76,0.0,C937885086,2215463.06,2351951.05,0,0
3911997,284,CASH_IN,217344.37,C495572889,4917976.08,5135320.46,C1487826082,11181782.39,10964438.01,0,0


In [40]:
data_sample.shape

(100000, 11)

### What is the distribution of the outcome? 

In [41]:
# Your response here

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [42]:
# Your code here
data_sample.dtypes

step                int64
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object

In [43]:
data_sample.isnull().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

In [44]:
data_sample["type"].value_counts()

CASH_OUT    35351
PAYMENT     33775
CASH_IN     21904
TRANSFER     8311
DEBIT         659
Name: type, dtype: int64

In [45]:
dummies = pd.get_dummies(data_sample["type"])

In [46]:
new_data = pd.concat([data_sample, dummies], axis=1)

In [47]:
new_data.drop("type", inplace = True, axis = 1)

In [48]:
new_data.drop("nameOrig", inplace = True, axis = 1) #deleted because is a string

In [49]:
new_data.drop("nameDest", inplace = True, axis = 1) #deleted because is a string

In [50]:
new_data.drop("isFlaggedFraud", inplace = True, axis = 1) #deleted because only has values with 0

In [51]:
new_data.head()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,CASH_IN,CASH_OUT,DEBIT,PAYMENT,TRANSFER
5870199,403,17509.95,303892.08,286382.13,0.0,0.0,0,0,0,0,1,0
4193018,305,164344.44,48329.6,0.0,204610.35,368954.79,0,0,1,0,0,0
314193,16,182877.5,3824909.55,4007787.05,3283310.23,2965018.09,0,1,0,0,0,0
5790096,401,136487.99,2052.76,0.0,2215463.06,2351951.05,0,0,0,0,0,1
3911997,284,217344.37,4917976.08,5135320.46,11181782.39,10964438.01,0,1,0,0,0,0


In [52]:
new_data.dtypes

step                int64
amount            float64
oldbalanceOrg     float64
newbalanceOrig    float64
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
CASH_IN             uint8
CASH_OUT            uint8
DEBIT               uint8
PAYMENT             uint8
TRANSFER            uint8
dtype: object

In [53]:
new_data.corr()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,CASH_IN,CASH_OUT,DEBIT,PAYMENT,TRANSFER
step,1.0,0.022357,-0.013393,-0.013325,0.026208,0.02438,0.035586,0.007531,-0.018334,0.001295,0.008759,0.00508
amount,0.022357,1.0,-0.004457,-0.009061,0.271424,0.433445,0.073067,-0.010418,-0.003892,-0.022992,-0.192782,0.359373
oldbalanceOrg,-0.013393,-0.004457,1.0,0.998834,0.064978,0.041942,0.007765,0.50808,-0.201436,-0.021589,-0.189429,-0.081529
newbalanceOrig,-0.013325,-0.009061,0.998834,1.0,0.066294,0.041518,-0.009859,0.528501,-0.211654,-0.021998,-0.193733,-0.086935
oldbalanceDest,0.026208,0.271424,0.064978,0.066294,1.0,0.97841,-0.0051,0.071319,0.085454,0.011058,-0.224243,0.126101
newbalanceDest,0.02438,0.433445,0.041942,0.041518,0.97841,1.0,0.002068,0.032236,0.093539,0.008347,-0.232754,0.186032
isFraud,0.035586,0.073067,0.007765,-0.009859,-0.0051,0.002068,1.0,-0.018736,0.017057,-0.002881,-0.025265,0.042662
CASH_IN,0.007531,-0.010418,0.50808,0.528501,0.071319,0.032236,-0.018736,1.0,-0.391622,-0.043135,-0.378211,-0.159447
CASH_OUT,-0.018334,-0.003892,-0.201436,-0.211654,0.085454,0.093539,0.017057,-0.391622,1.0,-0.060228,-0.528088,-0.222632
DEBIT,0.001295,-0.022992,-0.021589,-0.021998,0.011058,0.008347,-0.002881,-0.043135,-0.060228,1.0,-0.058165,-0.024521


### Run a logisitc regression classifier and evaluate its accuracy.

In [57]:
# Your code here
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

features = new_data.drop("isFraud", axis = 1)
target = new_data["isFraud"]

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state = 0)

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

print(log_reg.score(X_test, y_test))
print(f"Training data accuracy was {log_reg.score(X_train, y_train)}")

0.9985
Training data accuracy was 0.9987


### Now pick a model of your choice and evaluate its accuracy.

In [59]:
# Your code here
#Extreme Gradient Boosting
import xgboost

xgb_reg = xgboost.XGBRegressor(max_depth = 5, n_estimators = 500)
xgb_reg.fit(X_train, y_train)

print(xgb_reg.score(X_test, y_test))
print(f"Training data accuracy was {xgb_reg.score(X_train, y_train)}")

0.5383219853506562
Training data accuracy was 0.9989275131167832


### Which model worked better and how do you know?

In [None]:
# Your response here: Logistic logisitc regression worked better. Extreme Gradient Boosting is overfitting.

### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.