# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.

### First, download the data from: https://www.kaggle.com/datasets/ealaxi/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?
### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze.

In [100]:
import pandas as pd
import numpy as np

download = pd.read_csv('/Users/tiagoornelas/Downloads/PS_20174392719_1491204439457_log.csv')



In [101]:
data = download.sample(100000)

### What is the distribution of the outcome? 

In [102]:
data['isFraud'].value_counts()

0    99873
1      127
Name: isFraud, dtype: int64

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [103]:

display(data.head(3))
data.dtypes

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
6193791,571,CASH_IN,117210.89,C1305827522,3450017.89,3567228.79,C1603305724,156595.94,39385.04,0,0
2085534,182,PAYMENT,10490.04,C1504874999,18531.0,8040.96,M942041713,0.0,0.0,0,0
1338882,137,CASH_OUT,354353.49,C1580171723,828.0,0.0,C1086480347,30216.95,384570.43,0,0


step                int64
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object

In [104]:
data['days'] = round(data['step'] / 24)
data['hours'] = data['step'] % 24 


def roundup (row):

    if row == 0:
        return 1
    else:
        return row

data['days'] = data['days'].apply(roundup)


In [105]:
def time(row):

    day = str(row['days']).replace('0','').replace('.','')
    hour = str(row['hours'])
    date = '01/2022'
    day_hour_month_year = day + '/' + date + ' ' + hour + ':00:00'

    return pd.to_datetime(day_hour_month_year, format='%d/%m/%Y %H:%M:%S')


data['step'] = data.apply(time, axis=1)

In [116]:
import datetime as dt

data['step']=data['step'].map(dt.datetime.toordinal)

In [106]:
data = data.drop(['days', 'hours', 'nameOrig', 'nameDest'], axis=1)

In [110]:
dummies = pd.get_dummies(data['type'], prefix='type')

data = pd.concat([data, dummies], axis=1)

data = data.drop(['type'], axis=1)

### Run a logisitc regression classifier and evaluate its accuracy.

In [118]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X = data.drop(['isFraud', 'isFlaggedFraud'], axis=1)
y = data['isFraud']
X_train, X_test, y_train, y_test = train_test_split(X,y)

In [124]:
model = LogisticRegression()

model.fit(X_train, y_train)

print('train data accuracy was', model.score(X_train, y_train))
print('test data accuracy was', model.score(X_test, y_test))

train data accuracy was 0.9993733333333333
test data accuracy was 0.9994


### Now pick a model of your choice and evaluate its accuracy.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

## creating a list of hyperparameters 

n_estimators = [10,15,20]
max_depth = [3,6,10]



grid = {'n_estimators': n_estimators, 'max_depth': max_depth}

model = RandomForestClassifier()

grid_search = GridSearchCV(estimator=model, param_grid=grid, cv = 5)

grid_search.fit(X_train, y_train)


In [131]:

forest = RandomForestClassifier(
    n_estimators=10,
    max_depth= 10)


forest.fit(X_train, y_train)

forest.score(X_test, y_test)

0.99956

### Which model worked better and how do you know?

In [None]:
# after hypertuning random forest classifier, it has a marginally higher accuracy than the logistic regression

### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.