# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.

### First, download the data from: https://www.kaggle.com/ntnu-testimon/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?
### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze.

In [15]:
# import libraries
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report


In [2]:
# Your code here

# limit to 100000 rows
n = 100000

# load dataset
paysim = pd.read_csv("/Users/rickardramhoj/Downloads/ntnu-testimon.csv", nrows=n)

# check shape
paysim.shape

(100000, 11)

In [3]:
# check data
display(paysim.head())
display(paysim.tail())

# check dtypes
display(paysim.dtypes)

# describe data
display(paysim.describe())


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
99995,10,PAYMENT,4020.66,C1410794718,159929.0,155908.34,M1257036576,0.0,0.0,0,0
99996,10,PAYMENT,18345.49,C744303677,6206.0,0.0,M1785344556,0.0,0.0,0,0
99997,10,CASH_IN,183774.91,C104331851,39173.0,222947.91,C36392889,54925.05,0.0,0,0
99998,10,CASH_OUT,82237.17,C707662966,6031.0,0.0,C1553004158,592635.66,799140.46,0,0
99999,10,PAYMENT,20096.56,C1868032458,110117.0,90020.44,M1419201886,0.0,0.0,0,0


step                int64
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,8.49964,173602.2,877757.5,894061.9,880504.8,1184041.0,0.00116,0.0
std,1.825545,344300.3,2673284.0,2711318.0,2402267.0,2802350.0,0.034039,0.0
min,1.0,0.32,0.0,0.0,0.0,0.0,0.0,0.0
25%,8.0,9963.562,0.0,0.0,0.0,0.0,0.0,0.0
50%,9.0,52745.52,20061.5,0.0,20839.43,49909.18,0.0,0.0
75%,10.0,211763.1,190192.0,214813.2,588272.4,1058186.0,0.0,0.0
max,10.0,10000000.0,33797390.0,34008740.0,34008740.0,38946230.0,1.0,0.0


### What is the distribution of the outcome? 

In [4]:
# Your response here

#value counts for isFraud
display(paysim["isFraud"].value_counts())

# look at the Fraud situations only
display(paysim[paysim["isFraud"] == 1].head())


0    99884
1      116
Name: isFraud, dtype: int64

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
251,1,TRANSFER,2806.0,C1420196421,2806.0,0.0,C972765878,0.0,0.0,1,0
252,1,CASH_OUT,2806.0,C2101527076,2806.0,0.0,C1007251739,26202.0,0.0,1,0
680,1,TRANSFER,20128.0,C137533655,20128.0,0.0,C1848415041,0.0,0.0,1,0


### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [5]:
# Your code here

# each step is one hour of the simulation. I think it is ok to have it as integers. 

# one hot encoding on type
paysim = pd.merge(left = paysim,
              right = pd.get_dummies(paysim['type'],prefix='type'),
              left_index=True,
              right_index=True)

# divide dataset into features and labels
features = paysim.drop(["isFraud", "type", "isFlaggedFraud", "nameOrig", "nameDest"], axis=1)
labels = paysim["isFraud"]

# divide into test and train
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, labels, random_state=0, test_size=0.25)

# check shapes
display(X_train.shape)
display(y_train.shape)
display(X_test.shape)
display(y_test.shape)

(75000, 11)

(75000,)

(25000, 11)

(25000,)

### Run a logisitc regression classifier and evaluate its accuracy.

In [16]:
# Your code here

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(X_train, y_train)

display("test accuracy:", model.score(X_test, y_test))
display("train accuracy:", model.score(X_train, y_train))

pred = model.predict(X_test)

print(classification_report(y_test, pred))

'test accuracy:'

0.99908

'train accuracy:'

0.9988666666666667

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     24974
           1       0.80      0.15      0.26        26

    accuracy                           1.00     25000
   macro avg       0.90      0.58      0.63     25000
weighted avg       1.00      1.00      1.00     25000



### Now pick a model of your choice and evaluate its accuracy.

In [17]:
# Your code here

# import decision tree function
from sklearn.tree import DecisionTreeClassifier

# create model
model = DecisionTreeClassifier(max_depth=3)

# fit model
model.fit(X_train, y_train)

display("test accuracy:", model.score(X_test, y_test))
display("train accuracy:", model.score(X_train, y_train))

pred = model.predict(X_test)

print(classification_report(y_test, pred))

'test accuracy:'

0.999

'train accuracy:'

0.99884

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     24974
           1       1.00      0.04      0.07        26

    accuracy                           1.00     25000
   macro avg       1.00      0.52      0.54     25000
weighted avg       1.00      1.00      1.00     25000



### Which model worked better and how do you know?

In [8]:
# Your response here

# The logistic regression is better since it has a higher recall for the cases of fraud. The decision tree has 0.04 and the logistic regression 0.15.

### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.