# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.

### First, download the data from: https://www.kaggle.com/ealaxi/paysim1 . Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?
### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze.

In [75]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [47]:
# Your code here

data=pd.read_csv('../Data.csv').head(100000)
data

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.00,0.00,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.00,0.00,0,0
2,1,TRANSFER,181.00,C1305486145,181.0,0.00,C553264065,0.00,0.00,1,0
3,1,CASH_OUT,181.00,C840083671,181.0,0.00,C38997010,21182.00,0.00,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.00,0.00,0,0
...,...,...,...,...,...,...,...,...,...,...,...
99995,10,PAYMENT,4020.66,C1410794718,159929.0,155908.34,M1257036576,0.00,0.00,0,0
99996,10,PAYMENT,18345.49,C744303677,6206.0,0.00,M1785344556,0.00,0.00,0,0
99997,10,CASH_IN,183774.91,C104331851,39173.0,222947.91,C36392889,54925.05,0.00,0,0
99998,10,CASH_OUT,82237.17,C707662966,6031.0,0.00,C1553004158,592635.66,799140.46,0,0


In [50]:
data.shape

(100000, 11)

In [51]:
data.dtypes

step                int64
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object

In [52]:
data.isna().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

In [53]:
data.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,8.49964,173602.2,877757.5,894061.9,880504.8,1184041.0,0.00116,0.0
std,1.825545,344300.3,2673284.0,2711318.0,2402267.0,2802350.0,0.034039,0.0
min,1.0,0.32,0.0,0.0,0.0,0.0,0.0,0.0
25%,8.0,9963.562,0.0,0.0,0.0,0.0,0.0,0.0
50%,9.0,52745.52,20061.5,0.0,20839.43,49909.18,0.0,0.0
75%,10.0,211763.1,190192.0,214813.2,588272.4,1058186.0,0.0,0.0
max,10.0,10000000.0,33797390.0,34008740.0,34008740.0,38946230.0,1.0,0.0


### What is the distribution of the outcome? 

In [54]:
# Your response here

data['isFraud'].value_counts()

0    99884
1      116
Name: isFraud, dtype: int64

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [55]:
# Your code here

#Step column maps a unit of time in the real world. In this case 1 step is 1 hour of time.

In [56]:
data['type'].value_counts()

PAYMENT     39512
CASH_OUT    30718
CASH_IN     20185
TRANSFER     8597
DEBIT         988
Name: type, dtype: int64

In [57]:
dummies=pd.get_dummies(data['type'],drop_first=True)
#dummies

In [58]:
data_dummy=pd.merge(left= data,
                      right=dummies,
                      left_index=True,
                      right_index=True)

#data_dummy

In [59]:
data_dummy.drop(columns=['type','nameOrig','nameDest','isFlaggedFraud'], inplace=True, axis=1)

In [60]:
data_dummy

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,CASH_OUT,DEBIT,PAYMENT,TRANSFER
0,1,9839.64,170136.0,160296.36,0.00,0.00,0,0,0,1,0
1,1,1864.28,21249.0,19384.72,0.00,0.00,0,0,0,1,0
2,1,181.00,181.0,0.00,0.00,0.00,1,0,0,0,1
3,1,181.00,181.0,0.00,21182.00,0.00,1,1,0,0,0
4,1,11668.14,41554.0,29885.86,0.00,0.00,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...
99995,10,4020.66,159929.0,155908.34,0.00,0.00,0,0,0,1,0
99996,10,18345.49,6206.0,0.00,0.00,0.00,0,0,0,1,0
99997,10,183774.91,39173.0,222947.91,54925.05,0.00,0,0,0,0,0
99998,10,82237.17,6031.0,0.00,592635.66,799140.46,0,1,0,0,0


### Run a logisitc regression classifier and evaluate its accuracy.

In [67]:
from sklearn.model_selection import train_test_split

In [68]:
X=data_dummy.drop(['isFraud'],axis=1)
y=data_dummy['isFraud']

In [69]:
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.2,random_state=0)

In [70]:
# Your code here
#initialize the model
from sklearn.linear_model import LogisticRegression

In [71]:
# initialize the model
model = LogisticRegression()

# training the model on the training datasets -> where the algorithm will learn
model = model.fit(X_train, y_train)

In [72]:
from sklearn.metrics import accuracy_score

In [73]:
y_pred = model.predict(X_test)
y_pred

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [76]:
# and compare with the "real" data -> y_test

np.array(y_test)

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [77]:
accuracy_score(y_pred,y_test)

0.999

In [99]:
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [101]:
print("precision: ",precision_score(y_test,y_pred))
print("recall: ",recall_score(y_test,y_pred))
print(classification_report(y_test,y_pred))
confusion_matrix(y_test,y_pred)

precision:  0.75
recall:  0.13636363636363635
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     19978
           1       0.75      0.14      0.23        22

    accuracy                           1.00     20000
   macro avg       0.87      0.57      0.62     20000
weighted avg       1.00      1.00      1.00     20000



array([[19977,     1],
       [   19,     3]], dtype=int64)

### Now pick a model of your choice and evaluate its accuracy.

In [86]:
# Your code here

from sklearn.neighbors import KNeighborsClassifier

# define hyperparameters here
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=3)

In [87]:
y_predict = knn.predict(X_test)
y_predict

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [88]:
np.array(y_test)

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [89]:
accuracy_score(y_predict,y_test)

0.99875

In [102]:
print("precision: ",precision_score(y_test,y_predict))
print("recall: ",recall_score(y_test,y_predict))
print(classification_report(y_test,y_predict))
confusion_matrix(y_test,y_predict)

precision:  0.2857142857142857
recall:  0.09090909090909091
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     19978
           1       0.29      0.09      0.14        22

    accuracy                           1.00     20000
   macro avg       0.64      0.55      0.57     20000
weighted avg       1.00      1.00      1.00     20000



array([[19973,     5],
       [   20,     2]], dtype=int64)

### Which model worked better and how do you know?

In [2]:
# Your response here

#The first model is better because it has better results in "precision" and "recall" metrics
# and also because the number of false positives and false negatives is lower than in the second model

### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.