# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.

### First, download the data from: https://www.kaggle.com/ntnu-testimon/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?
### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze.

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import RobustScaler, StandardScaler, PolynomialFeatures, MinMaxScaler
import seaborn as sns

In [26]:
fraud = pd.read_csv('C:/Users/Zaca/Documents/Datasets/simulated_payment_data/fraud.csv').sample(1000000)

In [27]:
fraud.shape

(1000000, 11)

In [28]:
fraud.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
2345681,189,CASH_OUT,24343.06,C1172641997,155.0,0.0,C715885997,0.0,24343.06,0,0
5501018,380,CASH_OUT,56113.43,C1178267475,30564.0,0.0,C1674925903,0.0,56113.43,0,0
874737,42,PAYMENT,21914.32,C2086871501,537012.0,515097.68,M1875681686,0.0,0.0,0,0
6051973,495,CASH_IN,152124.48,C2092445131,7231715.65,7383840.13,C532019803,1786875.29,1634750.81,0,0
6151179,546,TRANSFER,457713.12,C970587691,0.0,0.0,C427165150,5822106.49,6279819.62,0,0


In [29]:
fraud.dtypes

step                int64
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object

In [30]:
# there are only 5 types so we can label these
fraud.type.value_counts()

CASH_OUT    351440
PAYMENT     338098
CASH_IN     220008
TRANSFER     83989
DEBIT         6465
Name: type, dtype: int64

In [31]:
# the other object column is namedest and nameorig. there are too many unique values, I will drop.
fraud.nameDest.value_counts()

C1590550415    26
C1286084959    23
C1978557426    20
C985934102     19
C1027017168    19
               ..
M1324883766     1
C346358398      1
M1382740876     1
M1200518987     1
M916026574      1
Name: nameDest, Length: 648603, dtype: int64

In [32]:
fraud.drop(labels=['nameDest', 'nameOrig'], axis=1, inplace=True)

### What is the distribution of the outcome? 

In [33]:
fraud.isFraud.value_counts(normalize=True)

0    0.998703
1    0.001297
Name: isFraud, dtype: float64

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [34]:
# Why not? Much like time, it is an arbitrary number. As long as the same number always represents the same hour...

In [35]:
# label the type of transaction:
le = LabelEncoder()
label_cols = ['type']
fraud[label_cols] = fraud[label_cols].apply(le.fit_transform)

### Run a logisitc regression classifier and evaluate its accuracy.

In [36]:
y = fraud['isFraud']
X = fraud.drop(labels='isFraud', axis=1)

#scaler = StandardScaler()
#scaler.fit_transform(X)

# divide train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2)

In [48]:
lr = LogisticRegression()#class_weight='balanced')
lr.fit(X_train, y_train)
acc = lr.score(X_test, y_test)*100

print(f"Logistic Regression Test Accuracy {round(acc, 2)}%")



Logistic Regression Test Accuracy 99.81%


In [49]:
from sklearn.metrics import confusion_matrix, accuracy_score

y_pred = lr.predict(X_test)
print(accuracy_score(y_test, y_pred)*100)
cm = confusion_matrix(y_test, y_pred)
print(cm)

print('Precision ', cm[1,1]/(cm[0,1] + cm[1,1]))

99.8112
[[249394    307]
 [   165    134]]
Precision  0.30385487528344673


### Now pick a model of your choice and evaluate its accuracy.

In [46]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
y_pred = dtc.predict(X_test)
acc = dtc.score(X_test, y_test)*100
print(f"Decision Tree Test Accuracy {round(acc, 2)}%")
cm = confusion_matrix(y_test, y_pred)
print(cm)

print('Precision ', cm[1,1]/(cm[0,1] + cm[1,1]))

Decision Tree Test Accuracy 99.96%
[[249645     56]
 [    51    248]]
Precision  0.8157894736842105


### Which model worked better and how do you know?

In [50]:
# I believe it is the decision tree the one that is better because it has more precision. The targets 1s are VERY
# rare so precision might be the best measure. In this case DTC has 80% precision while LR has 30%

In [51]:
# So what am I learning about imbalance here? I'm not clear.

### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.