# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from imblearn.under_sampling import NearMiss, RandomUnderSampler
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import ElasticNet
from sklearn.neural_network import MLPClassifier

### First, download the data from: https://www.kaggle.com/ntnu-testimon/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?

In [2]:
# Your code here
data = pd.read_csv("PS.csv")

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
step              int64
type              object
amount            float64
nameOrig          object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest          object
oldbalanceDest    float64
newbalanceDest    float64
isFraud           int64
isFlaggedFraud    int64
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


In [4]:
data.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


### What is the distribution of the outcome? 

In [5]:
data.isnull().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

In [6]:
# Your response here
data['isFraud'].value_counts()

0    6354407
1       8213
Name: isFraud, dtype: int64

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [7]:
data[data['isFraud'] == 1]['type'].value_counts()

CASH_OUT    4116
TRANSFER    4097
Name: type, dtype: int64

In [8]:
# Your code here
"""
The frauds only occurs in CASH_OUT and TRANSFER types, so we can delete others
"""

data_reduced = data[data['type'].str.contains('CASH_OUT|TRANSFER')].copy()

In [9]:
data_reduced['isFraud'].value_counts()
"""
We still have an imbalanced dataset
"""

'\nWe still have an imbalanced dataset\n'

In [10]:
data_reduced.drop(['nameOrig', 'nameDest'], axis =1 , inplace =True)
"""
I delete this two categorical columns because I can't do anything with them.
"""

"\nI delete this two categorical columns because I can't do anything with them.\n"

In [11]:
data_reduced = pd.get_dummies(data_reduced, columns = ['type'])
"""
Doing one hot encoding with the type column
"""

'\nDoing one hot encoding with the type column\n'

In [12]:
data_reduced.head()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,type_CASH_OUT,type_TRANSFER
2,1,181.0,181.0,0.0,0.0,0.0,1,0,0,1
3,1,181.0,181.0,0.0,21182.0,0.0,1,0,1,0
15,1,229133.94,15325.0,0.0,5083.0,51513.44,0,0,1,0
19,1,215310.3,705.0,0.0,22425.0,0.0,0,0,0,1
24,1,311685.89,10835.0,0.0,6267.0,2719172.89,0,0,0,1


In [13]:
Ramdom_sample = RandomUnderSampler(random_state=42)

X_rus, y_rus = Ramdom_sample.fit_resample(data_reduced.drop('isFraud', axis = 1), data_reduced['isFraud'])


print(X_rus.shape)
print(y_rus.shape)
"""
Random Under Sample
"""



(16426, 9)
(16426,)


'\nRandom Under Sample\n'

In [14]:
nr = NearMiss(random_state =42)
X_nr, y_nr = nr.fit_resample(data_reduced.drop('isFraud', axis=1), data_reduced['isFraud'])

print(X_nr.shape)
print(y_nr.shape)

(16426, 9)
(16426,)


### Run a logisitc regression classifier and evaluate its accuracy.

In [15]:
X_train, X_test, y_train, y_test = train_test_split(data_reduced.drop('isFraud', axis=1),
                                                    data_reduced['isFraud'], random_state = 42, test_size=0.2)
X_train_rus, X_test_rus, y_train_rus, y_test_rus = train_test_split(X_rus, y_rus, 
                                                                    random_state=42, test_size=0.2)
X_train_nr, X_test_nr, y_train_nr, y_test_nr = train_test_split(X_nr, y_nr, 
                                                                  random_state=42, test_size=0.2)




In [16]:
Lore = LogisticRegression()
#y_pred = Lore.fit(X_train, y_train).predict(X_test)
y_pred_rus = Lore.fit(X_train_rus, y_train_rus).predict(X_test_rus)
y_pred_nr = Lore.fit(X_train_nr, y_train_nr).predict(X_test_nr)




In [20]:
#print(confusion_matrix(y_test, y_pred))
#print(classification_report(y_test,y_pred))
print(confusion_matrix(y_test_rus,y_pred_rus))
print(classification_report(y_test_rus, y_pred_rus))
print(confusion_matrix(y_test_nr,y_pred_nr))
print(classification_report(y_test_nr, y_pred_nr))

"""
The first without correct the unbalanced data did very bab with the fraud data

The second with random balanced data was the best with 0.92
"""


[[1459  190]
 [  57 1580]]
              precision    recall  f1-score   support

           0       0.96      0.88      0.92      1649
           1       0.89      0.97      0.93      1637

    accuracy                           0.92      3286
   macro avg       0.93      0.92      0.92      3286
weighted avg       0.93      0.92      0.92      3286

[[1470  179]
 [ 214 1423]]
              precision    recall  f1-score   support

           0       0.87      0.89      0.88      1649
           1       0.89      0.87      0.88      1637

    accuracy                           0.88      3286
   macro avg       0.88      0.88      0.88      3286
weighted avg       0.88      0.88      0.88      3286



'\nThe first without correct the unbalanced data did very bab with the fraud data\n\nThe second with random balanced data was the best with 0.92\n'

### Now pick a model of your choice and evaluate its accuracy.

In [21]:
neigh_rus = KNeighborsClassifier(n_neighbors = 3).fit(X_train_rus, y_train_rus)

y_neigh_pred_rus = neigh_rus.predict(X_test_rus)

print(confusion_matrix(y_test_rus, y_neigh_pred_rus))

print(classification_report(y_test_rus, y_neigh_pred_rus))

[[1561   88]
 [  75 1562]]
              precision    recall  f1-score   support

           0       0.95      0.95      0.95      1649
           1       0.95      0.95      0.95      1637

    accuracy                           0.95      3286
   macro avg       0.95      0.95      0.95      3286
weighted avg       0.95      0.95      0.95      3286



In [22]:
neigh_nr = KNeighborsClassifier(n_neighbors = 3).fit(X_train_nr, y_train_nr)

y_neigh_pred_nr = neigh_nr.predict(X_test_nr)

print(confusion_matrix(y_test_nr, y_neigh_pred_nr))

print(classification_report(y_test_nr, y_neigh_pred_nr))

[[1562   87]
 [  65 1572]]
              precision    recall  f1-score   support

           0       0.96      0.95      0.95      1649
           1       0.95      0.96      0.95      1637

    accuracy                           0.95      3286
   macro avg       0.95      0.95      0.95      3286
weighted avg       0.95      0.95      0.95      3286



In [23]:
X_train_robust_rus = RobustScaler().fit(X_train_rus).transform(X_train_rus)
X_test_robust_rus = RobustScaler().fit(X_test_rus).transform(X_test_rus)

neigh_rus = KNeighborsClassifier(n_neighbors = 3).fit(X_train_robust_rus, y_train_rus)

y_neigh_pred_rus = neigh_rus.predict(X_test_robust_rus)

print(confusion_matrix(y_test_rus, y_neigh_pred_rus))

print(classification_report(y_test_rus, y_neigh_pred_rus))

[[1582   67]
 [  59 1578]]
              precision    recall  f1-score   support

           0       0.96      0.96      0.96      1649
           1       0.96      0.96      0.96      1637

    accuracy                           0.96      3286
   macro avg       0.96      0.96      0.96      3286
weighted avg       0.96      0.96      0.96      3286



In [24]:
X_train_robust_nr = RobustScaler().fit(X_train_nr).transform(X_train_nr)
X_test_robust_nr = RobustScaler().fit(X_test_nr).transform(X_test_nr)

neigh_nr = KNeighborsClassifier(n_neighbors = 3).fit(X_train_robust_nr, y_train_nr)

y_neigh_pred_nr = neigh_nr.predict(X_test_robust_nr)

print(confusion_matrix(y_test_nr, y_neigh_pred_nr))

print(classification_report(y_test_nr, y_neigh_pred_nr))

[[1593   56]
 [  87 1550]]
              precision    recall  f1-score   support

           0       0.95      0.97      0.96      1649
           1       0.97      0.95      0.96      1637

    accuracy                           0.96      3286
   macro avg       0.96      0.96      0.96      3286
weighted avg       0.96      0.96      0.96      3286



In [31]:

multi_layer_perceptron = MLPClassifier(hidden_layer_sizes=(400, ),random_state =42).fit(X_train_rus, y_train_rus)
y_mlp_pred_rus = multi_layer_perceptron.predict(X_test_rus)

print(confusion_matrix(y_test_rus, y_mlp_pred_rus))

print(classification_report(y_test_rus, y_mlp_pred_rus))

[[1631   18]
 [  18 1619]]
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      1649
           1       0.99      0.99      0.99      1637

    accuracy                           0.99      3286
   macro avg       0.99      0.99      0.99      3286
weighted avg       0.99      0.99      0.99      3286





In [26]:
multi_layer_perceptron = MLPClassifier(random_state =42).fit(X_train_nr, y_train_nr)
y_mlp_pred_nr = multi_layer_perceptron.predict(X_test_nr)

print(confusion_matrix(y_test_nr, y_mlp_pred_nr))

print(classification_report(y_test_nr, y_mlp_pred_nr))

[[1578   71]
 [  62 1575]]
              precision    recall  f1-score   support

           0       0.96      0.96      0.96      1649
           1       0.96      0.96      0.96      1637

    accuracy                           0.96      3286
   macro avg       0.96      0.96      0.96      3286
weighted avg       0.96      0.96      0.96      3286





### Which model worked better and how do you know?

In [27]:
# Your response here
"""
The best model that I found it is the MLPClassifier (NeuralNetwork 400 layers)  with data balanced (random under sampling)
with 0.99 precision, recall and f1-score.
"""



'\nThe best model that I found it is the MLPClassifier (NeuralNetwork) with data balanced (random under sampling)\nwith 0.98 precision, recall and f1-score.\n'