# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### First, download the data from: https://www.kaggle.com/ntnu-testimon/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?

In [47]:
# Your code here
import pandas as pd 
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score


In [74]:
data = pd.read_csv ('/home/inrx/Ironhack/Labs-Ironhack/lab-inbalance/your-code/PS_20174392719_1491204439457_log.csv')

In [2]:
data.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [3]:
data.shape

(6362620, 11)

In [4]:
data.columns

Index(['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig',
       'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud',
       'isFlaggedFraud'],
      dtype='object')

In [5]:
relevant_features = ['type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig',
       'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud',
       'isFlaggedFraud']

print (f'I believe the most relevant features are {str(relevant_features)}')

I believe the most relevant features are ['type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig', 'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud', 'isFlaggedFraud']


### What is the distribution of the outcome? 

In [6]:
# Your response here
import matplotlib.pyplot as plt

plt.hist(data['isFraud'])

(array([6354407.,       0.,       0.,       0.,       0.,       0.,
              0.,       0.,       0.,    8213.]),
 array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
 <a list of 10 Patch objects>)

In [7]:
data['isFraud'].value_counts()

0    6354407
1       8213
Name: isFraud, dtype: int64

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [8]:
# Your code here
data.dtypes

step                int64
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object

In [9]:
data.step.value_counts()

19     51352
18     49579
187    49083
235    47491
307    46968
       ...  
725        4
245        4
655        4
112        2
662        2
Name: step, Length: 743, dtype: int64

In [10]:
data.nameOrig.value_counts()

C1462946854    3
C1530544995    3
C1832548028    3
C1677795071    3
C2051359467    3
              ..
C622754779     1
C965112639     1
C6046964       1
C830180816     1
C1090507233    1
Name: nameOrig, Length: 6353307, dtype: int64

In [11]:
data.nameOrig.unique()

array(['C1231006815', 'C1666544295', 'C1305486145', ..., 'C1162922333',
       'C1685995037', 'C1280323807'], dtype=object)

In [75]:
data.isFraud.value_counts()

0    6354407
1       8213
Name: isFraud, dtype: int64

In [16]:
from imblearn.under_sampling import RandomUnderSampler

In [18]:
rus = RandomUnderSampler(random_state=0)
X_resampled, y_resampled = rus.fit_resample(X, y)

In [21]:
data_x = pd.DataFrame(X_resampled, columns=list(X.columns))
data_y = pd.DataFrame(y_resampled, columns=['isFraud'])

data = pd.concat([data_x, data_y], axis=1)
data.head()

In [24]:
data.dtypes

step              object
type              object
amount            object
nameOrig          object
oldbalanceOrg     object
newbalanceOrig    object
nameDest          object
oldbalanceDest    object
newbalanceDest    object
isFlaggedFraud    object
isFraud            int64
dtype: object

In [26]:
data.columns

Index(['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig',
       'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFlaggedFraud',
       'isFraud'],
      dtype='object')

In [33]:
data.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFlaggedFraud,isFraud
0,139,CASH_OUT,265803.0,C1431494964,0.0,0.0,C78023053,751669,1017470.0,0,0
1,589,CASH_OUT,11278.3,C1337769645,0.0,0.0,C1383513926,220971,232249.0,0,0
2,284,PAYMENT,16363.1,C1329599228,35966.2,19603.1,M1697421136,0,0.0,0,0
3,37,CASH_IN,107223.0,C676400927,20752600.0,20859900.0,C1012855711,804595,571711.0,0,0
4,287,PAYMENT,3678.25,C78035356,0.0,0.0,M2109783444,0,0.0,0,0


In [36]:
data[['step', 'amount', 'oldbalanceOrg', 
     'oldbalanceDest', 'newbalanceDest', 
     'isFlaggedFraud','isFraud']] = data[['step', 'amount', 'oldbalanceOrg', 
     'oldbalanceDest', 'newbalanceDest', 
     'isFlaggedFraud','isFraud']].astype(float)

In [40]:
rng = len(data.dtypes)

for i in range(rng):
    if data.dtypes[i] == 'object':
        print (data.dtypes.index[i])

type
nameOrig
newbalanceOrig
nameDest


In [45]:
data['isFraud'].value_counts()

1.0    8213
0.0    8213
Name: isFraud, dtype: int64

In [42]:
data.drop(columns={'nameOrig','newbalanceOrig','nameDest'}, inplace=True)

In [44]:
data_dummy = pd.get_dummies(data)
data_dummy.head()

Unnamed: 0,step,amount,oldbalanceOrg,oldbalanceDest,newbalanceDest,isFlaggedFraud,isFraud,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,139.0,265803.35,0.0,751669.39,1017472.74,0.0,0.0,0,1,0,0,0
1,589.0,11278.28,0.0,220970.84,232249.12,0.0,0.0,0,1,0,0,0
2,284.0,16363.06,35966.16,0.0,0.0,0.0,0.0,0,0,0,1,0
3,37.0,107223.31,20752629.71,804594.62,571711.22,0.0,0.0,1,0,0,0,0
4,287.0,3678.25,0.0,0.0,0.0,0.0,0.0,0,0,0,1,0


### Run a logisitc regression classifier and evaluate its accuracy.

In [63]:
# Your code here
X = data_dummy.drop(columns={'isFraud'})
y = data_dummy['isFraud']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=29)

from sklearn.linear_model import LogisticRegression

In [71]:
## Logistic Regression
lr = LogisticRegression()
lr.fit(X_train,y_train)
acc_lr_test = lr.score(X_test,y_test)*100
acc_lr_train = lr.score(X_train,y_train)*100

print(f"Logistic Regression Test Accuracy {round(acc_lr_train, 2)}%")
print(f"Logistic Regression Test Accuracy {round(acc_lr_test, 2)}%")

Logistic Regression Test Accuracy 65.24%
Logistic Regression Test Accuracy 64.18%




### Now pick a model of your choice and evaluate its accuracy.

In [72]:
# Your code here

from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)

acc_dtc_test = dtc.score(X_test, y_test)*100
acc_dtc_train = dtc.score(X_train, y_train)*100
print(f"Decision Tree Test Accuracy on test: {round(acc_dtc_train, 2)}%\n\
Decision Tree Test Accuracy on test: {round(acc_dtc_test, 2)}%")

Decision Tree Test Accuracy on test: 100.0%
Decision Tree Test Accuracy on test: 97.46%


### Which model worked better and how do you know?

In [73]:
# Your response here

"""The best model was decision tree, with higher accuracy and no overfitting, based on score for train and test"""

'The best model was decision tree, with higher accuracy and no overfitting, based on score for train and test'