# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.

### First, download the data from: https://www.kaggle.com/ntnu-testimon/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?
### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [2]:
file = 'PS_20174392719_1491204439457_log.csv'

df = pd.read_csv(file)
df = df.sample(n = 100000) 

- 1,PAYMENT,1060.31,C429214117,1089.0,28.69,M1591654462,0.0,0.0,0,0
- step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).
- type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.
- amount - amount of the transaction in local currency.
- nameOrig - customer who started the transaction
- oldbalanceOrg - initial balance before the transaction
- newbalanceOrig - new balance after the transaction
- nameDest - customer who is the recipient of the transaction
- oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).
- newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).
- isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.
- isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.

In [3]:
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
5747646,399,CASH_OUT,117915.4,C590497979,313606.0,195690.6,C482802677,10708.97,128624.37,0,0
5717076,398,CASH_OUT,23606.25,C1902442212,48421.0,24814.75,C212841898,7393.99,31000.24,0,0
4892295,349,PAYMENT,27734.35,C680538630,243287.14,215552.78,M1807485544,0.0,0.0,0,0
5665851,396,TRANSFER,772590.65,C1817828559,0.0,0.0,C129742359,1700094.52,2472685.17,0,0
2040234,181,PAYMENT,13603.88,C1643265582,209024.56,195420.68,M1788226952,0.0,0.0,0,0


In [4]:
df.shape

(100000, 11)

In [5]:
df.dtypes
# types OK

step                int64
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object

In [6]:
df.isna().sum()
# zero nulls

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

### What is the distribution of the outcome? 

In [7]:
df['isFraud'].value_counts()

0    99859
1      141
Name: isFraud, dtype: int64

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [8]:
# checking object columns
cols = ['nameOrig','nameDest','type']
for col in cols:
    print(df[col].value_counts())

C1935481284    2
C336548382     2
C756033061     2
C1854571298    2
C41990019      1
              ..
C1502861360    1
C2120224691    1
C251577791     1
C921843268     1
C109080853     1
Name: nameOrig, Length: 99996, dtype: int64
C798678484     6
C1324885855    5
C486371171     5
C1870237274    5
C920011586     5
              ..
M1211678643    1
C2121674863    1
C807042906     1
M1592154794    1
C1290678008    1
Name: nameDest, Length: 92937, dtype: int64
CASH_OUT    35189
PAYMENT     33820
CASH_IN     21880
TRANSFER     8454
DEBIT         657
Name: type, dtype: int64


In [9]:
# Dropping object columns
df.drop(['type','nameOrig','nameDest'], axis = 1, inplace = True)

### Run a logisitc regression classifier and evaluate its accuracy.

In [12]:
X = df.drop('isFraud', axis=1)
y = df['isFraud']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 29)

lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)

acc = lr.score(X_test, y_test)
print(f"Linear Regression Score: {round(acc,2)}%")

Linear Regression Score: 0.19%


### Now pick a model of your choice and evaluate its accuracy.

In [13]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 2) 
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

acc = knn.score(X_test, y_test)*100
print(f"3 neighbors KNN Score: {round(acc,2)}%")

3 neighbors KNN Score: 99.96%


### Which model worked better and how do you know?

In [None]:
# KNeighbors worked better because the accuracy score is way better

### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.