# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.

### First, download the data from: https://www.kaggle.com/ealaxi/paysim1 . Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?
### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze.

In [26]:
import imblearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.utils import resample
from imblearn.over_sampling import SMOTE
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

In [11]:
# Your code here
data=pd.read_csv('../archive/PS_20174392719_1491204439457_log.csv', nrows=100000) #only first 100000 rows


In [12]:
data.shape

(100000, 11)

In [13]:
data.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [14]:
data.dtypes

step                int64
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object

In [15]:
data.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,8.49964,173602.2,877757.5,894061.9,880504.8,1184041.0,0.00116,0.0
std,1.825545,344300.3,2673284.0,2711318.0,2402267.0,2802350.0,0.034039,0.0
min,1.0,0.32,0.0,0.0,0.0,0.0,0.0,0.0
25%,8.0,9963.562,0.0,0.0,0.0,0.0,0.0,0.0
50%,9.0,52745.52,20061.5,0.0,20839.43,49909.18,0.0,0.0
75%,10.0,211763.1,190192.0,214813.2,588272.4,1058186.0,0.0,0.0
max,10.0,10000000.0,33797390.0,34008740.0,34008740.0,38946230.0,1.0,0.0


### What is the distribution of the outcome? 

In [17]:
# Your response here
display(data['isFraud'].value_counts()) #imbalanced data
display(data['isFlaggedFraud'].value_counts()) 

0    99884
1      116
Name: isFraud, dtype: int64

0    100000
Name: isFlaggedFraud, dtype: int64

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [18]:
# Your code here
display(data['step'].value_counts())

9     37628
10    27274
8     21097
7      6837
1      2708
6      1660
2      1014
5       665
4       565
3       552
Name: step, dtype: int64

In [21]:
# Your code here
display(data['type'].value_counts())

PAYMENT     39512
CASH_OUT    30718
CASH_IN     20185
TRANSFER     8597
DEBIT         988
Name: type, dtype: int64

In [19]:
data.isnull().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

In [20]:
data.dtypes

step                int64
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object

### Run a logisitc regression classifier and evaluate its accuracy.

In [43]:
# Your code here
X = data.drop(['step','type','nameOrig','nameDest','isFraud','isFlaggedFraud'],axis = 1) #features
y = data['isFraud'] #labels
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [44]:
"""
from sklearn.neighbors import KNeighborsClassifier

# initialize the model -> set hyperparameters
model = KNeighborsClassifier(n_neighbors = 3)
model = model.fit(X_train, y_train) #I train my model
"""
LR = LogisticRegression(max_iter=1000)#
LR.fit(X_train, y_train)
pred = LR.predict(X_test)
LR.score(X_test, y_test)

0.99916

In [36]:
print("precision: ",precision_score(y_test,pred))
print("recall: ",recall_score(y_test,pred))
print("f1: ",f1_score(y_test,pred))
print(confusion_matrix(y_test,pred))
#7 are fraud and we mark them as NOT fraud

precision:  0.5757575757575758
recall:  0.7307692307692307
f1:  0.6440677966101696
[[24960    14]
 [    7    19]]


### Now pick a model of your choice and evaluate its accuracy.

In [45]:
# Your code here


In [46]:
y_pred

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [47]:
y_test

3582     0
60498    0
53227    0
21333    0
3885     0
        ..
26543    0
85764    0
87585    0
32519    0
18831    0
Name: isFraud, Length: 25000, dtype: int64

In [48]:
print("test data accuracy was ",model.score(X_test,y_test)) #test data accuracy
print("train data accuracy was ", model.score(X_train, y_train)) #train data accuracy:

test data accuracy was  0.999
train data accuracy was  0.99912


In [49]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 3) #Num of neighbours

knn.fit(X_train, y_train) #TRAINING 

y_predict = knn.predict(X_test) #PREDICTION
y_predict #PREDICTION
np.array(y_test) #REAL DATA
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_predict)

0.999

In [50]:
knn.score(X_test, y_test) #SAME AS ABOVE accuracy_score

0.999

In [51]:
print(confusion_matrix(y_test,y_predict))
#20 are fraud and we mark them as NOT fraud. Much more worst than the 7 cases initially.

[[24969     5]
 [   20     6]]


### Which model worked better and how do you know?

In [2]:
# Your response here
#logistic regression worked much more better than KNN
#We know it from the confusion matrix. With LR we had 7 frauds marked as not fraud. However in KNN there were 20. Much more higher.
#However score was very good in both (Maybe too good, I´m concerned about this. I may have made a mistake because it is too high.)


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.