# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.

### First, download the data from: https://www.kaggle.com/ealaxi/paysim1 . Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?
### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze.

In [1]:
# Your code here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv("PS_20174392719_1491204439457_log.csv")
data = data.sample(n=100000)
data.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
1215983,133,PAYMENT,19926.07,C861404775,0.0,0.0,M1738444549,0.0,0.0,0,0
1380727,138,CASH_IN,116486.49,C1913530774,13960685.62,14077172.11,C1749309365,417190.14,300703.65,0,0
3198325,249,CASH_OUT,266493.79,C561850127,0.0,0.0,C1697323714,3227609.76,3494103.55,0,0
2527263,205,PAYMENT,1033.78,C2021076634,0.0,0.0,M1313591026,0.0,0.0,0,0
5274237,372,TRANSFER,626369.64,C346195988,12795.0,0.0,C1820735980,0.0,626369.64,0,0


In [2]:
paysim = data.copy()
paysim.shape

(100000, 11)

In [3]:
paysim.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,243.30189,179295.4,845574.5,866199.4,1100317.0,1223211.0,0.00131,0.0
std,142.10701,567539.2,2939243.0,2972166.0,3243104.0,3498879.0,0.03617,0.0
min,1.0,0.35,0.0,0.0,0.0,0.0,0.0,0.0
25%,155.0,13364.53,0.0,0.0,0.0,0.0,0.0,0.0
50%,238.0,74936.64,14497.95,0.0,133420.1,216818.5,0.0,0.0
75%,334.0,209015.1,110909.3,146926.8,947323.3,1116471.0,0.0,0.0
max,741.0,39885140.0,57316260.0,47316260.0,158680800.0,174654100.0,1.0,0.0


### What is the distribution of the outcome? 

In [4]:
# Your response here
print(paysim.describe())
print(paysim["isFraud"].value_counts())
print(paysim["isFlaggedFraud"].value_counts())

               step        amount  oldbalanceOrg  newbalanceOrig  \
count  100000.00000  1.000000e+05   1.000000e+05    1.000000e+05   
mean      243.30189  1.792954e+05   8.455745e+05    8.661994e+05   
std       142.10701  5.675392e+05   2.939243e+06    2.972166e+06   
min         1.00000  3.500000e-01   0.000000e+00    0.000000e+00   
25%       155.00000  1.336453e+04   0.000000e+00    0.000000e+00   
50%       238.00000  7.493664e+04   1.449795e+04    0.000000e+00   
75%       334.00000  2.090151e+05   1.109093e+05    1.469268e+05   
max       741.00000  3.988514e+07   5.731626e+07    4.731626e+07   

       oldbalanceDest  newbalanceDest       isFraud  isFlaggedFraud  
count    1.000000e+05    1.000000e+05  100000.00000        100000.0  
mean     1.100317e+06    1.223211e+06       0.00131             0.0  
std      3.243104e+06    3.498879e+06       0.03617             0.0  
min      0.000000e+00    0.000000e+00       0.00000             0.0  
25%      0.000000e+00    0.000000e+00

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [5]:
# Your code here
paysim["type"].unique()
#I decide to dummify the column "type", since it has only 4 unique values that may have some kind of impact in the payment
#Apart of that, I will drop step, since it gives us no certain clue of how it works, also the names (they could be useful to track single persons, not our case), and "isFlaggedFraud" since it doesn't give us any kind of new info (it's all 0s)
paysim = pd.get_dummies(paysim, columns=["type"])
paysim.drop(['step','nameOrig','nameDest','isFlaggedFraud'],axis = 1, inplace = True)

### Run a logisitc regression classifier and evaluate its accuracy.

In [8]:
# Your code here
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

y = paysim["isFraud"]
X = paysim.drop("isFraud", axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y)

scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

model1 = LogisticRegression()
model1.fit(X_train, y_train)
model1.score(X_test, y_test)

pred = model1.predict(X_test)

print("precision: ",precision_score(y_test,pred))
print("recall: ",recall_score(y_test,pred))
print("f1: ",f1_score(y_test,pred))
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, pred)))
confusion_matrix(y_test,pred)

precision:  1.0
recall:  0.41379310344827586
f1:  0.5853658536585366
Mean Absolute Error: 0.00068
Mean Squared Error: 0.00068
Root Mean Squared Error: 0.026076809620810597


array([[24971,     0],
       [   17,    12]], dtype=int64)

### Now pick a model of your choice and evaluate its accuracy.

In [13]:
# Your code here

from sklearn import metrics
from sklearn.ensemble import RandomForestRegressor

scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

regressor = RandomForestRegressor()
regressor.fit(X_train, y_train)
pred2 = regressor.predict(X_test)

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, pred2))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, pred2))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, pred2)))
print(regressor.score(X_train,y_train)*100)
print(regressor.score(X_test,y_test)*100)

#Heavy overfitting

Mean Absolute Error: 0.0008683999999999999
Mean Squared Error: 0.00046568400000000006
Root Mean Squared Error: 0.021579712695029098
95.7069064908668
59.80820510412768


In [14]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

model2 = KNeighborsClassifier(n_neighbors = 4)
model2 = model2.fit(X_train, y_train)
model2.predict(X_test)
np.array(y_test)
accuracy_score(model2.predict(X_test),np.array(y_test))



pred3 = model2.predict(X_test)

print("precision: ",precision_score(y_test,pred3))
print("recall: ",recall_score(y_test,pred3))
print("f1: ",f1_score(y_test,pred3))
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, pred3))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, pred3))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, pred3)))
confusion_matrix(y_test,pred3)



precision:  1.0
recall:  0.4482758620689655
f1:  0.6190476190476191
Mean Absolute Error: 0.00064
Mean Squared Error: 0.00064
Root Mean Squared Error: 0.025298221281347035


array([[24971,     0],
       [   16,    13]], dtype=int64)

### Which model worked better and how do you know?

In [None]:
# Your response here
# After scaling the data, I've got a 100% accuracy in KNN and LinearRegression, but a higher recall and F1 with KNN. 


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.