# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.

### First, download the data from: https://www.kaggle.com/ealaxi/paysim1 . Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?
### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze.

In [33]:
# Your code here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv("PS_20174392719_1491204439457_log.csv")
data = data.sample(n=100000)
data.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
3315418,252,CASH_IN,281387.4,C2043732768,15581784.21,15863171.61,C213471909,1628241.41,1346854.01,0,0
4222717,306,CASH_IN,20880.68,C1006120729,2009.0,22889.68,C1148440206,533020.87,512140.19,0,0
2774407,213,PAYMENT,36775.27,C1959953509,161425.73,124650.45,M237815722,0.0,0.0,0,0
1431668,139,CASH_OUT,190698.66,C575471378,0.0,0.0,C222854729,2882182.68,3072881.34,0,0
2025751,180,PAYMENT,14926.21,C1745791810,10138.0,0.0,M1439545627,0.0,0.0,0,0


In [47]:
paysim = data.copy()
paysim.shape

(100000, 11)

In [48]:
paysim.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,243.25502,180175.3,846251.2,868064.7,1104175.0,1228758.0,0.00095,2e-05
std,142.257519,619790.5,2916006.0,2952902.0,3439751.0,3717224.0,0.030808,0.004472
min,1.0,0.23,0.0,0.0,0.0,0.0,0.0,0.0
25%,155.0,13334.48,0.0,0.0,0.0,0.0,0.0,0.0
50%,238.0,74802.05,14059.5,0.0,131069.5,215240.0,0.0,0.0
75%,334.0,209124.3,108525.8,147107.8,954025.0,1117434.0,0.0,0.0
max,741.0,56255000.0,38563400.0,38939420.0,228098500.0,227943300.0,1.0,1.0


### What is the distribution of the outcome? 

In [49]:
# Your response here
print(paysim.describe())
print(paysim["isFraud"].value_counts())
print(paysim["isFlaggedFraud"].value_counts())

                step        amount  oldbalanceOrg  newbalanceOrig  \
count  100000.000000  1.000000e+05   1.000000e+05    1.000000e+05   
mean      243.255020  1.801753e+05   8.462512e+05    8.680647e+05   
std       142.257519  6.197905e+05   2.916006e+06    2.952902e+06   
min         1.000000  2.300000e-01   0.000000e+00    0.000000e+00   
25%       155.000000  1.333448e+04   0.000000e+00    0.000000e+00   
50%       238.000000  7.480205e+04   1.405950e+04    0.000000e+00   
75%       334.000000  2.091243e+05   1.085258e+05    1.471078e+05   
max       741.000000  5.625500e+07   3.856340e+07    3.893942e+07   

       oldbalanceDest  newbalanceDest        isFraud  isFlaggedFraud  
count    1.000000e+05    1.000000e+05  100000.000000   100000.000000  
mean     1.104175e+06    1.228758e+06       0.000950        0.000020  
std      3.439751e+06    3.717224e+06       0.030808        0.004472  
min      0.000000e+00    0.000000e+00       0.000000        0.000000  
25%      0.000000e+00  

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [50]:
# Your code here
paysim["type"].unique()
#I decide to dummify the column "type", since it has only 4 unique values that may have some kind of impact in the payment
#Apart of that, I will drop step, since it gives us no certain clue of how it works, also the names (they could be useful to track single persons, not our case), and "isFlaggedFraud" since it doesn't give us any kind of new info (it's all 0s)
paysim = pd.get_dummies(paysim, columns=["type"])
paysim.drop(['step','nameOrig','nameDest','isFlaggedFraud'],axis = 1, inplace = True)

### Run a logisitc regression classifier and evaluate its accuracy.

In [76]:
# Your code here
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix

y = paysim["isFraud"]
X = paysim.drop("isFraud", axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y)

scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

model1 = LogisticRegression()
model1.fit(X_train, y_train)
model1.score(X_test, y_test)

pred = model1.predict(X_test)

print("precision: ",precision_score(y_test,pred))
print("recall: ",recall_score(y_test,pred))
print("f1: ",f1_score(y_test,pred))
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, pred)))
confusion_matrix(y_test,pred)

precision:  1.0
recall:  0.16129032258064516
f1:  0.27777777777777773
Mean Absolute Error: 0.00104
Mean Squared Error: 0.00104
Root Mean Squared Error: 0.0322490309931942


array([[24969,     0],
       [   26,     5]], dtype=int64)

### Now pick a model of your choice and evaluate its accuracy.

In [77]:
# Your code here

from sklearn import metrics
from sklearn.ensemble import RandomForestRegressor

scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

regressor = RandomForestRegressor()
regressor.fit(X_train, y_train)
pred2 = regressor.predict(X_test)

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, pred2))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, pred2))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, pred2)))



Mean Absolute Error: 0.0006664
Mean Squared Error: 0.00041996000000000004
Root Mean Squared Error: 0.020492925608609425


In [81]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

model2 = KNeighborsClassifier(n_neighbors = 4)
model2 = model2.fit(X_train, y_train)
model2.predict(X_test)
np.array(y_test)
accuracy_score(model2.predict(X_test),np.array(y_test))



pred3 = model2.predict(X_test)

print("precision: ",precision_score(y_test,pred3))
print("recall: ",recall_score(y_test,pred3))
print("f1: ",f1_score(y_test,pred3))
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, pred3))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, pred3))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, pred3)))
confusion_matrix(y_test,pred3)



precision:  1.0
recall:  0.41935483870967744
f1:  0.5909090909090909
Mean Absolute Error: 0.00072
Mean Squared Error: 0.00072
Root Mean Squared Error: 0.02683281572999748


array([[24969,     0],
       [   18,    13]], dtype=int64)

### Which model worked better and how do you know?

In [2]:
# Your response here
#After scaling the data, I've got a 100% accuracy in KNN and LinearRegression, but a higher recall and F1 with KNN. 
#Nonetheless, the the error stats of the RandomForestRegression seem to be better than the other two.

### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.