# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### First, download the data from: https://www.kaggle.com/ntnu-testimon/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?

In [1]:
# Import library
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Import dataset
data = pd.read_csv("../fraud.csv")

In [None]:
# Take a random sample
data = data.sample(n = 10000) 
data

In [None]:
# Descriptive statistics
data.describe()

In [None]:
# Inspect shape
print(data.shape)

In [None]:
# Dtypes
data.dtypes

In [None]:
# Correlations
data.corr()

In [None]:
# Plots
plt.scatter(x="newbalanceDest",y="newbalanceDest",data=data)

In [None]:
# Plots
plt.scatter(x="amount",y="newbalanceDest",data=data)

In [None]:
"""
What do you think will be the important features in determining the outcome?

I think the amount and newbalanceOrig will play a role.
"""

### What is the distribution of the outcome? 

In [None]:
# Your response here
# data["isFlaggedFraud"].value_counts().plot.bar()
data["isFlaggedFraud"].value_counts()

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [None]:
# First we rename the time variable
# data["time"] = pd.to_timedelta(data["step"], unit='h')
# data["time"] = pd.to_datetime(data["step"])
# # data["time"] = pd.to_datetime(data["time"])

data.sort_values(by="step", ascending=True)

# Resource: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_timedelta.html

In [None]:
# Drop columns
data.drop(["nameOrig","nameDest","oldbalanceDest","oldbalanceOrg"], axis=1, inplace=True)

"""
We drop nameOrigin and nameDest because they are unique values.
We drop oldbalanceDest & Org because it is directly with newbalanceDest & Org
"""

In [None]:
# We also create dummy variable of variable "type"
data["type"] = pd.get_dummies(data["type"])
data.head(5)

### Run a logisitc regression classifier and evaluate its accuracy.

In [None]:
# Import libraries
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Split dataframe
data_x = data[data.columns.difference(["isFraud"])]
data_y = data["isFraud"]

# Split data in test and train
X_train, X_test, y_train, y_test = train_test_split(data_x, data_y)

# Initiate & fit model
lr = LogisticRegression(random_state=0)
lr = lr.fit(X_train, y_train)

# Predict y
y_pred = lr.predict(X_test)
print("Predicted response:", y_pred, sep="\n")

# Actual response
print("Actual response:",y_test, sep="\n")

# Check the accuracy of the model prediction
lr_score = lr.score(y_test,y_pred)
print("Accuracy score lr:",lr_score)


### Now pick a model of your choice and evaluate its accuracy.

In [None]:
## Decision Tree
# Import libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Initiate & fit model
rfc = RandomForestClassifier().fit(X_train, y_train)

# Predict y
y_pred = rfc.predict(X_test)
print("Predicted response:", y_pred, sep="\n")

# Actual response
print("Actual response:",y_test, sep="\n")

# Check the accuracy of the model prediction
rfc_score = rfc.score(y_test,y_pred)
print("Accuracy score rfc:",rfc_score)

In [None]:
## K-Nearest Neighbors
# Import libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Initiate & fit model
knc = KNeighborsClassifier().fit(X_train, y_train)

# Predict y
y_pred = knc.predict(X_test)
print("Predicted response:", y_pred, sep="\n")

# Actual response
print("Actual response:",y_test, sep="\n")

# Check the accuracy of the model prediction
knc_score = knc.score(y_test,y_pred)
print("Accuracy score knc:",knc_score)

### Which model worked better and how do you know?

In [None]:
print("Accuracy score lr:",lr_score)
print("Accuracy score rfc:",rfc_score)
print("Accuracy score knc:",knc_score)

In [None]:
"""
They all produced the approximately same accuracy score
"""