# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.

### First, download the data from: https://www.kaggle.com/datasets/chitwanmanchanda/fraudulent-transactions-data?resource=download . Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?
### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze.

In [None]:
# Imports

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

from imblearn.over_sampling import SMOTE

from sklearn.metrics import recall_score

In [None]:
fraud = pd.read_csv('Fraud.csv').sample(100000)

In [None]:
fraud.shape

In [None]:
fraud.head()

In [None]:
fraud.dtypes

In [None]:
fraud.describe()

In [None]:
# Seaborn heatmap

corr = fraud.corr(numeric_only = True)

mask = np.triu(np.ones_like(corr, dtype = bool))

sns.heatmap(corr, mask = mask, annot = True, cmap = 'coolwarm')

plt.show()

In [None]:
# 'amount', 'oldbalanceOrg' and 'newbalanceOrig' are the most important features.
# They all have a strong positive correlation with isFraud.

### What is the distribution of the outcome? 

In [None]:
print(fraud.isFlaggedFraud.value_counts())
print(fraud.isFraud.value_counts())

### Clean the dataset. Pre-process it to make it suitable for ML training. Feel free to explore, drop, encode, transform, etc. Whatever you feel will improve the model score.

In [None]:
fraud.sample()

In [None]:
fraud.isFlaggedFraud.value_counts()

In [None]:
fraud.isna().sum()

In [None]:
fraud_clean = fraud[['amount', 'oldbalanceOrg', 'newbalanceOrig', 'isFraud']].reset_index(drop = True)

fraud_clean.head()

In [None]:
X = fraud_clean.drop('isFraud', axis = 1)
y = fraud_clean['isFraud']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

In [None]:
# Handling imbalance

sm = SMOTE(random_state = 0, sampling_strategy = 1.0)

X_train, y_train = sm.fit_resample(X_train, y_train)

### Run a logisitc regression classifier and evaluate its accuracy.

In [None]:
model_lr = LogisticRegression(max_iter = 1000)

model_lr.fit(X_train, y_train)

pred = model_lr.predict(X_test)

print('Accuracy:', model_lr.score(X_test, y_test))
print('Recall:', recall_score(y_test, pred))

### Now pick a model of your choice and evaluate its accuracy.

In [None]:
model_knn = KNeighborsClassifier(n_neighbors = 10)

model_knn.fit(X_train, y_train)

pred = model_knn.predict(X_test)

print('Accuracy:', model_knn.score(X_test, y_test))
print('Recall:', recall_score(y_test, pred))

### Which model worked better and how do you know?

In [None]:
# I used a logistic regression model and a knn model.
# In both cases, the accuracy was extremely high (> 0.99%).
# The knn model had a slightly better recall (0.97 > 0.94).

### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.