# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.

### First, download the data from: https://www.kaggle.com/datasets/chitwanmanchanda/fraudulent-transactions-data?resource=download . Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?
### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze.

In [2]:
# Your code here
import pandas as pd

data = pd.read_csv('Fraud.csv', nrows=100000)

In [3]:
data.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


### What is the distribution of the outcome? 

In [4]:
# Your response here
data['isFraud'].value_counts()

0    99884
1      116
Name: isFraud, dtype: int64

### Clean the dataset. Pre-process it to make it suitable for ML training. Feel free to explore, drop, encode, transform, etc. Whatever you feel will improve the model score.

In [11]:
# Your code here
data.drop(columns=['nameOrig','nameDest','isFlaggedFraud'], inplace=True)

In [12]:
data['step'].value_counts()

9     37628
10    27274
8     21097
7      6837
1      2708
6      1660
2      1014
5       665
4       565
3       552
Name: step, dtype: int64

In [13]:
data['type'].value_counts()

PAYMENT     39512
CASH_OUT    30718
CASH_IN     20185
TRANSFER     8597
DEBIT         988
Name: type, dtype: int64

In [15]:
data_dummies = pd.get_dummies(data)
data_dummies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 12 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   step            100000 non-null  int64  
 1   amount          100000 non-null  float64
 2   oldbalanceOrg   100000 non-null  float64
 3   newbalanceOrig  100000 non-null  float64
 4   oldbalanceDest  100000 non-null  float64
 5   newbalanceDest  100000 non-null  float64
 6   isFraud         100000 non-null  int64  
 7   type_CASH_IN    100000 non-null  uint8  
 8   type_CASH_OUT   100000 non-null  uint8  
 9   type_DEBIT      100000 non-null  uint8  
 10  type_PAYMENT    100000 non-null  uint8  
 11  type_TRANSFER   100000 non-null  uint8  
dtypes: float64(5), int64(2), uint8(5)
memory usage: 5.8 MB


### Run a logisitc regression classifier and evaluate its accuracy.

In [22]:
# Your code here
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score

In [17]:
X = data_dummies.drop(columns=['isFraud'])
Y = data_dummies['isFraud']

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, Y)

In [21]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

In [24]:
y_pred = model.predict(X_test)
print("Accuracy: ", accuracy_score(y_test, y_pred))
confusion_matrix(y_test, y_pred)

Accuracy:  0.99892


array([[24969,     1],
       [   26,     4]])

### Now pick a model of your choice and evaluate its accuracy.

In [26]:
# Your code here
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors = 11)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy: ", accuracy_score(y_test, y_pred))
confusion_matrix(y_test, y_pred)

Accuracy:  0.99884


array([[24970,     0],
       [   29,     1]])

### Which model worked better and how do you know?

In [2]:
# Your response here
# The logestic regression worked better than KNN because the accuracy of logestic regression is higher
# which can be seen in the accuracy and confusion matrix

### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.