# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.

### First, download the data from: https://www.kaggle.com/datasets/chitwanmanchanda/fraudulent-transactions-data?resource=download . Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?
### Note: don't use the entire dataset, use a sample instead, with n=100000 elements, so your computer doesn't freeze.

In [24]:
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [4]:
# Your code here

fraud_transactions = pd.read_csv("/Users/amandine/Downloads/Fraud.csv")

fraud_transactions_sample = fraud_transactions.sample(n=100000, random_state=42)

In [5]:
fraud_transactions_sample.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
3737323,278,CASH_IN,330218.42,C632336343,20866.0,351084.42,C834976624,452419.57,122201.15,0,0
264914,15,PAYMENT,11647.08,C1264712553,30370.0,18722.92,M215391829,0.0,0.0,0,0
85647,10,CASH_IN,152264.21,C1746846248,106589.0,258853.21,C1607284477,201303.01,49038.8,0,0
5899326,403,TRANSFER,1551760.63,C333676753,0.0,0.0,C1564353608,3198359.45,4750120.08,0,0
2544263,206,CASH_IN,78172.3,C813403091,2921331.58,2999503.88,C1091768874,415821.9,337649.6,0,0


### What is the distribution of the outcome? 

In [8]:
fraud_transactions_sample.shape

(100000, 11)

In [7]:
# Your response here

outcome_distribution = fraud_transactions_sample['isFraud'].value_counts()
print(outcome_distribution)

0    99859
1      141
Name: isFraud, dtype: int64


### Clean the dataset. Pre-process it to make it suitable for ML training. Feel free to explore, drop, encode, transform, etc. Whatever you feel will improve the model score.

In [9]:
# Your code here

fraud_transactions_sample.dtypes

step                int64
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object

In [13]:
cleaned_data.isna().sum()

step              0
amount            0
oldbalanceOrg     0
newbalanceOrig    0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
type_CASH_IN      0
type_CASH_OUT     0
type_DEBIT        0
type_PAYMENT      0
type_TRANSFER     0
dtype: int64

In [10]:
# Drop unecessary and redundant columns 

cleaned_data = fraud_transactions_sample.drop(['nameOrig', 'nameDest', 'isFlaggedFraud'], axis=1)

In [11]:
cleaned_data["type"].unique()

array(['CASH_IN', 'PAYMENT', 'TRANSFER', 'CASH_OUT', 'DEBIT'],
      dtype=object)

In [12]:
# categorical data - # Perform one-hot encoding for the 'type' column
cleaned_data = pd.get_dummies(cleaned_data, columns=['type'], prefix='type')

In [15]:
# numerical features - scale them 
# important because ensures that coefficients in linear models like logistic regression represent 
# the impact of each feature on the target variable on a comparable scale.

scaler = StandardScaler()
numerical_cols = ['step', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']
cleaned_data[numerical_cols] = scaler.fit_transform(cleaned_data[numerical_cols])


In [16]:
cleaned_data.head()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
3737323,0.240608,0.267847,-0.28121,-0.172685,-0.202226,-0.318779,0,1,0,0,0,0
264914,-1.604774,-0.302388,-0.277934,-0.285857,-0.342598,-0.353941,0,0,0,0,1,0
85647,-1.639858,-0.050687,-0.251661,-0.204091,-0.28014,-0.339831,0,1,0,0,0,0
5899326,1.117691,2.454376,-0.288402,-0.292232,0.649758,1.012879,0,0,0,0,0,1
2544263,-0.264592,-0.183309,0.718575,0.729124,-0.213581,-0.256785,0,1,0,0,0,0


In [17]:
# Balance the target variable (isFraud) using SMOTE

X = cleaned_data.drop('isFraud', axis=1)
y = cleaned_data['isFraud']

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

In [21]:
# Check if resampling was successful 

# Before resampling
print("Before resampling:")
print(y.value_counts())

# After resampling
print("\nAfter resampling:")
print(y_resampled.value_counts())

Before resampling:
0    99859
1      141
Name: isFraud, dtype: int64

After resampling:
0    99859
1    99859
Name: isFraud, dtype: int64


In [22]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

### Run a logisitc regression classifier and evaluate its accuracy.

In [25]:
# Your code here

# Initialize and train the logistic regression classifier
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train, y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [26]:
# Make predictions on the testing data
y_pred = log_reg.predict(X_test)

In [27]:
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of logistic regression classifier:", accuracy)

Accuracy of logistic regression classifier: 0.9568896455037051


### Now pick a model of your choice and evaluate its accuracy.

In [33]:
# Your code here

# Try the Random Forest model for several reasons 

# can combine the predictions of multiple individual decision trees. 
#Each tree is trained on a random subset of the data and features, which helps to reduce overfitting 
#and improve generalization performance.

# Non-linear Relationships: Fraud detection often involves detecting complex patterns and relationships in the data. 
#Random Forest can capture non-linear relationships between features and the target variable, 
#making it effective for identifying fraudulent patterns that may not be captured by
# linear models like logistic regression.

# Robustness to Outliers: Fraudulent transactions often exhibit anomalous behavior that can be considered
#as outliers in the dataset. Random Forest is robust to outliers because it constructs multiple trees 
#and aggregates their predictions, reducing the impact of individual outliers on the overall model performance.

# Feature Importance: Random Forest provides a measure of feature importance, 
#which can help identify the most relevant features for detecting fraud. 
#This information can be valuable for understanding the underlying patterns in the data and improving 
#the fraud detection system.

# Handling Imbalanced Data: Random Forest can handle imbalanced datasets, where the number of fraudulent 
#transactions is much smaller than non-fraudulent ones. 

In [32]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

In [34]:
random_forest = RandomForestClassifier(n_estimators=100, random_state=42)
random_forest.fit(X_train, y_train)

In [35]:
# Make predictions on the testing data
y_pred = random_forest.predict(X_test)

In [36]:
# classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     19913
           1       1.00      1.00      1.00     20031

    accuracy                           1.00     39944
   macro avg       1.00      1.00      1.00     39944
weighted avg       1.00      1.00      1.00     39944



In [37]:
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

0.9993490887242139


### Which model worked better and how do you know?

In [2]:
# Your response here

# random forest works better accuracy is higher than LR + precision, recall, f1-score = 1 (indicating very good model)

### Note: before doing the first commit, make sure you don't include the large csv file, either by adding it to .gitignore, or by deleting it.