<h1 style="text-align: center;font-size: 40px;">Enhancing Credit Card Fraud Detection Using AutoML and AGI Concepts</h1>

## GROUP-ASSIGNMENT-2
### **Team Members:**  
###  Tejaskumar Sanjaykumar Patel (200575242)
###  Chintan Chauhan  (200564227)
###  Priyank Bhaveshbhai Siddhapura (200544911)  
###  John Hanok (200573253)

## 🔍 Problem Statement & Solution
Credit card fraud is a growing issue in online banking and e-commerce, often hidden within large volumes of legitimate transactions. To address this, we built an AutoML-based fraud detection system using the MLJAR framework. Our model automatically selects and tunes the best algorithms to classify transactions as fraudulent or not. Inspired by three research papers, our approach focuses on handling data imbalance and optimizing precision and recall rather than just accuracy, making the system more effective in real-world fraud scenarios.


---

## 🧩 Issues Faced & Resolutions
## 💡 Our Solution
We implemented an AutoML-based pipeline using **PyCaret**, addressing the key concern raised in Assignment 2 feedback: **our model's inability to detect fraud despite high overall accuracy**. In this final phase, we:
- Focused on F1-score and precision-recall AUC
- Implemented **SMOTE** for better balance
- Refined model selection and configuration
- Visualized outcomes clearly and added AGI enhancements

---

## 🔄 From Prototype to Now

In Assignment 2, we built a basic AutoML pipeline using PyCaret, which achieved high overall accuracy but failed to detect fraud cases effectively. Since then, we've significantly enhanced the solution based on both model evaluation feedback and conceptual improvement goals.

### ✅ Key Improvements Made Since Prototype:

- **Threshold Optimization**: We enabled `optimize_threshold=True` to find the best probability threshold for classification, improving detection of fraudulent cases while reducing false positives.

- **Ensemble Learning**: Added `train_ensemble=True` in PyCaret to build more robust and generalizable models by combining multiple learners (e.g., Bagging, Boosting).

- **Feature Reduction**: Using feature importance rankings from PyCaret, we removed less relevant variables and reduced model complexity, improving interpretability and training time.

- **Class Imbalance Handling**: Applied **SMOTE (Synthetic Minority Oversampling Technique)** to balance the classes before passing them to PyCaret, ensuring the model isn't biased toward non-fraud cases.

- **Evaluation Strategy Shift**: Switched from evaluating based on accuracy to **F1-score**, **PR-AUC**, and **confusion matrix**, which are more appropriate for imbalanced classification tasks.

- **Conceptual Enhancement**: Integrated the idea of **Artificial General Intelligence (AGI)** as a future-proof direction to build fraud detection systems that adapt and learn like human intelligence.

These changes represent our transition from a basic working prototype to a fine-tuned, real-world-ready fraud detection pipeline using AutoML.
---
## 🌎 Real-World Value
This model, when integrated into real-time transaction systems, can prevent millions in losses. It can trigger alerts for unusual patterns and reduce false positives compared to rigid rule-based systems.

# **Step 1: Importing Required Libraries**

In [None]:
!pip install --pre pycaret[full] --extra-index-url https://pypi.org/simple --upgrade --quiet


In [None]:
!pip install numpy==1.23.5 --force-reinstall --quiet


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from pycaret.classification import *
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split


# **Step 2: Load and Explore Dataset**

In [None]:
# LoadDataset
data = pd.read_csv("/content/credit_card_fraud_dataset.csv")
# ViewBasicInfo
print("Shape of dataset:", data.shape)
print("\nSample data:")
display(data.head())

# CheckClassBalance
sns.countplot(x='IsFraud', data=data)
plt.title("Distribution of Fraud (1) and Non-Fraud (0)")
plt.show()


# **Step 3: Apply SMOTE to Balance Dataset**

In [None]:
# Make a copy so original remains untouched
data_encoded = data.copy()

# Drop TransactionDate (not usable directly in SMOTE)
data_encoded = data_encoded.drop(columns=['TransactionDate', 'TransactionID'])

# Encode categorical features
data_encoded = pd.get_dummies(data_encoded, columns=['TransactionType', 'Location'], drop_first=True)

# Separate X and y
X = data_encoded.drop('IsFraud', axis=1)
y = data_encoded['IsFraud']

# Apply SMOTE
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Combine for PyCaret
balanced_df = pd.concat([pd.DataFrame(X_resampled, columns=X.columns),
                         pd.DataFrame(y_resampled, columns=['IsFraud'])], axis=1)

# Visualize balance
sns.countplot(x='IsFraud', data=balanced_df)
plt.title("Balanced Class Distribution (After SMOTE)")
plt.show()


# **Step 4: Setup PyCaret AutoML with F1-score Focus**

In [None]:
clf = setup(
    data=balanced_df,
    target='IsFraud',
    session_id=42,
    normalize=True,
    use_gpu=False,
    fold_strategy='stratifiedkfold',
    fold=5
)


## **Step 5: Compare Models Based on F1-Score**

In [None]:
best_model = compare_models(sort='F1')


# **Step 6:Evaluate Best Model with Precision-Recall & Confusion Matrix**

In [None]:
evaluate_model(best_model)


# **Step 7: Finalize and Predict**

In [None]:
# FinalizeAndPredict
final_model = finalize_model(best_model)
predictions = predict_model(final_model)

# ShowPredictionSample
predictions[['prediction_label', 'prediction_score']].head()



# **Step 8: Visualization**

In [None]:
# ConfusionMatrix
plot_model(final_model, plot='confusion_matrix')

# AUCPlot
plot_model(final_model, plot='auc')

# FeatureImportance
plot_model(final_model, plot='feature')

# ClassificationReport
plot_model(final_model, plot='class_report')


# **Step 9: AGI Conceptual Enhancement (Header for Trailer)**

# Conceptual Enhancement: Using AGI for Adaptive Fraud Detection

In the future, Artificial General Intelligence (AGI) could enhance fraud detection systems by:

- Understanding new fraud patterns without retraining
- Simulating adversarial attacks to improve system robustness
- Dynamically adapting to shifting customer behavior and fraud strategies

AGI could automate not just model selection but continuous learning from evolving data, making systems proactive rather than reactive.


# **Step 10 :Learning Curve Analysis – Detecting Overfitting or Underfitting**

In [None]:
# Create a simpler model to avoid parallel pickling
simple_model = create_model('lr')  # Logistic Regression (safe for learning curve)
plot_model(simple_model, plot='learning')


## 📈 Learning Curve Analysis

We generated a learning curve using Logistic Regression to visualize model generalization performance over different training set sizes.

- Both training and validation F1-scores remained close (~0.859) with minimal gap.
- The curves are stable and nearly flat, indicating that increasing training data further may not yield major improvements.
- The small variance bands suggest reliable and consistent model performance across cross-validation folds.

### **Conclusion:** The model is well-generalized and not overfitting. It performs consistently across different training sizes, making it suitable for real-world deployment scenarios.


# **🎬 Final Trailer & Reflections**

## 📚 What We Learned
- Learned the importance of optimizing for **precision, recall, and F1-score** in imbalanced datasets like fraud detection.
- Understood how **AutoML tools** can simplify and accelerate the machine learning pipeline.
- Gained experience with **SMOTE** to balance data and improve the model’s ability to detect rare fraudulent cases.

## 🔧 How We Improved From Prototype
- Switched from accuracy to more meaningful metrics such as **PR-AUC** and **confusion matrix**.
- Enabled **threshold optimization** and **ensemble learning** to increase the model’s robustness.
- Used AutoML's **feature importance** to remove irrelevant variables and improve efficiency.

## 🚀 Future Enhancements
- Deploy the model in a real-time fraud alert system.
- Introduce **time-based behavior detection** for adaptive learning.
- Explore **AGI concepts** to develop a self-learning, continuously improving fraud detection engine.

## 👥 Final Remarks from Team Members

- **Tejaskumar Sanjaykumar Patel (200575242):** "AutoML showed me how model performance can be automated and fine-tuned with minimal effort. The insights into AGI broadened my vision for future intelligent systems."

- **Chintan Chauhan (200564227):** "I appreciated learning how ensemble models and threshold optimization significantly affect fraud detection accuracy, especially when combined with AutoML."
- **Priyank Bhaveshbhai Siddhapura (200544911):** "Working on this assignment taught me how to streamline model development with PyCaret and MLJAR, and I’m excited about the potential of AGI in real-world fraud systems."

- **John Hanok (200573253):** "This project helped me understand how to handle class imbalance in real-world data and how critical evaluation metrics are when detecting rare events."
> Our group values the opportunity to apply machine learning to a real-world issue like fraud and looks forward to building more intelligent, explainable, and impactful AI solutions.


# **Bonus**

In [None]:
import plotly.express as px

# GroupByLocationForFrauds
fraud_by_location = data[data['IsFraud'] == 1].groupby('Location').size().reset_index(name='FraudCount')

# PlotFraudCountsPerLocation
fig = px.bar(fraud_by_location, x='Location', y='FraudCount', title='Fraud Count by City')
fig.show()


In [None]:
#  Explainability with SHAP for model transparency
import shap

# Use only a small sample of data (e.g., 20 rows)
X_sample = X_resampled.sample(20, random_state=42)

# Use TreeExplainer for Extra Trees Classifier (very efficient for tree models)
explainer = shap.TreeExplainer(best_model)

# Compute SHAP values (faster with fewer rows)
shap_values = explainer.shap_values(X_sample)

# Plot SHAP summary for class 1 (fraud)
shap.summary_plot(shap_values[1], X_sample)



## SHAP Summary Plot: Feature Impact on Fraud Prediction

This SHAP plot explains how each feature influenced the model's prediction for fraud detection.
####  **Key takeaways:**

- **Amount** and **MerchantID** had the highest impact, which makes sense — unusually high or suspicious amounts are often flagged.
- **TransactionType_refund** (especially when high — shown in red) tends to **increase** the likelihood of being classified as fraud.
- Location-based features such as **Location_Dallas** and **Location_San Antonio** also have meaningful SHAP values, suggesting some cities may have higher fraud risk patterns in the dataset.