# SMOTE-Adv - Quick Start Guide

This notebook provides a comprehensive introduction to the SMOTE-Adv framework for imbalanced classification.

## Overview

The framework implements SMOTE-Adv (Adversarially-Filtered SMOTE) for handling imbalanced datasets across different domains:

- **Credit Card Fraud Detection**: Extreme imbalance (0.17% fraud rate) - 284,807 transactions
- **NSL-KDD Network Intrusion Detection**: Balanced (48.12% attack rate) - 148,517 connections

## Supported Augmentation Methods

1. **None (Baseline)**: No augmentation, class weighting only
2. **SMOTE**: Standard synthetic minority oversampling technique
3. **SMOTE-Adv**: Adversarially-filtered SMOTE with single discriminator

## Key Features

- **Multi-dataset Support**: Unified interface for different dataset types
- **Quality-focused Augmentation**: Adversarial filtering removes low-quality synthetic samples
- **Comprehensive Evaluation**: ROC-AUC, PR-AUC, F1-macro, F1-weighted metrics
- **Reproducible Results**: Controlled random seeds and deterministic settings


## Setup and Installation

First, ensure you have the required dependencies installed:

```bash
pip install -r requirements.txt
```

## Dataset Preparation

### Credit Card Fraud Dataset
Visit the [dataset page](https://www.kaggle.com/mlg-ulb/creditcardfraud), download `creditcard.csv`, and place it at `data/creditcard.csv`.

### NSL-KDD Dataset
Visit the [NSL-KDD dataset page](https://www.kaggle.com/datasets/hassan06/nslkdd), download the dataset files, and place them in `data/NSL-KDD dataset/` directory. Ensure the directory structure matches: `data/NSL-KDD dataset/KDDTrain+.txt`, `data/NSL-KDD dataset/KDDTest+.txt`, etc.


## Quick Start Examples

### 1. Single Experiment Training

Run individual experiments with different augmentation methods:

In [None]:
# Credit Card Fraud Detection - AF-SMOTE
import os
os.chdir('..')  # Change to project root directory

# Run AF-SMOTE experiment on Credit Card dataset
!python src/train.py --dataset creditcard --augment afsmote --rbf-components 300 --rbf-gamma 0.5 --test-size 0.2 --seed 42

In [None]:
# NSL-KDD Network Intrusion Detection - Baseline
!python src/train.py --dataset nsl_kdd --augment none --rbf-components 300 --rbf-gamma 0.5 --test-size 0.2 --seed 42


### 2. Complete Experiment Suites

Run comprehensive experiments for each dataset:


In [None]:
# Run all Credit Card experiments
!python scripts/run_creditcard_experiments.py


In [None]:
# Run all NSL-KDD experiments
!python scripts/run_nsl_kdd_experiments.py


### 3. Generate Augmented Datasets

Create augmented datasets for further analysis:


In [None]:
# Generate AF-SMOTE augmented Credit Card dataset
!python scripts/generate_augmented_datasets.py --method afsmote --dataset creditcard --input data/creditcard.csv --output-dir data/augmented --seed 42


In [None]:
# Generate AF-SMOTE augmented NSL-KDD dataset
!python scripts/generate_augmented_datasets.py --method afsmote --dataset nsl_kdd --output-dir data/augmented --seed 42


### 4. Cross-Dataset Analysis

Compare performance across both datasets:


In [None]:
# Run comprehensive accuracy analysis
!python scripts/analyze_accuracy_results.py


## Expected Results

### Credit Card Fraud Detection
- **Dataset**: 284,807 transactions, 0.17% fraud rate
- **Best Method**: Baseline (no augmentation) typically performs best
- **Typical Performance**: Accuracy ~53.6%, PR-AUC ~53.7%

### NSL-KDD Network Intrusion Detection  
- **Dataset**: 148,517 connections, 48.12% attack rate
- **Best Method**: Baseline (no augmentation) typically performs best
- **Typical Performance**: Accuracy ~93.1%, PR-AUC ~96.9%

## Key Insights

1. **Class Balance Impact**: Balanced datasets (NSL-KDD) achieve much higher performance
2. **Augmentation Effectiveness**: Data augmentation shows minimal benefits on both datasets
3. **Quality vs Quantity**: High-quality original samples outperform synthetic samples
4. **Domain-Specific Performance**: Network intrusion detection is inherently easier than fraud detection

## Output Files

- **Experiment Results**: `outputs/*/` directories with metrics and plots
- **Augmented Datasets**: `data/augmented/*_augmentation_*.csv`
- **Comparison Reports**: `outputs/*_comparison_report/`
- **Analysis Results**: `outputs/accuracy_analysis/`
