[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/USER/REPO/blob/main/kdd_project/notebooks/KDD_Credit_Card_Fraud.ipynb)

# KDD: Credit Card Fraud Detection

> Selection → Preprocessing → Transformation → Data Mining → Interpretation/Evaluation

**Dataset**: Kaggle — Credit Card Fraud Detection (highly imbalanced).

In [None]:
!pip -q install pandas numpy scikit-learn matplotlib seaborn xgboost lightgbm imbalanced-learn joblib gradio fastapi uvicorn kaggle

In [None]:
import pandas as pd, numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_absolute_error, r2_score, roc_auc_score, classification_report, confusion_matrix
import joblib
Path('data').mkdir(exist_ok=True)
Path('models').mkdir(exist_ok=True)
Path('reports').mkdir(exist_ok=True)

In [None]:
# --- Kaggle download (make sure you've set up your Kaggle token) ---
    # In Colab:
    # from google.colab import files
    # files.upload()  # upload kaggle.json
    # !mkdir -p ~/.kaggle
    # !cp kaggle.json ~/.kaggle/
    # !chmod 600 ~/.kaggle/kaggle.json
    !kaggle datasets download -d mlg-ulb/creditcardfraud -p data
    !ls -lh data
    # If a zip is downloaded, unzip it:
    !python - << 'PY'
import zipfile, glob, os
zips = glob.glob('data/*.zip')
for z in zips:
    with zipfile.ZipFile(z) as f:
        f.extractall('data')
        print('unzipped', z)
PY

## 1. Selection
- Define the subset of fields and time windows
- Document inclusion/exclusion and data lineage

In [None]:
import pandas as pd
from pathlib import Path

csv = next((p for p in Path('data').glob('*.csv') if 'creditcard' in p.name.lower()), None)
df = pd.read_csv(csv) if csv else None
print('Rows/cols:', None if df is None else df.shape)
df.head() if df is not None else None

## 2. Preprocessing
- Handle missing values (this dataset often has none)
- Sanity checks & label balance

In [None]:
if df is not None:
    print(df.isna().sum().sum(), 'missing total')
    print('Class balance:')
    print(df['Class'].value_counts(normalize=True))

## 3. Transformation
- Scaling, PCA (already done), resampling strategies (SMOTE, undersample)

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, classification_report

if df is not None:
    X = df.drop(columns=['Class'])
    y = df['Class']
    X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
    sm = SMOTE(random_state=42)
    X_res, y_res = sm.fit_resample(X_tr, y_tr)


## 4. Data Mining
- Train several classifiers; tune with CV
- Compare ROC‑AUC, precision/recall on fraud class

In [None]:
if df is not None:
    clf = LogisticRegression(max_iter=200)
    clf.fit(X_res, y_res)
    proba = clf.predict_proba(X_te)[:,1]
    print('ROC-AUC:', roc_auc_score(y_te, proba))
    print(classification_report(y_te, clf.predict(X_te)))
    joblib.dump(clf, 'models/model.joblib')

## 5. Interpretation & Evaluation
- Threshold analysis; cost matrix
- Monitoring & drift plan; actionable insights