# Fraud Detection: Hybrid Approach
**Tugas Besar 2 IF3070 â€“ Dasar Inteligensi Artifisial**

**Team:** AbyuDAIya-Ganbatte

## Overview
This notebook demonstrates a hybrid implementation:
1.  **Data Preprocessing:** Uses industry-standard **Scikit-Learn** libraries for robust, efficient, and clean data transformation.
2.  **Model:** Uses a **custom-built Logistic Regression** (implemented from scratch) featuring advanced optimization techniques like **Adam Optimizer** and **Focal Loss**.

### Libraries Used
* `pandas` & `numpy`: Data manipulation.
* `sklearn.preprocessing`: StandardScaler, OneHotEncoder, RobustScaler.
* `sklearn.impute`: KNNImputer, SimpleImputer.
* `sklearn.compose`: ColumnTransformer.
* `sklearn.model_selection`: train_test_split, StratifiedKFold.
* `sklearn.metrics`: roc_auc_score, f1_score (for verification).

In [1]:
import numpy as np
import pandas as pd
import json
import matplotlib.pyplot as plt

# Scikit-Learn Imports for Preprocessing
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder, RobustScaler, FunctionTransformer
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, f1_score, classification_report

np.random.seed(42)

## 1. The Manual Logistic Regression Model
This section contains the custom implementation of Logistic Regression. **No libraries** are used for the optimization logic here. It includes:
* **Adam Optimizer** (Adaptive Moment Estimation)
* **Focal Loss** (for class imbalance)
* **Elastic Net Regularization**

In [2]:
class ManualLogisticRegression:
    """
    Logistic Regression implemented from scratch with Adam Optimizer and Focal Loss.
    """
    def __init__(self, learning_rate=0.01, n_iterations=1000, optimizer="adam", 
                 batch_size=None, regularization=0.0, l1_ratio=0.0, class_weight=None,
                 lr_schedule="constant", lr_decay=0.1, lr_decay_steps=100,
                 beta1=0.9, beta2=0.999, epsilon=1e-8,
                 use_focal_loss=False, focal_gamma=2.0,
                 early_stopping=True, patience=10, tol=1e-5, verbose=True):
        self.learning_rate = learning_rate
        self.initial_lr = learning_rate
        self.n_iterations = n_iterations
        self.optimizer = optimizer.lower()
        self.regularization = regularization
        self.l1_ratio = l1_ratio
        self.class_weight = class_weight
        self.lr_schedule = lr_schedule
        self.lr_decay = lr_decay
        self.lr_decay_steps = lr_decay_steps
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.use_focal_loss = use_focal_loss
        self.focal_gamma = focal_gamma
        self.early_stopping = early_stopping
        self.patience = patience
        self.tol = tol
        self.verbose = verbose
        
        if batch_size is None:
            self.batch_size = 32 if self.optimizer == "mini-batch" else None
        else:
            self.batch_size = batch_size
        
        # Model parameters
        self.weights = None
        self.bias = None
        
        # Adam parameters
        self.m_w = None; self.v_w = None
        self.m_b = None; self.v_b = None
        self.t = 0
        
        self.loss_history = []
    
    def _get_learning_rate(self, iteration):
        if self.lr_schedule == "constant": return self.initial_lr
        elif self.lr_schedule == "step":
            return self.initial_lr * (self.lr_decay ** (iteration // self.lr_decay_steps))
        return self.initial_lr
    
    def sigmoid(self, z):
        z = np.clip(z, -500, 500)
        positive_mask = z >= 0
        negative_mask = ~positive_mask
        result = np.zeros_like(z, dtype=float)
        result[positive_mask] = 1 / (1 + np.exp(-z[positive_mask]))
        exp_z = np.exp(z[negative_mask])
        result[negative_mask] = exp_z / (1 + exp_z)
        return result
    
    def compute_loss(self, y_true, y_pred, sample_weights=None):
        epsilon = 1e-15
        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
        
        if self.use_focal_loss:
            p_t = np.where(y_true == 1, y_pred, 1 - y_pred)
            focal_weight = (1 - p_t) ** self.focal_gamma
            loss_per_sample = -focal_weight * np.log(p_t)
        else:
            loss_per_sample = -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
        
        if sample_weights is not None:
            loss = np.sum(sample_weights * loss_per_sample) / np.sum(sample_weights)
        else:
            loss = np.mean(loss_per_sample)
            
        if self.regularization > 0 and self.weights is not None:
            l1_term = self.l1_ratio * np.sum(np.abs(self.weights))
            l2_term = (1 - self.l1_ratio) * 0.5 * np.sum(self.weights ** 2)
            loss += self.regularization * (l1_term + l2_term)
        return loss
    
    def _compute_gradients(self, X, y, y_pred, sample_weights=None):
        n_samples = len(y)
        if self.use_focal_loss:
            epsilon = 1e-15
            y_pred_clipped = np.clip(y_pred, epsilon, 1 - epsilon)
            p_t = np.where(y == 1, y_pred_clipped, 1 - y_pred_clipped)
            focal_weight = (1 - p_t) ** self.focal_gamma
            grad_p = np.where(
                y == 1,
                -focal_weight * (self.focal_gamma * (1 - y_pred_clipped) * np.log(y_pred_clipped + epsilon) + 1) / (y_pred_clipped + epsilon),
                focal_weight * (self.focal_gamma * y_pred_clipped * np.log(1 - y_pred_clipped + epsilon) + 1) / (1 - y_pred_clipped + epsilon)
            )
            error = grad_p * y_pred_clipped * (1 - y_pred_clipped)
        else:
            error = y_pred - y
        
        if sample_weights is not None:
            weighted_error = sample_weights * error
            dw = np.dot(X.T, weighted_error) / np.sum(sample_weights)
            db = np.sum(weighted_error) / np.sum(sample_weights)
        else:
            dw = (1 / n_samples) * np.dot(X.T, error)
            db = (1 / n_samples) * np.sum(error)
            
        if self.regularization > 0:
            l2_grad = (1 - self.l1_ratio) * self.weights
            l1_grad = self.l1_ratio * np.sign(self.weights)
            dw += self.regularization * (l1_grad + l2_grad)
        return dw, db
    
    def fit(self, X, y):
        # Handle inputs
        if hasattr(X, "toarray"): X = X.toarray()  # Handle sparse matrices from sklearn
        if isinstance(X, pd.DataFrame): X = X.values
        if isinstance(y, (pd.Series, pd.DataFrame)): y = y.values.flatten()
        
        n_samples, n_features = X.shape
        np.random.seed(42)
        self.weights = np.random.randn(n_features) * np.sqrt(2.0 / n_features)
        self.bias = 0.0
        
        # Initialize Adam
        self.m_w = np.zeros(n_features); self.v_w = np.zeros(n_features)
        self.m_b = 0.0; self.v_b = 0.0
        self.t = 0
        
        # Class weights
        if self.class_weight == "balanced":
            class_counts = np.bincount(y.astype(int))
            cw = n_samples / (2 * class_counts)
            sample_weights = np.where(y == 1, cw[1], cw[0])
        else:
            sample_weights = None
            
        # Training Loop (Adam Only)
        best_loss = float('inf'); patience_counter = 0; best_weights = self.weights.copy()
        batch_size = min(self.batch_size if self.batch_size else 32, n_samples)
        
        for iteration in range(self.n_iterations):
            lr = self._get_learning_rate(iteration)
            indices = np.random.permutation(n_samples)
            X_s, y_s = X[indices], y[indices]
            sw_s = sample_weights[indices] if sample_weights is not None else None
            
            for start_idx in range(0, n_samples, batch_size):
                end_idx = min(start_idx + batch_size, n_samples)
                X_batch = X_s[start_idx:end_idx]
                y_batch = y_s[start_idx:end_idx]
                sw_batch = sw_s[start_idx:end_idx] if sw_s is not None else None
                
                # Forward & Backward
                z = np.dot(X_batch, self.weights) + self.bias
                y_pred = self.sigmoid(z)
                dw, db = self._compute_gradients(X_batch, y_batch, y_pred, sw_batch)
                
                # Adam Update
                self.t += 1
                self.m_w = self.beta1 * self.m_w + (1 - self.beta1) * dw
                self.m_b = self.beta1 * self.m_b + (1 - self.beta1) * db
                self.v_w = self.beta2 * self.v_w + (1 - self.beta2) * (dw ** 2)
                self.v_b = self.beta2 * self.v_b + (1 - self.beta2) * (db ** 2)
                
                m_w_corr = self.m_w / (1 - self.beta1 ** self.t)
                m_b_corr = self.m_b / (1 - self.beta1 ** self.t)
                v_w_corr = self.v_w / (1 - self.beta2 ** self.t)
                v_b_corr = self.v_b / (1 - self.beta2 ** self.t)
                
                self.weights -= lr * m_w_corr / (np.sqrt(v_w_corr) + self.epsilon)
                self.bias -= lr * m_b_corr / (np.sqrt(v_b_corr) + self.epsilon)
            
            # Evaluation for Early Stopping
            z_full = np.dot(X, self.weights) + self.bias
            loss = self.compute_loss(y, self.sigmoid(z_full), sample_weights)
            self.loss_history.append(loss)
            
            if self.early_stopping:
                if loss < best_loss - self.tol:
                    best_loss = loss; best_weights = self.weights.copy(); patience_counter = 0
                else:
                    patience_counter += 1
                    if patience_counter >= self.patience:
                        if self.verbose: print(f"Early stopping at {iteration}, loss: {loss:.6f}")
                        self.weights = best_weights
                        break
                        
            if self.verbose and iteration % 500 == 0:
                print(f"Iteration {iteration}: loss = {loss:.6f}")
        return self

    def predict_proba(self, X):
        if hasattr(X, "toarray"): X = X.toarray()
        if isinstance(X, pd.DataFrame): X = X.values
        z = np.dot(X, self.weights) + self.bias
        return self.sigmoid(z)
    
    def predict(self, X, threshold=0.5):
        return (self.predict_proba(X) >= threshold).astype(int)

## 2. Feature Creation (Pandas)
We still use Pandas for creating *new* features (Feature Engineering) because Scikit-Learn is designed for processing *existing* features. 

In [3]:
def create_features(df):
    """Creates domain-specific ratios and interactions."""
    df = df.copy()
    
    # Ratios (Numerical/Numerical)
    if 'transaction_amount' in df.columns and 'avg_transaction_amount' in df.columns:
        df['amount_vs_avg'] = df['transaction_amount'] / (df['avg_transaction_amount'] + 1e-5)
    
    if 'transactions_last_1h' in df.columns and 'transactions_last_24h' in df.columns:
        df['hourly_conc'] = df['transactions_last_1h'] / (df['transactions_last_24h'] + 1)
        
    # Log Transformations for skewed features
    for col in ['transaction_amount', 'distance_from_home']:
        if col in df.columns and df[col].min() >= 0:
            df[f'{col}_log'] = np.log1p(df[col])
            
    # Interaction Features
    if 'ip_risk_score' in df.columns and 'device_trust_score' in df.columns:
        df['risk_interaction'] = df['ip_risk_score'] * (1 - df['device_trust_score'] / 100)

    return df

## 3. Data Processing Pipeline (Scikit-Learn)
This replaces the manual `one_hot_encode`, `StandardScaler`, and `KNNImputer` functions with a robust `ColumnTransformer`.

**Strategy:**
* **Numerical Cols:** Impute (KNN) -> Scale (RobustScaler to handle outliers).
* **Categorical Cols:** Impute (Mode) -> One-Hot Encode.

In [4]:
def get_preprocessor(numerical_cols, categorical_cols):
    """
    Builds a Scikit-Learn ColumnTransformer.
    """
    # Pipeline for Numerical Features
    # 1. KNN Imputer fills missing values based on neighbors
    # 2. RobustScaler scales data but is robust to outliers (uses IQR instead of mean/std)
    numeric_transformer = Pipeline(steps=[
        ('imputer', KNNImputer(n_neighbors=5)),
        ('scaler', RobustScaler())  # Better than StandardScaler for fraud data (outliers)
    ])

    # Pipeline for Categorical Features
    # 1. SimpleImputer fills missing with 'most_frequent' (Mode)
    # 2. OneHotEncoder converts categories to binary columns
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
    ])

    # Combine both
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numerical_cols),
            ('cat', categorical_transformer, categorical_cols)
        ],
        verbose_feature_names_out=False
    )
    
    return preprocessor

## 4. Main Execution
Orchestrates loading, splitting (sklearn), preprocessing (sklearn), and training (manual model).

In [5]:
# 1. Load Data
try:
    train_df = pd.read_csv("train.csv")
    test_df = pd.read_csv("test.csv")
    
    # Store IDs for submission
    test_ids = test_df['ID']
    
    # Separate Target
    X = train_df.drop(columns=['is_fraud', 'ID', 'transaction_id', 'user_id'], errors='ignore')
    y = train_df['is_fraud']
    X_test_raw = test_df.drop(columns=['ID', 'transaction_id', 'user_id'], errors='ignore')

    # 2. Feature Engineering (Creation Only)
    print("Creating Features...")
    X = create_features(X)
    X_test_raw = create_features(X_test_raw)

    # Identify Columns
    num_cols = X.select_dtypes(include=[np.number]).columns.tolist()
    cat_cols = X.select_dtypes(include=['object', 'category']).columns.tolist()
    print(f"Numerical: {len(num_cols)}, Categorical: {len(cat_cols)}")

    # 3. Split Data (Using Sklearn)
    # Stratify ensures the fraud ratio is maintained in train/val
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    # 4. Fit & Transform Preprocessing Pipeline
    print("Running Sklearn Pipeline...")
    preprocessor = get_preprocessor(num_cols, cat_cols)
    
    # Fit on TRAIN, Transform on TRAIN, VAL, and TEST
    X_train_processed = preprocessor.fit_transform(X_train)
    X_val_processed = preprocessor.transform(X_val)
    X_test_processed = preprocessor.transform(X_test_raw)
    
    print(f"Processed Train Shape: {X_train_processed.shape}")

    # 5. Train Manual Model
    print("Training Manual Logistic Regression...")
    model = ManualLogisticRegression(
        learning_rate=0.0015, 
        n_iterations=3000, 
        optimizer="adam", 
        batch_size=256, 
        regularization=0.0006, 
        l1_ratio=0.5,
        class_weight="balanced",
        use_focal_loss=True,  # Using Focal Loss for imbalance
        focal_gamma=2.0
    )
    
    model.fit(X_train_processed, y_train)

    # 6. Evaluate
    y_prob_val = model.predict_proba(X_val_processed)
    y_pred_val = model.predict(X_val_processed, threshold=0.5)
    
    auc = roc_auc_score(y_val, y_prob_val)
    print(f"\nValidation ROC AUC: {auc:.4f}")
    print("Classification Report:")
    print(classification_report(y_val, y_pred_val))

    # 7. Generate Submission
    final_probs = model.predict_proba(X_test_processed)
    submission = pd.DataFrame({'ID': test_ids, 'is_fraud': final_probs})
    submission.to_csv("submission.csv", index=False)
    print("Submission saved to submission.csv")

except FileNotFoundError:
    print("Error: train.csv or test.csv not found.")

Creating Features...
Numerical: 26, Categorical: 6
Running Sklearn Pipeline...
Processed Train Shape: (80000, 59)
Training Manual Logistic Regression...
Iteration 0: loss = 3.859026
Early stopping at 10, loss: 17.436073

Validation ROC AUC: 0.4525
Classification Report:
              precision    recall  f1-score   support

           0       0.86      1.00      0.92     17174
           1       0.21      0.00      0.00      2826

    accuracy                           0.86     20000
   macro avg       0.54      0.50      0.46     20000
weighted avg       0.77      0.86      0.79     20000

Submission saved to submission.csv
