# Assignment 8 – Applied Component: Credit Card Default Prediction using Support Vector Machines (SVM)  
**Week 8 Topic:** Advanced Classification with Linear and Non-Linear Kernels  

---

### Objective  
In this assignment, we apply **Support Vector Machines (SVM)** to predict whether a credit-card customer will default on their payment.  
You will build both **linear and RBF kernel SVM models**, evaluate their performance, and interpret the results in a financial-risk context.  

---

### Theoretical Background  

#### 1 Support Vector Machines  
SVM is a supervised learning algorithm that finds the optimal hyperplane that best separates the data into classes.  
It maximizes the margin between the nearest data points of different classes (called support vectors).  

**Mathematical Formulation:**  
For a binary classification task, SVM solves:  

$$
\min_{w,b}\frac{1}{2}\|w\|^2 \quad
\text{s.t. } y_i(w^Tx_i+b)\ge1
$$  

The decision function is:  
$$
f(x)=\text{sign}(w^Tx+b)
$$  

---

#### 2 Kernel Trick  
Real-world data is often non-linearly separable.  
SVM handles this by using **kernel functions** to map data into a higher-dimensional space where a linear separator can be found.  

Common kernels:  
- Linear Kernel: \( K(x_i,x_j)=x_i^Tx_j \)  
- Polynomial Kernel: \( K(x_i,x_j)=(x_i^Tx_j+c)^d \)  
- RBF Kernel: \( K(x_i,x_j)=\exp(-\gamma \|x_i-x_j\|^2) \)  

**Linear SVM** works well for high-dimensional text or numeric features.  
**RBF SVM** captures complex, non-linear patterns in the data.  

---

#### 3 Regularization and Hyperparameters  
- **C (parameter):** controls the trade-off between margin width and classification errors.  
  - Low C → wider margin, more tolerance for errors (high bias).  
  - High C → narrow margin, less tolerance for errors (low bias).  
- **Gamma (for RBF):** defines how far the influence of a single training example reaches.  
  - Low gamma → smooth decision boundary.  
  - High gamma → tight fit to training data (overfitting risk).  

---

#### 4 Evaluation Metrics  

| Metric | Formula | Interpretation |  
|:--|:--|:--|  
| Accuracy | \( \frac{TP+TN}{TP+TN+FP+FN} \) | Overall correctness |  
| Precision | \( \frac{TP}{TP+FP} \) | Among predicted defaults, how many were true defaults |  
| Recall | \( \frac{TP}{TP+FN} \) | Proportion of actual defaults correctly identified |  
| F1 Score | \( 2 · \frac{Precision·Recall}{Precision+Recall} \) | Balance between precision and recall |  
| ROC-AUC | Area under ROC curve | Measures class separability |  

---

### Business Context  

Predicting credit-card defaults helps banks identify high-risk customers and reduce losses.  
SVM models are robust for financial applications due to their ability to handle high-dimensional and non-linear data.  

**Example:**  
If a customer has high outstanding balances and irregular payments, an RBF SVM can detect non-linear risk patterns missed by linear models.  


# Step 1 – Import Libraries and Load Data

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve, average_precision_score

# Load dataset
df = pd.read_csv('____')   ### FILL IN BLANK

# Display dataset information
print("Dataset shape:", ____)   ### FILL IN BLANK
print("\nColumns:\n", ____)   ### FILL IN BLANK
print("\nData Info:\n")
print(df.info())

# Display first few rows
df.head()



In [None]:
# Visualize class distribution
print("Class distribution:")
print(df['____'].value_counts())   ### FILL IN BLANK

sns.countplot(data=df, x='____')   ### FILL IN BLANK
plt.title('Class Distribution (0 = Genuine, 1 = Fraud)')
plt.show()


# Step 2: Sampling for Computational Efficiency  


In [None]:
# Perform random sampling for efficiency
sample_df = df.sample(n=____, random_state=____)   ### FILL IN BLANK
print("Sampled Data Shape:", sample_df.shape)

# Check class balance in the sampled data
print(sample_df['____'].value_counts(normalize=True))   ### FILL IN BLANK



# Step 3: Feature Scaling and Train-Test Split

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X = sample_df.drop("____", axis=1)   ### FILL IN BLANK
y = sample_df["____"]                ### FILL IN BLANK

scaler = StandardScaler()
X_scaled = scaler.fit_transform(____)   ### FILL IN BLANK

X_train, X_test, y_train, y_test = train_test_split(
    ____, ____, test_size=____, random_state=42, stratify=____   ### FILL IN BLANKS
)

print("Training samples:", X_train.shape[0], " Testing samples:", X_test.shape[0])


**Interpretation Question:** What might happen if we skip scaling when using distance-based algorithms like SVM?

# Step 4: Baseline Linear SVM
Using a **linear kernel** with class balancing enabled.


In [None]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

linear_svm = SVC(kernel='____', class_weight='____', probability=True, random_state=42)   ### FILL IN BLANKS
linear_svm.fit(____, ____)   ### FILL IN BLANK
y_pred_linear = linear_svm.predict(____)   ### FILL IN BLANK

print("\n--- Linear Kernel SVM ---")
print(classification_report(____, ____, digits=4))   ### FILL IN BLANK
print("ROC-AUC Score:", roc_auc_score(____, linear_svm.predict_proba(____)[:,1]))   ### FILL IN BLANKS

cm_linear = confusion_matrix(____, ____)   ### FILL IN BLANK
print("\nConfusion Matrix:\n", cm_linear)


**Interpretation Question:** Why might a linear boundary perform poorly on complex, nonlinear data?

# Step 5: Faster RBF SVM (No Grid Search)
Train an **RBF kernel SVM**, which is capable of handling non-linear relationships.


In [None]:
rbf_svm = SVC(kernel='____', C=____, gamma='____', class_weight='____', probability=True, random_state=42)   ### FILL IN BLANKS
rbf_svm.fit(____, ____)   ### FILL IN BLANK
y_pred_rbf = rbf_svm.predict(____)   ### FILL IN BLANK

print("\n--- RBF Kernel SVM ---")
print(classification_report(____, ____, digits=4))   ### FILL IN BLANK
print("ROC-AUC Score:", roc_auc_score(____, rbf_svm.predict_proba(____)[:,1]))   ### FILL IN BLANKS

cm_rbf = confusion_matrix(____, ____)   ### FILL IN BLANK
print("\nConfusion Matrix:\n", cm_rbf)



**Interpretation Question:** How does the RBF kernel differ from the linear kernel in handling nonlinear relationships?  

# Step 6: Compare Model Performances
Evaluate and compare Linear vs RBF SVMs using key metrics.


In [None]:
from sklearn.metrics import accuracy_score, f1_score

results = pd.DataFrame({
    "Model": ["Linear SVM", "RBF SVM"],
    "Accuracy": [
        accuracy_score(____, ____),   ### FILL IN BLANK
        accuracy_score(____, ____)    ### FILL IN BLANK
    ],
    "F1 Score": [
        f1_score(____, ____),         ### FILL IN BLANK
        f1_score(____, ____)          ### FILL IN BLANK
    ],
    "ROC-AUC": [
        roc_auc_score(____, linear_svm.predict_proba(____)[:,1]),   ### FILL IN BLANK
        roc_auc_score(____, rbf_svm.predict_proba(____)[:,1])       ### FILL IN BLANK
    ]
})

print("\n--- Model Performance Summary ---")
display(results)


**Interpretation Question:** Which model shows the highest accuracy or F1 score, and what does that imply?

# Step 7: Visualize ROC Curves


In [None]:
y_scores = ____.decision_function(____)   ### FILL IN BLANK
precision, recall, _ = precision_recall_curve(____, ____)   ### FILL IN BLANK
average_precision = average_precision_score(____, ____)     ### FILL IN BLANK

plt.figure(figsize=(8,6))
plt.plot(____, ____, label=f'RBF SVM (AP={average_precision:.4f})')   ### FILL IN BLANK
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid(True)
plt.show()



In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

fpr_lin, tpr_lin, _ = roc_curve(____, ____.predict_proba(____)[:,1])   ### FILL IN BLANK
fpr_rbf, tpr_rbf, _ = roc_curve(____, ____.predict_proba(____)[:,1])   ### FILL IN BLANK

plt.figure(figsize=(8,6))
plt.plot(____, ____, label=f"Linear SVM (AUC = {auc(____, ____):.3f})")   ### FILL IN BLANK
plt.plot(____, ____, label=f"RBF SVM (AUC = {auc(____, ____):.3f})")     ### FILL IN BLANK
plt.plot([0,1],[0,1],'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve Comparison: Linear vs RBF SVM")
plt.legend()
plt.grid(True)
plt.show()


# Step 8: Reflection and Discussion

### 1. What does the Precision–Recall curve reveal about model behavior on imbalanced data?Discuss its shape and why Average Precision is preferred over ROC AUC in extreme imbalance.

### 2. Compare Linear vs RBF kernel results: Which performs better in Recall? Why? Relate to data non-linearity and model complexity.

### 3. Explain how class_weight='balanced' affects the SVM optimization objective and margin position.

### 4. If false negatives are costlier than false positives, how would you adjust this pipeline? Consider threshold tuning, custom loss functions, or cost-sensitive learning.

### 5. Would you deploy this SVM in real-time fraud detection? Why or why not? Think about latency, interpretability, and update frequency vs Gradient Boosting or Deep Models.