# Feature Engineering: Phishing Brand Classification

This notebook covers data preprocessing and feature engineering for the phishing brand classifier.

## Objectives
1. Data preprocessing and cleaning
2. Train/validation/test split with stratification
3. Data augmentation strategies
4. Class imbalance handling
5. Feature visualization and analysis

## Key Considerations
- **Minimize false positives**: Benign sites ('others') should NOT be classified as brands
- **Class imbalance**: Use weighted sampling and appropriate loss functions
- **Data augmentation**: Simulate real-world variations in screenshots

In [None]:
# Import required libraries
import os
import sys
from pathlib import Path

import albumentations as A
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import torch
from PIL import Image
from sklearn.model_selection import train_test_split

# Add project root to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

from src.data.utils import scan_dataset, prepare_dataset_splits
from src.data.transforms import get_train_transforms, get_val_transforms, AlbumentationsTransform
from src.data.dataset import PhishingDataset

# Configuration
DATA_DIR = project_root / 'data' / 'raw'
PROCESSED_DIR = project_root / 'data' / 'processed'
FIGURES_DIR = project_root / 'outputs' / 'figures'

PROCESSED_DIR.mkdir(parents=True, exist_ok=True)
FIGURES_DIR.mkdir(parents=True, exist_ok=True)

# Set random seed for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

print(f"Data directory: {DATA_DIR}")
print(f"PyTorch version: {torch.__version__}")

## 1. Load and Prepare Data

In [None]:
# Load dataset
df = scan_dataset(str(DATA_DIR))

print(f"Total images: {len(df)}")
print(f"\nClass distribution:")
print(df['label'].value_counts())

In [None]:
# Define class order (ensure 'others' is last for easier handling)
CLASS_NAMES = [
    'amazon', 'apple', 'facebook', 'google', 'instagram',
    'linkedin', 'microsoft', 'netflix', 'paypal', 'twitter',
    'others'
]

# Filter to only include classes in our list
df = df[df['label'].isin(CLASS_NAMES)]

print(f"Filtered dataset size: {len(df)}")
print(f"\nClasses: {CLASS_NAMES}")
print(f"Others class index: {CLASS_NAMES.index('others')}")

## 2. Train/Validation/Test Split

We use stratified splitting to maintain class proportions across splits.

In [None]:
# Split parameters
TRAIN_SIZE = 0.70
VAL_SIZE = 0.15
TEST_SIZE = 0.15

# Perform stratified split
train_df, val_df, test_df = prepare_dataset_splits(
    df,
    train_size=TRAIN_SIZE,
    val_size=VAL_SIZE,
    test_size=TEST_SIZE,
    stratify=True,
    random_state=RANDOM_SEED
)

print(f"\nSplit sizes:")
print(f"  Train: {len(train_df)} ({len(train_df)/len(df)*100:.1f}%)")
print(f"  Val:   {len(val_df)} ({len(val_df)/len(df)*100:.1f}%)")
print(f"  Test:  {len(test_df)} ({len(test_df)/len(df)*100:.1f}%)")

In [None]:
# Verify stratification
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for ax, (name, split_df) in zip(axes, [('Train', train_df), ('Validation', val_df), ('Test', test_df)]):
    counts = split_df['label'].value_counts().sort_index()
    colors = ['coral' if c == 'others' else 'steelblue' for c in counts.index]
    counts.plot(kind='bar', ax=ax, color=colors)
    ax.set_title(f'{name} Set Distribution')
    ax.set_xlabel('Class')
    ax.set_ylabel('Count')
    ax.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'split_distribution.png', dpi=150, bbox_inches='tight')
plt.show()

# Print class distribution percentages
print("\nClass distribution per split:")
for name, split_df in [('Train', train_df), ('Val', val_df), ('Test', test_df)]:
    dist = split_df['label'].value_counts(normalize=True) * 100
    print(f"\n{name}:")
    for cls in CLASS_NAMES:
        if cls in dist:
            print(f"  {cls}: {dist[cls]:.1f}%")

In [None]:
# Save splits
train_df.to_csv(PROCESSED_DIR / 'train.csv', index=False)
val_df.to_csv(PROCESSED_DIR / 'val.csv', index=False)
test_df.to_csv(PROCESSED_DIR / 'test.csv', index=False)

print(f"Splits saved to {PROCESSED_DIR}")

## 3. Data Augmentation

Data augmentation is crucial for:
- Increasing effective dataset size
- Improving model generalization
- Simulating real-world variations in screenshots

### Augmentation Strategy
For website screenshots, we apply:
- **Geometric**: Slight rotation, horizontal flip (pages can be mirrored)
- **Color**: Brightness/contrast changes, color jitter
- **Quality**: Blur, compression artifacts, noise

We avoid vertical flips as websites are not vertically symmetric.

In [None]:
# Image size for model input
IMAGE_SIZE = 224  # Standard for EfficientNet/ResNet

# Get transforms
train_transform = get_train_transforms(image_size=IMAGE_SIZE)
val_transform = get_val_transforms(image_size=IMAGE_SIZE)

print("Training transforms:")
print(train_transform)
print("\nValidation transforms:")
print(val_transform)

In [None]:
def visualize_augmentations(image_path, transform, n_augments=8, figsize=(16, 8)):
    """Visualize multiple augmentations of an image."""
    img = Image.open(image_path).convert('RGB')
    img_array = np.array(img)
    
    n_cols = 4
    n_rows = (n_augments + 1 + n_cols - 1) // n_cols
    
    fig, axes = plt.subplots(n_rows, n_cols, figsize=figsize)
    axes = axes.flatten()
    
    # Original image
    axes[0].imshow(img)
    axes[0].set_title('Original', fontsize=10)
    axes[0].axis('off')
    
    # Augmented versions
    for i in range(n_augments):
        augmented = transform(image=img_array)['image']
        # Convert from tensor if needed
        if isinstance(augmented, torch.Tensor):
            # Denormalize
            mean = torch.tensor([0.485, 0.456, 0.406]).view(3, 1, 1)
            std = torch.tensor([0.229, 0.224, 0.225]).view(3, 1, 1)
            augmented = augmented * std + mean
            augmented = augmented.permute(1, 2, 0).numpy()
            augmented = np.clip(augmented, 0, 1)
        
        axes[i + 1].imshow(augmented)
        axes[i + 1].set_title(f'Augmentation {i+1}', fontsize=10)
        axes[i + 1].axis('off')
    
    # Hide unused subplots
    for j in range(n_augments + 1, len(axes)):
        axes[j].axis('off')
    
    plt.tight_layout()
    return fig

In [None]:
# Visualize augmentations for a sample image from each class
for class_name in ['google', 'facebook', 'others']:
    class_df = train_df[train_df['label'] == class_name]
    if len(class_df) > 0:
        sample = class_df.sample(1).iloc[0]
        print(f"\nAugmentations for {class_name.upper()} class:")
        fig = visualize_augmentations(sample['image_path'], train_transform, n_augments=7)
        plt.savefig(FIGURES_DIR / f'augmentations_{class_name}.png', dpi=100, bbox_inches='tight')
        plt.show()

## 4. Class Imbalance Handling

For phishing detection, handling class imbalance is critical:

### Strategies
1. **Class weights**: Higher weight for minority classes in loss function
2. **Weighted sampling**: Oversample minority classes during training
3. **Focal loss**: Focus on hard-to-classify examples

### Special Handling for 'Others'
The 'others' class needs special attention:
- False positives (benign â†’ brand) create poor user experience
- May need higher confidence threshold for brand classification
- Consider asymmetric loss penalties

In [None]:
# Create training dataset
train_dataset = PhishingDataset(
    data_dir=str(DATA_DIR),
    df=train_df,
    transform=AlbumentationsTransform(train_transform),
    class_names=CLASS_NAMES
)

# Calculate class weights
class_weights = train_dataset.get_class_weights()

print("Class weights (inverse frequency):")
for name, weight in zip(CLASS_NAMES, class_weights):
    marker = " <-- BENIGN" if name == 'others' else ""
    print(f"  {name}: {weight:.3f}{marker}")

In [None]:
# Visualize class weights
fig, ax = plt.subplots(figsize=(12, 5))

colors = ['coral' if c == 'others' else 'steelblue' for c in CLASS_NAMES]
bars = ax.bar(CLASS_NAMES, class_weights.numpy(), color=colors, edgecolor='black')

ax.set_xlabel('Class')
ax.set_ylabel('Weight')
ax.set_title('Class Weights for Training')
ax.tick_params(axis='x', rotation=45)

# Add value labels
for bar, weight in zip(bars, class_weights):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
           f'{weight:.2f}', ha='center', fontsize=9)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'class_weights.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Demonstrate weighted sampling effect
from torch.utils.data import WeightedRandomSampler, DataLoader

# Get sample weights
sample_weights = train_dataset.get_sample_weights()

# Create weighted sampler
sampler = WeightedRandomSampler(
    weights=sample_weights,
    num_samples=len(sample_weights),
    replacement=True
)

# Create dataloader with sampler
loader = DataLoader(train_dataset, batch_size=64, sampler=sampler)

# Check distribution after one epoch of sampling
sampled_labels = []
for _, labels, _ in loader:
    sampled_labels.extend(labels.numpy())

# Compare distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Original distribution
orig_counts = train_df['label'].value_counts().sort_index()
axes[0].bar(CLASS_NAMES, [orig_counts.get(c, 0) for c in CLASS_NAMES],
           color=['coral' if c == 'others' else 'steelblue' for c in CLASS_NAMES])
axes[0].set_title('Original Distribution')
axes[0].set_xlabel('Class')
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=45)

# Sampled distribution
from collections import Counter
sampled_counts = Counter(sampled_labels)
axes[1].bar(CLASS_NAMES, [sampled_counts.get(i, 0) for i in range(len(CLASS_NAMES))],
           color=['coral' if c == 'others' else 'steelblue' for c in CLASS_NAMES])
axes[1].set_title('After Weighted Sampling (1 epoch)')
axes[1].set_xlabel('Class')
axes[1].set_ylabel('Count')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'weighted_sampling_effect.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nWeighted sampling helps balance class representation during training!")

## 5. Feature Analysis: Pretrained Model Features

Analyze feature representations from a pretrained model to understand:
- How well brands are separated in feature space
- Potential confusion between classes
- Quality of pretrained features for our task

In [None]:
import timm
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

# Load pretrained model for feature extraction
feature_extractor = timm.create_model('efficientnet_b0', pretrained=True, num_classes=0)
feature_extractor.eval()

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
feature_extractor = feature_extractor.to(device)

print(f"Feature extractor loaded on {device}")
print(f"Feature dimension: {feature_extractor.num_features}")

In [None]:
# Extract features for a subset of images
val_dataset = PhishingDataset(
    data_dir=str(DATA_DIR),
    df=val_df,
    transform=AlbumentationsTransform(val_transform),
    class_names=CLASS_NAMES
)

# Sample subset for visualization
n_samples = min(500, len(val_dataset))
indices = np.random.choice(len(val_dataset), n_samples, replace=False)

features = []
labels = []

print(f"Extracting features from {n_samples} images...")

with torch.no_grad():
    for idx in indices:
        img, label, _ = val_dataset[idx]
        img = img.unsqueeze(0).to(device)
        feat = feature_extractor(img)
        features.append(feat.cpu().numpy())
        labels.append(label)

features = np.vstack(features)
labels = np.array(labels)

print(f"Features shape: {features.shape}")

In [None]:
# Dimensionality reduction with t-SNE
print("Running t-SNE...")
tsne = TSNE(n_components=2, random_state=RANDOM_SEED, perplexity=30)
features_2d = tsne.fit_transform(features)

# Plot
fig, ax = plt.subplots(figsize=(14, 10))

# Define colors for each class
colors = plt.cm.tab10(np.linspace(0, 1, len(CLASS_NAMES)))
color_map = {i: colors[i] for i in range(len(CLASS_NAMES))}
color_map[CLASS_NAMES.index('others')] = 'gray'  # Make 'others' gray

for class_idx in range(len(CLASS_NAMES)):
    mask = labels == class_idx
    ax.scatter(
        features_2d[mask, 0],
        features_2d[mask, 1],
        c=[color_map[class_idx]],
        label=CLASS_NAMES[class_idx],
        alpha=0.6,
        s=50
    )

ax.set_title('t-SNE Visualization of Pretrained Features', fontsize=14)
ax.set_xlabel('t-SNE 1')
ax.set_ylabel('t-SNE 2')
ax.legend(bbox_to_anchor=(1.02, 1), loc='upper left', fontsize=10)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'tsne_features.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nObservations:")
print("- Check if brands form distinct clusters")
print("- Look for overlaps that might cause confusion")
print("- 'Others' (gray) should ideally be separate from brand clusters")

## 6. Confidence Threshold Analysis

Understanding the importance of confidence thresholding for minimizing false positives.

**Key Insight**: By setting a higher confidence threshold, we can reject uncertain predictions and classify them as 'others' (benign), reducing false positives.

In [None]:
# Simulate threshold effects
def simulate_threshold_effect(confidences, is_correct, thresholds):
    """Analyze effect of confidence thresholds on accuracy and rejection rate."""
    results = []
    
    for thresh in thresholds:
        accepted = confidences >= thresh
        rejected = ~accepted
        
        # Accuracy on accepted predictions
        if accepted.sum() > 0:
            acc = is_correct[accepted].mean()
        else:
            acc = 0
        
        results.append({
            'threshold': thresh,
            'accepted_rate': accepted.mean(),
            'rejected_rate': rejected.mean(),
            'accuracy_on_accepted': acc
        })
    
    return pd.DataFrame(results)

# Generate synthetic confidence/correctness data for demonstration
np.random.seed(RANDOM_SEED)
n_samples = 1000

# Simulate: correct predictions tend to have higher confidence
is_correct = np.random.binomial(1, 0.85, n_samples).astype(bool)
confidences = np.where(
    is_correct,
    np.random.beta(8, 2, n_samples),  # Higher confidence for correct
    np.random.beta(2, 5, n_samples)   # Lower confidence for incorrect
)

# Analyze thresholds
thresholds = np.arange(0.5, 0.99, 0.05)
threshold_analysis = simulate_threshold_effect(confidences, is_correct, thresholds)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Threshold vs Accuracy
axes[0].plot(threshold_analysis['threshold'], threshold_analysis['accuracy_on_accepted'],
            'b-o', linewidth=2, markersize=8)
axes[0].set_xlabel('Confidence Threshold')
axes[0].set_ylabel('Accuracy on Accepted Predictions')
axes[0].set_title('Accuracy vs Confidence Threshold')
axes[0].grid(True, alpha=0.3)
axes[0].set_ylim(0.8, 1.0)

# Threshold vs Rejection Rate
axes[1].plot(threshold_analysis['threshold'], threshold_analysis['rejected_rate'],
            'r-o', linewidth=2, markersize=8)
axes[1].set_xlabel('Confidence Threshold')
axes[1].set_ylabel('Rejection Rate')
axes[1].set_title('Rejection Rate vs Confidence Threshold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(FIGURES_DIR / 'threshold_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nThreshold Analysis:")
print(threshold_analysis.to_string(index=False))
print("\n** Higher threshold = Fewer false positives, but more rejections **")

## 7. Summary and Next Steps

### Key Preprocessing Decisions

1. **Image Size**: 224x224 for EfficientNet (can try 384x384 for ViT)
2. **Split Ratio**: 70/15/15 with stratification
3. **Augmentation**: Rotation, brightness, blur, compression artifacts
4. **Class Imbalance**: 
   - Weighted sampling during training
   - Focal loss for hard examples
   - Class weights in loss function
5. **Confidence Threshold**: ~0.85 to minimize false positives

### Files Generated
- `data/processed/train.csv`
- `data/processed/val.csv`
- `data/processed/test.csv`
- Various analysis figures in `outputs/figures/`

In [None]:
# Save preprocessing configuration
import yaml

preprocess_config = {
    'image_size': IMAGE_SIZE,
    'class_names': CLASS_NAMES,
    'others_class_idx': CLASS_NAMES.index('others'),
    'split': {
        'train': TRAIN_SIZE,
        'val': VAL_SIZE,
        'test': TEST_SIZE
    },
    'class_weights': class_weights.tolist(),
    'recommended_confidence_threshold': 0.85,
    'random_seed': RANDOM_SEED
}

with open(PROCESSED_DIR / 'preprocess_config.yaml', 'w') as f:
    yaml.dump(preprocess_config, f, default_flow_style=False)

print(f"Preprocessing configuration saved to {PROCESSED_DIR / 'preprocess_config.yaml'}")
print("\nReady for model training!")