# Multi-class Classification with Feature Extraction

This notebook demonstrates **feature extraction** for 7-class skin lesion classification:
- **HOG Features**: Handcrafted gradient-based features
- **VGG16 Features**: Deep learning transfer learning
- **Expected improvement**: Better accuracy and faster training

**Classes (7):**
- nv: Melanocytic nevi (~67%)
- mel: Melanoma (~11%)
- bkl: Benign keratosis-like lesions (~11%)
- bcc: Basal cell carcinoma (~5%)
- akiec: Actinic keratoses (~3%)
- vasc: Vascular lesions (~1%)
- df: Dermatofibroma (~1%)

---

## Setup and Imports

In [None]:
# Import libraries
import sys
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import os
import time
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import (
    accuracy_score, confusion_matrix, classification_report,
    top_k_accuracy_score
)

# Import preprocessing utilities
from image_loader import load_images_from_metadata
from preprocessing import prepare_images_with_feature_extraction, DX_DICT

import warnings
warnings.filterwarnings('ignore')

# Set random seed
np.random.seed(42)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print('✓ Libraries imported successfully!')
print('✓ Using enhanced preprocessing with feature extraction')

## Configuration: Choose Feature Extraction Method

In [None]:
# ============================================================
# CONFIGURATION: Choose your feature extraction method
# ============================================================

# Option 1: HOG Features (Fast, no TensorFlow)
FEATURE_METHOD = 'hog'
FEATURE_PARAMS = {
    'target_size': (128, 128),
    'pixels_per_cell': (16, 16),
    'cells_per_block': (2, 2)
}

# Option 2: VGG16 Features (Best accuracy, requires TensorFlow)
# FEATURE_METHOD = 'vgg16'
# FEATURE_PARAMS = {'pooling': 'avg'}

# Option 3: ResNet50 Features (Deep features, requires TensorFlow)
# FEATURE_METHOD = 'resnet50'
# FEATURE_PARAMS = {'pooling': 'avg'}

print(f'Feature Extraction Method: {FEATURE_METHOD.upper()}')
print(f'Parameters: {FEATURE_PARAMS}')

## Step 1: Load Data

In [None]:
# Load metadata
metadata_path = '../data/HAM10000_metadata.csv'
df = pd.read_csv(metadata_path)

print(f'Metadata loaded: {df.shape[0]} samples')
print(f'\nDiagnosis types:')
print(df['dx'].value_counts())
print(f'\nDiagnosis names:')
for code, name in DX_DICT.items():
    count = (df['dx'] == code).sum()
    print(f'  {code}: {name} ({count} samples, {count/len(df)*100:.1f}%)')

In [None]:
# Visualize class distribution
plt.figure(figsize=(12, 5))

# Count plot
plt.subplot(1, 2, 1)
df['dx'].value_counts().plot(kind='bar', color='steelblue')
plt.title('Class Distribution (Count)', fontsize=14, fontweight='bold')
plt.xlabel('Diagnosis', fontsize=12)
plt.ylabel('Number of Samples', fontsize=12)
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.3)

# Pie chart
plt.subplot(1, 2, 2)
df['dx'].value_counts().plot(kind='pie', autopct='%1.1f%%', startangle=90)
plt.title('Class Distribution (Percentage)', fontsize=14, fontweight='bold')
plt.ylabel('')

plt.tight_layout()
plt.show()

print('\n⚠ Note: Imbalanced dataset - nv (67%) dominates')

In [None]:
# Load images
print('Loading images... This may take several minutes.')
print('=' * 60)

IMAGE_SIZE = (224, 224)
images, loaded_image_ids = load_images_from_metadata(
    df,
    base_path='../data',
    target_size=IMAGE_SIZE,
    normalize=True,
    verbose=True
)

print(f'\nLoaded images shape: {images.shape}')
print(f'Memory usage: {images.nbytes / (1024**3):.2f} GB')

In [None]:
# Visualize sample images from each class
fig, axes = plt.subplots(2, 7, figsize=(18, 6))
axes = axes.ravel()

for i, (code, name) in enumerate(DX_DICT.items()):
    # Find first image of this class
    idx = df[df['dx'] == code].index[0]
    axes[i].imshow(images[idx])
    axes[i].set_title(f'{code}\n{name.split()[0]}', fontsize=9)
    axes[i].axis('off')
    
    # Show second example
    if len(df[df['dx'] == code]) > 1:
        idx2 = df[df['dx'] == code].index[1]
        axes[i+7].imshow(images[idx2])
        axes[i+7].axis('off')

plt.suptitle('Sample Images from Each Class (7 Types)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Encode labels
df_filtered = df[df['image_id'].isin(loaded_image_ids)].reset_index(drop=True)

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(df_filtered['dx'])

print('Label Encoding:')
for i, class_name in enumerate(label_encoder.classes_):
    count = (y_encoded == i).sum()
    print(f'  {i}: {class_name} - {DX_DICT[class_name]} ({count} samples)')

In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    images,
    y_encoded,
    test_size=0.2,
    random_state=42,
    stratify=y_encoded
)

print('Train-Test Split:')
print(f'Training: {X_train.shape}')
print(f'Testing: {X_test.shape}')
print(f'\nTraining class distribution:')
unique, counts = np.unique(y_train, return_counts=True)
for label, count in zip(unique, counts):
    class_name = label_encoder.classes_[label]
    print(f'  {label} ({class_name}): {count} ({count/len(y_train)*100:.1f}%)')

## Step 2: Feature Extraction

In [None]:
# Extract features
print(f'Extracting {FEATURE_METHOD.upper()} features...')
print('=' * 70)

start_time = time.time()

prep_data = prepare_images_with_feature_extraction(
    X_train, X_test,
    method=FEATURE_METHOD,
    use_pca=False,  # Feature extraction already reduces dimensions
    **FEATURE_PARAMS
)

preprocessing_time = time.time() - start_time

print(f'\n✓ Feature extraction complete! Time: {preprocessing_time:.2f} seconds')
print(f'\nFeature Summary:')
print(f'  Method: {FEATURE_METHOD.upper()}')
print(f'  Training features: {prep_data["X_train_final"].shape}')
print(f'  Test features: {prep_data["X_test_final"].shape}')
print(f'  Feature dimensions: {prep_data["X_train_final"].shape[1]:,}')

## Step 3: Train Multi-class SVM Model

In [None]:
# Initialize multi-class SVM model
svm_model = SVC(
    kernel='rbf',
    C=1.0,
    gamma='scale',
    random_state=42,
    probability=True,
    class_weight='balanced',
    decision_function_shape='ovr'  # One-vs-Rest for multi-class
)

print('Multi-class SVM Configuration:')
print(f'  Kernel: {svm_model.kernel}')
print(f'  C: {svm_model.C}')
print(f'  Decision Function: {svm_model.decision_function_shape}')
print(f'  Class Weight: {svm_model.class_weight}')
print(f'  Number of classes: {len(np.unique(y_train))}')

In [None]:
# Train the model
print(f'Training multi-class SVM on {FEATURE_METHOD.upper()} features...')
print(f'Feature dimensions: {prep_data["X_train_final"].shape}')
print(f'Expected time: Varies by method (2-10 minutes)')

start_time = time.time()
svm_model.fit(prep_data['X_train_final'], y_train)
training_time = time.time() - start_time

print(f'\n✓ Training complete! Time: {training_time:.2f} seconds ({training_time/60:.1f} minutes)')
print(f'Number of support vectors per class: {svm_model.n_support_}')
print(f'Total support vectors: {sum(svm_model.n_support_)}')

In [None]:
# Make predictions
y_train_pred = svm_model.predict(prep_data['X_train_final'])
y_test_pred = svm_model.predict(prep_data['X_test_final'])
y_test_proba = svm_model.predict_proba(prep_data['X_test_final'])

print('✓ Predictions complete!')
print(f'\nSample predictions (first 10):')
for i in range(10):
    true_class = label_encoder.classes_[y_test[i]]
    pred_class = label_encoder.classes_[y_test_pred[i]]
    match = '✓' if y_test[i] == y_test_pred[i] else '✗'
    prob = y_test_proba[i][y_test_pred[i]]
    print(f'  {match} True: {true_class:6s} | Predicted: {pred_class:6s} | Prob: {prob:.3f}')

## Step 4: Evaluate Model

In [None]:
# Calculate metrics
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
top2_accuracy = top_k_accuracy_score(y_test, y_test_proba, k=2)

print('=' * 70)
print(f'MULTI-CLASS CLASSIFICATION - {FEATURE_METHOD.upper()} FEATURES')
print('=' * 70)
print(f'\nFeature Extraction:')
print(f'  Method: {FEATURE_METHOD.upper()}')
print(f'  Feature dimensions: {prep_data["X_train_final"].shape[1]:,}')
print(f'  Extraction time: {preprocessing_time:.2f} seconds')
print(f'\nPerformance:')
print(f'  Training Accuracy: {train_accuracy:.4f} ({train_accuracy*100:.2f}%)')
print(f'  Testing Accuracy:  {test_accuracy:.4f} ({test_accuracy*100:.2f}%)')
print(f'  Top-2 Accuracy:    {top2_accuracy:.4f} ({top2_accuracy*100:.2f}%)')
print(f'  Random Baseline:   {1/7:.4f} ({100/7:.2f}%)')
print(f'\nTiming:')
print(f'  Preprocessing time: {preprocessing_time:.2f} seconds')
print(f'  Training time: {training_time:.2f} seconds ({training_time/60:.1f} minutes)')
print(f'  Total time: {(preprocessing_time + training_time)/60:.1f} minutes')

In [None]:
# Confusion matrix
cm_test = confusion_matrix(y_test, y_test_pred)
class_names = label_encoder.classes_

plt.figure(figsize=(10, 8))
sns.heatmap(cm_test, annot=True, fmt='d', cmap='Blues', cbar=True,
            xticklabels=class_names,
            yticklabels=class_names)
plt.title(f'Confusion Matrix - {FEATURE_METHOD.upper()} Features', fontsize=14, fontweight='bold')
plt.ylabel('True Label', fontsize=12)
plt.xlabel('Predicted Label', fontsize=12)
plt.tight_layout()
plt.show()

print('\nInterpretation:')
print('  - Diagonal: Correct predictions')
print('  - Off-diagonal: Misclassifications')
print('  - Look for: nv ↔ mel confusion (benign vs malignant)')

In [None]:
# Normalized confusion matrix
cm_normalized = cm_test.astype('float') / cm_test.sum(axis=1)[:, np.newaxis]

plt.figure(figsize=(10, 8))
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', cbar=True,
            xticklabels=class_names,
            yticklabels=class_names,
            vmin=0, vmax=1)
plt.title(f'Normalized Confusion Matrix - {FEATURE_METHOD.upper()} Features', fontsize=14, fontweight='bold')
plt.ylabel('True Label', fontsize=12)
plt.xlabel('Predicted Label', fontsize=12)
plt.tight_layout()
plt.show()

print('Values show proportion of true class predicted as each class (row-wise normalization)')

In [None]:
# Classification report
print('=' * 70)
print('CLASSIFICATION REPORT')
print('=' * 70)

report = classification_report(
    y_test,
    y_test_pred,
    target_names=class_names,
    digits=4
)
print(report)

# Save classification report
report_dict = classification_report(
    y_test,
    y_test_pred,
    target_names=class_names,
    output_dict=True
)
report_df = pd.DataFrame(report_dict).transpose()

print('\nPer-Class Performance Summary:')
for i, class_name in enumerate(class_names):
    support = (y_test == i).sum()
    if class_name in report_df.index:
        f1 = report_df.loc[class_name, 'f1-score']
        recall = report_df.loc[class_name, 'recall']
        full_name = DX_DICT[class_name]
        print(f'  {class_name}: {full_name:30s} F1={f1:.3f}, Recall={recall:.3f} (n={support})')

In [None]:
# Visualize per-class metrics
metrics_df = report_df[report_df.index.isin(class_names)][['precision', 'recall', 'f1-score']]

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for i, metric in enumerate(['precision', 'recall', 'f1-score']):
    axes[i].bar(range(len(class_names)), metrics_df[metric], color='steelblue')
    axes[i].set_xticks(range(len(class_names)))
    axes[i].set_xticklabels(class_names, rotation=45)
    axes[i].set_ylabel(metric.capitalize(), fontsize=12)
    axes[i].set_title(f'{metric.capitalize()} by Class', fontsize=13, fontweight='bold')
    axes[i].set_ylim([0, 1])
    axes[i].grid(axis='y', alpha=0.3)
    axes[i].axhline(y=metrics_df[metric].mean(), color='red', linestyle='--', 
                    label=f'Mean: {metrics_df[metric].mean():.3f}')
    axes[i].legend()

plt.suptitle(f'Per-Class Performance Metrics - {FEATURE_METHOD.upper()}', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## Step 5: Save Model and Artifacts

In [None]:
# Save model and preprocessing artifacts
save_dir = f'../models/{FEATURE_METHOD}'
os.makedirs(save_dir, exist_ok=True)
os.makedirs(f'../results/{FEATURE_METHOD}', exist_ok=True)

# Save SVM model
with open(f'{save_dir}/svm_multiclass_model_{FEATURE_METHOD}.pkl', 'wb') as f:
    pickle.dump(svm_model, f)

# Save scaler
with open(f'{save_dir}/scaler_multiclass_{FEATURE_METHOD}.pkl', 'wb') as f:
    pickle.dump(prep_data['scaler'], f)

# Save label encoder
with open(f'{save_dir}/label_encoder_multiclass_{FEATURE_METHOD}.pkl', 'wb') as f:
    pickle.dump(label_encoder, f)

# Save metrics
metrics = {
    'method': FEATURE_METHOD,
    'train_accuracy': train_accuracy,
    'test_accuracy': test_accuracy,
    'top2_accuracy': top2_accuracy,
    'confusion_matrix': cm_test,
    'preprocessing_time': preprocessing_time,
    'training_time': training_time,
    'feature_dim': prep_data['X_train_final'].shape[1],
    'feature_params': FEATURE_PARAMS,
    'class_names': class_names.tolist()
}

with open(f'{save_dir}/svm_multiclass_metrics_{FEATURE_METHOD}.pkl', 'wb') as f:
    pickle.dump(metrics, f)

# Save classification report
report_df.to_csv(f'../results/{FEATURE_METHOD}/multiclass_classification_report_{FEATURE_METHOD}.csv')

print('✓ Model and artifacts saved successfully!')
print(f'\nSaved files to: {save_dir}/')
print(f'  - svm_multiclass_model_{FEATURE_METHOD}.pkl')
print(f'  - scaler_multiclass_{FEATURE_METHOD}.pkl')
print(f'  - label_encoder_multiclass_{FEATURE_METHOD}.pkl')
print(f'  - svm_multiclass_metrics_{FEATURE_METHOD}.pkl')
print(f'\nResults saved to: ../results/{FEATURE_METHOD}/')
print(f'  - multiclass_classification_report_{FEATURE_METHOD}.csv')

## Summary

### Results

This notebook demonstrated multi-class feature extraction:

**Method:** {FEATURE_METHOD.upper()}
- ✓ 7 classes (nv, mel, bkl, bcc, akiec, vasc, df)
- ✓ Test accuracy: {test_accuracy:.2%}
- ✓ Top-2 accuracy: {top2_accuracy:.2%}
- ✓ Training time: {training_time/60:.1f} minutes

### Key Observations

1. **Class Imbalance**: nv dominates (67%), rare classes struggle
2. **Best Performance**: nv (dominant class) typically has highest F1-score
3. **Challenging Classes**: df, vasc (<2% of data) hard to classify
4. **Common Confusion**: nv ↔ mel (benign vs malignant nevi)

### Feature Extraction Benefits

**Why it works:**
- Semantic features capture lesion patterns
- Reduced dimensionality speeds training
- Transfer learning (VGG16) leverages ImageNet
- More robust to image variations

### Next Steps

1. **Try different methods**: Switch between HOG, VGG16, ResNet50
2. **Address imbalance**: Try SMOTE, class weights, stratified sampling
3. **Tune hyperparameters**: Adjust SVM C, gamma
4. **Ensemble methods**: Combine multiple feature extractors
5. **Deploy**: Use saved artifacts for inference