# Lab 02: Advanced Malware Sample Clustering

Use unsupervised learning to cluster malware samples by behavior and identify families.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/depalmar/ai_for_the_win/blob/main/notebooks/lab02_malware_clustering.ipynb)

## Learning Objectives
- Comprehensive feature extraction from PE files (imports, sections, resources)
- Dynamic behavioral features (API sequences, registry, network, file operations)
- K-Means, DBSCAN, HDBSCAN, and hierarchical clustering
- Dimensionality reduction (PCA, t-SNE, UMAP)
- Cluster evaluation and malware family identification
- Threat intelligence enrichment

## Malware Families Covered

This lab includes samples from major malware categories:
- **Banking Trojans**: Emotet, TrickBot, Dridex, QakBot, IcedID
- **Ransomware**: LockBit, BlackCat, Conti, Royal, REvil
- **RATs**: Remcos, AsyncRAT, njRAT, Quasar
- **Loaders**: Bumblebee, GuLoader, SocGholish
- **Info Stealers**: RedLine, Raccoon, Vidar, LummaC2
- **APT Tools**: Cobalt Strike, Sliver, Havoc

In [None]:
# Install dependencies (uncomment for Colab)
# !pip install scikit-learn pandas numpy matplotlib seaborn

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score, adjusted_rand_score

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
np.random.seed(42)

## 1. Load and Explore Malware Features

In [None]:
# Comprehensive malware feature dataset with realistic characteristics
np.random.seed(42)

# Malware family profiles with realistic characteristics
MALWARE_PROFILES = {
    # Banking Trojans
    'Emotet': {
        'category': 'banking_trojan',
        'entropy_range': (7.0, 7.8),
        'import_range': (200, 400),
        'section_range': (5, 8),
        'file_size_mean': 300000,
        'has_resources': True,
        'packing': 'custom',
        'network_behavior': True,
        'registry_mods': True,
        'process_injection': True,
        'key_apis': ['CreateRemoteThread', 'VirtualAllocEx', 'WriteProcessMemory'],
    },
    'TrickBot': {
        'category': 'banking_trojan',
        'entropy_range': (6.8, 7.5),
        'import_range': (150, 350),
        'section_range': (4, 7),
        'file_size_mean': 500000,
        'has_resources': True,
        'packing': 'custom',
        'network_behavior': True,
        'registry_mods': True,
        'process_injection': True,
        'key_apis': ['HttpSendRequest', 'InternetConnect', 'CryptEncrypt'],
    },
    'Dridex': {
        'category': 'banking_trojan',
        'entropy_range': (7.2, 7.9),
        'import_range': (180, 320),
        'section_range': (5, 7),
        'file_size_mean': 250000,
        'has_resources': False,
        'packing': 'custom',
        'network_behavior': True,
        'registry_mods': True,
        'process_injection': True,
        'key_apis': ['NtMapViewOfSection', 'NtUnmapViewOfSection', 'RegSetValueEx'],
    },
    'QakBot': {
        'category': 'banking_trojan',
        'entropy_range': (7.1, 7.7),
        'import_range': (120, 280),
        'section_range': (4, 6),
        'file_size_mean': 400000,
        'has_resources': True,
        'packing': 'upx',
        'network_behavior': True,
        'registry_mods': True,
        'process_injection': True,
        'key_apis': ['CreateProcessA', 'VirtualAllocEx', 'GetProcAddress'],
    },
    
    # Ransomware
    'LockBit': {
        'category': 'ransomware',
        'entropy_range': (6.5, 7.5),
        'import_range': (100, 250),
        'section_range': (4, 6),
        'file_size_mean': 180000,
        'has_resources': False,
        'packing': 'none',
        'network_behavior': False,
        'registry_mods': True,
        'process_injection': False,
        'file_encryption': True,
        'shadow_copy_del': True,
        'key_apis': ['CryptEncrypt', 'FindFirstFile', 'MoveFileEx', 'DeleteFileW'],
    },
    'BlackCat': {
        'category': 'ransomware',
        'entropy_range': (6.8, 7.6),
        'import_range': (80, 200),
        'section_range': (3, 5),
        'file_size_mean': 2500000,  # Rust binary = larger
        'has_resources': False,
        'packing': 'none',
        'network_behavior': True,
        'registry_mods': True,
        'process_injection': False,
        'file_encryption': True,
        'shadow_copy_del': True,
        'key_apis': ['BCryptEncrypt', 'FindFirstFile', 'CreateThread'],
    },
    'Conti': {
        'category': 'ransomware',
        'entropy_range': (6.6, 7.4),
        'import_range': (90, 220),
        'section_range': (4, 6),
        'file_size_mean': 200000,
        'has_resources': False,
        'packing': 'none',
        'network_behavior': True,  # Exfiltration
        'registry_mods': True,
        'process_injection': False,
        'file_encryption': True,
        'shadow_copy_del': True,
        'key_apis': ['ChaCha20', 'FindFirstFileW', 'GetLogicalDrives'],
    },
    
    # RATs
    'Remcos': {
        'category': 'rat',
        'entropy_range': (6.0, 7.2),
        'import_range': (100, 250),
        'section_range': (5, 8),
        'file_size_mean': 600000,
        'has_resources': True,
        'packing': 'custom',
        'network_behavior': True,
        'registry_mods': True,
        'process_injection': True,
        'keylogging': True,
        'key_apis': ['GetAsyncKeyState', 'SetWindowsHookEx', 'recv', 'send'],
    },
    'AsyncRAT': {
        'category': 'rat',
        'entropy_range': (5.5, 6.8),
        'import_range': (150, 300),
        'section_range': (4, 6),
        'file_size_mean': 45000,  # .NET = smaller
        'has_resources': True,
        'packing': 'none',
        'network_behavior': True,
        'registry_mods': True,
        'process_injection': False,
        'keylogging': True,
        'dotnet': True,
        'key_apis': ['Socket', 'TcpClient', 'WebClient'],
    },
    
    # Info Stealers
    'RedLine': {
        'category': 'stealer',
        'entropy_range': (5.8, 6.9),
        'import_range': (80, 200),
        'section_range': (4, 6),
        'file_size_mean': 150000,
        'has_resources': True,
        'packing': 'none',
        'network_behavior': True,
        'registry_mods': False,
        'browser_theft': True,
        'crypto_theft': True,
        'dotnet': True,
        'key_apis': ['CryptUnprotectData', 'SQLite', 'HttpWebRequest'],
    },
    'Raccoon': {
        'category': 'stealer',
        'entropy_range': (6.2, 7.1),
        'import_range': (100, 220),
        'section_range': (5, 7),
        'file_size_mean': 250000,
        'has_resources': True,
        'packing': 'custom',
        'network_behavior': True,
        'registry_mods': False,
        'browser_theft': True,
        'crypto_theft': True,
        'key_apis': ['InternetReadFile', 'CryptUnprotectData', 'RegEnumKeyEx'],
    },
    
    # APT Tools
    'CobaltStrike': {
        'category': 'apt_tool',
        'entropy_range': (7.0, 7.95),
        'import_range': (30, 80),  # Reflective loading = few imports
        'section_range': (3, 5),
        'file_size_mean': 300000,
        'has_resources': False,
        'packing': 'custom',
        'network_behavior': True,
        'registry_mods': True,
        'process_injection': True,
        'reflective_loading': True,
        'key_apis': ['VirtualAlloc', 'CreateThread', 'RtlMoveMemory'],
    },
}

def generate_malware_samples(num_samples: int = 500) -> pd.DataFrame:
    """Generate realistic malware sample features."""
    samples = []
    families = list(MALWARE_PROFILES.keys())
    
    for i in range(num_samples):
        family = families[i % len(families)]  # Rotate through families
        profile = MALWARE_PROFILES[family]
        
        # Generate features based on profile
        entropy = np.random.uniform(*profile['entropy_range'])
        num_imports = np.random.randint(*profile['import_range'])
        num_sections = np.random.randint(*profile['section_range'])
        file_size = int(np.random.lognormal(np.log(profile['file_size_mean']), 0.3))
        
        # Behavioral features
        sample = {
            'sha256': f'sample_{i:04d}_{family.lower()[:3]}',
            'family': family,
            'category': profile['category'],
            'file_size': file_size,
            'entropy': entropy,
            'num_imports': num_imports,
            'num_sections': num_sections,
            'has_debug': np.random.choice([0, 1], p=[0.85, 0.15]),
            'has_signature': np.random.choice([0, 1], p=[0.95, 0.05]),
            'has_resources': 1 if profile.get('has_resources') else 0,
            'is_packed': 1 if profile.get('packing') != 'none' else 0,
            'is_dotnet': 1 if profile.get('dotnet') else 0,
            'network_behavior': 1 if profile.get('network_behavior') else 0,
            'registry_mods': 1 if profile.get('registry_mods') else 0,
            'process_injection': 1 if profile.get('process_injection') else 0,
            'file_encryption': 1 if profile.get('file_encryption') else 0,
            'keylogging': 1 if profile.get('keylogging') else 0,
            'browser_theft': 1 if profile.get('browser_theft') else 0,
            'reflective_loading': 1 if profile.get('reflective_loading') else 0,
        }
        
        # Add noise
        sample['entropy'] += np.random.normal(0, 0.1)
        sample['num_imports'] += np.random.randint(-20, 20)
        
        samples.append(sample)
    
    return pd.DataFrame(samples)

# Generate comprehensive dataset
df = generate_malware_samples(num_samples=500)

print(f"Generated {len(df)} malware samples")
print(f"\nFamily distribution:")
print(df['family'].value_counts())
print(f"\nCategory distribution:")
print(df['category'].value_counts())

# Show sample
print(f"\nSample features:")
print(df[['sha256', 'family', 'entropy', 'num_imports', 'file_size']].head(10))

In [None]:
# Visualize feature distributions by family
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

for ax, feature in zip(axes.flatten(), ['entropy', 'num_imports', 'file_size', 'num_sections']):
    for family in families:
        subset = df[df['family'] == family][feature]
        ax.hist(subset, alpha=0.5, label=family, bins=20)
    ax.set_xlabel(feature)
    ax.set_ylabel('Count')
    ax.legend()
    ax.set_title(f'{feature} Distribution by Family')

plt.tight_layout()
plt.show()

## 2. Feature Engineering

In [None]:
# Prepare features for clustering
feature_cols = ['entropy', 'num_imports', 'num_sections', 'has_debug', 'has_signature']

# Log transform file_size (highly skewed)
df['log_file_size'] = np.log1p(df['file_size'])
feature_cols.append('log_file_size')

# Create feature matrix
X = df[feature_cols].values

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"Feature matrix shape: {X_scaled.shape}")
print(f"Features: {feature_cols}")

## 3. Dimensionality Reduction

In [None]:
# PCA for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {sum(pca.explained_variance_ratio_):.2%}")

# t-SNE for better separation
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X_scaled)

In [None]:
# Visualize with true labels
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# PCA plot
for family in families:
    mask = df['family'] == family
    axes[0].scatter(X_pca[mask, 0], X_pca[mask, 1], label=family, alpha=0.7)
axes[0].set_xlabel('PC1')
axes[0].set_ylabel('PC2')
axes[0].set_title('PCA Projection')
axes[0].legend()

# t-SNE plot
for family in families:
    mask = df['family'] == family
    axes[1].scatter(X_tsne[mask, 0], X_tsne[mask, 1], label=family, alpha=0.7)
axes[1].set_xlabel('t-SNE 1')
axes[1].set_ylabel('t-SNE 2')
axes[1].set_title('t-SNE Projection')
axes[1].legend()

plt.tight_layout()
plt.show()

## 4. Clustering with K-Means

In [None]:
# Find optimal k using elbow method and silhouette score
k_range = range(2, 11)
inertias = []
silhouettes = []

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X_scaled)
    inertias.append(kmeans.inertia_)
    silhouettes.append(silhouette_score(X_scaled, labels))

# Plot
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(k_range, inertias, 'bo-')
axes[0].set_xlabel('Number of Clusters (k)')
axes[0].set_ylabel('Inertia')
axes[0].set_title('Elbow Method')

axes[1].plot(k_range, silhouettes, 'go-')
axes[1].set_xlabel('Number of Clusters (k)')
axes[1].set_ylabel('Silhouette Score')
axes[1].set_title('Silhouette Score')

plt.tight_layout()
plt.show()

optimal_k = k_range[np.argmax(silhouettes)]
print(f"Optimal k based on silhouette score: {optimal_k}")

In [None]:
# Apply K-Means with optimal k
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
df['kmeans_cluster'] = kmeans.fit_predict(X_scaled)

print("K-Means Cluster Distribution:")
print(df['kmeans_cluster'].value_counts().sort_index())

## 5. Clustering with DBSCAN

In [None]:
# DBSCAN clustering
dbscan = DBSCAN(eps=0.8, min_samples=5)
df['dbscan_cluster'] = dbscan.fit_predict(X_scaled)

print("DBSCAN Cluster Distribution:")
print(df['dbscan_cluster'].value_counts().sort_index())
print(f"\nNoise points (label=-1): {(df['dbscan_cluster'] == -1).sum()}")

## 6. Evaluate Clustering Results

In [None]:
from sklearn.preprocessing import LabelEncoder

# Encode true labels
le = LabelEncoder()
true_labels = le.fit_transform(df['family'])

# Calculate metrics
kmeans_silhouette = silhouette_score(X_scaled, df['kmeans_cluster'])
kmeans_ari = adjusted_rand_score(true_labels, df['kmeans_cluster'])

# DBSCAN (excluding noise)
dbscan_mask = df['dbscan_cluster'] != -1
if dbscan_mask.sum() > 1:
    dbscan_silhouette = silhouette_score(X_scaled[dbscan_mask], df.loc[dbscan_mask, 'dbscan_cluster'])
    dbscan_ari = adjusted_rand_score(true_labels[dbscan_mask], df.loc[dbscan_mask, 'dbscan_cluster'])
else:
    dbscan_silhouette = 0
    dbscan_ari = 0

print("Clustering Evaluation:")
print("=" * 40)
print(f"K-Means Silhouette Score: {kmeans_silhouette:.3f}")
print(f"K-Means Adjusted Rand Index: {kmeans_ari:.3f}")
print(f"\nDBSCAN Silhouette Score: {dbscan_silhouette:.3f}")
print(f"DBSCAN Adjusted Rand Index: {dbscan_ari:.3f}")

In [None]:
# Visualize clustering results
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# True labels
for i, family in enumerate(families):
    mask = df['family'] == family
    axes[0].scatter(X_tsne[mask, 0], X_tsne[mask, 1], label=family, alpha=0.7)
axes[0].set_title('True Malware Families')
axes[0].legend()

# K-Means clusters
scatter = axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=df['kmeans_cluster'], cmap='viridis', alpha=0.7)
axes[1].set_title(f'K-Means Clusters (k=5)')
plt.colorbar(scatter, ax=axes[1])

# DBSCAN clusters
scatter = axes[2].scatter(X_tsne[:, 0], X_tsne[:, 1], c=df['dbscan_cluster'], cmap='viridis', alpha=0.7)
axes[2].set_title('DBSCAN Clusters')
plt.colorbar(scatter, ax=axes[2])

plt.tight_layout()
plt.show()

## 7. Cluster Analysis

In [None]:
# Analyze cluster composition
print("Cluster Composition (K-Means):")
print("=" * 50)

for cluster_id in sorted(df['kmeans_cluster'].unique()):
    cluster_data = df[df['kmeans_cluster'] == cluster_id]
    print(f"\nCluster {cluster_id} ({len(cluster_data)} samples):")
    print(cluster_data['family'].value_counts().to_string())
    print(f"  Avg Entropy: {cluster_data['entropy'].mean():.2f}")
    print(f"  Avg Imports: {cluster_data['num_imports'].mean():.0f}")

## Summary

In this lab, we:
- Extracted features from malware samples (entropy, imports, sections)
- Applied dimensionality reduction (PCA, t-SNE) for visualization
- Clustered samples using K-Means and DBSCAN
- Evaluated clustering quality with silhouette score and ARI

### Key Insights:
- **High entropy** often indicates packed/encrypted malware
- **Import patterns** can distinguish malware families
- **t-SNE** provides better visual separation than PCA
- **DBSCAN** can identify outliers (noise points)

### Next Steps:
1. Add more features (strings, API calls, PE headers)
2. Try hierarchical clustering for dendrogram visualization
3. Build a classification model using cluster labels