
# Sujet 8 — Spectral Clustering + KDDCup 99

## Objectifs
Mettre en œuvre le **Spectral Clustering** sur le jeu de données **KDDCup99**, préparer les données,
définir des métriques pertinentes et comparer à une baseline simple.

Seed fixe : **42**


In [1]:

# Setup & versions
import numpy as np
import pandas as pd
import random
import warnings
warnings.filterwarnings("ignore")

SEED = 42
np.random.seed(SEED)
random.seed(SEED)

import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

print("scikit-learn version:", sklearn.__version__)


scikit-learn version: 1.7.2



## 1. Chargement du jeu de données (KDDCup99)
On utilise un sous-ensemble pour réduire le coût mémoire.


In [2]:

from sklearn.datasets import fetch_kddcup99

data = fetch_kddcup99(
    subset="SA",
    percent10=True,
    as_frame=True,
    random_state=SEED
)

X = data.data
y = data.target

print("Dimensions:", X.shape)
print("Distribution des classes:")
print(y.value_counts())
X.head()


Dimensions: (100655, 41)
Distribution des classes:
labels
b'normal.'             97278
b'smurf.'               2409
b'neptune.'              898
b'back.'                  15
b'satan.'                 15
b'ipsweep.'               10
b'teardrop.'               9
b'portsweep.'              8
b'warezclient.'            8
b'pod.'                    3
b'buffer_overflow.'        1
b'land.'                   1
Name: count, dtype: int64


Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
0,0,b'tcp',b'http',b'SF',181,5450,0,0,0,0,...,9,9,1.0,0.0,0.11,0.0,0.0,0.0,0.0,0.0
1,0,b'tcp',b'http',b'SF',239,486,0,0,0,0,...,19,19,1.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0
2,0,b'tcp',b'http',b'SF',235,1337,0,0,0,0,...,29,29,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0
3,0,b'tcp',b'http',b'SF',219,1337,0,0,0,0,...,39,39,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0
4,0,b'tcp',b'http',b'SF',217,2032,0,0,0,0,...,49,49,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0



## 2. Exploration rapide


In [3]:

print("Valeurs manquantes:", X.isna().sum().sum())
print("Types de variables:")
print(X.dtypes.value_counts())


Valeurs manquantes: 0
Types de variables:
object    41
Name: count, dtype: int64



## 3. Prétraitement
- Encodage des variables catégorielles
- Standardisation
- PCA pour réduire la dimension (nécessaire pour Spectral Clustering)


In [4]:

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA

X_enc = X.copy()

for col in X_enc.select_dtypes(include=["object"]).columns:
    le = LabelEncoder()
    X_enc[col] = le.fit_transform(X_enc[col])

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_enc)

pca = PCA(n_components=20, random_state=SEED)
X_pca = pca.fit_transform(X_scaled)

print("Shape après PCA:", X_pca.shape)


Shape après PCA: (100655, 20)



## 4. Modèle Baseline : KMeans


In [5]:

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score

k = len(np.unique(y))

kmeans = KMeans(n_clusters=k, random_state=SEED)
labels_km = kmeans.fit_predict(X_pca)

print("KMeans Silhouette:", silhouette_score(X_pca, labels_km))
print("KMeans Davies-Bouldin:", davies_bouldin_score(X_pca, labels_km))


KMeans Silhouette: 0.31242361851091416
KMeans Davies-Bouldin: 0.6757259881516292



## 5. Spectral Clustering


In [6]:

from sklearn.cluster import SpectralClustering

spectral = SpectralClustering(
    n_clusters=k,
    affinity="nearest_neighbors",
    n_neighbors=10,
    random_state=SEED
)

labels_sc = spectral.fit_predict(X_pca)

print("Spectral Silhouette:", silhouette_score(X_pca, labels_sc))
print("Spectral Davies-Bouldin:", davies_bouldin_score(X_pca, labels_sc))


KeyboardInterrupt: 


## 6. Comparaison des modèles


In [None]:

results = pd.DataFrame({
    "Modèle": ["KMeans", "Spectral Clustering"],
    "Silhouette": [
        silhouette_score(X_pca, labels_km),
        silhouette_score(X_pca, labels_sc)
    ],
    "Davies-Bouldin": [
        davies_bouldin_score(X_pca, labels_km),
        davies_bouldin_score(X_pca, labels_sc)
    ]
})

results



## 7. Discussion & limites
- Le clustering spectral est coûteux en mémoire
- Sensible au paramètre n_neighbors
- PCA peut entraîner une perte d'information
- Les labels réels servent uniquement à fixer k



## 8. Packaging Colab
Notebook exécutable, seed fixé, commentaires inclus.
