
# KDD Cup 1999 - Anomali Tespiti (Tam Proje) ğŸ‡¹ğŸ‡·

Bu defter, KDD Cup 1999 veri seti Ã¼zerinde **anomali (saldÄ±rÄ±) tespiti** iÃ§in uÃ§tan uca bir akÄ±ÅŸ sunar:
1. **Veri YÃ¼kleme** (CSV)
2. **Ã–n Ä°ÅŸleme** (One-Hot + Standardizasyon, Train/Test)
3. **Modeller**: Lojistik Regresyon, Random Forest, Decision Tree, Lineer (SGDClassifier), *(opsiyonel)* XGBoost
4. **Hiperparametre Optimizasyonu** (GridSearchCV)
5. **DeÄŸerlendirme** (ROC-AUC, PR-EÄŸrisi, F1, Confusion Matrix)
6. **PCA(2)** Ã¼zerinde **karar sÄ±nÄ±rÄ±** gÃ¶rselleÅŸtirme (SVM ile)
7. **K-Means**: Elbow (WCSS) + **Silhouette** skoru
8. **SelectKBest (ANOVA F)** ile Ã¶rnek **Ã¶zellik seÃ§imi**

> Not: Bu Ã§alÄ±ÅŸma **sÄ±nÄ±flandÄ±rma** problemidir (Normal vs Attack). Lineer/Multiple Regression (sÃ¼rekli hedef) **kapsam dÄ±ÅŸÄ±dÄ±r**.


In [None]:

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    roc_auc_score, roc_curve, precision_recall_curve,
    average_precision_score, f1_score, confusion_matrix,
    classification_report
)
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

# XGBoost opsiyonel
try:
    from xgboost import XGBClassifier
    XGB_AVAILABLE = True
except Exception as e:
    XGB_AVAILABLE = False
    print("XGBoost bulunamadÄ±. Kurmak iÃ§in: pip install xgboost")

from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.feature_selection import SelectKBest, f_classif

import joblib

plt.rcParams['figure.figsize'] = (8, 6)
RANDOM_STATE = 42



## I. Veri Setinin YÃ¼klenmesi

Bu bÃ¶lÃ¼mde **KDD Cup 1999** veri seti `pandas` ile CSV'den okunur.  
Etiket sÃ¼tunu (`label`) **ikili** hale getirilir: `normal` â†’ 0, diÄŸer saldÄ±rÄ±lar â†’ 1.


In [None]:

# CSV dosya yolu (aynÄ± klasÃ¶re koymanÄ±z Ã¶nerilir)
DATA_PATH = "kdd99_10percent.csv"  # Ã¶rnek ad

# KDD 1999 Ã¶zellik isimleri (41 + label)
KDD_COLUMNS = [
    "duration","protocol_type","service","flag","src_bytes","dst_bytes","land",
    "wrong_fragment","urgent","hot","num_failed_logins","logged_in","num_compromised",
    "root_shell","su_attempted","num_root","num_file_creations","num_shells",
    "num_access_files","num_outbound_cmds","is_host_login","is_guest_login",
    "count","srv_count","serror_rate","srv_serror_rate","rerror_rate","srv_rerror_rate",
    "same_srv_rate","diff_srv_rate","srv_diff_host_rate","dst_host_count",
    "dst_host_srv_count","dst_host_same_srv_rate","dst_host_diff_srv_rate",
    "dst_host_same_src_port_rate","dst_host_srv_diff_host_rate","dst_host_serror_rate",
    "dst_host_srv_serror_rate","dst_host_rerror_rate","dst_host_srv_rerror_rate",
    "label"
]

if not os.path.exists(DATA_PATH):
    raise FileNotFoundError(
        f"Veri dosyasÄ± bulunamadÄ±: {DATA_PATH}\n"
        "CSV dosyasÄ±nÄ± klasÃ¶re koyup DATA_PATH'i gÃ¼ncelleyin."
    )

df = pd.read_csv(DATA_PATH, header=None, names=KDD_COLUMNS)

# Ä°kili hedef
df['label_binary'] = np.where(df['label'].astype(str).str.contains('normal'), 0, 1)

print("Veri ÅŸekli:", df.shape)
print("SaldÄ±rÄ± oranÄ± (1):", round(df['label_binary'].mean(), 4))
df.head()



## II. Veri Ã–n Ä°ÅŸleme

- Kategorik: `protocol_type`, `service`, `flag` â†’ **One-Hot Encoding**  
- SayÄ±sal: DiÄŸer tÃ¼m sÃ¼tunlar â†’ **StandardScaler**  
- **Stratified** train/test bÃ¶lme (sÄ±nÄ±f oranÄ±nÄ± korur)


In [None]:

categorical_cols = ['protocol_type', 'service', 'flag']
numeric_cols = [c for c in df.columns if c not in categorical_cols + ['label', 'label_binary']]

X = df[categorical_cols + numeric_cols]
y = df['label_binary']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=RANDOM_STATE
)

preprocess = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
        ("num", StandardScaler(with_mean=False), numeric_cols)  # sparse uyumu
    ]
)

# Dengesizlik iÃ§in Ã¶lÃ§ek Ã¶nerisi (XGBoost'ta kullanÄ±lacak)
pos = (y_train == 1).sum()
neg = (y_train == 0).sum()
scale_pos_weight_val = float(neg) / float(pos) if pos > 0 else 1.0
print("scale_pos_weight Ã¶nerisi:", round(scale_pos_weight_val, 2))



## III. Modeller

AÅŸaÄŸÄ±daki klasik algoritmalar bir **Pipeline** iÃ§inde eÄŸitilecektir:
- **Lojistik Regresyon** (baseline)
- **Random Forest**
- **Decision Tree**
- **SGDClassifier** (lineer, `loss='log_loss'`)  
- *(Opsiyonel)* **XGBoost** (varsa)


In [None]:

logreg_pipe = Pipeline([("prep", preprocess),
                        ("clf", LogisticRegression(max_iter=1000, random_state=RANDOM_STATE))])

rf_pipe = Pipeline([("prep", preprocess),
                    ("clf", RandomForestClassifier(random_state=RANDOM_STATE, n_jobs=-1))])

dt_pipe = Pipeline([("prep", preprocess),
                    ("clf", DecisionTreeClassifier(random_state=RANDOM_STATE))])

sgd_pipe = Pipeline([("prep", preprocess),
                     ("clf", SGDClassifier(loss="log_loss", class_weight="balanced", random_state=RANDOM_STATE))])

if XGB_AVAILABLE:
    xgb_pipe = Pipeline([("prep", preprocess),
                         ("clf", XGBClassifier(
                             objective="binary:logistic",
                             eval_metric="logloss",
                             tree_method="hist",
                             random_state=RANDOM_STATE,
                             n_jobs=-1
                         ))])



## IV. Hiperparametre Optimizasyonu (GridSearchCV)

Her model iÃ§in **ROC-AUC** puanÄ±nÄ± maksimize edecek ÅŸekilde arama yapÄ±lÄ±r.


In [None]:

search_spaces = {
    "Logistic Regression": (logreg_pipe, {
        "clf__C": [0.1, 1.0, 3.0],
        "clf__penalty": ["l2"],
        "clf__solver": ["lbfgs", "liblinear"]
    }),
    "Random Forest": (rf_pipe, {
        "clf__n_estimators": [100, 200],
        "clf__max_depth": [None, 20, 40],
        "clf__min_samples_split": [2, 5],
        "clf__min_samples_leaf": [1, 2]
    }),
    "Decision Tree": (dt_pipe, {
        "clf__max_depth": [None, 10, 20, 40],
        "clf__min_samples_split": [2, 5, 10],
        "clf__min_samples_leaf": [1, 2, 4],
        "clf__class_weight": [None, "balanced"]
    }),
    "SGD (Linear)": (sgd_pipe, {
        "clf__alpha": [1e-4, 1e-3, 1e-2],
        "clf__max_iter": [1000, 2000],
        "clf__tol": [1e-3, 1e-4]
    })
}

if XGB_AVAILABLE:
    search_spaces["XGBoost"] = (xgb_pipe, {
        "clf__n_estimators": [200, 400],
        "clf__max_depth": [4, 6, 8],
        "clf__learning_rate": [0.03, 0.1],
        "clf__subsample": [0.8, 1.0],
        "clf__colsample_bytree": [0.8, 1.0],
        "clf__scale_pos_weight": [1.0, scale_pos_weight_val]
    })

best_models = {}
cv_results = []

for name, (pipe, grid) in search_spaces.items():
    print(f"\n>> {name} GridSearch baÅŸlÄ±yor...")
    gs = GridSearchCV(
        estimator=pipe,
        param_grid=grid,
        scoring="roc_auc",
        cv=3,
        n_jobs=-1,
        verbose=1
    )
    gs.fit(X_train, y_train)
    best_models[name] = gs.best_estimator_
    cv_results.append((name, gs.best_score_, gs.best_params_))
    print(f"{name} en iyi ROC-AUC (CV): {gs.best_score_:.4f}")
    print(f"{name} en iyi parametreler: {gs.best_params_}")

cv_summary = pd.DataFrame(cv_results, columns=["Model", "ROC-AUC (CV)", "Best Params"]).sort_values("ROC-AUC (CV)", ascending=False)
cv_summary



## V. Model DeÄŸerlendirme

Her model iÃ§in test setinde:
- **ROC-AUC**, **PR-AUC**, **F1** skorlarÄ±
- **ROC** ve **Precision-Recall** eÄŸrileri
- **KarmaÅŸÄ±klÄ±k Matrisi (Confusion Matrix)**


In [None]:

def evaluate_classifier(name, model, X_tr, y_tr, X_te, y_te):
    # OlasÄ±lÄ±klar
    y_tr_proba = model.predict_proba(X_tr)[:, 1]
    y_te_proba = model.predict_proba(X_te)[:, 1]

    # 0.5 eÅŸik ile sÄ±nÄ±flar
    y_tr_pred = (y_tr_proba >= 0.5).astype(int)
    y_te_pred = (y_te_proba >= 0.5).astype(int)

    # Skorlar
    roc_auc_tr = roc_auc_score(y_tr, y_tr_proba)
    roc_auc_te = roc_auc_score(y_te, y_te_proba)
    ap_tr = average_precision_score(y_tr, y_tr_proba)
    ap_te = average_precision_score(y_te, y_te_proba)
    f1_tr = f1_score(y_tr, y_tr_pred)
    f1_te = f1_score(y_te, y_te_pred)

    print(f"\n[{name}]")
    print("ROC-AUC  (train/test):", round(roc_auc_tr,4), "/", round(roc_auc_te,4))
    print("PR-AUC   (train/test):", round(ap_tr,4), "/", round(ap_te,4))
    print("F1-score (train/test):", round(f1_tr,4), "/", round(f1_te,4))
    print("\nClassification Report (Test):\n", classification_report(y_te, y_te_pred, digits=4))

    # Confusion Matrix
    cm = confusion_matrix(y_te, y_te_pred)
    plt.figure()
    plt.imshow(cm, interpolation='nearest')
    plt.title(f'Confusion Matrix - {name}')
    plt.colorbar()
    plt.xticks([0,1], ['Normal (0)', 'Attack (1)'])
    plt.yticks([0,1], ['Normal (0)', 'Attack (1)'])
    plt.xlabel('Predicted'); plt.ylabel('True')
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            plt.text(j, i, cm[i, j], ha='center', va='center')
    plt.tight_layout(); plt.show()

    # ROC
    fpr, tpr, _ = roc_curve(y_te, y_te_proba)
    plt.figure()
    plt.plot(fpr, tpr, label=f'{name}')
    plt.plot([0,1],[0,1],'--')
    plt.xlabel('False Positive Rate'); plt.ylabel('True Positive Rate')
    plt.title(f'ROC Curve - {name}'); plt.legend(); plt.tight_layout(); plt.show()

    # PR
    prec, rec, _ = precision_recall_curve(y_te, y_te_proba)
    plt.figure()
    plt.plot(rec, prec, label=f'{name}')
    plt.xlabel('Recall'); plt.ylabel('Precision')
    plt.title(f'Precision-Recall Curve - {name}'); plt.legend(); plt.tight_layout(); plt.show()

for name, model in best_models.items():
    evaluate_classifier(name, model, X_train, y_train, X_test, y_test)



## VI. Ã–zellik Ã–nemi (Random Forest) ve EÅŸik AyarÄ± (Opsiyonel)

- **Ã–zellik Ã¶nemi**: En anlamlÄ± deÄŸiÅŸkenleri gÃ¶rmek iÃ§in.  
- **EÅŸik ayarÄ±**: Ä°htiyaca gÃ¶re **precision/recall** dengesini deÄŸiÅŸtirmek iÃ§in.


In [None]:

# RF bulunursa Ã¶nemleri Ã§Ä±kar
if "Random Forest" in best_models:
    best_rf = best_models["Random Forest"]
    ohe = best_rf.named_steps['prep'].named_transformers_['cat']
    cat_feature_names = list(ohe.get_feature_names_out(['protocol_type','service','flag']))
    num_feature_names = [c for c in df.columns if c not in ['protocol_type','service','flag','label','label_binary']]
    all_feature_names = cat_feature_names + num_feature_names

    rf = best_rf.named_steps['clf']
    importances = rf.feature_importances_
    feat_imp = pd.DataFrame({"feature": all_feature_names, "importance": importances})\
               .sort_values("importance", ascending=False).head(20)
    display(feat_imp)

    plt.figure()
    plt.barh(feat_imp['feature'][::-1], feat_imp['importance'][::-1])
    plt.xlabel("Importance"); plt.title("Top 20 Feature Importances (Random Forest)")
    plt.tight_layout(); plt.show()

    # EÅŸik AyarÄ± (RF)
    y_proba = best_rf.predict_proba(X_test)[:, 1]
    thresholds = np.linspace(0.1, 0.9, 9)
    rows = []
    for th in thresholds:
        y_pred = (y_proba >= th).astype(int)
        tp = ((y_pred == 1) & (y_test == 1)).sum()
        fp = ((y_pred == 1) & (y_test == 0)).sum()
        fn = ((y_pred == 0) & (y_test == 1)).sum()
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
        rows.append({"threshold": th, "precision": precision, "recall": recall, "f1": f1_score(y_test, y_pred)})
    th_df = pd.DataFrame(rows).sort_values("f1", ascending=False)
    display(th_df.head())



## VII. PCA(2D) Ãœzerinde SÄ±nÄ±f Karar SÄ±nÄ±rlarÄ± (SVM ile)

Sunumlarda sÄ±nÄ±flarÄ± **Xâ€“Y dÃ¼zleminde gÃ¶rmek** iÃ§in, eÄŸitim/test verisini pipeline sonrasÄ± dÃ¶nÃ¼ÅŸtÃ¼rÃ¼p **PCA(2)** ile indiriyoruz; ardÄ±ndan bir **SVM** ile karar sÄ±nÄ±rÄ±nÄ± Ã§iziyoruz.


In [None]:

# Pipeline sonrasÄ± dÃ¶nÃ¼ÅŸtÃ¼r
any_model = list(best_models.values())[0]  # herhangi bir en iyi modelin prep'ini kullan
prep = any_model.named_steps['prep']
X_train_tr = prep.transform(X_train)
X_test_tr  = prep.transform(X_test)

# Dense'e Ã§evir (gerekirse)
X_train_dense = X_train_tr.toarray() if hasattr(X_train_tr, "toarray") else X_train_tr
X_test_dense  = X_test_tr.toarray() if hasattr(X_test_tr, "toarray") else X_test_tr

# PCA(2)
pca_vis = PCA(n_components=2, random_state=RANDOM_STATE)
X_train_2d = pca_vis.fit_transform(X_train_dense)
X_test_2d  = pca_vis.transform(X_test_dense)

# GÃ¶rsel sÄ±nÄ±rlayÄ±cÄ± model
svm_vis = SVC(kernel="rbf", probability=True, random_state=RANDOM_STATE)
svm_vis.fit(X_train_2d, y_train)

# Meshgrid
x1_min, x1_max = X_test_2d[:,0].min()-1, X_test_2d[:,0].max()+1
x2_min, x2_max = X_test_2d[:,1].min()-1, X_test_2d[:,1].max()+1
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, 0.02),
                       np.arange(x2_min, x2_max, 0.02))
Z = svm_vis.predict(np.c_[xx1.ravel(), xx2.ravel()]).reshape(xx1.shape)

plt.figure(figsize=(7,6))
plt.contourf(xx1, xx2, Z, alpha=0.5)
plt.scatter(X_test_2d[y_test==0,0], X_test_2d[y_test==0,1], s=20, label="Normal (0)")
plt.scatter(X_test_2d[y_test==1,0], X_test_2d[y_test==1,1], s=20, label="Attack (1)")
plt.title("PCA(2D) Karar SÄ±nÄ±rlarÄ± (SVM-RBF)")
plt.xlabel("PC1"); plt.ylabel("PC2"); plt.legend(); plt.tight_layout(); plt.show()



## VIII. K-Means: Elbow ve Silhouette

Denetimsiz Ã¶ÄŸrenmeye Ã¶rnek olarak, pipeline sonrasÄ± dÃ¶nÃ¼ÅŸtÃ¼rÃ¼lmÃ¼ÅŸ Ã¶zellikler Ã¼zerinde **K-Means** iÃ§in **Elbow (WCSS)** grafiÄŸi ve **Silhouette** skorunu hesaplÄ±yoruz.


In [None]:

X_all = df[categorical_cols + numeric_cols]
X_all_tr = prep.transform(X_all)
X_all_dense = X_all_tr.toarray() if hasattr(X_all_tr, "toarray") else X_all_tr

# Elbow (WCSS)
wcss = []
K = range(2, 11)
for k in K:
    km = KMeans(n_clusters=k, init="k-means++", n_init=10, random_state=RANDOM_STATE)
    km.fit(X_all_dense)
    wcss.append(km.inertia_)

plt.figure()
plt.plot(list(K), wcss, marker="o")
plt.title("K-Means Elbow (Ã–n Ä°ÅŸlem SonrasÄ± Ã–zellikler)")
plt.xlabel("KÃ¼me sayÄ±sÄ± (k)"); plt.ylabel("WCSS (inertia)")
plt.tight_layout(); plt.show()

# Ã–rnek bir k iÃ§in silhouette
k_best = 5  # elbow grafiÄŸine gÃ¶re gÃ¼ncelleyebilirsiniz
km_best = KMeans(n_clusters=k_best, init="k-means++", n_init=10, random_state=RANDOM_STATE)
labels = km_best.fit_predict(X_all_dense)
print("Silhouette skoru:", round(silhouette_score(X_all_dense, labels), 3))



## IX. SelectKBest (ANOVA F) ile Ã–rnek Ã–zellik SeÃ§imi

Karma veri tiplerinde, yalnÄ±zca **sayÄ±sal** sÃ¼tunlar Ã¼zerinde **ANOVA F (f_classif)** ile en iyi `k` Ã¶zellik seÃ§imi.


In [None]:

X_num = df[numeric_cols].copy()
y_bin = df["label_binary"].copy()

selector = SelectKBest(score_func=f_classif, k=10)  # en iyi 10 sayÄ±sal Ã¶zellik
X_num_sel = selector.fit_transform(X_num, y_bin)
selected_num_features = np.array(numeric_cols)[selector.get_support()]
print("SeÃ§ilen sayÄ±sal Ã¶zellikler:", selected_num_features)



## X. Modellerin Kaydedilmesi

En iyi modeller **joblib** ile diske kaydedilir.


In [None]:

os.makedirs("models", exist_ok=True)
for name, model in best_models.items():
    safe = name.lower().replace(" ", "_")
    joblib.dump(model, f"models/{safe}.joblib")
print("Modeller kaydedildi: models/ klasÃ¶rÃ¼nde.")



## XI. KapanÄ±ÅŸ

- **Binary (Normal vs Attack)** yaklaÅŸÄ±mÄ± ile birden Ã§ok klasik algoritma karÅŸÄ±laÅŸtÄ±rÄ±ldÄ±.
- **RF** genellikle gÃ¼Ã§lÃ¼ sonuÃ§ verir; ancak veri yapÄ±sÄ±na gÃ¶re diÄŸerleri tercih edilebilir.
- **PCA(2)** ile karar sÄ±nÄ±rÄ± gÃ¶rselleÅŸtirme sunumda etkilidir.
- **K-Means** ve **SelectKBest** ile denetimsiz Ã¶ÄŸrenme ve Ã¶zellik seÃ§imine dair kÄ±sa demolar eklendi.
- Dengesiz veri iÃ§in **class_weight**/**scale_pos_weight** gibi yaklaÅŸÄ±mlar kritik olabilir.
