# Network Intrusion Detection System Using Machine Learning

This notebook demonstrates the complete workflow for building an intrusion detection system using the NSL-KDD dataset.

## Table of Contents
1. [Data Loading](#Data-Loading)
2. [Data Understanding](#Data-Understanding)
3. [Data Preprocessing](#Data-Preprocessing)
4. [Model Training](#Model-Training)
5. [Model Evaluation](#Model-Evaluation)
6. [Results & Conclusion](#Results-&-Conclusion)

## Data Loading

First, let's load the NSL-KDD dataset and examine its structure.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, classification_report
import pickle

# Set style for plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Define column names for NSL-KDD dataset
columns = [
    'duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land',
    'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised',
    'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells',
    'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count',
    'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate',
    'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count',
    'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate',
    'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate',
    'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'label'
]

# Load the dataset
train_df = pd.read_csv('../data/KDDTrain+.txt', header=None, names=columns)
test_df = pd.read_csv('../data/KDDTest+.txt', header=None, names=columns)

print(f"Training data shape: {train_df.shape}")
print(f"Testing data shape: {test_df.shape}")
print("\nFirst few rows of training data:")
train_df.head()

## Data Understanding

Let's examine the target variable and class distribution.

In [None]:
# Examine the target variable
print("Unique labels in training data:")
print(train_df['label'].unique())

# Convert to binary classification
train_df['label'] = train_df['label'].apply(lambda x: 0 if x == 'normal' else 1)
test_df['label'] = test_df['label'].apply(lambda x: 0 if x == 'normal' else 1)

# Check class distribution
plt.figure(figsize=(8, 6))
train_df['label'].value_counts().plot(kind='bar')
plt.title('Class Distribution in Training Data')
plt.xlabel('Class (0: Normal, 1: Attack)')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()

print("\nClass distribution:")
print(train_df['label'].value_counts())
print(f"\nNormal traffic: {train_df['label'].value_counts()[0]} ({train_df['label'].value_counts()[0]/len(train_df)*100:.2f}%)")
print(f"Attack traffic: {train_df['label'].value_counts()[1]} ({train_df['label'].value_counts()[1]/len(train_df)*100:.2f}%)")

## Data Preprocessing

Now we'll preprocess the data: encode categorical features, scale numerical features, and split the data.

In [None]:
# Combine datasets for consistent preprocessing
combined_df = pd.concat([train_df, test_df], ignore_index=True)

# Identify categorical and numerical features
categorical_features = ['protocol_type', 'service', 'flag']
numerical_features = [col for col in columns[:-1] if col not in categorical_features]

print("Categorical features:", categorical_features)
print("Number of numerical features:", len(numerical_features))

# One-Hot Encode categorical features
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoded_categorical = encoder.fit_transform(combined_df[categorical_features])
encoded_df = pd.DataFrame(encoded_categorical, columns=encoder.get_feature_names_out(categorical_features))

# Combine with numerical features
X = pd.concat([combined_df[numerical_features], encoded_df], axis=1)
y = combined_df['label']

# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

# Split back into train and test
train_size = len(train_df)
X_train = X_scaled[:train_size]
X_test = X_scaled[train_size:]
y_train = y[:train_size]
y_test = y[train_size:]

print(f"\nPreprocessed data shapes:")
print(f"X_train: {X_train.shape}, y_train: {y_train.shape}")
print(f"X_test: {X_test.shape}, y_test: {y_test.shape}")

## Model Training

Let's train Logistic Regression and Random Forest models.

In [None]:
# Train Logistic Regression
print("Training Logistic Regression...")
lr_model = LogisticRegression(random_state=42, max_iter=1000)
lr_model.fit(X_train, y_train)
print("Logistic Regression training completed.")

# Train Random Forest
print("\nTraining Random Forest...")
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=10, n_jobs=-1)
rf_model.fit(X_train, y_train)
print("Random Forest training completed.")

# Save models
with open('../models/logistic_regression.pkl', 'wb') as f:
    pickle.dump(lr_model, f)
with open('../models/random_forest.pkl', 'wb') as f:
    pickle.dump(rf_model, f)
print("\nModels saved successfully.")

## Model Evaluation

Now let's evaluate both models using various metrics.

In [None]:
def evaluate_model(model, X_test, y_test, model_name):
    """Evaluate a model and return metrics"""
    y_pred = model.predict(X_test)
    
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    
    print(f"\n--- {model_name} Results ---")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    
    # Confusion Matrix
    cm = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=['Normal', 'Attack'],
                yticklabels=['Normal', 'Attack'])
    plt.title(f'Confusion Matrix - {model_name}')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.show()
    
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=['Normal', 'Attack']))
    
    return accuracy, precision, recall

# Evaluate both models
lr_metrics = evaluate_model(lr_model, X_test, y_test, "Logistic Regression")
rf_metrics = evaluate_model(rf_model, X_test, y_test, "Random Forest")

## Results & Conclusion

Let's compare the models and discuss the results.

In [None]:
# Compare models
models = ['Logistic Regression', 'Random Forest']
accuracies = [lr_metrics[0], rf_metrics[0]]
precisions = [lr_metrics[1], rf_metrics[1]]
recalls = [lr_metrics[2], rf_metrics[2]]

x = np.arange(len(models))
width = 0.25

plt.figure(figsize=(10, 6))
plt.bar(x - width, accuracies, width, label='Accuracy', alpha=0.8)
plt.bar(x, precisions, width, label='Precision', alpha=0.8)
plt.bar(x + width, recalls, width, label='Recall', alpha=0.8)
plt.xlabel('Models')
plt.ylabel('Score')
plt.title('Model Performance Comparison')
plt.xticks(x, models)
plt.legend()
plt.ylim(0.8, 1.0)
plt.show()

# Feature importance for Random Forest
if hasattr(rf_model, 'feature_importances_'):
    importances = rf_model.feature_importances_
    indices = np.argsort(importances)[::-1]
    
    plt.figure(figsize=(12, 8))
    plt.title("Top 20 Feature Importances (Random Forest)")
    plt.bar(range(20), importances[indices][:20])
    plt.xticks(range(20), [X_train.columns[i] for i in indices[:20]], rotation=90)
    plt.tight_layout()
    plt.show()

## Conclusion

### Key Findings:
- Both Logistic Regression and Random Forest performed well on the NSL-KDD dataset
- Random Forest generally achieved higher accuracy and recall
- Feature importance analysis shows which network features are most indicative of attacks

### Why These Models?
- **Logistic Regression**: Interpretable, fast, provides probability scores
- **Random Forest**: Handles complex patterns, robust to overfitting, feature selection capability

### Security Context:
- High recall is crucial to avoid missing attacks
- Precision helps reduce false alarms in production systems
- Both models can be deployed for real-time intrusion detection

### Limitations:
- Dataset is from 1999, may not represent modern attack patterns
- Binary classification doesn't distinguish between attack types
- No real-time processing demonstrated

### Future Improvements:
- Implement cross-validation for more robust evaluation
- Add feature selection to reduce model complexity
- Explore deep learning approaches for better accuracy
- Integrate with packet capture tools for real-time detection