# Enhancing Security: Dataset Generation for Next-Gen Intrusion Detection Systems
This notebook demonstrates key concepts and code examples for generating and working with datasets for modern intrusion detection systems (IDS). We'll explore dataset quality, handling encrypted traffic, and best practices for IDS development.

## Setup and Required Libraries
First, let's import the necessary Python libraries for our analysis:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

# Set plotting style
plt.style.use('seaborn')
sns.set_theme(style="whitegrid")

## Loading and Exploring Sample Network Traffic Data
We'll create a sample dataset that mimics network traffic patterns:

In [None]:
# Generate sample network traffic data
np.random.seed(42)
n_samples = 1000

data = {
'packet_size': np.random.normal(500, 150, n_samples),
'duration': np.random.exponential(2, n_samples),
'protocol': np.random.choice(['TCP', 'UDP', 'ICMP'], n_samples),
'is_encrypted': np.random.choice([0, 1], n_samples, p=[0.3, 0.7])
}

df = pd.DataFrame(data)

# Add some anomalous traffic
anomaly_idx = np.random.choice(n_samples, size=int(0.1 * n_samples), replace=False)
df.loc[anomaly_idx, 'packet_size'] *= 3  # Abnormally large packets

print('Dataset shape:', df.shape)
df.head()

## Analyzing Traffic Patterns
Let's visualize the distribution of packet sizes and compare encrypted vs non-encrypted traffic:

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Packet size distribution
sns.histplot(data=df, x='packet_size', hue='is_encrypted', bins=30, ax=ax1)
ax1.set_title('Packet Size Distribution by Encryption Status')

# Protocol distribution
sns.countplot(data=df, x='protocol', hue='is_encrypted', ax=ax2)
ax2.set_title('Protocol Distribution by Encryption Status')

plt.tight_layout()
plt.show()

## Implementing Basic Anomaly Detection
Now let's implement a simple anomaly detection system using Isolation Forest:

In [None]:
# Prepare features for anomaly detection
X = df[['packet_size', 'duration']].copy()

# Initialize and fit the Isolation Forest model
model = IsolationForest(contamination=0.1, random_state=42)
predictions = model.fit_predict(X)

# Convert predictions to boolean (True for anomalies)
df['is_anomaly'] = predictions == -1

# Visualize results
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='packet_size', y='duration', 
hue='is_anomaly', style='is_encrypted')
plt.title('Anomaly Detection Results')
plt.show()

## Best Practices and Error Handling
Here's an example of implementing error handling and best practices for dataset processing:

In [None]:
def process_network_data(df, required_columns=['packet_size', 'duration']):
"""Process network traffic data with error handling.

Args:
df (pd.DataFrame): Input network traffic data
required_columns (list): Required columns for processing

Returns:
pd.DataFrame: Processed data
"""
try:
# Validate input
if not isinstance(df, pd.DataFrame):
raise TypeError("Input must be a pandas DataFrame")

# Check required columns
missing_cols = set(required_columns) - set(df.columns)
if missing_cols:
raise ValueError(f"Missing required columns: {missing_cols}")

# Remove invalid values
df_clean = df.copy()
df_clean = df_clean.dropna(subset=required_columns)
df_clean = df_clean[df_clean['packet_size'] > 0]

return df_clean

except Exception as e:
print(f"Error processing data: {str(e)}")
return None

# Test the function
processed_df = process_network_data(df)
print("Processed data shape:", processed_df.shape if processed_df is not None else None)

## Conclusion
In this notebook, we've demonstrated:
1. Loading and processing network traffic data
2. Visualizing traffic patterns and encryption status
3. Implementing basic anomaly detection
4. Best practices for error handling and data validation

These concepts form the foundation for building robust intrusion detection systems that can handle modern challenges like encrypted traffic.