# Exercise 8.1A: DNS Traffic Pattern Analysis

**Course**: SS*/AIML* ZG567 - AI and ML Techniques in Cyber Security  
**Module**: 08 - Domain Name Detection  
**Type**: Analytical Exercise  
**Duration**: 2-3 hours  
**Difficulty**: Beginner-Intermediate

---

## ðŸŽ¯ Scenario

You are a junior SOC analyst at a mid-sized enterprise. Your SIEM has captured 24 hours of DNS queries from the corporate network. Your task is to **establish baseline patterns** for legitimate DNS traffic to prepare for anomaly detection.

## ðŸ“‹ Learning Objectives

- Understand DNS query structure and components
- Identify characteristics of legitimate domain traffic
- Establish baselines for anomaly detection
- Recognize temporal patterns in network behavior

---

## Setup: Import Required Libraries

In [3]:
# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
from collections import Counter
import warnings

# Configuration
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

print("âœ… Libraries imported successfully!")

âœ… Libraries imported successfully!


## Task 1: Load and Explore DNS Query Logs

### 1.1 Generate Sample DNS Logs

For this exercise, we'll generate synthetic DNS logs that simulate 24 hours of corporate network traffic.

In [None]:
# Generate sample DNS logs (simulated corporate traffic)
np.random.seed(42)

# Common legitimate domains
legitimate_domains = [
    'google.com', 'microsoft.com', 'amazon.com', 'facebook.com',
    'twitter.com', 'linkedin.com', 'github.com', 'stackoverflow.com',
    'office365.com', 'salesforce.com', 'zoom.us', 'slack.com',
    'dropbox.com', 'atlassian.com', 'adobe.com', 'apple.com'
]

# Generate 10,000 DNS queries over 24 hours
n_queries = 10000
start_time = datetime(2026, 1, 30, 0, 0, 0)

dns_logs = []
for i in range(n_queries):
    # Timestamp: More queries during business hours (8am-6pm)
    hour_offset = np.random.choice(
        range(24),
        p=[0.02, 0.01, 0.01, 0.01, 0.01, 0.02, 0.03, 0.04,  # 0-7am (sum: 0.15)
           0.08, 0.10, 0.09, 0.09, 0.10, 0.09, 0.08, 0.07,  # 8am-3pm (sum: 0.70)
           0.05, 0.03, 0.02, 0.01, 0.01, 0.01, 0.01, 0.01]  # 4pm-11pm (sum: 0.15)
    )
    minute_offset = np.random.randint(0, 60)
    second_offset = np.random.randint(0, 60)
    timestamp = start_time + timedelta(hours=int(hour_offset), minutes=int(minute_offset), seconds=int(second_offset))
    
    # Source IP: Simulate corporate network (10.0.0.0/8)
    source_ip = f"10.{np.random.randint(1, 255)}.{np.random.randint(1, 255)}.{np.random.randint(1, 255)}"
    
    # Domain: Mostly legitimate with some random patterns
    if np.random.random() < 0.95:  # 95% legitimate
        domain = np.random.choice(legitimate_domains)
    else:  # 5% suspicious patterns (for later analysis)
        random_string = ''.join(np.random.choice(list('abcdefghijklmnopqrstuvwxyz0123456789'), 
                                                  size=np.random.randint(8, 15)))
        domain = f"{random_string}.com"
    
    # Query type: Mostly A records
    query_type = np.random.choice(['A', 'AAAA', 'CNAME', 'MX'], p=[0.70, 0.15, 0.10, 0.05])
    
    dns_logs.append({
        'timestamp': timestamp,
        'source_ip': source_ip,
        'domain': domain,
        'query_type': query_type
    })

# Create DataFrame
df = pd.DataFrame(dns_logs)
df = df.sort_values('timestamp').reset_index(drop=True)

print(f"âœ… Generated {len(df):,} DNS queries")
print(f"ðŸ“… Time range: {df['timestamp'].min()} to {df['timestamp'].max()}")

TypeError: unsupported type for timedelta hours component: numpy.int64

### 1.2 Initial Data Exploration

**TODO**: Explore the dataset structure and calculate basic statistics.

In [None]:
# Display first 10 rows
print("=" * 80)
print("SAMPLE DNS LOGS")
print("=" * 80)
display(df.head(10))

# Dataset info
print("\n" + "=" * 80)
print("DATASET INFORMATION")
print("=" * 80)
df.info()

# TODO: Calculate summary statistics
# Hint: Use df.describe() to get numeric summaries
# Hint: Count unique domains, IPs, query types

print("\n" + "=" * 80)
print("SUMMARY STATISTICS")
print("=" * 80)

# YOUR CODE HERE

## Task 2: Domain Characteristic Analysis

### 2.1 Extract Domain Components

Parse domains to extract TLD, SLD, and other components.

In [None]:
def parse_domain(domain):
    """
    Extract components from domain name.
    
    Args:
        domain (str): Domain name (e.g., 'www.example.com')
        
    Returns:
        dict: Domain components
    """
    try:
        parts = domain.split('.')
        if len(parts) < 2:
            return {'sld': domain, 'tld': None, 'subdomain_count': 0, 'length': len(domain)}
        
        tld = parts[-1]
        sld = parts[-2] if len(parts) >= 2 else None
        subdomain_count = len(parts) - 2 if len(parts) > 2 else 0
        
        return {
            'sld': sld,
            'tld': tld,
            'subdomain_count': subdomain_count,
            'length': len(sld) if sld else 0
        }
    except:
        return {'sld': None, 'tld': None, 'subdomain_count': 0, 'length': 0}

# Apply domain parsing
domain_components = df['domain'].apply(parse_domain).apply(pd.Series)
df = pd.concat([df, domain_components], axis=1)

print("âœ… Domain components extracted")
display(df[['domain', 'sld', 'tld', 'subdomain_count', 'length']].head(10))

### 2.2 Analyze TLD Distribution

**TODO**: Analyze the distribution of Top-Level Domains (TLDs) in the traffic.

In [None]:
# TODO: Count TLD occurrences and create visualization
# Hint: Use df['tld'].value_counts()
# Hint: Create a bar chart showing top 10 TLDs

# YOUR CODE HERE

# Example structure:
# tld_counts = ...
# plt.figure(figsize=(12, 6))
# plt.bar(...)
# plt.title('TLD Distribution in DNS Traffic')
# plt.xlabel('Top-Level Domain')
# plt.ylabel('Number of Queries')
# plt.show()

### 2.3 Domain Length Analysis

**TODO**: Analyze the distribution of domain lengths (SLD length).

In [None]:
# TODO: Calculate length statistics and create histogram
# Hint: df['length'].describe() for statistics
# Hint: Use plt.hist() for distribution visualization

# YOUR CODE HERE

# Calculate statistics
print("Domain Length Statistics:")
print("=" * 50)
# Display mean, median, std, min, max

### 2.4 Character Composition Analysis

Analyze the ratio of digits vs. alphabetic characters in domain names.

In [None]:
def analyze_character_composition(domain):
    """
    Analyze character composition of domain SLD.
    
    Returns:
        dict: Character composition metrics
    """
    sld = domain.split('.')[0] if '.' in domain else domain
    
    digit_count = sum(c.isdigit() for c in sld)
    alpha_count = sum(c.isalpha() for c in sld)
    total = len(sld)
    
    return {
        'digit_ratio': digit_count / total if total > 0 else 0,
        'alpha_ratio': alpha_count / total if total > 0 else 0,
        'has_digits': digit_count > 0
    }

# Apply composition analysis
char_comp = df['domain'].apply(analyze_character_composition).apply(pd.Series)
df = pd.concat([df, char_comp], axis=1)

# TODO: Visualize character composition
# Create a scatter plot or box plot showing digit_ratio distribution

# YOUR CODE HERE

## Task 3: Temporal Pattern Analysis

### 3.1 Queries by Hour of Day

Analyze when DNS queries occur throughout the day.

In [None]:
# Extract hour from timestamp
df['hour'] = df['timestamp'].dt.hour

# TODO: Count queries per hour and visualize
# Hint: df.groupby('hour').size()
# Hint: Create a line plot showing query volume over 24 hours

# YOUR CODE HERE

# Example structure:
# hourly_counts = df.groupby('hour').size()
# plt.figure(figsize=(14, 6))
# plt.plot(hourly_counts.index, hourly_counts.values, marker='o', linewidth=2)
# plt.title('DNS Query Volume by Hour of Day')
# plt.xlabel('Hour (0-23)')
# plt.ylabel('Number of Queries')
# plt.grid(True, alpha=0.3)
# plt.show()

### 3.2 Business Hours vs. Off-Hours Analysis

**TODO**: Compare DNS patterns during business hours (8am-6pm) vs. off-hours.

In [None]:
# Define business hours
def categorize_time(hour):
    if 8 <= hour < 18:
        return 'Business Hours'
    else:
        return 'Off-Hours'

df['time_category'] = df['hour'].apply(categorize_time)

# TODO: Compare statistics between business hours and off-hours
# - Query volume
# - Unique domains
# - Query types distribution

# YOUR CODE HERE

## Task 4: Create Baseline Profile

### 4.1 Calculate Baseline Statistics

Document normal characteristics of legitimate DNS traffic.

In [None]:
# TODO: Calculate comprehensive baseline statistics

baseline_profile = {
    'total_queries': len(df),
    'unique_domains': df['domain'].nunique(),
    'unique_source_ips': df['source_ip'].nunique(),
    
    # Domain length statistics
    'avg_domain_length': df['length'].mean(),
    'median_domain_length': df['length'].median(),
    'std_domain_length': df['length'].std(),
    
    # TODO: Add more statistics:
    # - Most common TLDs (top 5)
    # - Average digit ratio
    # - Business hours query percentage
    # - Query type distribution
}

# YOUR CODE HERE to add more statistics

print("=" * 80)
print("BASELINE PROFILE: LEGITIMATE DNS TRAFFIC")
print("=" * 80)
for key, value in baseline_profile.items():
    print(f"{key:30s}: {value}")

### 4.2 Define Anomaly Detection Thresholds

**TODO**: Based on baseline analysis, define thresholds for flagging suspicious domains.

In [None]:
# TODO: Define anomaly detection rules
# Example thresholds (adjust based on your analysis):

anomaly_thresholds = {
    # Rule 1: Domain length
    'max_normal_length': None,  # TODO: Calculate from baseline (e.g., mean + 2*std)
    
    # Rule 2: Digit ratio
    'max_digit_ratio': None,  # TODO: Define threshold (e.g., 0.3)
    
    # Rule 3: Suspicious TLDs
    'suspicious_tlds': ['.tk', '.ml', '.ga', '.cf', '.gq'],  # Free TLDs often abused
    
    # TODO: Add more rules:
    # - Subdomain count threshold
    # - Off-hours query volume threshold
    # - Entropy threshold (if you calculate it)
}

# YOUR CODE HERE to calculate threshold values

print("=" * 80)
print("ANOMALY DETECTION THRESHOLDS")
print("=" * 80)
for key, value in anomaly_thresholds.items():
    print(f"{key:30s}: {value}")

### 4.3 Apply Anomaly Detection Rules

Test your thresholds by flagging potentially suspicious domains.

In [None]:
# TODO: Implement anomaly detection function

def flag_suspicious_domain(row, thresholds):
    """
    Flag domain as suspicious based on multiple rules.
    
    Returns:
        tuple: (is_suspicious, reasons)
    """
    reasons = []
    
    # TODO: Implement detection rules
    # Check length, digit ratio, TLD, etc.
    # Append reasons for each violated rule
    
    # YOUR CODE HERE
    
    is_suspicious = len(reasons) > 0
    return is_suspicious, reasons

# Apply detection
# df['suspicious'], df['reasons'] = zip(*df.apply(lambda row: flag_suspicious_domain(row, anomaly_thresholds), axis=1))

# Display flagged domains
# suspicious_domains = df[df['suspicious']]
# print(f"\nðŸš¨ Flagged {len(suspicious_domains)} suspicious domains ({len(suspicious_domains)/len(df)*100:.2f}%)")
# display(suspicious_domains[['domain', 'length', 'digit_ratio', 'tld', 'reasons']].head(10))

## Deliverable: Summary Report

### Create a summary of your findings

In [None]:
# TODO: Generate summary report

print("="*80)
print("DNS TRAFFIC BASELINE ANALYSIS - SUMMARY REPORT")
print("="*80)

print("\nðŸ“Š DATASET OVERVIEW")
print("-" * 80)
print(f"Total DNS Queries: {len(df):,}")
print(f"Unique Domains: {df['domain'].nunique():,}")
print(f"Time Period: {df['timestamp'].min()} to {df['timestamp'].max()}")

# TODO: Add more sections:
# - DOMAIN CHARACTERISTICS
# - TEMPORAL PATTERNS
# - ANOMALY DETECTION RESULTS
# - RECOMMENDATIONS

# YOUR CODE HERE

## ðŸŽ“ Reflection Questions

Answer these questions in markdown cells below:

1. **What are the top 3 characteristics that distinguish legitimate domains in this dataset?**

2. **How would your baseline change if you analyzed DNS traffic from a different organization (e.g., university vs. financial company)?**

3. **What are the limitations of using static thresholds for anomaly detection?**

4. **If you deployed this system in a real SOC, what additional features would you add?**

---

### YOUR ANSWERS HERE:

**Answer 1:**

*[Your answer here]*

**Answer 2:**

*[Your answer here]*

**Answer 3:**

*[Your answer here]*

**Answer 4:**

*[Your answer here]*

---

## âœ… Submission Checklist

Before submitting, ensure:

- [ ] All code cells execute without errors
- [ ] At least 5 visualizations created (TLD chart, length histogram, temporal plot, etc.)
- [ ] Baseline statistics calculated and documented
- [ ] Anomaly detection thresholds defined with justification
- [ ] Reflection questions answered
- [ ] Summary report generated
- [ ] Code is well-commented

---

**Version**: 1.0  
**Last Updated**: January 31, 2026  
**Instructor Contact**: Via course forum