# Container Metrics Analysis - Day 2 Exploratory Analysis

**Project**: Predictive Autoscaling  
**Date**: October 14, 2025  
**Objective**: Analyze collected metrics for patterns and prepare for ML model training

## Overview
This notebook performs exploratory analysis on container metrics collected from:
- webapp (Flask application)
- db (PostgreSQL)
- cache (Redis)

### Metrics Collected:
1. **CPU Usage** - Container CPU utilization
2. **Memory Usage** - Container memory consumption
3. **Network Traffic** - RX/TX bytes per second
4. **HTTP Requests** - Application request rates

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import glob
import os

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: '%.4f' % x)

print("‚úÖ Libraries imported successfully")
print(f"üìÖ Analysis date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

‚úÖ Libraries imported successfully
üìÖ Analysis date: 2025-11-06 19:47:27


## 1. Data Loading

Load the most recent metrics export from the `data/raw/` directory.

In [17]:
# Find the most recent metrics file
data_dir = '../data/raw'
metrics_files = glob.glob(os.path.join(data_dir, 'metrics_*.csv'))

if not metrics_files:
    print("‚ùå No metrics files found!")
else:
    # Get the most recent file
    latest_file = max(metrics_files, key=os.path.getctime)
    print(f"üìÅ Loading: {os.path.basename(latest_file)}")
    
    # Load data
    df = pd.read_csv(latest_file)

    print(df.columns)
    
    # Convert timestamp to datetime
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    
    # Sort by timestamp
    df = df.sort_values('timestamp').reset_index(drop=True)
    
    print(f"\n‚úÖ Loaded {len(df):,} records")
    print(f"   Time range: {df['timestamp'].min()} to {df['timestamp'].max()}")
    print(f"   Duration: {df['timestamp'].max() - df['timestamp'].min()}")
    print(f"\nüìä Dataset shape: {df.shape}")
    print(f"   Columns: {df.shape[1]}")
    print(f"   Rows: {df.shape[0]:,}")

üìÅ Loading: metrics_20251106_201831.csv
Index(['label___name__', 'label_cpu', 'label_endpoint', 'label_id',
       'label_instance', 'label_job', 'label_method', 'label_status',
       'metric_name', 'timestamp', 'value'],
      dtype='object')

‚úÖ Loaded 32,367 records
   Time range: 2025-11-06 19:48:31 to 2025-11-06 20:18:31
   Duration: 0 days 00:30:00

üìä Dataset shape: (32367, 11)
   Columns: 11
   Rows: 32,367


## 2. Data Exploration

In [21]:
# Display basic information
print("=" * 80)
print("DATASET OVERVIEW")
print("=" * 80)

print("\nüìã Column Names:")
print(df.columns.tolist())

print("\nüìä Data Types:")
print(df.dtypes)

print("\nüìà Basic Statistics:")
print(df.describe())

print("\nüîç First Few Rows:")
display(df.head(10))

print("\nüîç Sample Data:")
display(df.sample(100))

DATASET OVERVIEW

üìã Column Names:
['label___name__', 'label_cpu', 'label_endpoint', 'label_id', 'label_instance', 'label_job', 'label_method', 'label_status', 'metric_name', 'timestamp', 'value']

üìä Data Types:
label___name__            object
label_cpu                 object
label_endpoint            object
label_id                  object
label_instance            object
label_job                 object
label_method              object
label_status             float64
metric_name               object
timestamp         datetime64[ns]
value                    float64
dtype: object

üìà Basic Statistics:
       label_status                      timestamp           value
count      484.0000                          32367      32367.0000
mean       275.0000  2025-11-06 20:03:31.208082432  120747944.9187
min        200.0000            2025-11-06 19:48:31          0.0000
25%        200.0000            2025-11-06 19:56:01          0.0007
50%        200.0000            2025-11-06 20:03

Unnamed: 0,label___name__,label_cpu,label_endpoint,label_id,label_instance,label_job,label_method,label_status,metric_name,timestamp,value
0,,total,,/,cadvisor:8080,cadvisor,,,container_cpu,2025-11-06 19:48:31,0.7842
1,container_memory_usage_bytes,,,/libpod_parent/libpod-f9f4525102564eb9d0a275ee...,cadvisor:8080,cadvisor,,,container_memory,2025-11-06 19:48:31,241664.0
2,,total,,/libpod_parent/libpod-f9f4525102564eb9d0a275ee...,cadvisor:8080,cadvisor,,,container_cpu,2025-11-06 19:48:31,0.0012
3,container_memory_usage_bytes,,,/libpod_parent/libpod-f9f4525102564eb9d0a275ee...,cadvisor:8080,cadvisor,,,container_memory,2025-11-06 19:48:31,266240.0
4,container_memory_usage_bytes,,,/libpod_parent/libpod-f9f4525102564eb9d0a275ee...,cadvisor:8080,cadvisor,,,container_memory,2025-11-06 19:48:31,221184.0
5,container_memory_usage_bytes,,,/libpod_parent/libpod-f9f4525102564eb9d0a275ee...,cadvisor:8080,cadvisor,,,container_memory,2025-11-06 19:48:31,166952960.0
6,container_memory_usage_bytes,,,/libpod_parent/libpod-f9f4525102564eb9d0a275ee...,cadvisor:8080,cadvisor,,,container_memory,2025-11-06 19:48:31,166637568.0
7,container_memory_usage_bytes,,,/libpod_parent/libpod-f9f4525102564eb9d0a275ee...,cadvisor:8080,cadvisor,,,container_memory,2025-11-06 19:48:31,217088.0
8,,total,,/libpod_parent/libpod-f9f4525102564eb9d0a275ee...,cadvisor:8080,cadvisor,,,container_cpu,2025-11-06 19:48:31,0.0012
9,container_memory_usage_bytes,,,/libpod_parent/libpod-f9f4525102564eb9d0a275ee...,cadvisor:8080,cadvisor,,,container_memory,2025-11-06 19:48:31,51097600.0



üîç Sample Data:


Unnamed: 0,label___name__,label_cpu,label_endpoint,label_id,label_instance,label_job,label_method,label_status,metric_name,timestamp,value
3325,,total,,/system.slice/system-getty.slice/getty@tty1.se...,cadvisor:8080,cadvisor,,,container_cpu,2025-11-06 19:51:31,0.0
14653,,total,,/system.slice/unattended-upgrades.service,cadvisor:8080,cadvisor,,,container_cpu,2025-11-06 20:02:01,0.0
21227,,total,,/libpod_parent/libpod-f9f4525102564eb9d0a275ee...,cadvisor:8080,cadvisor,,,container_cpu,2025-11-06 20:08:16,0.239
13004,container_memory_usage_bytes,,,/libpod_parent/libpod-f9f4525102564eb9d0a275ee...,cadvisor:8080,cadvisor,,,container_memory,2025-11-06 20:00:31,347283456.0
28638,container_memory_usage_bytes,,,/kubepods.slice/kubepods-besteffort.slice,cadvisor:8080,cadvisor,,,container_memory,2025-11-06 20:15:16,134721536.0
28861,container_memory_usage_bytes,,,/user.slice/user-1001.slice/user@1001.service/...,cadvisor:8080,cadvisor,,,container_memory,2025-11-06 20:15:16,491626496.0
8823,,total,,/system.slice/polkit.service,cadvisor:8080,cadvisor,,,container_cpu,2025-11-06 19:56:46,0.0
23066,container_memory_usage_bytes,,,/libpod_parent/libpod-f9f4525102564eb9d0a275ee...,cadvisor:8080,cadvisor,,,container_memory,2025-11-06 20:10:01,229376.0
27764,container_memory_usage_bytes,,,/user.slice/user-1000.slice/user@1000.service/...,cadvisor:8080,cadvisor,,,container_memory,2025-11-06 20:14:16,323584.0
22286,container_memory_usage_bytes,,,/kubepods.slice/kubepods-besteffort.slice,cadvisor:8080,cadvisor,,,container_memory,2025-11-06 20:09:16,134823936.0


In [20]:
# Check for missing values
print("=" * 80)
print("MISSING VALUES")
print("=" * 80)

missing = df.isnull().sum()
missing_pct = 100 * df.isnull().sum() / len(df)
missing_df = pd.DataFrame({
    'Column': missing.index,
    'Missing Count': missing.values,
    'Percentage': missing_pct.values
})
missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

if len(missing_df) > 0:
    print(missing_df.to_string(index=False))
else:
    print("‚úÖ No missing values!")

MISSING VALUES
        Column  Missing Count  Percentage
  label_method          31883     98.5046
label_endpoint          31883     98.5046
  label_status          31883     98.5046
     label_cpu          18659     57.6482
label___name__          14192     43.8471
      label_id            484      1.4954


In [None]:
# Analyze metrics distribution
print("=" * 80)
print("METRICS DISTRIBUTION")
print("=" * 80)

print("\nüìä Records per Metric:")
metric_counts = df['metric_name'].value_counts()
for metric, count in metric_counts.items():
    percentage = 100 * count / len(df)
    print(f"   {metric:30s}: {count:6,} ({percentage:5.1f}%)")

print("\nüè∑Ô∏è  Unique Labels:")
label_cols = [col for col in df.columns if col.startswith('label_')]
for col in label_cols:
    unique_vals = df[col].nunique()
    print(f"   {col:30s}: {unique_vals} unique values")
    if unique_vals < 20:
        print(f"      Values: {df[col].unique().tolist()}")

## 3. Time Series Visualization

Visualize the collected metrics over time to identify patterns and anomalies.

In [None]:
# Create time series plots for each metric
fig, axes = plt.subplots(5, 1, figsize=(15, 20))
fig.suptitle('Container Metrics Over Time', fontsize=16, fontweight='bold')

metrics = df['metric_name'].unique()

for idx, metric in enumerate(metrics):
    ax = axes[idx]
    metric_data = df[df['metric_name'] == metric]
    
    # Group by timestamp and aggregate (in case of multiple series)
    time_series = metric_data.groupby('timestamp')['value'].agg(['mean', 'min', 'max'])
    
    # Plot mean with confidence interval
    ax.plot(time_series.index, time_series['mean'], label='Mean', linewidth=2)
    ax.fill_between(time_series.index, time_series['min'], time_series['max'], 
                     alpha=0.3, label='Min-Max Range')
    
    ax.set_title(f'{metric.upper().replace("_", " ")}', fontweight='bold')
    ax.set_xlabel('Time')
    ax.set_ylabel('Value')
    ax.legend(loc='upper right')
    ax.grid(True, alpha=0.3)
    
    # Rotate x-axis labels
    plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')

plt.tight_layout()
plt.show()

print("‚úÖ Time series plots generated")

## 4. Container-Specific Analysis

Break down metrics by individual containers to understand their behavior.

In [None]:
# Analyze CPU usage by container
print("=" * 80)
print("CPU USAGE BY CONTAINER")
print("=" * 80)

cpu_data = df[df['metric_name'] == 'container_cpu']

if 'label_name' in cpu_data.columns:
    # Filter for our containers
    containers = ['metrics-webapp', 'metrics-db', 'metrics-cache']
    
    fig, axes = plt.subplots(3, 1, figsize=(15, 12))
    fig.suptitle('CPU Usage by Container', fontsize=16, fontweight='bold')
    
    for idx, container in enumerate(containers):
        container_data = cpu_data[cpu_data['label_name'] == container]
        
        if len(container_data) > 0:
            ax = axes[idx]
            
            # Aggregate by timestamp
            ts = container_data.groupby('timestamp')['value'].mean()
            
            ax.plot(ts.index, ts.values, linewidth=2)
            ax.set_title(f'{container}', fontweight='bold')
            ax.set_xlabel('Time')
            ax.set_ylabel('CPU Usage (cores)')
            ax.grid(True, alpha=0.3)
            
            # Add statistics
            mean_cpu = ts.mean()
            max_cpu = ts.max()
            ax.axhline(mean_cpu, color='red', linestyle='--', label=f'Mean: {mean_cpu:.4f}')
            ax.legend()
            
            print(f"\n{container}:")
            print(f"   Mean CPU: {mean_cpu:.4f} cores")
            print(f"   Max CPU:  {max_cpu:.4f} cores")
            print(f"   Std Dev:  {ts.std():.4f}")
            
            plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')
    
    plt.tight_layout()
    plt.show()
else:
    print("‚ö†Ô∏è  No container name labels found in CPU data")

In [None]:
# Analyze memory usage by container
print("=" * 80)
print("MEMORY USAGE BY CONTAINER")
print("=" * 80)

mem_data = df[df['metric_name'] == 'container_memory']

if 'label_name' in mem_data.columns and len(mem_data) > 0:
    containers = ['metrics-webapp', 'metrics-db', 'metrics-cache']
    
    fig, axes = plt.subplots(3, 1, figsize=(15, 12))
    fig.suptitle('Memory Usage by Container', fontsize=16, fontweight='bold')
    
    for idx, container in enumerate(containers):
        container_data = mem_data[mem_data['label_name'] == container]
        
        if len(container_data) > 0:
            ax = axes[idx]
            
            # Aggregate and convert to MB
            ts = container_data.groupby('timestamp')['value'].mean() / (1024 ** 2)
            
            ax.plot(ts.index, ts.values, linewidth=2, color='green')
            ax.set_title(f'{container}', fontweight='bold')
            ax.set_xlabel('Time')
            ax.set_ylabel('Memory Usage (MB)')
            ax.grid(True, alpha=0.3)
            
            # Add statistics
            mean_mem = ts.mean()
            max_mem = ts.max()
            ax.axhline(mean_mem, color='red', linestyle='--', label=f'Mean: {mean_mem:.1f} MB')
            ax.legend()
            
            print(f"\n{container}:")
            print(f"   Mean Memory: {mean_mem:.1f} MB")
            print(f"   Max Memory:  {max_mem:.1f} MB")
            print(f"   Std Dev:     {ts.std():.1f} MB")
            
            plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')
    
    plt.tight_layout()
    plt.show()
else:
    print("‚ö†Ô∏è  No container memory data found")

## 5. HTTP Request Analysis

Analyze HTTP request patterns to understand application load.

In [None]:
# Analyze HTTP requests
print("=" * 80)
print("HTTP REQUEST PATTERNS")
print("=" * 80)

http_data = df[df['metric_name'] == 'http_requests']

if len(http_data) > 0:
    # Plot request rate over time
    fig, axes = plt.subplots(2, 1, figsize=(15, 10))
    
    # Overall request rate
    ax1 = axes[0]
    ts = http_data.groupby('timestamp')['value'].sum()
    ax1.plot(ts.index, ts.values, linewidth=2, color='blue')
    ax1.set_title('Total HTTP Request Rate', fontweight='bold')
    ax1.set_xlabel('Time')
    ax1.set_ylabel('Requests/second')
    ax1.grid(True, alpha=0.3)
    plt.setp(ax1.xaxis.get_majorticklabels(), rotation=45, ha='right')
    
    # By endpoint
    ax2 = axes[1]
    if 'label_endpoint' in http_data.columns:
        endpoints = http_data['label_endpoint'].unique()
        for endpoint in endpoints:
            endpoint_data = http_data[http_data['label_endpoint'] == endpoint]
            ts = endpoint_data.groupby('timestamp')['value'].mean()
            ax2.plot(ts.index, ts.values, label=endpoint, linewidth=2)
        
        ax2.set_title('Request Rate by Endpoint', fontweight='bold')
        ax2.set_xlabel('Time')
        ax2.set_ylabel('Requests/second')
        ax2.legend(loc='upper right')
        ax2.grid(True, alpha=0.3)
        plt.setp(ax2.xaxis.get_majorticklabels(), rotation=45, ha='right')
    
    plt.tight_layout()
    plt.show()
    
    # Statistics
    print(f"\nTotal HTTP Requests: {http_data['value'].sum():,.0f}")
    print(f"Mean Request Rate: {http_data['value'].mean():.2f} req/s")
    print(f"Peak Request Rate: {http_data['value'].max():.2f} req/s")
    
    if 'label_endpoint' in http_data.columns:
        print("\nüìä Requests by Endpoint:")
        endpoint_totals = http_data.groupby('label_endpoint')['value'].sum().sort_values(ascending=False)
        for endpoint, total in endpoint_totals.items():
            percentage = 100 * total / endpoint_totals.sum()
            print(f"   {endpoint:20s}: {total:8,.0f} ({percentage:5.1f}%)")
else:
    print("‚ö†Ô∏è  No HTTP request data found")

## 6. Pattern Detection

Identify patterns such as spikes, periodic behavior, and anomalies.

In [None]:
# Detect patterns in the data
print("=" * 80)
print("PATTERN DETECTION")
print("=" * 80)

# Focus on CPU usage for pattern detection
cpu_data = df[df['metric_name'] == 'container_cpu']

if len(cpu_data) > 0:
    # Aggregate CPU across all containers
    cpu_ts = cpu_data.groupby('timestamp')['value'].sum()
    
    # Calculate rolling statistics
    window = 20  # 20 data points = 5 minutes (15s intervals)
    
    cpu_ts_df = pd.DataFrame({
        'value': cpu_ts.values,
        'rolling_mean': cpu_ts.rolling(window=window, center=True).mean(),
        'rolling_std': cpu_ts.rolling(window=window, center=True).std()
    }, index=cpu_ts.index)
    
    # Detect spikes (values > 2 std dev from rolling mean)
    cpu_ts_df['spike'] = (
        cpu_ts_df['value'] > cpu_ts_df['rolling_mean'] + 2 * cpu_ts_df['rolling_std']
    )
    
    # Plot
    fig, ax = plt.subplots(figsize=(15, 6))
    
    ax.plot(cpu_ts_df.index, cpu_ts_df['value'], label='CPU Usage', linewidth=1.5)
    ax.plot(cpu_ts_df.index, cpu_ts_df['rolling_mean'], 
            label=f'Rolling Mean ({window} samples)', linestyle='--', linewidth=2)
    ax.fill_between(cpu_ts_df.index,
                     cpu_ts_df['rolling_mean'] - 2 * cpu_ts_df['rolling_std'],
                     cpu_ts_df['rolling_mean'] + 2 * cpu_ts_df['rolling_std'],
                     alpha=0.2, label='¬±2 Std Dev')
    
    # Mark spikes
    spikes = cpu_ts_df[cpu_ts_df['spike']]
    if len(spikes) > 0:
        ax.scatter(spikes.index, spikes['value'], color='red', s=100, 
                   label=f'Spikes ({len(spikes)})', zorder=5)
    
    ax.set_title('CPU Usage Pattern Detection', fontsize=14, fontweight='bold')
    ax.set_xlabel('Time')
    ax.set_ylabel('Total CPU Usage (cores)')
    ax.legend(loc='upper right')
    ax.grid(True, alpha=0.3)
    plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
    
    print(f"\nüîç Pattern Analysis:")
    print(f"   Total data points: {len(cpu_ts_df):,}")
    print(f"   Spikes detected: {cpu_ts_df['spike'].sum()}")
    print(f"   Spike rate: {100 * cpu_ts_df['spike'].sum() / len(cpu_ts_df):.2f}%")
    print(f"   Mean CPU: {cpu_ts_df['value'].mean():.4f} cores")
    print(f"   Std Dev: {cpu_ts_df['value'].std():.4f}")
    print(f"   Coefficient of Variation: {cpu_ts_df['value'].std() / cpu_ts_df['value'].mean():.4f}")
else:
    print("‚ö†Ô∏è  No CPU data available for pattern detection")

## 7. Correlation Analysis

Analyze correlations between different metrics and containers.

In [None]:
# Create a pivot table for correlation analysis
print("=" * 80)
print("CORRELATION ANALYSIS")
print("=" * 80)

# Prepare data for correlation
# Pivot to get each metric/container as a column
pivot_data = []

for metric in df['metric_name'].unique():
    metric_df = df[df['metric_name'] == metric]
    
    if 'label_name' in metric_df.columns:
        # Group by timestamp and container
        for container in metric_df['label_name'].unique():
            container_df = metric_df[metric_df['label_name'] == container]
            ts = container_df.groupby('timestamp')['value'].mean()
            pivot_data.append({
                'series': ts,
                'name': f'{metric}_{container}'
            })
    else:
        # Just aggregate by timestamp
        ts = metric_df.groupby('timestamp')['value'].sum()
        pivot_data.append({
            'series': ts,
            'name': metric
        })

if pivot_data:
    # Combine into DataFrame
    corr_df = pd.DataFrame({item['name']: item['series'] for item in pivot_data})
    
    # Calculate correlation matrix
    corr_matrix = corr_df.corr()
    
    # Plot heatmap
    plt.figure(figsize=(14, 12))
    sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
                center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
    plt.title('Metric Correlation Heatmap', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    # Print high correlations
    print("\nüîó High Correlations (|r| > 0.7):")
    for i in range(len(corr_matrix.columns)):
        for j in range(i+1, len(corr_matrix.columns)):
            corr_val = corr_matrix.iloc[i, j]
            if abs(corr_val) > 0.7:
                print(f"   {corr_matrix.columns[i]:40s} <-> {corr_matrix.columns[j]:40s}: {corr_val:6.3f}")
else:
    print("‚ö†Ô∏è  Insufficient data for correlation analysis")

## 8. Summary and Next Steps

### Key Findings:
- Document any interesting patterns observed
- Note any anomalies or unexpected behavior
- Identify metrics that are most variable

### Next Steps:
1. ‚úÖ Collect additional data with periodic pattern
2. Create sliding window features (15-60 min windows) -> src/preprocessing/sliding_windows.py
3. Build initial LSTM prototype for CPU prediction -> src/models/lstm_baseline.py
4. Evaluate baseline model performance

In [None]:
# Summary statistics
print("=" * 80)
print("SUMMARY REPORT")
print("=" * 80)

print(f"\nüìä Dataset Overview:")
print(f"   Total records: {len(df):,}")
print(f"   Time range: {df['timestamp'].max() - df['timestamp'].min()}")
print(f"   Unique metrics: {df['metric_name'].nunique()}")
print(f"   Sampling interval: ~15 seconds")

print(f"\nüìà Metrics Summary:")
for metric in df['metric_name'].unique():
    metric_df = df[df['metric_name'] == metric]
    print(f"\n   {metric.upper()}:")
    print(f"      Records: {len(metric_df):,}")
    print(f"      Mean: {metric_df['value'].mean():.4f}")
    print(f"      Std: {metric_df['value'].std():.4f}")
    print(f"      Min: {metric_df['value'].min():.4f}")
    print(f"      Max: {metric_df['value'].max():.4f}")

print(f"\n‚úÖ Analysis complete!")
print(f"   Ready for feature engineering and model training")