# SMF 30 Binary Analysis - Complete Guide

This notebook demonstrates how to analyze **raw SMF 30 binary dumps** from z/OS mainframe using the SMF30 binary parser.

## Overview

SMF Type 30 records contain **Job and Step completion statistics** including:
- CPU time and service units
- Memory usage (real and virtual)
- I/O operations (EXCP counts)
- Job timing information
- Resource consumption metrics

## Prerequisites

1. **Raw SMF 30 dump file** from z/OS mainframe
2. **Python libraries**: pandas, matplotlib, numpy
3. **SMF30 binary parser** (included in this workspace)

Let's get started!

## 1. Import Required Libraries

In [None]:
import sys
import struct
from pathlib import Path
from datetime import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Import SMF30 modules from current workspace
sys.path.insert(0, str(Path.cwd()))
from smf30_binary_parser import SMFBinaryParser
from smf30_structures import *

# Configure plotting style
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

print("✓ All libraries imported successfully")
print(f"✓ Working directory: {Path.cwd()}")

## 2. Load Raw Binary SMF 30 Dump

Specify the path to your **raw SMF 30 binary dump** file from the mainframe.

In [None]:
# Specify your SMF 30 binary dump file path
# Use small sample for faster execution, or "dumpsample.bin" for full 12.9MB dump
dump_file = "dumpsample_small.bin"  # Small 100KB sample for quick testing

# Verify file exists
dump_path = Path(dump_file)
if dump_path.exists():
    file_size_mb = dump_path.stat().st_size / (1024 * 1024)
    print(f"✓ Found dump file: {dump_path}")
    print(f"  Size: {file_size_mb:.2f} MB")
else:
    print(f"⚠ Dump file not found: {dump_file}")
    print(f"  Please update the path to your SMF 30 binary dump")
    print(f"\n  To obtain SMF dumps from z/OS, see: BINARY_DUMP_GUIDE.md")

## 3. Parse Binary SMF 30 Records

The parser automatically handles:
- **RDW (Record Descriptor Word)** - Variable-length record headers
- **EBCDIC to ASCII** conversion (code page 500)
- **Big-endian binary** unpacking
- **Subtype detection** at offset 21

In [None]:
# Initialize the binary parser
parser = SMFBinaryParser(dump_file)

# Parse all records from the dump
print("Parsing SMF 30 binary dump...")
records_by_subtype = parser.parse_dump_file()

# Display summary
print("\n" + "="*60)
print("PARSING SUMMARY")
print("="*60)
total_records = sum(len(records) for records in records_by_subtype.values())
print(f"Total records parsed: {total_records}")
print("\nRecords by Subtype:")
for subtype, records in sorted(records_by_subtype.items()):
    if records:
        print(f"  Subtype {subtype}: {len(records):4d} records")

## 4. Convert to Pandas DataFrames

Create DataFrames for each subtype for easier analysis.

In [None]:
# Convert parsed records to DataFrames
dataframes = {}

for subtype, records in records_by_subtype.items():
    if records:
        # Convert to dictionaries
        data = [record.to_dict() for record in records]
        df = pd.DataFrame(data)
        dataframes[subtype] = df
        print(f"Subtype {subtype}: {len(df)} rows, {len(df.columns)} columns")

# Store most common subtypes for analysis
df_interval = dataframes.get(1, pd.DataFrame())  # Subtype 1: Interval
df_step = dataframes.get(4, pd.DataFrame())      # Subtype 4: Step completion
df_job = dataframes.get(5, pd.DataFrame())       # Subtype 5: Job completion

print(f"\n✓ Created {len(dataframes)} DataFrames")

## 5. Inspect Sample Records

Let's examine the structure and data from parsed records.

In [None]:
# Display sample from Job Completion records (Subtype 1)
if not df_interval.empty:
    print("Job Step Termination Records (Subtype 1) - Sample")
    print("="*80)
    print(df_interval[['job_name', 'program_name', 'step_name', 'cpu_time_ms', 
                       'elapsed_time_ms', 'excp_count']].head(10))
    print(f"\nTotal records: {len(df_interval)}")
else:
    print("No records found")

# Display Step Completion if available
if not df_step.empty:
    print("\n\nStep Completion Records (Subtype 4) - Sample")
    print("="*80)
    print(df_step[['job_name', 'step_name', 'program_name', 'cpu_time_ms']].head(5))
    print(f"\nTotal Step records: {len(df_step)}")

## 6. Analyze CPU Usage

Analyze CPU consumption across jobs.

In [None]:
if not df_interval.empty:
    # CPU statistics (convert ms to seconds)
    df_interval['cpu_time_sec'] = df_interval['cpu_time_ms'] / 1000
    print("CPU Usage Statistics")
    print("="*60)
    print(f"Total CPU time: {df_interval['cpu_time_sec'].sum():.2f} seconds")
    print(f"Average CPU per job: {df_interval['cpu_time_sec'].mean():.2f} seconds")
    print(f"Max CPU (single job): {df_interval['cpu_time_sec'].max():.2f} seconds")
    print(f"Min CPU (single job): {df_interval['cpu_time_sec'].min():.2f} seconds")
    
    # Top CPU consumers
    print("\n\nTop 10 CPU Consuming Jobs")
    print("="*60)
    top_cpu = df_interval.nlargest(10, 'cpu_time_ms')[['job_name', 'program_name', 'cpu_time_sec']]
    print(top_cpu.to_string(index=False))
else:
    print("No job data available for CPU analysis")

## 7. Visualize CPU Distribution

In [None]:
if not df_interval.empty:
    # Convert ms to seconds for visualization
    df_interval['cpu_time_sec'] = df_interval['cpu_time_ms'] / 1000
    df_interval['elapsed_time_sec'] = df_interval['elapsed_time_ms'] / 1000
    
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    
    # 1. CPU Time Distribution
    axes[0, 0].hist(df_interval['cpu_time_sec'], bins=50, color='steelblue', edgecolor='black')
    axes[0, 0].set_xlabel('CPU Time (seconds)')
    axes[0, 0].set_ylabel('Frequency')
    axes[0, 0].set_title('CPU Time Distribution Across Jobs')
    axes[0, 0].grid(True, alpha=0.3)
    
    # 2. Top 15 CPU Consumers
    top15 = df_interval.nlargest(15, 'cpu_time_sec')
    axes[0, 1].barh(range(len(top15)), top15['cpu_time_sec'], color='coral')
    axes[0, 1].set_yticks(range(len(top15)))
    axes[0, 1].set_yticklabels(top15['job_name'], fontsize=8)
    axes[0, 1].set_xlabel('CPU Time (seconds)')
    axes[0, 1].set_title('Top 15 CPU Consuming Jobs')
    axes[0, 1].grid(True, alpha=0.3, axis='x')
    
    # 3. CPU vs Elapsed Time
    axes[1, 0].scatter(df_interval['elapsed_time_sec'], df_interval['cpu_time_sec'], 
                       alpha=0.6, s=30, color='green')
    axes[1, 0].set_xlabel('Elapsed Time (seconds)')
    axes[1, 0].set_ylabel('CPU Time (seconds)')
    axes[1, 0].set_title('CPU Time vs Elapsed Time')
    axes[1, 0].grid(True, alpha=0.3)
    
    # 4. CPU by Program Name (top 10 programs)
    if 'program_name' in df_interval.columns:
        program_cpu = df_interval.groupby('program_name')['cpu_time_sec'].sum().nlargest(10)
        axes[1, 1].bar(range(len(program_cpu)), program_cpu.values, color='purple')
        axes[1, 1].set_xticks(range(len(program_cpu)))
        axes[1, 1].set_xticklabels(program_cpu.index, rotation=45, ha='right', fontsize=8)
        axes[1, 1].set_ylabel('Total CPU Time (seconds)')
        axes[1, 1].set_title('CPU Usage by Program (Top 10)')
        axes[1, 1].grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.savefig('reports/cpu_analysis.png', dpi=150, bbox_inches='tight')
    plt.show()
    print("\n✓ CPU visualization saved to: reports/cpu_analysis.png")
else:
    print("No job data available for visualization")

## 8. Analyze Memory Usage

Examine real and virtual storage consumption.

In [None]:
if not df_job.empty and 'real_storage' in df_job.columns:
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    # Real Storage Usage
    axes[0].hist(df_job['real_storage'] / 1024, bins=40, color='teal', edgecolor='black')
    axes[0].set_xlabel('Real Storage (MB)')
    axes[0].set_ylabel('Frequency')
    axes[0].set_title('Real Storage Usage Distribution')
    axes[0].grid(True, alpha=0.3)
    
    # Virtual Storage Usage
    if 'virtual_storage' in df_job.columns:
        axes[1].hist(df_job['virtual_storage'] / 1024, bins=40, color='orange', edgecolor='black')
        axes[1].set_xlabel('Virtual Storage (MB)')
        axes[1].set_ylabel('Frequency')
        axes[1].set_title('Virtual Storage Usage Distribution')
        axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('reports/memory_analysis.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    # Statistics
    print("Memory Usage Statistics")
    print("="*60)
    print(f"Average Real Storage: {df_job['real_storage'].mean() / 1024:.2f} MB")
    print(f"Max Real Storage: {df_job['real_storage'].max() / 1024:.2f} MB")
    if 'virtual_storage' in df_job.columns:
        print(f"Average Virtual Storage: {df_job['virtual_storage'].mean() / 1024:.2f} MB")
        print(f"Max Virtual Storage: {df_job['virtual_storage'].max() / 1024:.2f} MB")
    print("\n✓ Memory visualization saved to: reports/memory_analysis.png")
else:
    print("Memory fields not available in parsed data")

## 9. Analyze I/O Operations (EXCP)

In [None]:
if not df_interval.empty and 'excp_count' in df_interval.columns:
    # EXCP Statistics
    print("I/O (EXCP) Statistics")
    print("="*60)
    print(f"Total EXCP Count: {df_interval['excp_count'].sum():,}")
    print(f"Average EXCP per job: {df_interval['excp_count'].mean():.2f}")
    print(f"Max EXCP (single job): {df_interval['excp_count'].max():,}")
    
    # Top I/O intensive jobs
    print("\n\nTop 10 I/O Intensive Jobs")
    print("="*60)
    top_io = df_interval.nlargest(10, 'excp_count')[['job_name', 'program_name', 'excp_count']]
    print(top_io.to_string(index=False))
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    # EXCP Distribution
    axes[0].hist(df_interval['excp_count'], bins=50, color='darkgreen', edgecolor='black')
    axes[0].set_xlabel('EXCP Count')
    axes[0].set_ylabel('Frequency')
    axes[0].set_title('I/O Operations (EXCP) Distribution')
    axes[0].grid(True, alpha=0.3)
    
    # Top I/O Consumers
    top15_io = df_interval.nlargest(15, 'excp_count')
    axes[1].barh(range(len(top15_io)), top15_io['excp_count'], color='darkred')
    axes[1].set_yticks(range(len(top15_io)))
    axes[1].set_yticklabels(top15_io['job_name'], fontsize=8)
    axes[1].set_xlabel('EXCP Count')
    axes[1].set_title('Top 15 I/O Intensive Jobs')
    axes[1].grid(True, alpha=0.3, axis='x')
    
    plt.tight_layout()
    plt.savefig('reports/io_analysis.png', dpi=150, bbox_inches='tight')
    plt.show()
    print("\n✓ I/O visualization saved to: reports/io_analysis.png")
else:
    print("EXCP count field not available in parsed data")

## 10. Export Reports to CSV/Excel

In [None]:
# Create reports directory if it doesn't exist
Path('reports').mkdir(exist_ok=True)

# Export each subtype to CSV
for subtype, df in dataframes.items():
    csv_file = f'reports/smf30_subtype{subtype}_report.csv'
    df.to_csv(csv_file, index=False)
    print(f"✓ Exported Subtype {subtype}: {csv_file} ({len(df)} records)")

# Create summary Excel file with multiple sheets
if dataframes:
    excel_file = 'reports/smf30_complete_report.xlsx'
    with pd.ExcelWriter(excel_file, engine='openpyxl') as writer:
        for subtype, df in dataframes.items():
            sheet_name = f'Subtype_{subtype}'
            df.to_excel(writer, sheet_name=sheet_name, index=False)
    print(f"\n✓ Created Excel report: {excel_file}")
    print(f"  Contains {len(dataframes)} sheets")

print("\n✓ All reports exported successfully!")

## 11. Custom Analysis Example

Perform custom queries on the data.