# Phase 3: Automated Technical Evaluation & Runtime Instrumentation

### **Overview**
This notebook contains the complete execution pipeline for **Phase 3** (as detailed in Section 3.4 of the methodology). It ingests the 1195 fully synthesized C programming solutions generated in Phase 2 and subjects them to a rigorous, multi-dimensional technical audit.

### **Methodology**
To identify the "Technical Reliability Gap" between syntactic adherence and logical correctness, this pipeline automatically evaluates each code generation using the following tools:
* **Static Analysis:** `lizard` (v1.20.0) for Cyclomatic Complexity (CCN) and `cppcheck` (v.2.7-1) for Static Error Density (SED).
* **Stylistic Adherence:** `clang-format` (LLVM standard) to identify formatting deviations.
* **Dynamic Compilation:** `gcc` (v.11.4.0) with strict academic flags (`-Wall -Werror -std=c11`) to calculate the Compilation Success Rate (CSR).
* **Memory & Runtime Auditing:** `Valgrind` to detect "Definitely Lost" heap memory leaks, and `UBSan` to catch fatal Undefined Behaviors (UB) during execution.

**Target Output:** A consolidated CSV dataset containing the technical metrics and failure profiles for all 1195 iterations, which directly populates the tables and figures in the study.

---
**Note for Double-Blind Review:** This codebase relies on native Linux utilities (GCC, Valgrind) which are readily available in the Google Colab environment. If running locally, ensure these dependencies are installed via your system's package manager.

### 1. Environment Initialization & Data Validation
This cell initializes the Python environment and establishes a persistent connection to the dataset generated in Phase 2. It mounts the Google Drive filesystem, verifies the target directory paths, and performs a preliminary audit to confirm all JSONL files are present and correctly named before the technical evaluation begins.

In [None]:
# ==========================================
# --- ENVIRONMENT INITIALIZATION ---
# ==========================================
import os
import json
import re
import pandas as pd
from collections import defaultdict
from google.colab import drive

def setup_research_environment(target_folder):
    """
    Mounts the Google Drive filesystem and validates the data directory.
    Objective: Establish a persistent connection to the model-generated datasets.
    """
    if not os.path.exists("/content/drive"):
        print("üîç Attempting to mount Google Drive...")
        drive.mount('/content/drive', force_remount=True)

    data_path = os.path.join("/content/drive/MyDrive/", target_folder)

    if not os.path.exists(data_path):
        raise FileNotFoundError(f"Critical Failure: Directory not found at {data_path}")

    print(f"Research environment verified. Path set to: {data_path}")
    return data_path

# --- CONFIGURATION ---
FOLDER_NAME = "CANAI_LLM_Results/Eval/"

try:
    path = setup_research_environment(FOLDER_NAME)

    # --- VALIDATION 1: FILENAME AUDIT ---
    files = sorted([f for f in os.listdir(path) if f.endswith('.jsonl')])
    print(f"Validation Check: {len(files)} JSONL files detected.\n")

    print("Detected Filenames:")
    for i, fname in enumerate(files, 1):
        print(f"  {i}. {fname}")

    if len(files) == 0:
        print("Warning: No data files found. Please verify the folder contents.")
except Exception as e:
    print(f"Initialization Error: {e}")

### 2. Data Parsing and Extraction Audit
To prepare the dataset for static and dynamic analysis, the raw JSONL strings must be parsed to isolate the functional C code.

This cell performs three critical preprocessing tasks:
1. **Trace Sanitization:** It removes internal "Chain-of-Thought" (CoT) reasoning traces (e.g., `<think>` tags) generated by latent reasoning models to prevent metric pollution during static analysis.
2. **Code Extraction:** It uses regex to isolate the terminal ` ```c ` markdown block from Step 2 of the prompt chain, representing the model's final synthesized solution.
3. **Integrity Auditing:** It identifies and logs any iterations that failed to follow the formatting constraints (resulting in the 5 excluded iterations mentioned in the methodology, finalizing the dataset at 1195 items).

In [None]:
# ==========================================
# --- DATA PARSING & EXTRACTION AUDIT ---
# ==========================================

def clean_model_output(text):
    """
    Removes internal reasoning traces (e.g., <think> tags) to prevent metric pollution.
    """
    if not text:
        return ""
    return re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL).strip()

def extract_c_source_with_reason(text):
    """
    Isolates the final C code block and provides a reason for any formatting failure.
    Returns: (extracted_code, failure_reason)
    """
    # Locates all C code blocks within Markdown delimiters
    code_blocks = re.findall(r"```c\s*(.*?)\s*```", text, re.DOTALL)

    if not code_blocks:
        return "", "Missing Delimiters (```c ... ```)"

    # Selects the terminal block to capture the final solution logic
    final_block = code_blocks[-1].strip()
    if not final_block:
        return "", "Empty Code Block (Delimiters found but no content)"

    return final_block, None

def parse_metadata(filename):
    """
    Extracts provenance metadata based on the verified Phase 2 naming convention:
    e.g., SOLVED_YYYYMMDD_ModelName_TopicName_B1.jsonl
    """
    pattern = r"SOLVED_\d{8}_(.+?)_(.+)_B\d+\.jsonl"
    match = re.match(pattern, filename)

    if match:
        return {
            "model": match.group(1).replace("_", "/"),
            "topic": match.group(2).replace("_", " ")
        }
    return None

# --- PROCESSING & EXTRACTION VALIDATION ---
dataset = defaultdict(lambda: defaultdict(list))
extraction_audit = {"success": 0, "failed": 0, "unparsed_files": 0}
failure_log = []

# Sorted processing for consistent indexing (uses 'path' from Cell 1)
files = sorted([f for f in os.listdir(path) if f.endswith('.jsonl')])

for fname in files:
    meta = parse_metadata(fname)
    if not meta:
        extraction_audit["unparsed_files"] += 1
        continue

    with open(os.path.join(path, fname), 'r', encoding='utf-8') as f:
        for line in f:
            try:
                data = json.loads(line)
                it_id = data.get("iteration", "N/A")
                raw_steps = data.get("steps", {})

                # Cleaning traces and isolating code from Step 2
                solution_text = clean_model_output(raw_steps.get("step_2", ""))
                extracted_code, reason = extract_c_source_with_reason(solution_text)

                if extracted_code:
                    extraction_audit["success"] += 1
                    data["extracted_code"] = extracted_code
                    dataset[meta['model']][meta['topic']].append(data)
                else:
                    extraction_audit["failed"] += 1
                    # Detailed logging of the extraction failure location
                    failure_log.append({
                        "File": fname,
                        "Iteration": it_id,
                        "Model": meta['model'],
                        "Topic": meta['topic'],
                        "Reason": reason
                    })

            except Exception as e:
                print(f"JSON Parse Error in {fname}: {e}")

# --- VALIDATION REPORT OUTPUT ---
print("EXTRACTION VALIDATION REPORT")
print("-" * 115)
print(f"Successfully Isolated C Code: {extraction_audit['success']}")
print(f"Failed to Extract Code:     {extraction_audit['failed']}")
print(f"Files Skipping (No Match):  {extraction_audit['unparsed_files']}")
print("-" * 115)

if failure_log:
    print("\nFAILURE PINPOINT LOG (Excluded from technical evaluation)")
    # Formatted table for clear mapping of failures to source files
    header = f"{'FILENAME':<45} | {'ITER':<5} | {'TOPIC':<30} | {'REASON':<25}"
    print(header)
    print("-" * len(header))
    for fail in failure_log:
        print(f"{fail['File'][:45]:<45} | {fail['Iteration']:<5} | {fail['Topic'][:30]:<30} | {fail['Reason']:<25}")
    print("-" * len(header))
    print(f"\nVerification Note: These {extraction_audit['failed']} iterations lack valid C code blocks and are excluded from the dataset.")
else:
    print("\nAll extractions successful. No failures found.")

### 3. Static Analysis: Structural Complexity (Lizard)
This cell executes the first phase of the static "Code-as-Text" audit detailed in Section 3.4. It utilizes the `lizard` library (v1.20.0) to analyze the structural density of the generated C solutions.

**Key Metrics Extracted:**
* **Max/Avg Cyclomatic Complexity (CCN):** Quantifies the number of independent branching paths. This metric is used to establish the "Reliability Ceiling" for each architecture (RQ1).
* **Token Count & NLOC:** Measures model verbosity. When cross-referenced with CCN, this generates the "Logic Density" (CCN per 100 Tokens) metric used to evaluate architectural efficiency between sparse MoE and dense models (RQ2).

*(Note: System-level dependencies like `cppcheck` and `clang-format` are also installed in this cell to prepare the environment for the subsequent style and security audits).*

In [None]:
# ==========================================
# --- EXPANDED COMPLEXITY (CCN) ANALYSIS ---
# ==========================================
print("Installing Static Analysis Tools...")
# Install native Linux tools required for subsequent cells
!apt-get update -qq
!apt-get install -y cppcheck clang-format -qq
# Install lizard matching the v1.20.0 specification from the methodology
!pip install lizard -q

import lizard
import tempfile
import os
import pandas as pd
import numpy as np

def calculate_complexity_expanded(dataset):
    """
    Analyzes deep structural metrics using Lizard at the file level.
    Captures Max/Avg CCN, Token Count, Parameter Count, and total NLOC.
    """
    print("Analyzing Logical Complexity (Expanded Metrics)...")
    stats = {"processed": 0, "skipped": 0}

    with tempfile.TemporaryDirectory() as tmp_dir:
        tmp_path = os.path.join(tmp_dir, "analysis_target.c")

        for model in dataset:
            for topic in dataset[model]:
                for entry in dataset[model][topic]:
                    code = entry.get("extracted_code", "")
                    metrics = entry.get("metrics_static", {})

                    # Handle missing or empty code blocks
                    if not code:
                        entry["metrics_static"] = {
                            "max_ccn": 0, "avg_ccn": 0, "tokens": 0,
                            "avg_params": 0, "loc": 0
                        }
                        stats["skipped"] += 1
                        continue

                    # Write target code to temporary file for Lizard analysis
                    with open(tmp_path, "w", encoding="utf-8") as f:
                        f.write(code)

                    try:
                        analysis = lizard.analyze_file(tmp_path)

                        if analysis.function_list:
                            # Aggregate metrics across all functions in the file
                            ccn_list = [f.cyclomatic_complexity for f in analysis.function_list]
                            param_list = [len(f.parameters) for f in analysis.function_list]

                            metrics["max_ccn"] = max(ccn_list)
                            metrics["avg_ccn"] = round(np.mean(ccn_list), 2)
                            metrics["tokens"] = analysis.token_count
                            metrics["avg_params"] = round(np.mean(param_list), 2)
                            metrics["loc"] = analysis.nloc
                        else:
                            # Fallback for structural snippets lacking formal definitions
                            metrics["max_ccn"] = 1
                            metrics["avg_ccn"] = 1.0
                            metrics["tokens"] = analysis.token_count
                            metrics["avg_params"] = 0
                            metrics["loc"] = len(code.splitlines())

                        stats["processed"] += 1
                    except Exception:
                        stats["skipped"] += 1

                    entry["metrics_static"] = metrics

    print(f"Complexity metrics updated. Processed: {stats['processed']} | Skipped: {stats['skipped']}")

# Execute the analysis on the global dataset
calculate_complexity_expanded(dataset)

# ==========================================
# --- VALIDATION REPORT OUTPUT ---
# ==========================================
def verify_static_metrics_expanded(dataset):
    print("\nVALIDATION: EXPANDED STATIC METRICS (Spot-Check: 2 per Model)")
    print("-" * 115)
    records = []

    for model in sorted(dataset.keys()):
        model_count = 0
        for topic in dataset[model]:
            if model_count >= 2:
                break
            for entry in dataset[model][topic]:
                if model_count < 2:
                    m = entry.get("metrics_static", {})
                    records.append({
                        "Model": model[:15],  # Expanded slightly for better readability
                        "Topic": topic[:25],
                        "Iter": entry.get("iteration"),
                        "Max_CCN": m.get("max_ccn"),
                        "Avg_CCN": m.get("avg_ccn"),
                        "Tokens": m.get("tokens"),
                        "LOC": m.get("loc")
                    })
                    model_count += 1

    df = pd.DataFrame(records)
    print(df.to_string(index=False))
    print("-" * 115)

verify_static_metrics_expanded(dataset)

### 4. Static Analysis: Error Density (cppcheck)
This cell executes the second phase of the static audit. It leverages `cppcheck` (v.2.7-1) to identify high-severity logical flaws that bypass standard compilers (e.g., uninitialized variables, null pointer dereferences).

To ensure fair comparison across models with varying verbosity, the raw error count is normalized against the file's Lines of Code (LOC) to generate the **Static Error Density (SED)** score. This metric is central to evaluating architectural efficiency (RQ2).

In [None]:
# ==========================================
# --- ROBUST STATIC ERROR DENSITY (SED) ANALYSIS ---
# ==========================================
import subprocess
import os
import tempfile
import pandas as pd

def calculate_sed_robust(dataset):
    """
    Analyzes SED using cppcheck and normalizes counts against file-level LOC.
    """
    print("Analyzing Static Error Density (SED) via cppcheck...")
    stats = {"evaluated": 0, "failures": 0}

    with tempfile.TemporaryDirectory() as tmp_dir:
        tmp_path = os.path.join(tmp_dir, "sed_target.c")

        for model in dataset:
            for topic in dataset[model]:
                for entry in dataset[model][topic]:
                    code = entry.get("extracted_code", "")
                    metrics = entry.get("metrics_static", {})
                    loc = metrics.get("loc", 0)

                    if not code or loc == 0:
                        metrics["static_errors"] = 0
                        metrics["sed_score"] = 0.0
                        continue

                    with open(tmp_path, "w", encoding="utf-8") as f:
                        f.write(code)

                    try:
                        # XML version 2 provides structured data for precise error counting
                        result = subprocess.run(
                            ["cppcheck", "--enable=all", "--xml", "--xml-version=2", tmp_path],
                            capture_output=True, text=True, check=False
                        )

                        error_count = result.stderr.count("<error ")
                        metrics["static_errors"] = error_count
                        metrics["sed_score"] = round(error_count / loc, 4) if loc > 0 else 0.0
                        stats["evaluated"] += 1

                    except Exception as e:
                        metrics["static_errors"] = -1
                        metrics["sed_score"] = -1.0
                        stats["failures"] += 1

                    entry["metrics_static"] = metrics

    print(f"SED analysis complete. Evaluated: {stats['evaluated']} | Failures: {stats['failures']}")

calculate_sed_robust(dataset)

# ==========================================
# --- VALIDATION REPORT OUTPUT ---
# ==========================================
def verify_sed_robust(dataset):
    print("\nVALIDATION: SED RE-CHECK (Spot-Check: 2 per Model)")
    print("-" * 115)
    records = []

    for model in sorted(dataset.keys()):
        model_count = 0
        for topic in dataset[model]:
            if model_count >= 2:
                break
            for entry in dataset[model][topic]:
                if model_count < 2:
                    m = entry.get("metrics_static", {})
                    records.append({
                        "Model": model[:15],
                        "Topic": topic[:25],
                        "Iter": entry.get("iteration"),
                        "Errors": m.get("static_errors"),
                        "SED_Score": m.get("sed_score"),
                        "LOC": m.get("loc")
                    })
                    model_count += 1

    df = pd.DataFrame(records)
    print(df.to_string(index=False))
    print("-" * 115)

verify_sed_robust(dataset)

### 4.1. Static Analysis: SED Logic Verification
To ensure the integrity of the Static Error Density (SED) metric, this validation cell performs a deep inspection of the `cppcheck` outputs.

By isolating a flagged iteration and parsing the raw XML output, this script proves that the automated pipeline is successfully capturing genuine C memory and logic violations (e.g., `nullPointer`, `uninitvar`, `memleak`), rather than superficial environment or missing-path warnings.

In [None]:
# ==========================================
# --- SED LOGIC VERIFICATION & DEBUGGING ---
# ==========================================
import os
import tempfile
import subprocess
import xml.etree.ElementTree as ET

def verify_sed_error_content(dataset):
    """
    Pulls the actual error message from the first flagged entry to confirm
    we are catching high-severity C errors and not system/path errors.
    """
    print("VALIDATION: SED ERROR TEXT EXTRACTION")
    print("-" * 115)

    # Grab the first available entry that has an error > 0
    sample_entry = None
    for model in dataset:
        for topic in dataset[model]:
            for entry in dataset[model][topic]:
                if entry.get("metrics_static", {}).get("static_errors", 0) > 0:
                    sample_entry = entry
                    break
            if sample_entry:
                break
        if sample_entry:
            break

    if not sample_entry:
        print("No static errors found in the dataset to verify.")
        print("-" * 115)
        return

    # Extract data for the test
    code = sample_entry.get("extracted_code", "")
    model_source = sample_entry.get("model", "Unknown")
    iter_id = sample_entry.get("iteration", "N/A")

    print(f"Target Acquired: {model_source} (Iteration {iter_id})")

    # Safely write to a temporary file
    with tempfile.NamedTemporaryFile(suffix=".c", mode="w", delete=False, encoding="utf-8") as tmp:
        tmp.write(code)
        tmp_path = tmp.name

    # Run cppcheck and capture the raw XML output
    result = subprocess.run(
        ["cppcheck", "--enable=all", "--xml", "--xml-version=2", tmp_path],
        capture_output=True, text=True, check=False
    )

    print("\n--- Raw XML Error Parsing ---")
    try:
        # Parse XML to find the specific 'id' and 'msg' attributes
        root = ET.fromstring(result.stderr)
        errors = root.findall(".//error")

        if not errors:
            print("No <error> tags found in XML output.")
        else:
            for err in errors[:3]: # Limit to first 3 errors to keep output clean
                err_id = err.get('id', 'UNKNOWN_ID')
                err_msg = err.get('msg', 'No message provided')
                print(f"ID: {err_id:<20} | MSG: {err_msg}")

    except Exception as e:
        print(f"Error parsing XML: {e}")
        print(f"Raw Stderr Snippet: {result.stderr[:200]}")

    # Clean up the temporary file
    os.remove(tmp_path)
    print("-" * 115)

verify_sed_error_content(dataset)

### 4.2. Static Analysis: Filtered Error Density (cppcheck)
Standard static analysis tools often inflate error counts when analyzing standalone code snippets due to system configuration warnings (e.g., `missingInclude` when standard C libraries are not explicitly linked in the working directory).

To ensure the Static Error Density (SED) metric is a pure reflection of the model's logical competence, this cell re-evaluates the `cppcheck` output, programmatically parsing the XML to filter out environmental noise. This guarantees the final SED score strictly represents high-severity semantic and memory flaws.

In [None]:
# ==========================================
# --- FILTERED STATIC ERROR DENSITY (SED) ---
# ==========================================
import os
import subprocess
import tempfile
import pandas as pd
import xml.etree.ElementTree as ET

def calculate_sed_filtered(dataset):
    """
    Re-evaluates SED by filtering out system configuration warnings (missingInclude).
    This ensures the score reflects true code-logic errors rather than environment limits.
    """
    print("Re-analyzing SED (Filtering System Noise)...")
    stats = {"evaluated": 0, "actual_errors": 0}

    with tempfile.TemporaryDirectory() as tmp_dir:
        tmp_path = os.path.join(tmp_dir, "sed_filter.c")

        for model in dataset:
            for topic in dataset[model]:
                for entry in dataset[model][topic]:
                    code = entry.get("extracted_code", "")
                    metrics = entry.get("metrics_static", {})
                    loc = metrics.get("loc", 0)

                    if not code or loc == 0:
                        continue

                    # Safely write to temporary target
                    with open(tmp_path, "w", encoding="utf-8") as f:
                        f.write(code)

                    try:
                        # Use subprocess to get raw XML
                        result = subprocess.run(
                            ["cppcheck", "--enable=all", "--xml", "--xml-version=2", tmp_path],
                            capture_output=True, text=True, check=False
                        )

                        # Parse XML to count true errors while excluding 'missingInclude'
                        try:
                            root = ET.fromstring(result.stderr)
                            # Filter errors: Keep if ID is not a 'missing include' warning
                            valid_errors = [
                                err for err in root.findall(".//error")
                                if "missingInclude" not in err.get("id")
                            ]

                            error_count = len(valid_errors)
                            metrics["static_errors"] = error_count
                            metrics["sed_score"] = round(error_count / loc, 4)

                            stats["evaluated"] += 1
                            stats["actual_errors"] += error_count

                        except Exception:
                            # Fallback if XML is malformed
                            metrics["static_errors"] = 0
                            metrics["sed_score"] = 0.0

                    except Exception:
                        metrics["static_errors"] = -1

                    entry["metrics_static"] = metrics

    print(f"Filtered SED complete. Evaluated: {stats['evaluated']} | True Errors Found: {stats['actual_errors']}")

# Execute the filtered audit
calculate_sed_filtered(dataset)

# ==========================================
# --- VALIDATION REPORT OUTPUT ---
# ==========================================
def verify_sed_filtered(dataset):
    print("\nVALIDATION: TRUE SED PERFORMANCE (Spot-Check: 2 per Model)")
    print("-" * 115)
    records = []

    for model in sorted(dataset.keys()):
        m_count = 0
        for topic in dataset[model]:
            if m_count >= 2:
                break
            for entry in dataset[model][topic]:
                if m_count < 2:
                    m = entry.get("metrics_static", {})
                    records.append({
                        "Model": model[:15],
                        "Topic": topic[:25],
                        "Iter": entry.get("iteration"),
                        "True_Errors": m.get("static_errors"),
                        "SED_Score": m.get("sed_score")
                    })
                    m_count += 1

    df = pd.DataFrame(records)
    print(df.to_string(index=False))
    print("-" * 115)

verify_sed_filtered(dataset)

### 5. Static Analysis: Stylistic Adherence (clang-format)
This cell concludes the static analysis phase by quantifying the stylistic cleanliness of the generated C solutions. It leverages `clang-format` configured to the industry-standard LLVM style guide.

By counting the discrete `<replacement>` tags required to bring the code into full compliance, this metric generates the "Style Deviations" score. This data is critical for investigating the "Style-Safety Paradox", testing the hypothesis that LLMs may generate visually pristine code that simultaneously harbors catastrophic runtime flaws.

In [None]:
# ==========================================
# --- STYLE COMPLIANCE ANALYSIS (LLVM) ---
# ==========================================
import os
import subprocess
import tempfile
import pandas as pd

def calculate_style_compliance(dataset):
    """
    Utilizes clang-format to measure stylistic deviations.
    Counts the number of discrete formatting fixes needed to match LLVM standards.
    """
    print("Analyzing Style Compliance via clang-format...")
    stats = {"processed": 0, "failures": 0}

    with tempfile.TemporaryDirectory() as tmp_dir:
        tmp_path = os.path.join(tmp_dir, "style_target.c")

        for model in dataset:
            for topic in dataset[model]:
                for entry in dataset[model][topic]:
                    code = entry.get("extracted_code", "")
                    metrics = entry.get("metrics_static", {})

                    if not code:
                        metrics["style_deviations"] = 0
                        continue

                    # Safely write code to temporary file
                    with open(tmp_path, "w", encoding="utf-8") as f:
                        f.write(code)

                    try:
                        # --output-replacements-xml: Lists every required change in XML format
                        # --style=LLVM: The industry standard for C code formatting
                        style_proc = subprocess.run(
                            ["clang-format", "--output-replacements-xml", "--style=LLVM", tmp_path],
                            capture_output=True, text=True, check=False
                        )

                        # Each <replacement> tag represents a specific formatting deviation
                        deviations = style_proc.stdout.count("<replacement ")
                        metrics["style_deviations"] = deviations
                        stats["processed"] += 1

                    except Exception:
                        metrics["style_deviations"] = -1
                        stats["failures"] += 1

                    entry["metrics_static"] = metrics

    print(f"Style analysis complete. Processed: {stats['processed']} | Failures: {stats['failures']}")

# Execute Style Analysis
calculate_style_compliance(dataset)

# ==========================================
# --- VALIDATION REPORT OUTPUT ---
# ==========================================
def verify_style_results(dataset):
    print("\nVALIDATION: STYLE COMPLIANCE SAMPLE (Spot-Check: 2 per Model)")
    print("-" * 115)
    records = []

    for model in sorted(dataset.keys()):
        model_count = 0
        for topic in dataset[model]:
            if model_count >= 2:
                break
            for entry in dataset[model][topic]:
                if model_count < 2:
                    m = entry.get("metrics_static", {})
                    records.append({
                        "Model": model[:15],
                        "Topic": topic[:25],
                        "Iter": entry.get("iteration"),
                        "Style_Dev": m.get("style_deviations"),
                        "LOC": m.get("loc")
                    })
                    model_count += 1

    df = pd.DataFrame(records)
    print(df.to_string(index=False))
    print("-" * 115)

verify_style_results(dataset)

### 6. Dynamic Analysis: Strict Compilation (CSR)
This cell initiates the dynamic runtime audit. Each generated solution is subjected to strict compilation using the `gcc` compiler (v.11.4.0).

To ensure the code meets rigorous academic standards, the compilation strictly enforces the C11 standard and elevates all warnings to fatal errors using the `-Wall -Werror -std=c11` flags. The resulting **Compilation Success Rate (CSR)** serves as the baseline metric for syntactic fluency, separating structurally viable code from catastrophic generation failures before memory instrumentation begins. Successfully compiled binaries are persisted to the Drive for the subsequent Valgrind/UBSan audits.

In [None]:
# ==========================================
# --- STRICT COMPILATION & BINARY PERSISTENCE ---
# ==========================================
import os
import subprocess
import tempfile
import pandas as pd

# Setup Binary Storage Directory (Uses the verified path from Cell 1)
BIN_DIR = os.path.join(path, "binaries")
if not os.path.exists(BIN_DIR):
    os.makedirs(BIN_DIR)
    print(f"Created binary storage at: {BIN_DIR}")

def run_compilation_audit(dataset):
    """
    Attempts to compile C code with strict flags: -Wall -Werror -std=c11.
    Successful binaries and error logs are stored on Drive.
    """
    print("Starting Strict Compilation Audit (Calculating CSR)...")
    stats = {"pass": 0, "fail": 0}

    for model in dataset:
        # Create model-specific subfolders for clean binary organization
        model_safe = model.replace("/", "_")
        model_path = os.path.join(BIN_DIR, model_safe)
        if not os.path.exists(model_path):
            os.makedirs(model_path)

        for topic in dataset[model]:
            for entry in dataset[model][topic]:
                code = entry.get("extracted_code", "")
                it_id = entry.get("iteration", "unknown")

                # Initialize dynamic metrics dictionary
                dyn_metrics = entry.get("metrics_dynamic", {})

                if not code:
                    dyn_metrics["compilation_status"] = "FAIL_EMPTY"
                    dyn_metrics["csr"] = 0
                    entry["metrics_dynamic"] = dyn_metrics
                    stats["fail"] += 1
                    continue

                # Define persistent paths for the binary and potential error logs
                bin_file = os.path.join(model_path, f"it_{it_id}.out")
                log_file = os.path.join(model_path, f"it_{it_id}.log")

                # Write code to a temporary .c file for GCC to target
                with tempfile.NamedTemporaryFile(suffix=".c", mode="w", delete=False, encoding="utf-8") as tmp:
                    tmp.write(code)
                    tmp_c = tmp.name

                try:
                    # Execute strict compilation command per Section 3.4 methodology
                    comp_proc = subprocess.run(
                        ["gcc", "-Wall", "-Werror", "-std=c11", tmp_c, "-o", bin_file],
                        capture_output=True, text=True, check=False
                    )

                    if comp_proc.returncode == 0:
                        dyn_metrics["compilation_status"] = "PASS"
                        dyn_metrics["csr"] = 1
                        entry["binary_path"] = bin_file # Store persistent path for Valgrind
                        stats["pass"] += 1
                    else:
                        dyn_metrics["compilation_status"] = "FAIL_GCC"
                        dyn_metrics["csr"] = 0
                        # Save the GCC error log for qualitative 'Oracle Hazard' analysis
                        with open(log_file, "w", encoding="utf-8") as log:
                            log.write(comp_proc.stderr)
                        stats["fail"] += 1

                except Exception as e:
                    dyn_metrics["compilation_status"] = "SYSTEM_ERROR"
                    dyn_metrics["csr"] = 0
                    stats["fail"] += 1

                finally:
                    # Always clean up the temporary .c file
                    if os.path.exists(tmp_c):
                        os.remove(tmp_c)

                entry["metrics_dynamic"] = dyn_metrics

    print(f"Compilation Audit Complete. PASS: {stats['pass']} | FAIL: {stats['fail']}")

# Execute the GCC compilation audit
run_compilation_audit(dataset)

# ==========================================
# --- VALIDATION REPORT OUTPUT ---
# ==========================================
def verify_compilation_results(dataset):
    print("\nVALIDATION: COMPILATION SUCCESS SAMPLE (Spot-Check: 2 per Model)")
    print("-" * 115)
    records = []

    for model in sorted(dataset.keys()):
        m_count = 0
        for topic in dataset[model]:
            if m_count >= 2:
                break
            for entry in dataset[model][topic]:
                if m_count < 2:
                    d = entry.get("metrics_dynamic", {})
                    records.append({
                        "Model": model[:15],
                        "Topic": topic[:25],
                        "Iter": entry.get("iteration"),
                        "Status": d.get("compilation_status"),
                        "CSR_Binary": d.get("csr", 0)
                    })
                    m_count += 1

    df = pd.DataFrame(records)
    print(df.to_string(index=False))
    print("-" * 115)

verify_compilation_results(dataset)

### 7. Dynamic Analysis: Test Suite Extraction
To fully automate the dynamic analysis, the pipeline must feed the generated C binaries with valid input data.

As outlined in the multi-turn generation protocol, Step 6 tasked the LLM with generating a machine-readable JSON block containing comprehensive test cases. This cell parses that output, extracting the required "Handshake" input string and the specific `exit_command` needed to allow the program to terminate safely. These extracted drivers will be used to instrument the Valgrind and execution environments in the subsequent cells.

In [None]:
# ==========================================
# --- TEST METADATA EXTRACTION (STEP 6) ---
# ==========================================
import re
import json
import pandas as pd

def extract_test_metadata(dataset):
    """
    Parses the JSON blocks in Step 6 to extract automated test drivers.
    Metadata includes the exit command and the primary test input string.
    """
    print("Extracting Test Drivers from Step 6 Generation...")
    stats = {"found": 0, "missing": 0}

    for model in dataset:
        for topic in dataset[model]:
            for entry in dataset[model][topic]:
                # Locate Step 6 text
                step6_text = entry.get('steps', {}).get('step_6', "")

                # Regex to isolate the JSON block within the Step 6 prose
                json_match = re.search(r"```json\s*(\{.*?\})\s*```", step6_text, re.DOTALL)

                if json_match:
                    try:
                        test_json = json.loads(json_match.group(1))

                        # Safely extract the first input string to avoid IndexError on empty arrays
                        test_suite = test_json.get("test_suite", [])
                        first_input = test_suite[0].get("input", "") if len(test_suite) > 0 else ""

                        # Capture the exit command and the test input
                        entry["test_metadata"] = {
                            "exit_cmd": str(test_json.get("exit_command", "0")),
                            "input_str": first_input
                        }
                        stats["found"] += 1

                    except Exception:
                        entry["test_metadata"] = None
                        stats["missing"] += 1
                else:
                    entry["test_metadata"] = None
                    stats["missing"] += 1

    print(f"Metadata Extraction Complete. Drivers Found: {stats['found']} | Missing/Malformed: {stats['missing']}")

# Execute Extraction
extract_test_metadata(dataset)

# ==========================================
# --- VALIDATION REPORT OUTPUT ---
# ==========================================
def verify_test_metadata(dataset):
    print("\nVALIDATION: TEST DRIVER SAMPLE (Spot-Check: 1 per Model)")
    print("-" * 115)
    for model in sorted(dataset.keys()):
        for topic in dataset[model]:
            if len(dataset[model][topic]) == 0:
                continue

            entry = dataset[model][topic][0]
            meta = entry.get("test_metadata")

            if meta:
                # Replace actual newlines with literal '\n' characters for compact terminal display
                input_preview = meta['input_str'].replace('\n', '\\n')
                print(f"{model[:15]:<15} | Exit: {meta['exit_cmd']:<3} | Input: {input_preview[:50]}...")
                break # Only show one per model for the preview

    print("-" * 115)

verify_test_metadata(dataset)

### 7.1. Dynamic Analysis: Robust Metadata & Safety Injection
Given the probabilistic nature of LLMs, the generated JSON test suites in Step 6 occasionally contain malformed syntax or missing `exit_command` fields.

To ensure maximum dataset retention, this cell implements a robust fallback parser. More importantly, it appends a **"Safety Sequence"** (a cascade of common exit integers and string commands) to the extracted test drivers. This mitigates catastrophic "logical hangs (infinite loops)" during the dynamic execution phase. By forcing a stuck program to exit naturally rather than requiring a hard system-level `SIGKILL`, the pipeline ensures that the Valgrind memory profiler can successfully generate its leak report.

In [None]:
# ==========================================
# --- ROBUST TEST METADATA EXTRACTION ---
# ==========================================
import re
import json
import pandas as pd

def extract_test_metadata_robust(dataset):
    """
    Parses Step 6 for JSON test drivers using aggressive regex matching.
    Implements a fallback 'Safety Sequence' to mitigate logical hangs
    during dynamic analysis if the model-provided exit command is incorrect.
    """
    print("Extracting Robust Test Drivers from Step 6...")
    stats = {"found": 0, "missing": 0, "fixed_exit": 0}

    for model in dataset:
        for topic in dataset[model]:
            for entry in dataset[model][topic]:
                step6_text = entry.get('steps', {}).get('step_6', "")

                # Broaden regex to find JSON even if the ```json markdown tag is missing
                json_match = re.search(r"(\{.*\})", step6_text.replace('\n', ' '), re.DOTALL)

                meta_found = False
                if json_match:
                    try:
                        # Clean the match to ensure it only contains the JSON object
                        clean_json = json_match.group(1)
                        test_json = json.loads(clean_json)

                        exit_cmd = str(test_json.get("exit_command", "")).strip()
                        input_str = ""

                        # Extract first input from the test suite securely
                        if "test_suite" in test_json and len(test_json["test_suite"]) > 0:
                            input_str = test_json["test_suite"][0].get("input", "")

                        # Validation: If exit_cmd is missing, attempt to infer it or set a flag
                        if not exit_cmd or exit_cmd == "":
                            exit_cmd = "0" # Default fallback
                            stats["fixed_exit"] += 1

                        entry["test_metadata"] = {
                            "exit_cmd": exit_cmd,
                            "input_str": input_str,
                            # A 'Safety Sequence' appended to execution buffers to prevent hard hangs
                            "safety_exit_seq": "\n0\n4\n5\n9\nexit\nquit\n"
                        }
                        meta_found = True
                    except Exception:
                        pass # Fall through to the missing counter if parsing completely fails

                if meta_found:
                    stats["found"] += 1
                else:
                    entry["test_metadata"] = None
                    stats["missing"] += 1

    print(f"Robust Extraction Complete.")
    print(f"Drivers Found: {stats['found']} | Missing: {stats['missing']} | Defaulted Exits: {stats['fixed_exit']}")

# Execute Robust Extraction
extract_test_metadata_robust(dataset)

# ==========================================
# --- VALIDATION REPORT OUTPUT ---
# ==========================================
def verify_metadata_alignment(dataset):
    """
    Verifies that the extracted input strings match the expected topic content.
    """
    print("\nVALIDATION: METADATA ALIGNMENT CHECK (1 per Topic per Model)")
    print("-" * 115)
    header = f"{'MODEL':<15} | {'TOPIC':<25} | {'ITER':<5} | {'EXIT':<5} | {'INPUT PREVIEW'}"
    print(header)
    print("-" * 115)

    for model in sorted(dataset.keys()):
        for topic in sorted(dataset[model].keys()):
            if len(dataset[model][topic]) == 0:
                continue

            entry = dataset[model][topic][0]
            meta = entry.get("test_metadata")

            if meta:
                input_prev = meta['input_str'].replace('\n', '\\n')[:45]
                print(f"{model[:15]:<15} | {topic[:25]:<25} | {entry.get('iteration'):<5} | {meta['exit_cmd']:<5} | {input_prev}...")

    print("-" * 115)

verify_metadata_alignment(dataset)

### 8. Dynamic Analysis: Valgrind Environment Handshake
Before executing the full-scale memory audit on the dataset, this cell performs a single-sample "smoke test".

In other words, it verifies that the dynamic analysis environment is correctly configured with `Valgrind` (v.3.18.1+) and confirms that the compiled C binaries can successfully interface with the automated input drivers extracted in Step 6. This handshake ensures that the pipeline can accurately capture "Definitely Lost" heap memory and invalid memory access violations without succumbing to system-level hangs.

In [None]:
# ==========================================
# --- SINGLE-SAMPLE VALGRIND HANDSHAKE ---
# ==========================================
import subprocess
import os

def run_valgrind_handshake(dataset):
    """
    Executes a single-sample test to verify the dynamic analysis environment.
    Confirmed compatibility between binaries and extracted test drivers is required
    before the full-scale audit.
    """
    print("Starting Valgrind Handshake (Single-Sample Verification)...")
    print("-" * 115)

    # 1. Identify a suitable sample for testing
    target_entry = None
    target_model = None

    for model in dataset:
        for topic in dataset[model]:
            for entry in dataset[model][topic]:
                # Sample must have a persistent binary and test metadata
                if entry.get("metrics_dynamic", {}).get("csr") == 1 and entry.get("test_metadata"):
                    target_entry = entry
                    target_model = model
                    break
            if target_entry:
                break
        if target_entry:
            break

    if not target_entry:
        print("Critical Failure: No compiled binaries with metadata found.")
        print("-" * 115)
        return

    # 2. Prepare execution parameters
    bin_path = target_entry.get("binary_path")
    meta = target_entry["test_metadata"]

    # Construct input stream: Test Input + Safety Exit Sequence
    # This prevents the process from hanging if the primary exit command fails
    full_input = f"{meta['input_str']}\n{meta['exit_cmd']}{meta['safety_exit_seq']}"

    print(f"Testing Model: {target_model}")
    print(f"Target Binary: {bin_path}")

    try:
        # 3. Execute Valgrind
        # --leak-check=full: Tracks every heap allocation
        # --errors-for-leak-kinds=all: Ensures all leak types are reported
        vg_proc = subprocess.run(
            ["valgrind", "--leak-check=full", "--errors-for-leak-kinds=all", bin_path],
            input=full_input,
            capture_output=True,
            text=True,
            encoding="utf-8",
            timeout=15, # Extended timeout for Valgrind overhead
            errors='replace'
        )

        # 4. Analyze Output Log
        output_log = vg_proc.stderr + vg_proc.stdout

        # Logic: Search for indicators of memory safety
        # 'definitely lost: 0' and 'ERROR SUMMARY: 0' indicate a clean execution
        has_leaks = "definitely lost:" in output_log and "definitely lost: 0" not in output_log
        has_errors = "ERROR SUMMARY:" in output_log and "ERROR SUMMARY: 0" not in output_log

        print("\n--- Valgrind Analysis Result ---")
        if not has_leaks and not has_errors:
            print("Status: CLEAN (No leaks or invalid operations detected)")
        else:
            print("Status: SAFETY_VIOLATION detected")
            if has_leaks:
                print("   - Heap Leak Identified ('Definitely Lost' > 0)")
            if has_errors:
                print("   - Invalid Memory Access Identified (e.g., Segfault/Invalid Read)")

        # 5. Validation Check: Termination
        if vg_proc.returncode == 0:
            print("Handshake Success: Program terminated gracefully.")
        else:
            print(f"Handshake Warning: Program exited with non-zero code ({vg_proc.returncode}).")

    except subprocess.TimeoutExpired:
        print("Handshake Failure: Program timed out (Logical Hang).")
    except Exception as e:
        print(f"System Error during handshake: {e}")

    print("-" * 115)


# Install Valgrind if not present (Verification step matching methodology v.3.18.1+)
print("Verifying Valgrind Installation...")
!apt-get install -y valgrind -qq

# Execute Handshake
run_valgrind_handshake(dataset)

### 9. Dynamic Analysis: Full-Scale Memory & Runtime Audit (Valgrind)
This cell executes the core dynamic evaluation of the 1195-iteration dataset. Every successfully compiled binary is subjected to execution under `Valgrind`, fed with the automated test drivers extracted in Step 6.

**Key Metrics Extracted (Section 4):**
* **Memory Safety Rate:** The percentage of executed programs that successfully terminate without triggering "Definitely Lost" heap memory leaks or invalid access violations.
* **Logical Hangs:** The incidence of models trapping the execution thread in infinite loops (caught via a strict 5-second execution timeout).

*Note: Due to the high computational overhead of memory instrumentation, this process may take several minutes. A progress bar is provided to track real-time execution.*

In [None]:
# ==========================================
# --- FULL-SCALE VALGRIND HYBRID PIPELINE ---
# ==========================================
from tqdm.auto import tqdm
import time
import subprocess

def run_full_valgrind_audit_with_progress(dataset):
    """
    Performs the memory safety audit with real-time progress tracking.
    Evaluates leaks, invalid memory accesses, and logical hangs.
    """
    # 1. Flatten the entries to calculate the total execution batch
    all_binaries = []
    for model in dataset:
        for topic in dataset[model]:
            for entry in dataset[model][topic]:
                if entry.get("metrics_dynamic", {}).get("csr") == 1 and entry.get("test_metadata"):
                    all_binaries.append(entry)

    total_to_run = len(all_binaries)
    print("Starting Full Dynamic Analysis (Valgrind)...")
    print(f"Target: {total_to_run} compiled binaries detected.")
    print("-" * 115)

    stats = {"clean": 0, "safety_violation": 0, "timeout": 0, "error": 0, "processed": 0}

    # 2. Initialize the dynamic Progress Bar
    progress_bar = tqdm(total=total_to_run, desc="Auditing Binaries", unit="bin")

    for entry in all_binaries:
        dyn = entry.get("metrics_dynamic")
        bin_path = entry.get("binary_path")
        meta = entry.get("test_metadata")

        # Prepare the input stream (User Input + Safety Exit Sequence)
        full_input = f"{meta['input_str']}\n{meta['exit_cmd']}{meta['safety_exit_seq']}"

        try:
            # Execute Valgrind with strict 5-second timeout to catch infinite loops
            vg_proc = subprocess.run(
                ["valgrind", "--leak-check=full", "--errors-for-leak-kinds=all", bin_path],
                input=full_input,
                capture_output=True,
                text=True,
                encoding="utf-8",
                timeout=5,
                errors='replace'
            )

            output_log = vg_proc.stderr + vg_proc.stdout

            # Logic: Identify "Definitely Lost" heap leaks or invalid accesses
            has_leaks = "definitely lost:" in output_log and "definitely lost: 0" not in output_log
            has_errors = "ERROR SUMMARY:" in output_log and "ERROR SUMMARY: 0" not in output_log

            if not has_leaks and not has_errors:
                dyn["valgrind_status"] = "CLEAN"
                dyn["mem_safety_score"] = 1
                stats["clean"] += 1
            else:
                dyn["valgrind_status"] = "SAFETY_VIOLATION"
                dyn["mem_safety_score"] = 0
                stats["safety_violation"] += 1

        except subprocess.TimeoutExpired:
            # Captures 'Logical Hangs' as defined in the methodology
            dyn["valgrind_status"] = "TIMEOUT_LOGIC_HANG"
            dyn["mem_safety_score"] = 0
            stats["timeout"] += 1

        except Exception:
            dyn["valgrind_status"] = "EXECUTION_ERROR"
            dyn["mem_safety_score"] = 0
            stats["error"] += 1

        entry["metrics_dynamic"] = dyn

        # 3. Update Progress Bar state
        stats["processed"] += 1
        progress_bar.update(1)

        # Periodic Checkpoint Print (Updates the text next to the progress bar)
        if stats["processed"] % 10 == 0:
            progress_bar.set_postfix(clean=stats["clean"], fails=stats["safety_violation"], hangs=stats["timeout"])

    progress_bar.close()

    # 4. Final Summary Report
    print("\n" + "=" * 55)
    print("FULL DYNAMIC ANALYSIS COMPLETE")
    print("=" * 55)
    print(f"Clean Executions:      {stats['clean']}")
    print(f"Safety Violations:     {stats['safety_violation']} (Leaks/UB)")
    print(f"Logical Hangs:         {stats['timeout']} (Infinite Loops)")
    print(f"System Errors:         {stats['error']}")
    print("-" * 55)
    print(f"Total Binaries Tested: {stats['processed']} / {total_to_run}")
    print("=" * 55)

# Execute the audit
run_full_valgrind_audit_with_progress(dataset)

### 10. Dynamic Analysis: Undefined Behavior Sanitizer (UBSan)
This cell completes the dynamic analysis suite by evaluating the dataset against the UndefinedBehaviorSanitizer (UBSan).

Standard testing and even Valgrind may overlook specific runtime illegalities (e.g., signed integer overflows, division by zero). To catch these "Oracle Hazards", this pipeline recompiles each successful C solution using `gcc` with the `-fsanitize=undefined` flag. The binaries are then executed, and any resulting `runtime error` logs are parsed to identify and penalize logical flaws.

In [None]:
# ==========================================
# --- UNDEFINED BEHAVIOR (UBSAN) AUDIT ---
# ==========================================
import os
import subprocess
import tempfile
import pandas as pd

def run_ubsan_audit(dataset):
    """
    Re-compiles and executes binaries with UndefinedBehaviorSanitizer.
    Identifies silent logic failures like integer overflows and null-pointer usage.
    """
    print(f"Starting Undefined Behavior Audit (UBSan)...")
    stats = {"clean": 0, "ub_detected": 0, "timeout": 0, "processed": 0}

    with tempfile.TemporaryDirectory() as tmp_dir:
        for model in dataset:
            for topic in dataset[model]:
                for entry in dataset[model][topic]:
                    # Only test binaries that passed the initial compilation
                    if entry.get("metrics_dynamic", {}).get("csr") != 1 or not entry.get("test_metadata"):
                        continue

                    code = entry.get("extracted_code", "")
                    meta = entry.get("test_metadata")
                    dyn = entry.get("metrics_dynamic")

                    # Compile with UBSan flags
                    c_file = os.path.join(tmp_dir, "ub_test.c")
                    bin_path = os.path.join(tmp_dir, "ub_test.out")
                    with open(c_file, "w", encoding="utf-8") as f:
                        f.write(code)

                    try:
                        # Re-compile strictly for UBSan instrumentation
                        comp = subprocess.run(
                            ["gcc", "-fsanitize=undefined", "-g", c_file, "-o", bin_path],
                            capture_output=True, text=True, check=False, encoding="utf-8"
                        )

                        if comp.returncode == 0:
                            full_input = f"{meta['input_str']}\n{meta['exit_cmd']}{meta['safety_exit_seq']}"

                            # Execute the instrumented binary
                            run = subprocess.run(
                                [bin_path], input=full_input,
                                capture_output=True, text=True, timeout=5, encoding="utf-8", errors='replace'
                            )

                            # UBSan outputs errors to stderr starting with 'runtime error:'
                            if "runtime error:" in run.stderr.lower():
                                dyn["ub_status"] = "UB_TRIGGERED"
                                # Capture the specific error detail for analysis
                                dyn["ub_detail"] = run.stderr.split('\n')[0]
                                stats["ub_detected"] += 1
                            else:
                                dyn["ub_status"] = "CLEAN"
                                stats["clean"] += 1

                            stats["processed"] += 1

                    except subprocess.TimeoutExpired:
                        dyn["ub_status"] = "TIMEOUT"
                        stats["timeout"] += 1
                    except Exception:
                        dyn["ub_status"] = "EXEC_ERR"

                    entry["metrics_dynamic"] = dyn

    print("\n" + "=" * 55)
    print("UBSAN ANALYSIS COMPLETE")
    print("=" * 55)
    print(f"Clean (No UB):         {stats['clean']}")
    print(f"UB Triggers Found:     {stats['ub_detected']}")
    print(f"Logical Hangs:         {stats['timeout']}")
    print("-" * 55)
    print(f"Total Binaries Tested: {stats['processed']}")
    print("=" * 55)

# Execute UBSan Audit
run_ubsan_audit(dataset)

# ==========================================
# --- VALIDATION REPORT OUTPUT ---
# ==========================================
def verify_ub_results(dataset):
    print("\nVALIDATION: UBSAN PERFORMANCE SAMPLE (Spot-Check: 2 per Model)")
    print("-" * 115)
    records = []

    for model in sorted(dataset.keys()):
        m_count = 0
        for topic in dataset[model]:
            if m_count >= 2:
                break
            for entry in dataset[model][topic]:
                if m_count < 2 and "ub_status" in entry.get("metrics_dynamic", {}):
                    d = entry.get("metrics_dynamic", {})
                    records.append({
                        "Model": model[:15],
                        "Topic": topic[:20],
                        "Iter": entry.get("iteration"),
                        "UB_Status": d.get("ub_status"),
                        "UB_Detail": (d.get("ub_detail", "")[:40] + "...") if d.get("ub_detail") else "N/A"
                    })
                    m_count += 1

    df = pd.DataFrame(records)
    print(df.to_string(index=False))
    print("-" * 115)

verify_ub_results(dataset)

### 11. Final Artifact: Master Dataset Consolidation
This final cell aggregates the outputs from the multi-dimensional auditing pipeline into a singular, human-and-machine-readable CSV file.

It synthesizes the metadata, the structural static metrics (`LOC`, `Max_CCN`, `SED_Score`), the stylistic adherence scores, and the dynamic runtime profiling results (`CSR_Binary`, `Mem_Safety_Score`, `UB_Status`). This consolidated dataset serves as the direct source of truth for the statistical analyses, tables, and figures presented in the conference paper.

In [None]:
# ==========================================
# --- MASTER DATA CONSOLIDATION (CSV) ---
# ==========================================
import os
import pandas as pd

def consolidate_master_results(dataset, path_folder, filename="Phase3_Master_Evaluation_Stats.csv"):
    """
    Consolidates static, dynamic, and safety metrics into a single CSV.
    This creates the primary dataset for the conference paper's statistical analysis.
    """
    print("Consolidating all multi-dimensional metrics into a master CSV...")
    final_rows = []

    for model in dataset:
        for topic in dataset[model]:
            for entry in dataset[model][topic]:
                # 1. Base Provenance Metadata
                row = {
                    "Model": model,
                    "Topic": topic,
                    "Iteration": entry.get("iteration"),
                }

                # 2. Static Metrics (from Cells 3, 4, 5)
                static = entry.get("metrics_static", {})
                row.update({
                    "LOC": static.get("loc", 0),
                    "Max_CCN": static.get("max_ccn", 0),
                    "Avg_CCN": static.get("avg_ccn", 0),
                    "Tokens": static.get("tokens", 0),
                    "Avg_Params": static.get("avg_params", 0),
                    "Static_Errors": static.get("static_errors", 0),
                    "SED_Score": static.get("sed_score", 0),
                    "Style_Deviations": static.get("style_deviations", 0)
                })

                # 3. Dynamic & Safety Metrics (from Cells 6, 9, 10)
                dyn = entry.get("metrics_dynamic", {})
                row.update({
                    "Comp_Status": dyn.get("compilation_status", "FAIL"),
                    "CSR_Binary": dyn.get("csr", 0),
                    "Valgrind_Status": dyn.get("valgrind_status", "N/A"),
                    "Mem_Safety_Score": dyn.get("mem_safety_score", 0),
                    "UB_Status": dyn.get("ub_status", "N/A"),
                    "UB_Detail": dyn.get("ub_detail", "None")
                })

                final_rows.append(row)

    # Convert dictionary list to a Pandas DataFrame
    df_master = pd.DataFrame(final_rows)

    # Save to the Drive folder established in Cell 1
    output_path = os.path.join(path_folder, filename)
    df_master.to_csv(output_path, index=False)

    print(f"Consolidation Complete. {len(df_master)} rows saved to:")
    print(f"{output_path}")

    return df_master

# Execute Consolidation using the 'path' variable from Cell 1
df_final = consolidate_master_results(dataset, path)

# ==========================================
# --- VALIDATION REPORT OUTPUT ---
# ==========================================
def verify_master_dataset(df):
    print("\nVALIDATION: MASTER DATASET PREVIEW")
    print("-" * 115)

    # Group by model to generate a high-level performance snapshot
    summary = df.groupby("Model").agg({
        "CSR_Binary": "mean",
        "Mem_Safety_Score": "mean",
        "Max_CCN": "mean",
        "SED_Score": "mean",
        "Avg_CCN": "mean"
    }).round(3)

    print("Aggregated Performance by Model:")
    print(summary)
    print("-" * 115)

    print("Sample Rows (Final 5 in Dataset):")
    # Formatted display of the final rows
    print(df.tail(5)[["Model", "Topic", "Iteration", "Comp_Status", "Valgrind_Status", "UB_Status"]].to_string(index=False))
    print("-" * 115)

verify_master_dataset(df_final)

### 12. Data Visualization: The Reliability Gap (Figure 2a)
This cell generates the primary visualization for Research Question 1 (RQ1). It charts the aggregated data from the `df_final` master dataset to illustrate the "Technical Reliability Gap."

By plotting the Compilation Success Rate (CSR) side-by-side with the Valgrind Memory Safety Rate, this figure visually demonstrates how surface-level syntactic fluency (compilation) often masks underlying logical and memory management failures in LLM-generated C code.

In [None]:
# ==========================================
# --- DATA VISUALIZATION: THE RELIABILITY GAP ---
# ==========================================
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

plt.figure(figsize=(12, 7))
sns.set_theme(style="whitegrid")

# Preparing the data for 4 models
gap_df = df_final.groupby("Model")[["CSR_Binary", "Mem_Safety_Score"]].mean().reset_index()
gap_df = gap_df.melt(id_vars="Model", var_name="Metric", value_name="Percentage")

# Rename metrics for professional display in the paper
gap_df['Metric'] = gap_df['Metric'].replace({
    'CSR_Binary': 'Compilation Success (CSR)',
    'Mem_Safety_Score': 'Memory Safety (Valgrind)'
})

# Plotting
ax = sns.barplot(data=gap_df, x="Model", y="Percentage", hue="Metric", palette="Set2")
plt.title("The Reliability Gap: 4-Model Comparison", fontsize=16, fontweight='bold', pad=20)
plt.ylabel("Success Rate (0.0 - 1.0)", fontsize=12)
plt.xlabel("Model Architecture", fontsize=12)
plt.ylim(0, 1.1)
plt.legend(title="Metric Type", loc='upper right')

# Annotate values
for container in ax.containers:
    ax.bar_label(container, fmt='%.2f', padding=3)

plt.savefig("/content/drive/MyDrive/CANAI_LLM_Results/Eval/Reliability_Gap_4Models.png", dpi=300, bbox_inches='tight')
plt.show()

### 13. Data Visualization: Failure Profile Audit (Figure 2b)

This cell generates the secondary visualization for Research Question 1 (RQ1). While the previous chart established the existence of the Reliability Gap, this visualization categorizes the specific technical violations responsible for that gap.

By aggregating the string outputs from the Valgrind and UBSan pipelines, this script counts the exact incidence of **Memory Leaks**, **Undefined Behavior**, and **Logical Hangs**. This unmasks the specific failure profiles of each architecture, such as dense models' susceptibility to memory leaks versus their resilience to logical hangs.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Aggregating failure modes using Named Aggregation
# This avoids KeyErrors by explicitly pointing to existing columns first.
failure_modes = df_final.groupby("Model").agg(
    Memory_Leaks=('Valgrind_Status', lambda x: (x.astype(str).str.strip() == "SAFETY_VIOLATION").sum()),
    Undefined_Behavior=('UB_Status', lambda x: (x.astype(str).str.strip() == "UB_TRIGGERED").sum()),
    Logical_Hangs=('Valgrind_Status', lambda x: (x.astype(str).str.strip() == "TIMEOUT_LOGIC_HANG").sum())
).reset_index()

# Rename for clear visualization
failure_modes.columns = ["Model", "Memory Leaks", "Undefined Behavior", "Logical Hangs"]

# 2. Reshape for plotting
failure_melted = failure_modes.melt(id_vars="Model", var_name="Failure Type", value_name="Total Count")

# 3. Plotting the 4-Model Diagnostic
plt.figure(figsize=(14, 7))
sns.set_theme(style="whitegrid")

ax = sns.barplot(data=failure_melted, x="Model", y="Total Count", hue="Failure Type", palette="magma")

plt.title("Failure Profile Audit: Categorizing Technical Violations", fontsize=16, fontweight='bold', pad=20)
plt.ylabel("Cumulative Count of Failures", fontsize=12)
plt.xlabel("Model Architecture", fontsize=12)

# Adding value labels to the bars
for container in ax.containers:
    ax.bar_label(container, padding=3)

plt.savefig("/content/drive/MyDrive/CANAI_LLM_Results/Eval/Failure_Profiles_Corrected.png", dpi=300, bbox_inches='tight')
plt.show()

### 14. Data Visualization: Complexity Stress Test (Figure 3)
This cell generates the categorical distribution chart for the complexity stress test, directly addressing the "Reliability Ceiling" discussed in Section 4.2.

By grouping the iterations into discrete Cyclomatic Complexity (CCN) bins, this visualization tracks the exact structural density where a model's performance transitions from stability to systemic failure. A vertical threshold marker is injected at $CCN>10$ to highlight the cognitive breaking point where dense generalist models begin to output more memory leaks and logical hangs than safe executions.

In [None]:
# ==========================================
# --- COMPLEXITY STRESS TEST (FIGURE 3) ---
# ==========================================
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# 1. Define the complexity intervals (bins) and labels
bins = [0, 5, 10, 15, 20, 25, 30, 40, 50, 100]
labels = ['0-5', '6-10', '11-15', '16-20', '21-25', '26-30', '31-40', '41-50', '51+']

# 2. Assign each iteration to an interval in the dataframe
df_final['CCN_Interval'] = pd.cut(df_final['Max_CCN'], bins=bins, labels=labels, right=True)

# 3. Setting professional academic style and custom palette
sns.set_theme(style="whitegrid", font_scale=1.1)
custom_palette = {1.0: "#4C72B0", 0.0: "#C44E52"} # Blue for Pass, Red for Fail

# 4. Create the plot using catplot to treat intervals as categories
g = sns.catplot(
    data=df_final,
    x="CCN_Interval",
    hue="Mem_Safety_Score",
    col="Model",
    col_wrap=2,
    kind="count",
    height=5,
    aspect=1.5,
    palette=custom_palette,
    legend=False
)

# --- REFINEMENT LOOP ---
for ax in g.axes.flat:
    # Rotate labels for readability
    ax.tick_params(axis='x', rotation=45)
    ax.set_xlabel("Complexity Interval (Max CCN)", fontsize=12)
    ax.set_ylabel("Number of Iterations", fontsize=12)

    # Add a vertical dashed line to mark the 'Reliability Ceiling' (after the 6-10 bin)
    ax.axvline(x=1.5, color='black', linestyle='--', linewidth=2, alpha=0.5)

# Fix the Legend for conference standard
handles = [plt.Rectangle((0,0),1,1, color=custom_palette[1.0]),
           plt.Rectangle((0,0),1,1, color=custom_palette[0.0])]
g.fig.legend(handles=handles, labels=['PASS (Safe)', 'FAIL (Leak/Hang/UB)'],
             title="Outcome", loc='center right', bbox_to_anchor=(1.08, 0.5))

g.fig.suptitle("Reliability Gap by Complexity Interval: Categorical Distribution",
               fontsize=18, fontweight='bold', y=1.05)

plt.tight_layout()
plt.savefig("/content/drive/MyDrive/CANAI_LLM_Results/Eval/Binned_Complexity_Histogram.png", dpi=300, bbox_inches='tight')
plt.show()

### 15. Data Visualization: Logic Density (Figure 4a)

This cell generates the first half of the Architectural Efficiency analysis (Figure 4a), addressing Research Question 2 (RQ2).

To compare the structural characteristics of sparse Mixture-of-Experts (MoE) architectures against dense Transformers, this script calculates a derived metric: **Logic Density**. By dividing the Max Cyclomatic Complexity (CCN) by the total token count and normalizing it per 100 tokens, this violin plot visualizes the distribution of "information density" across the generated C solutions.

In [None]:
# ==========================================
# --- LOGIC DENSITY BY MODEL (VIOLIN PLOT) ---
# ==========================================
import matplotlib.pyplot as plt
import seaborn as sns

# Calculate the derived metric: How much logic is packed into the tokens?
df_final['Logic_Density'] = (df_final['Max_CCN'] / df_final['Tokens']) * 100

plt.figure(figsize=(12, 7))
sns.set_theme(style="whitegrid")

# Violin plot shows the full distribution 'shape' of logic density
sns.violinplot(
    data=df_final,
    x="Model",
    y="Logic_Density",
    hue="Model",
    palette="muted",
    inner="quartile",
    legend=False
)

plt.title("Architectural Efficiency: Logic Density (CCN per 100 Tokens)", fontsize=16, fontweight='bold', pad=20)
plt.ylabel("Logic Density (Complexity / Tokens * 100)", fontsize=12)
plt.xlabel("Model Architecture", fontsize=12)

plt.savefig("/content/drive/MyDrive/CANAI_LLM_Results/Eval/Logic_Density_Violin.png", dpi=300, bbox_inches='tight')
plt.show()

### 16. Data Visualization: Model Verbosity and the Safety "Token Tax" (Figure 4b)
This cell generates the second half of the Architectural Efficiency audit (Figure 4b). It visualizes the total token count distribution across the four evaluated architectures.

As analyzed in **Section 4.3**, this data helps differentiate between "concise" and "elaborative" reasoning models. These results support the finding that Mixture-of-Experts (MoE) architectures consistently expend a higher token budget to implement mandatory safety predicates - such as `malloc` return-value verification - which correlates with their lower incidence of memory leaks compared to more concise, "short-circuiting" dense models.

In [None]:
# ==========================================
# --- MODEL VERBOSITY DISTRIBUTION (FIG 4B) ---
# ==========================================
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 6))
sns.set_theme(style="whitegrid")

# Boxplot shows the median, quartiles, and outliers of token counts
sns.boxplot(data=df_final, x="Model", y="Tokens", hue="Model", palette="Pastel1", legend=False)

plt.title("Model Verbosity: Token Distribution per Solution", fontsize=15, fontweight='bold', pad=20)
plt.ylabel("Total Token Count", fontsize=12)
plt.xlabel("Model Architecture", fontsize=12)

plt.savefig("/content/drive/MyDrive/CANAI_LLM_Results/Eval/Verbosity_Boxplot.png", dpi=300, bbox_inches='tight')
plt.show()

### 17. Data Visualization: The Style-Safety Paradox (Figure 5)
This cell generates the model-specific regression analysis discussed in **Section 5.1**. It investigates whether "clean" code‚Äîdefined by low stylistic deviations from the LLVM guide‚Äîcorrelates with higher runtime memory safety.

The resulting grid of regressions illustrates the **Style-Safety Paradox**: the finding that for many architectures, there is no significant positive correlation between aesthetic adherence and functional integrity.

In [None]:
# ==========================================
# --- STYLE VS. SAFETY REGRESSIONS (FIG 5) ---
# ==========================================
import matplotlib.pyplot as plt
import seaborn as sns

# Create a 2x2 grid for comparison
g = sns.lmplot(
    data=df_final,
    x="Style_Deviations",
    y="Mem_Safety_Score",
    col="Model",
    hue="Model",
    col_wrap=2,
    scatter_kws={'alpha':0.2, 's':60},
    line_kws={'lw':3},
    height=5,
    aspect=1.2,
    markers="x",
    palette="deep"
)

# Customize titles and labels
g.set_axis_labels("Style Deviations (Violations)", "Memory Safety (1=Pass, 0=Fail)")
g.fig.suptitle("Does Style Predict Safety? Model-Specific Regressions", fontsize=18, fontweight='bold', y=1.05)
g.set_titles("{col_name}")

plt.savefig("/content/drive/MyDrive/CANAI_LLM_Results/Eval/Style_vs_Safety_2x2.png", dpi=300, bbox_inches='tight')
plt.show()

### 18. Tabular Analysis: Granular Reliability Breakdown (Table 2)
This final analytical block produces the numerical data for the study's **Complexity Impact Table**.

By calculating the safety percentage across incremental Cyclomatic Complexity (CCN) intervals, this script provides the mathematical proof of architectural performance decay. This granular view is essential for identifying the "Reliability Ceiling" - the specific complexity threshold beyond which an architecture can no longer guarantee memory safety or logical termination.

In [None]:
# ==========================================
# --- GRANULAR RELIABILITY BREAKDOWN (TABLE 2) ---
# ==========================================
import pandas as pd

# 1. Categorize complexity relative to the Reliability Ceiling
df_final['Complexity_Tier'] = df_final['Max_CCN'].apply(lambda x: 'CCN < 10' if x < 10 else 'CCN > 10')

# 2. Pivot the data to count Pass/Fail per Model per Tier
pivot_table = df_final.groupby(['Model', 'Complexity_Tier', 'Mem_Safety_Score']).size().unstack(fill_value=0)

# 3. Renaming columns for academic clarity
pivot_table.columns = ['FAIL Count', 'PASS Count']

# 4. Flattening the multi-index for a clean display
summary_table = pivot_table.reset_index()

# 5. Optional: Adding a Success Rate % for extra depth
summary_table['Success %'] = (summary_table['PASS Count'] /
                             (summary_table['PASS Count'] + summary_table['FAIL Count']) * 100).round(1)

# Displaying the final table
print("--- Technical Breaking Point Summary Table ---")
display(summary_table.sort_values(by=['Complexity_Tier', 'Model']))