# ASIDE Results Analysis Notebook

This notebook provides analysis tools for ASIDE (Architecturally Separated Instruction-Data Embeddings) experimental results. It processes various evaluation metrics including:

- **SEP (Separation) Metrics**: Core ASIDE evaluation measuring instruction-data separation
- **AlpacaEval Scores**: General capability assessment using AlpacaEval benchmarks
- **Training Metrics**: Loss curves and training statistics
- **Output Quality Analysis**: Detection of repetitive or problematic model outputs

## Key Features

1. **Model Name Parsing**: Standardized parsing of model names across different file formats
2. **Data Integration**: Merging results from multiple evaluation sources
3. **Performance Comparison**: Tools for comparing ASIDE vs baseline methods
4. **Best Model Selection**: Automated selection of optimal models based on various metrics
5. **Visualization**: Plotting utilities for learning rate analysis and performance trends

## Usage

This notebook is designed to work with the standard ASIDE evaluation pipeline outputs. Adjust file paths and model names as needed for your specific experiments.


In [1]:
from analyze_results import *
import warnings
# Suppress FutureWarning messages
warnings.simplefilter(action='ignore', category=FutureWarning)

## Run Number Extraction and SEP Metric Analysis

These functions extract run numbers from model names and map them to SEP (separation) metrics. This is useful for analyzing how different training runs perform on the core ASIDE evaluation metric.

**SEP Metric**: Measures how well a model separates instructions from data. Higher values indicate better separation (closer to ASIDE's goal).

In [2]:
import re
def get_run_number_sep_metric_dict(df):
    """
    Extract run numbers from model names and map them to SEP metrics.
    
    This function processes a DataFrame containing model evaluation results,
    extracts run numbers from model names using regex patterns, and creates
    arrays of SEP metrics and utility scores indexed by run number.
    
    Args:
        df (pandas.DataFrame): DataFrame with columns:
            - 'model': Model names containing run numbers (e.g., 'dd_pure_run3_lr6e-6')
            - 'sep_metric': SEP separation scores (list format)
            - 'probe_in_instruct_asr': Attack success rates (list format)
    
    Returns:
        tuple: (sep_metrics, utils)
            - sep_metrics (numpy.array): SEP metric values indexed by run number
            - utils (numpy.array): Utility scores indexed by run number
    
    Example:
        >>> df = pd.DataFrame({
        ...     'model': ['dd_pure_run3_lr6e-6', 'dd_run5_lr1e-5'],
        ...     'sep_metric': [[0.65, 0.02], [0.70, 0.03]],
        ...     'probe_in_instruct_asr': [[0.85, 0.01], [0.90, 0.02]]
        ... })
        >>> sep_metrics, utils = get_run_number_sep_metric_dict(df)
        >>> print(f"Run 3 SEP: {sep_metrics[3]}, Utility: {utils[3]}")
    
    Note:
        - Searches for patterns 'dd_pure_run(\d+)' first, then 'dd_run(\d+)'
        - Uses first element of metric arrays ([0] index)
        - Pre-allocates arrays of size 10, then trims to actual count
    """
    sep_metrics = np.zeros(10)
    utils = np.zeros(10)
    cnt = 0
    for _, row in df.iterrows():
        model_name = row['model']
        match = re.search(r'dd_pure_run(\d+)', model_name)
        if match is None: 
            match = re.search(r'dd_run(\d+)', model_name)
        if match:
            run_number = match.group(1)  # keep as string, or convert to int if you prefer
            sep_metrics[int(run_number)] = row['sep_metric'][0]
            utils[int(run_number)] = row['probe_in_instruct_asr'][0]
            cnt += 1

    sep_metrics = sep_metrics[:cnt]
    utils = utils[:cnt] 
    return sep_metrics, utils

## Learning Rate Analysis and Visualization

These functions analyze how different learning rates affect model performance, comparing ASIDE methods against baselines. This is crucial for hyperparameter optimization and understanding training dynamics.

**Key Comparisons**:
- `dd_pure`: ASIDE method performance
- `pretrained_vanilla`: Baseline vanilla model performance
- SEP metrics vs utility scores across learning rates

In [3]:
import re
import matplotlib.pyplot as plt
import seaborn as sns


def create_lr_dict(df):
    """
    Create a learning rate dictionary mapping LR values to performance metrics.
    
    This function processes experimental results to create a comprehensive mapping
    of learning rates to performance metrics for both ASIDE and baseline methods.
    It's essential for hyperparameter analysis and comparison studies.
    
    Args:
        df (pandas.DataFrame): DataFrame containing model results with columns:
            - 'model': Model names with embedded learning rates
            - 'sep_metric': SEP separation scores [value, error]
            - 'probe_in_instruct_asr': Attack success rates [value, error]
    
    Returns:
        dict: Learning rate mapping with structure:
            {
                'lr_string': (
                    dd_pure_sep,           # ASIDE separation score
                    dd_pure_prompt_asr,    # ASIDE attack success rate  
                    pretrained_sep,        # Baseline separation score
                    pretrained_prompt_asr  # Baseline attack success rate
                )
            }
    
    Model Name Patterns:
        - 'dd_pure_from_base_run1e-4_bs8': ASIDE method with LR 1e-4
        - 'from_base_pretrained_vanilla_run1e-4_bs8': Baseline with LR 1e-4
        
    Example Usage:
        >>> lr_dict = create_lr_dict(results_df)
        >>> print(f"LR 1e-4 ASIDE SEP: {lr_dict['1e-4'][0]}")
        >>> print(f"LR 1e-4 Baseline SEP: {lr_dict['1e-4'][2]}")
    
    Note:
        - Extracts learning rates from model names using regex 'run([0-9e.-]+)_bs'
        - Handles missing runs gracefully (sets values to None)
        - Groups results by learning rate for direct comparison
    """
    
    lr_dict = {}
    
    # Helper function to extract the LR from run_name using a regex
    # that matches something like "run1e-4" or "run5e-5" or "run2e-5" etc.
    def extract_lr_from_name(name):
        # A simple approach is to find the substring after 'run' up to '_bs'
        # e.g. "dd_pure_from_base_run1e-4_bs8" -> "1e-4"
        match = re.search(r'run([0-9e.-]+)_bs', name)
        if match:
            return match.group(1)  # e.g. "1e-4"
        else:
            return None
    
    # We will collect data for each LR in sub-dictionaries:
    #   { '1e-4': {'dd_pure': (sep, prompt_asr), 
    #              'pretrained': (sep, prompt_asr)} }
    # Then we will flatten to the final format.
    temp_storage = {}
    
    for idx, row in df.iterrows():
        run_name = row['model']  # adapt to your column name
        lr_str = extract_lr_from_name(run_name)
        if not lr_str:
            # Skip 'original' or 'original_inst' or anything w/o LR
            continue
        
        # parse the first metric (sep) and second metric (prompt_in_data_asr)
        # Suppose your DataFrame has them in these columns:
        # "metric1_value", "metric1_error", "metric2_value", "metric2_error"
        # OR if they're in an array, adapt accordingly.
        # For example, if row['metrics'] = [ [sep_val, sep_err],
        #                                    [prompt_val, prompt_err],
        #                                    [other_val, other_err] ]
        
        # Example (based on your table):
        sep_val = row["sep_metric"][0]
        prompt_val = row["probe_in_instruct_asr"][0]
        
        if lr_str not in temp_storage:
            temp_storage[lr_str] = {}
        
        if 'dd_pure_from_base' in run_name:
            temp_storage[lr_str]['dd_pure'] = (sep_val, prompt_val)
        elif 'pretrained_vanilla' in run_name:
            temp_storage[lr_str]['pretrained'] = (sep_val, prompt_val)
    
    # Now build the final dictionary with the required structure:
    #   lr -> (dd_pure_sep, dd_pure_prompt_in_data_asr, 
    #          pretrained_vanilla_sep, pretrained_vanilla_prompt_in_data_asr)
    for lr_str, subdict in temp_storage.items():
        # Some runs might be missing from your data; handle gracefully
        dd_pure_sep, dd_pure_prompt = subdict.get('dd_pure', (None, None))
        pretrained_sep, pretrained_prompt = subdict.get('pretrained', (None, None))
        
        lr_dict[lr_str] = (
            dd_pure_sep, 
            dd_pure_prompt, 
            pretrained_sep, 
            pretrained_prompt
        )
    
    return lr_dict


def plot_results(lr_dict, original_value, original_inst_value):
    """
    Create comprehensive visualization of learning rate analysis results.
    
    This function generates publication-quality plots comparing ASIDE and baseline
    performance across different learning rates, with reference lines for
    original model performance.
    
    Args:
        lr_dict (dict): Learning rate dictionary from create_lr_dict()
            Format: lr_str -> (dd_pure_sep, dd_pure_prompt_asr, 
                              pretrained_sep, pretrained_prompt_asr)
        original_value (float): Baseline model performance (horizontal reference line)
        original_inst_value (float): Instruction-tuned baseline performance
    
    Plot Features:
        - Log-scale x-axis for learning rates
        - Solid lines for SEP metrics, dashed for attack success rates
        - Different colors for ASIDE vs baseline methods
        - Reference lines for original model performance
        - Professional styling with seaborn theme
    
    Visual Interpretation:
        - Higher SEP values = better instruction-data separation
        - Lower attack success rates = better robustness
        - ASIDE should outperform baselines across learning rates
    
    Example:
        >>> lr_dict = create_lr_dict(df)
        >>> plot_results(lr_dict, 0.387, 0.504)
    """
    ## Apply seaborn style
    sns.set_theme(style="whitegrid")
    
    # Convert the keys of lr_dict to floats for sorting
    def str_to_float(lr_str):
        # Safely evaluate scientific notation
        return float(lr_str)
    
    # Sort the learning rates (numeric ascending)
    sorted_lrs = sorted(lr_dict.keys(), key=str_to_float)
    numeric_lrs = [str_to_float(k) for k in sorted_lrs]
    
    # Extract the four series
    dd_pure_sep = [lr_dict[k][0] for k in sorted_lrs]
    dd_pure_prompt = [lr_dict[k][1] for k in sorted_lrs]
    pretrained_sep = [lr_dict[k][2] for k in sorted_lrs]
    pretrained_prompt = [lr_dict[k][3] for k in sorted_lrs]
    
    # Create the figure
    plt.figure(figsize=(10, 6))
    
    # Plot dd_pure (sep = solid, prompt_in_data_asr = dashed)
    plt.plot(numeric_lrs, dd_pure_sep, label='dd_pure_sep', linestyle='-', color=sns.color_palette("muted")[0], linewidth=2)
    plt.plot(numeric_lrs, dd_pure_prompt, label='dd_pure_prompt_in_data_asr', linestyle='--', color=sns.color_palette("muted")[0], linewidth=2)
    
    # Plot pretrained_vanilla (sep = solid, prompt_in_data_asr = dashed)
    plt.plot(numeric_lrs, pretrained_sep, label='pretrained_vanilla_sep', linestyle='-', color=sns.color_palette("muted")[1], linewidth=2)
    plt.plot(numeric_lrs, pretrained_prompt, label='pretrained_vanilla_prompt_in_data_asr', linestyle='--', color=sns.color_palette("muted")[1], linewidth=2)
    
    # Add horizontal lines for original and original_inst
    plt.axhline(y=original_value, color=sns.color_palette("muted")[2], linestyle='-.', label=f'base utility={original_value:.3f}', linewidth=2)
    plt.axhline(y=original_inst_value, color=sns.color_palette("muted")[3], linestyle=':', label=f'base inst utility={original_inst_value:.3f}', linewidth=2)
    
    # Configure log-scale on x-axis
    plt.xscale('log')
    
    # Labels, title, and ticks
    plt.xlabel('Learning Rate', fontsize=12)
    plt.ylabel('Metric (e.g., "sep")', fontsize=12)
    plt.title('Comparison of dd_pure vs pretrained_vanilla', fontsize=14)
    
    # Legend to the side
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, frameon=True)
    
    # Adjust layout for better readability
    plt.tight_layout()
    
    # Show plot
    plt.show()

In [4]:
def plot_metrics(df):
    """
    Convenient wrapper function to plot learning rate analysis from DataFrame.
    
    This function combines data processing and visualization into a single call,
    automatically extracting baseline values and creating the learning rate plot.
    
    Args:
        df (pandas.DataFrame): Results DataFrame containing:
            - Model results with learning rate information
            - Rows with model names 'original' and 'original_inst' for baselines
    
    Workflow:
        1. Create learning rate dictionary from DataFrame
        2. Extract baseline performance from 'original' and 'original_inst' rows
        3. Generate comparative visualization
    
    Expected DataFrame Structure:
        - Regular experiment rows with LR-encoded model names
        - Special rows: 'original' (base model), 'original_inst' (instruct-tuned)
    
    Example:
        >>> results_df = load_experiment_results()
        >>> plot_metrics(results_df)  # Displays interactive plot
    """
    # 1) Build the dictionary
    lr_dict = create_lr_dict(df)
    print(lr_dict)
    # 2) We also retrieve the "original" and "original_inst" from the DataFrame
    #    (assuming they're in row 18 and 19, or found by name).
    original_row = df.loc[df['model'] == 'original'].iloc[0]
    original_inst_row = df.loc[df['model'] == 'original_inst'].iloc[0]
    original_value = original_row["probe_in_instruct_asr"][0]     # e.g. 0.387
    original_inst_value = original_inst_row["probe_in_instruct_asr"][0] 
    
    # 3) Plot
    plot_results(lr_dict, original_value, original_inst_value)

## AlpacaEval Score Integration

These functions handle integration of AlpacaEval benchmark results with SEP evaluation data. AlpacaEval measures general instruction-following capability, providing a crucial utility metric to ensure ASIDE doesn't sacrifice performance for safety.

**Integration Process**:
1. Load AlpacaEval CSV results
2. Parse and standardize model names
3. Merge with SEP evaluation data
4. Enable comprehensive performance analysis

In [5]:
def get_alpacaeval_scores(file_path, substring, alpaca_ver="1.0"):
    """
    Extract AlpacaEval scores for models matching a specific substring pattern.
    
    This function processes AlpacaEval leaderboard CSV files to extract performance
    scores for models of interest, enabling integration with ASIDE evaluation results.
    
    Args:
        file_path (str): Path to AlpacaEval CSV leaderboard file
        substring (str): Substring to filter model names (e.g., "SFTv110")
        alpaca_ver (str): AlpacaEval version ("1.0" or "2.0")
            - "1.0": Uses 'win_rate' column
            - "2.0": Uses 'length_controlled_winrate' column
    
    Returns:
        pandas.DataFrame: Filtered results with columns:
            - 'model': Model name/path from CSV index
            - 'win_rate' or 'length_controlled_winrate': Performance score
    
    CSV Structure Expected:
        - First column (index): Model names/paths
        - Score columns: 'win_rate' (v1.0) or 'length_controlled_winrate' (v2.0)
    
    Example:
        >>> scores = get_alpacaeval_scores(
        ...     "./evals/alpaca_eval/leaderboard.csv", 
        ...     "SFTv110", 
        ...     alpaca_ver="1.0"
        ... )
        >>> print(f"Found {len(scores)} matching models")
    
    Note:
        - Filters models containing the substring in their name
        - Resets index to make 'model' a regular column
        - Handles both AlpacaEval v1.0 and v2.0 formats
    """
    # Read the CSV and use the first column as the index
    df = pd.read_csv(file_path, index_col=0)
    
    # Filter rows based on whether the index (model name) contains the substring
    filtered_df = df[df.index.str.contains(substring, na=False)]
    
    # Create a 'model' column from the index
    filtered_df = filtered_df.copy()
    filtered_df.loc[:, 'model'] = filtered_df.index
    
    # Only keep 'model' and 'length_controlled_winrate'
    if alpaca_ver=="1.0":
        filtered_df = filtered_df[['model', 'win_rate']]
    else:
        filtered_df = filtered_df[['model', 'length_controlled_winrate']]
    
    # Re-index the DataFrame so 'model' is just a column, not the index
    filtered_df.reset_index(drop=True, inplace=True)
    
    return filtered_df

## Model Name Parsing and Standardization

These functions handle the  task of parsing and standardizing model names across different file formats and evaluation systems. Consistent naming is crucial for merging results from multiple evaluation sources.

**Challenges Addressed**:
- Different naming conventions between evaluation systems
- Extracting run numbers and model types from complex paths
- Standardizing names for reliable data merging

In [6]:

def parse_model_first_table(name: str) -> str:
    """
    Parse model names from the first table format to standardized format.
    
    This function handles model names from SEP evaluation results, removing
    unnecessary components and standardizing the format for merging with
    other evaluation data.
    
    Args:
        name (str): Original model name from SEP results
            Example: 'forward_rot_from_base_run0_val25feb'
    
    Returns:
        str: Cleaned model name
            Example: 'forward_rot_run0'
    
    Processing Steps:
        1. Remove 'from_base_' prefix if present
        2. Truncate everything from '_val' onward (removes date suffixes)
        3. Preserve core model type and run number
    
    Example:
        >>> parse_model_first_table('forward_rot_from_base_run0_val25feb')
        'forward_rot_run0'
        >>> parse_model_first_table('ise_run5_val01mar')
        'ise_run5'
    """
    # 1. Remove 'from_base_' if present
    name = name.replace("from_base_", "")
    
    # 2. If there's a '_val', cut everything from '_val' onward
    val_idx = name.find("_val")
    if val_idx != -1:
        name = name[:val_idx]
    
    # Result: e.g. 'forward_rot_run0'
    return name

def parse_model_second_table(path: str) -> str:
    """
    Parse model names from filepath format to standardized format.
    
    This function extracts standardized model names from full model paths,
    typically from AlpacaEval results or training checkpoint directories.
    
    Args:
        path (str): Full model filepath
            Example: '../models/llama_3.1_8b/forward_rot/train_checkpoints/SFTv70/from_base_run_5e-6_bs8/last/'
    
    Returns:
        str: Standardized model name
            Example: 'forward_rot_run5e-6_bs8'
    
    Processing Logic:
        1. Split path by '/' to get components
        2. Extract technique from expected position (index 3)
        3. Find 'from_base_run_' component and extract run information
        4. Combine technique + run info in standard format
    
    Example:
        >>> path = '../models/llama_3.1_8b/forward_rot/train_checkpoints/SFTv70/from_base_run_15/last/'
        >>> parse_model_second_table(path)
        'forward_rot_run15'
    
    Note:
        - Returns 'unknown_run-9999' if parsing fails
        - Handles both simple run numbers and complex LR+batch size formats
    """
    # Split by "/"
    parts = path.split("/")
    
    # Attempt to find technique in the 3rd index or whichever you expect
    # Adjust if your path structure differs
    if len(parts) > 3:
        technique = parts[3]
    else:
        technique = "unknown"
    
    # Find the part that starts with 'from_base_run_'
    run_parts = [p for p in parts if p.startswith("from_base_run_")]
    if run_parts:
        run_part = run_parts[0]  # e.g. 'from_base_run_15' or 'from_base_run_5e-6_bs8'
        # Extract everything after 'from_base_run_'
        run_number = run_part[len("from_base_run_"):]
    else:
        run_number = "-9999"
    
    # Combine technique + run number
    return f"{technique}_run{run_number}"

# --------------------- #
# MERGE DATA EXAMPLE    #
# --------------------- #

def merge_sep_alpaca_tables(df1: pd.DataFrame, df2: pd.DataFrame) -> pd.DataFrame:
    """
    Merge SEP evaluation results with AlpacaEval scores based on standardized model names.
    
    This function combines two critical evaluation datasets: SEP metrics (measuring
    instruction-data separation) and AlpacaEval scores (measuring general capability).
    The merge enables comprehensive analysis of the safety-utility tradeoff.
    
    Args:
        df1 (pandas.DataFrame): SEP evaluation results with columns:
            - 'model': Model names in first table format
            - 'sep_metric': Separation scores
            - 'probe_in_instruct_asr': Attack success rates
            - Other evaluation metrics
        df2 (pandas.DataFrame): AlpacaEval results with columns:
            - 'model': Model names in second table format (filepaths)
            - 'win_rate': AlpacaEval performance scores
    
    Returns:
        pandas.DataFrame: Merged dataset with columns from df1 plus:
            - 'alpacaeval 1.0': Renamed win_rate column
            - 'parsed_name': Standardized model names used for merging
    
    Merge Process:
        1. Parse model names in both DataFrames to common format
        2. Perform left join on parsed names
        3. Rename AlpacaEval column for clarity
        4. Preserve all SEP data, add AlpacaEval where available
    
    Example:
        >>> sep_df = load_sep_results()
        >>> alpaca_df = get_alpacaeval_scores("leaderboard.csv", "SFTv110")
        >>> merged = merge_sep_alpaca_tables(sep_df, alpaca_df)
        >>> print(f"Merged {len(merged)} models with both metrics")
    
    Use Cases:
        - Analyze safety-utility tradeoffs
        - Identify models with optimal balance
        - Create comprehensive evaluation reports
    """
    # Create a parsed name column in df1
    df1["parsed_name"] = df1["model"].apply(parse_model_first_table)
    
    # Create a parsed name column in df2
    df2["parsed_name"] = df2["model"].apply(parse_model_second_table)
    
    # Merge (left-join) df2's 'win_rate' onto df1, keyed by 'parsed_name'
    merged = pd.merge(
        df1, 
        df2[["parsed_name", "win_rate"]], 
        on="parsed_name", 
        how="left"
    )
    
    # Rename 'win_rate' column to 'alpacaeval 1.0'
    merged.rename(columns={"win_rate": "alpacaeval 1.0"}, inplace=True)

    # Drop 'parsed_name' if you don't want it in the final output
    # Or you can keep it for debugging
    # merged.drop(columns="parsed_name", inplace=True)
    
    return merged

## Advanced Model Name Parsing

Additional parsing functions for handling more complex model naming schemes, particularly those with embedded hyperparameters and version information.

In [7]:
def parse_model_name(full_name):
    """
    Parse complex model names to extract core components and run numbers.
    
    This function handles the most complex model naming format, extracting
    both the model type and run number from names with embedded training
    configuration information.
    
    Args:
        full_name (str): Complete model name with training config
            Example: "forward_rot_train_full_SFTv110_run=11"
    
    Returns:
        tuple: (model_type, run_number)
            Example: ("forward_rot", "11")
    
    Parsing Logic:
        1. Extract model type from prefix before "_train_full"
        2. Extract run number from "run=" parameter
        3. Handle fallback cases for malformed names
    
    Supported Formats:
        - "forward_rot_train_full_SFTv110_run=11"
        - "ise_train_full_SFTv70_run=5"
        - "single_emb_train_full_SFTv110_run=20"
    
    Example:
        >>> model_type, run_num = parse_model_name("forward_rot_train_full_SFTv110_run=11")
        >>> print(f"Model: {model_type}, Run: {run_num}")
        Model: forward_rot, Run: 11
    """
    # Identify the model type (prefix before "_train_full")
    model_type_match = re.match(r'^([^_]+(?:_[^_]+)?)_train_full', full_name)
    if model_type_match:
        model_type = model_type_match.group(1)
    else:
        # Fallback if pattern doesn't match
        model_type = full_name.split('_train_full')[0]
    
    # Extract the run number after "run="
    run_match = re.search(r'run=([^/]+)', full_name)
    if run_match:
        run_number = run_match.group(1)
    else:
        # Fallback if run number not found
        run_number = "-9999"
    
    return model_type, run_number

def standardize_model_name(full_name):
    """
    Convert complex model names to standardized format for data merging.
    
    This function creates consistent model identifiers that can be used
    across different evaluation systems and data sources.
    
    Args:
        full_name (str): Complete model name from training logs
            Example: "forward_rot_train_full_SFTv110_run=11"
    
    Returns:
        str: Standardized model identifier
            Example: "forward_rot_run11"
    
    Standardization Benefits:
        - Consistent naming across evaluation systems
        - Simplified model identification
        - Reliable data merging capabilities
        - Easy filtering and grouping operations
    
    Example:
        >>> standardize_model_name("forward_rot_train_full_SFTv110_run=11")
        'forward_rot_run11'
        >>> standardize_model_name("ise_train_full_SFTv70_run=5")
        'ise_run5'
    """
    model_type, run_number = parse_model_name(full_name)
    return f"{model_type}_run{run_number}"

def transform_losses_df(df):
    """
    Add standardized model name column to training losses DataFrame.
    
    This function prepares training loss data for merging with evaluation
    results by adding a standardized model name column.
    
    Args:
        df (pandas.DataFrame): Training losses DataFrame with 'model' column
            containing original complex model names
    
    Returns:
        pandas.DataFrame: DataFrame with added 'parsed_name' column
            containing standardized model identifiers
    
    Use Case:
        Enables correlation analysis between training dynamics (loss curves)
        and final evaluation performance (SEP, AlpacaEval).
    
    Example:
        >>> losses_df = load_training_losses()
        >>> losses_df = transform_losses_df(losses_df)
        >>> print(losses_df[['model', 'parsed_name']].head())
    """
    df['parsed_name'] = df['model'].apply(standardize_model_name)
    return df

## AlpacaEval Output Parsing

This function parses AlpacaEval output files directly from evaluation directories, extracting performance scores and associating them with model identifiers.

In [8]:
def parse_alpaca_outputs(directory, substring):
    """
    Parse AlpacaEval output files from a directory to extract performance scores.
    
    This function processes raw AlpacaEval output files, extracting model paths
    and their corresponding performance scores. It's useful when working with
    direct evaluation outputs rather than processed leaderboard files.
    
    Args:
        directory (str): Path to directory containing AlpacaEval output files
        substring (str): Substring to filter relevant lines (e.g., "SFTv110")
    
    Returns:
        pandas.DataFrame: Parsed results with columns:
            - 'model': Model identifier extracted from path
            - 'alpacaeval 1.0': Performance score
            - 'parsed_name': Standardized model name for merging
    
    File Format Expected:
        Each line should contain:
        <model_path> <score> [other_data...]
        
        Example line:
        ../models/llama_3.1_8b/forward_rot/SFTv110/from_base_run_15/last/ 85.19
    
    Processing Steps:
        1. Read all files in directory
        2. Filter lines containing the substring
        3. Extract model path and score from each line
        4. Standardize model names for consistency
        5. Remove entries with failed parsing (-9999 indicator)
    
    Example:
        >>> scores = parse_alpaca_outputs("./alpaca_outputs/", "SFTv110")
        >>> print(f"Parsed {len(scores)} model scores")
        >>> print(scores[['parsed_name', 'alpacaeval 1.0']].head())
    
    Note:
        - Handles multiple files in the directory
        - Robust error handling for malformed lines
        - Filters out unparseable model names
    """
    data = []

    # Iterate over all items in the directory
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)
        
        # Process only if it's a regular file
        if os.path.isfile(file_path):
            with open(file_path, 'r', encoding='utf-8') as f:
                for line in f:
                    # Check if 'SFTv110' is in the line
                    if substring in line:
                        # Split line on whitespace
                        parts = line.strip().split()
                        
                        # Ensure we have at least 2 parts (path & numeric value)
                        if len(parts) < 2:
                            continue

                        full_path = parts[0]      # e.g. ../models/.../SFTv110/from_base_run_15/...
                        win_rate_str = parts[1]  # e.g. 85.19

                        # Find the substring starting at 'SFTv110'
                        idx = full_path.find(substring)
                        if idx == -1:
                            continue

                        # Grab everything from 'SFTv110' onward, removing trailing slashes
                        #model_str = full_path#[idx:].rstrip('/')
                        model_str = full_path[idx:].rstrip('/')

                        # Convert second token to float
                        try:
                            win_rate = float(win_rate_str)
                        except ValueError:
                            continue

                        # Append to data
                        data.append((model_str, win_rate))

    # Create a DataFrame
    df = pd.DataFrame(data, columns=['model', 'win_rate'])
    df["parsed_name"] = df["model"].apply(standardize_model_name)
    df = df[~df["parsed_name"].str.contains("-9999", na=False)]

    df.rename(columns={"win_rate": "alpacaeval 1.0"}, inplace=True)
    return df

## Output Quality Analysis

These functions analyze the quality of model outputs, particularly focusing on detecting repetitive or problematic generations that might indicate training issues or model failures.

In [9]:
import json
import glob
import os
import random
import pandas as pd

def is_repetitive(
    text: str, 
    num_positions: int = 10, 
    substring_len: int = 15, 
    min_repeats: int = 5
) -> bool:
    """
    Detect repetitive text using statistical sampling method.
    
    This function identifies repetitive model outputs by sampling random
    substrings and checking if they appear frequently throughout the text.
    It's designed to catch common failure modes like repetitive loops.
    
    Args:
        text (str): Text to analyze for repetitiveness
        num_positions (int): Number of random positions to sample (default: 10)
        substring_len (int): Length of substrings to extract (default: 15)
        min_repeats (int): Minimum repetitions to flag as repetitive (default: 5)
    
    Returns:
        bool: True if text is considered repetitive, False otherwise
    
    Algorithm:
        1. Randomly sample up to num_positions starting positions
        2. Extract substring_len characters from each position
        3. Count occurrences of each substring in the full text
        4. Return True if any substring appears >= min_repeats times
    
    Use Cases:
        - Quality control for model outputs
        - Identifying training instabilities
        - Filtering problematic generations
        - Model comparison based on output quality
    
    Example:
        >>> repetitive_text = "Hello world! " * 10
        >>> is_repetitive(repetitive_text)
        True
        >>> normal_text = "This is a normal response with varied content."
        >>> is_repetitive(normal_text)
        False
    
    Note:
        - Uses random sampling for efficiency on long texts
        - Balances false positive/negative rates through parameter tuning
        - Handles edge cases (short text, no valid positions)
    """
    # If text is too short to extract a substring of substring_len
    if len(text) < substring_len:
        return False
    
    # All valid starting positions
    max_start = len(text) - substring_len
    # If fewer than `num_positions` possible starts, sample them all
    sample_size = min(num_positions, max_start + 1)
    
    # Randomly pick positions from the valid range
    positions = random.sample(range(max_start + 1), k=sample_size)
    
    for start in positions:
        candidate = text[start : start + substring_len]
        # Count occurrences of candidate in the entire text
        count_occurrences = text.count(candidate)
        if count_occurrences >= min_repeats:
            return True
    
    return False

def analyze_outputs_folder(folder_with_json):
    """
    Analyze model output quality across all JSON files in a folder.
    
    This function processes evaluation result files to compute quality metrics
    including repetition rates and output lengths. It's essential for identifying
    models with generation issues.
    
    Args:
        folder_with_json (str): Path to folder containing JSON result files
            Each JSON should contain evaluation results with fields:
            - 'output1_probe_in_data': Model response when probe is in data section
            - 'output2_probe_in_task': Model response when probe is in task section
    
    Returns:
        pandas.DataFrame: Quality analysis results with columns:
            - 'model': Model identifier from filename
            - 'repetition_d': Repetition rate for data section outputs (0-1)
            - 'repetition_t': Repetition rate for task section outputs (0-1) 
            - 'len_d': Average length of data section outputs
            - 'len_t': Average length of task section outputs
    
    Quality Indicators:
        - Low repetition rates (< 0.1) indicate healthy generation
        - High repetition rates (> 0.3) suggest training issues
        - Consistent lengths across conditions indicate stability
        - Extreme length variations may indicate problematic generations
    
    Example:
        >>> quality_df = analyze_outputs_folder("./model_outputs/llama_3.1_8b/")
        >>> print(quality_df[['model', 'repetition_d', 'repetition_t']].head())
        >>> problematic = quality_df[quality_df['repetition_d'] > 0.3]
        >>> print(f"Found {len(problematic)} models with high repetition")
    
    Use Cases:
        - Model selection based on output quality
        - Identifying training hyperparameters that cause instability
        - Quality control in large-scale experiments
        - Comparing output stability across methods
    """
    rows = []
    
    for filepath in glob.glob(os.path.join(folder_with_json, "*.json")):
        with open(filepath, "r", encoding="utf-8") as f:
            data = json.load(f)  # list of dicts

        # Track stats
        count_d_reps = 0
        count_t_reps = 0
        lengths_d = []
        lengths_t = []
        
        for entry in data:
            text_d = entry["output1_probe_in_data"]
            text_t = entry["output2_probe_in_task"]
            
            # Check repetition
            if is_repetitive(text_d):
                count_d_reps += 1
            if is_repetitive(text_t):
                count_t_reps += 1
            
            # Keep lengths (basic char length or token length, up to you)
            lengths_d.append(len(text_d))
            lengths_t.append(len(text_t))
        
        total = len(data)
        if total > 0:
            repetition_d = count_d_reps / total
            repetition_t = count_t_reps / total
            len_d = sum(lengths_d) / total
            len_t = sum(lengths_t) / total
        else:
            repetition_d = 0.0
            repetition_t = 0.0
            len_d = 0.0
            len_t = 0.0
        
        # You can parse a "model name" from filename, if desired
        model_name = os.path.splitext(os.path.basename(filepath))[0]
        
        rows.append({
            "model": model_name,
            "repetition_d": repetition_d,
            "repetition_t": repetition_t,
            "len_d": len_d,
            "len_t": len_t,
        })
    
    df = pd.DataFrame(rows)
    return df

## Training Results Aggregation

These functions aggregate training metrics from multiple experiment runs, enabling analysis of training dynamics and correlation with final performance.

In [11]:
from collections import defaultdict

def aggregate_experiment_results(train_evals_path):
    """
    Aggregate training metrics from multiple experiment subfolders.
    
    This function processes training logs from multiple experiments, extracting
    minimum evaluation loss for each model. It's crucial for understanding
    training dynamics and correlating them with final performance.
    
    Args:
        train_evals_path (str): Path to main directory containing experiment subfolders
            Expected structure:
            train_evals_path/
            ├── model1_config/
            │   └── losses_metrics.json
            ├── model2_config/
            │   └── losses_metrics.json
            └── ...
    
    Returns:
        pandas.DataFrame: Aggregated results with columns:
            - 'parsed_name': Standardized model identifier
            - 'min_eval_loss': Minimum evaluation loss achieved during training
    
    JSON File Format Expected:
        {
            "eval_loss": [loss1, loss2, loss3, ...] or single_value
            // other metrics...
        }
    
    Processing Steps:
        1. Iterate through all subfolders
        2. Load losses_metrics.json from each folder
        3. Extract minimum evaluation loss
        4. Create standardized model names
        5. Return sorted DataFrame
    
    Use Cases:
        - Training stability analysis
        - Hyperparameter optimization
        - Correlation with final performance
        - Model selection based on training metrics
    
    Example:
        >>> losses_df = aggregate_experiment_results("./train_logs/llama_3.1_8b/SFTv110/")
        >>> print(f"Processed {len(losses_df)} training runs")
        >>> best_loss = losses_df.loc[losses_df['min_eval_loss'].idxmin()]
        >>> print(f"Best model: {best_loss['parsed_name']} (loss: {best_loss['min_eval_loss']:.4f})")
    
    Note:
        - Handles both list and scalar eval_loss formats
        - Robust error handling for missing/malformed files
        - Removes original 'model' column, keeps only parsed names
    """
    # Dictionary to store results
    results = defaultdict(float)

    # Loop through all subfolders in the main directory
    for model_folder in os.listdir(train_evals_path):
        folder_path = os.path.join(train_evals_path, model_folder)
        
        # Make sure it's a directory, not a file
        if not os.path.isdir(folder_path):
            continue
        
        # Path to the losses_metrics.json file
        json_path = os.path.join(folder_path, "losses_metrics.json")
        
        # Check if the file exists
        if os.path.exists(json_path):
            try:
                # Read the JSON file
                with open(json_path, 'r') as f:
                    metrics_data = json.load(f)
                
                # Extract the minimum eval_loss
                if "eval_loss" in metrics_data:
                    min_eval_loss = metrics_data["eval_loss"]
                    if isinstance(min_eval_loss, list):
                        min_eval_loss = min(min_eval_loss)
                    results[model_folder] = min_eval_loss
            except Exception as e:
                print(f"Error processing {model_folder}: {e}")

    # Create pandas DataFrame
    df = pd.DataFrame(list(results.items()), columns=["model", "min_eval_loss"])

    # Sort by model name
    df = df.sort_values("model").reset_index(drop=True)
    df = transform_losses_df(df)
    del df["model"]
    return df

## Best Model Selection

This function provides automated selection of the best performing models based on various metrics, enabling systematic comparison across different embedding methods.

In [12]:

def best_models_for_substrings(df, column_name, substring_list, use_max=True, top_k=3):
    """
    Find the best performing models for each embedding type based on a specified metric.
    
    This function enables systematic comparison across different ASIDE variants
    and baselines by identifying top performers in each category. It's essential
    for fair comparison and model selection.
    
    Args:
        df (pandas.DataFrame): Results DataFrame with model performance data
        column_name (str): Name of column to optimize (e.g., 'sep_metric', 'alpacaeval 1.0')
        substring_list (list): List of embedding type identifiers to search for
            Example: ["forward_rot", "ise", "single_emb"]
        use_max (bool): If True, select models with maximum values (default: True)
                       If False, select models with minimum values (for loss metrics)
        top_k (int): Number of top models to return per embedding type (default: 3)
    
    Returns:
        pandas.DataFrame: Best models for each embedding type, sorted by performance
    
    Column Value Handling:
        - Numeric values: Used directly for comparison
        - Lists/arrays: Uses first element [0] for comparison
        - Handles mixed data types gracefully
    
    Use Cases:
        - Systematic embedding method comparison
        - Hyperparameter optimization within methods
        - Creating performance summaries
        - Identifying best models for further analysis
    
    Example:
        >>> # Find best models by SEP metric
        >>> best_sep = best_models_for_substrings(
        ...     results_df, 
        ...     'sep_metric', 
        ...     ["forward_rot", "ise", "single_emb"],
        ...     use_max=True, 
        ...     top_k=1
        ... )
        >>> print("Best models by SEP score:")
        >>> print(best_sep[['parsed_name', 'sep_metric']])
        
        >>> # Find best models by training loss
        >>> best_loss = best_models_for_substrings(
        ...     losses_df, 
        ...     'min_eval_loss', 
        ...     ["forward_rot", "dd_pure", "ise"],
        ...     use_max=False,  # Lower loss is better
        ...     top_k=1
        ... )
    
    Filtering Logic:
        - Searches for substring matches in 'parsed_name' column
        - Handles partial matches (e.g., "forward_rot" matches "forward_rot_run5")
        - Returns empty DataFrame if no matches found
    
    Note:
        - Preserves all original columns in output
        - Adds temporary comparison column that is automatically removed
        - Handles edge cases (empty DataFrames, missing columns)
    """
    
    # Helper function to get comparable scalar from column values
    def get_scalar_value(x):
        # If it's a list, tuple, or numpy array, compare by the first element
        if isinstance(x, (list, tuple, np.ndarray)) and len(x) > 0:
            return x[0]
        return x  # Non-list/array values returned as-is
    
    best_rows = []
    
    for substring in substring_list:
        # Filter rows whose 'parsed_name' contains the substring
        subset = df[df['parsed_name'].str.contains(substring, na=False)].copy()
        if subset.empty:
            # No match for this substring, skip
            continue
        
        # Compute a new column with scalar values to compare
        subset['_comp_value'] = subset[column_name].apply(get_scalar_value)
        
        # Choose top_k by max or min
        if use_max:
            chosen = subset.nlargest(top_k, '_comp_value')
        else:
            chosen = subset.nsmallest(top_k, '_comp_value')
        
        best_rows.append(chosen)
    
    # Concatenate results for all substrings
    if best_rows:
        result_df = pd.concat(best_rows, ignore_index=True)
        # Remove the helper column
        result_df.drop(columns=['_comp_value'], inplace=True, errors='ignore')
        return result_df
    else:
        # Return empty DataFrame if no rows matched
        return pd.DataFrame(columns=df.columns)

## Example Usage: Selecting best model

This example demonstrates how to use the analysis functions to identify the best models based on training loss for the Qwen2.5-7B model family.

In [13]:
model = "qwen2.5_7b"
sft = "SFTv70"

train_evals_path= f"/nfs/scistore19/alistgrp/avolkova/side/model_outputs/train_logs/{model}/{sft}"
losses_df = aggregate_experiment_results(train_evals_path)

search_substrings = ["forward_rot", "dd_pure", "ise"]  # last one won't match
col = "min_eval_loss"
# Suppose we want the best model by 'alpacaeval 1.0'
best_loss = best_models_for_substrings(losses_df,col, search_substrings, use_max=False, top_k=1)
print(f"Best by {col}:")
best_loss

Error processing ise_train_full_SFTv70_run=5e-6_rotation_0_alpha=0.0: min() iterable argument is empty
Error processing forward_rot_train_full_SFTv70_run=rotation_0.5_alpha=1.5708: min() iterable argument is empty
Error processing dd_pure_train_full_SFTv70_run=5e-6_gradual_rotation_alpha=0.0: min() iterable argument is empty
Best by min_eval_loss:


Unnamed: 0,min_eval_loss,parsed_name
0,1.019663,forward_rot_run14_alpha=1.57079633
1,0.953548,dd_pure_run18_alpha=1.57079633
2,0.95431,ise_run34_alpha=1.57079633


## Analysis Example: SEP Metric Evaluation

This  example shows how to load and analyze SEP evaluation results, demonstrating the complete workflow from data loading to performance comparison.

In [17]:
sft = "SFTv70"
alpaca_ver = "1.0"
path1 = "./evals/data/tatsu-lab/alpaca_eval/alpaca_eval_gpt4/leaderboard.csv"
path2 = "./evals/data/tatsu-lab/alpaca_farm/alpaca_eval_gpt4/leaderboard.csv"
model = "Qwen2.5-7B"

outputs_path = f"./model_outputs/{model}/{sft}"
scores = get_df_scores_for_model(outputs_path)
sep = list(map(lambda x: x[0], np.array(scores["sep_metric"])))
scores=scores[scores["model"].str.contains("fullsep")].iloc[:, [0,1,3]].sort_values(by='model')
scores

Unnamed: 0,model,sep_metric,probe_in_instruct_asr
0,base_fullsep,"[0.375, 0.006]","[0.774, 0.004]"
2,forward_rot_from_inst_run14_fullsep,"[0.641, 0.006]","[0.726, 0.005]"
3,from_inst_single_run18_fullsep,"[0.418, 0.006]","[0.717, 0.005]"
4,ise_from_inst_run34_fullsep,"[0.419, 0.006]","[0.713, 0.005]"
