# Performance Analysis Framework

This notebook provides a comprehensive framework for analyzing model performance against statistical baselines using recall metrics and confidence intervals.

## Key Features:
- **Statistical Significance Testing**: Compare model performance against random baseline
- **Confidence Intervals**: Calculate Wilson confidence intervals for recall metrics
- **Comprehensive Visualizations**: Multiple plots for performance analysis
- **Theoretical Validation**: Simulation-based validation of analytical methods

## Signal Convention:
- **-1**: Sell signal
- **0**: Hold signal  
- **1**: Buy signal

In [1]:
# Import Required Libraries
import numpy as np
import pandas as pd
from statsmodels.stats.proportion import proportion_confint
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Optional, Dict, List, Tuple
import warnings
warnings.filterwarnings('ignore')


# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("📊 Performance Analysis Framework Loaded Successfully!")
print("🎯 Ready for statistical significance testing and visualization")

📊 Performance Analysis Framework Loaded Successfully!
🎯 Ready for statistical significance testing and visualization


In [12]:
target_feature = pd.read_parquet("../data/features/DATA_1/ETH_EUR.parquet")

## 1. Core Recall Calculation Functions

These functions form the foundation of our performance analysis framework.

In [6]:
def recall(predictions: pd.Series, targets: pd.Series, signal: int) -> float:
    """
    Calculate recall performance metric for a specific signal.
    
    Recall = True Positives / (True Positives + False Negatives)
    
    Args:
        predictions (pd.Series): Predicted values (-1, 0, 1)
        targets (pd.Series): True target values (-1, 0, 1)
        signal (int): The signal value to calculate recall for
        
    Returns:
        float: Recall performance metric (0.0 to 1.0)
    """
    true_positives = np.sum((predictions == signal) & (targets == signal))
    all_positives = np.sum(targets == signal)
    
    if all_positives == 0:
        return 0.0
    
    return true_positives / all_positives

# Test the recall function
print("✅ Recall calculation function defined")
print("📝 Formula: Recall = TP / (TP + FN)")

✅ Recall calculation function defined
📝 Formula: Recall = TP / (TP + FN)


## 2. Random Baseline Confidence Intervals

Calculate confidence intervals for recall under random baseline prediction using Wilson method.

In [13]:
def recall_interval_random_baseline(targets: pd.Series, signal: int, confidence: float = 0.95) -> tuple:
    """
    Calculate the confidence interval for recall under random baseline prediction.
    
    For a random predictor that follows the target distribution, the expected recall 
    for each signal equals the proportion of that signal in the targets.
    
    Args:
        targets (pd.Series): True target values
        signal (int): The signal value (-1, 0, or 1) to calculate recall for
        confidence (float): Confidence level for the interval (default: 0.95)
        
    Returns:
        tuple: (lower_bound, upper_bound) of the confidence interval
    """
    total_samples = len(targets)
    actual_positives = np.sum(targets == signal)
    
    if actual_positives == 0:
        return 0.0, 0.0
    
    # Under random prediction matching target proportions:
    # P(predict signal) = proportion of signal in targets
    # Expected recall = P(predict signal | true signal) = P(predict signal) = proportion
    signal_proportion = actual_positives / total_samples
    
    # Expected true positives under random prediction
    expected_tp = actual_positives * signal_proportion
    
    # Use Wilson confidence interval (more robust than normal approximation)
    lower, upper = proportion_confint(
        count=expected_tp,
        nobs=actual_positives,
        alpha=1 - confidence,
        method='wilson'
    )
    
    return lower, upper

target = target_feature["avg-10ms-of-mid-price-itincreases-after-200ms-with-threshold-5"]
interval = {f"signal{i}": recall_interval_random_baseline(target, signal=i) for i in [-1, 0, 1]}
print(interval)

{'signal-1': (0.0, 0.0), 'signal0': (0.9237211439928714, 0.924341508382823), 'signal1': (0.07489335434610084, 0.0770569690326472)}


## 3. Comprehensive Baseline Analysis

Compute confidence intervals for all signals with detailed statistics.

In [None]:
from typing import Dict
import pandas as pd
from statsmodels.stats.proportion import proportion_confint

def compute_all_recall_intervals_random_baseline_from_recalls(recall_values: Dict[int, float], 
                                                                  signal_counts: Dict[int, int],
                                                                  confidence: float = 0.95, 
                                                                  verbose: bool = False) -> dict:
    """
    Compute confidence intervals for recall of all signals under random baseline,
    working directly with recall values instead of predictions/targets.
    
    Args:
        recall_values (Dict[int, float]): Dictionary mapping signal -> actual recall value
        signal_counts (Dict[int, int]): Dictionary mapping signal -> count of occurrences
        confidence (float): Confidence level for the interval
        verbose (bool): Whether to print detailed results
        
    Returns:
        dict: Dictionary containing recall intervals for each signal and summary statistics
    """
    results = {}
    total_samples = sum(signal_counts.values())
    
    if verbose:
        print(f"🎯 RANDOM BASELINE RECALL CONFIDENCE INTERVALS (FROM RECALL VALUES)")
        print(f"{'='*70}")
        print(f"Total samples: {total_samples:,}")
        print(f"Confidence level: {confidence*100:.1f}%")
        print(f"{'='*70}")
    
    # Calculate for each signal
    for signal in sorted(signal_counts.keys()):
        count = signal_counts[signal]
        proportion = count / total_samples
        
        # Expected recall under random prediction = signal proportion
        expected_recall = proportion
        
        # Confidence interval using Wilson method
        # Expected true positives under random prediction
        expected_tp = count * proportion
        
        # Use Wilson confidence interval
        lower, upper = proportion_confint(
            count=expected_tp,
            nobs=count,
            alpha=1 - confidence,
            method='wilson'
        )
        
        results[signal] = {
            'count': count,
            'proportion': proportion,
            'expected_recall': expected_recall,
            'ci_lower': lower,
            'ci_upper': upper,
            'ci_width': upper - lower,
            'actual_recall': recall_values.get(signal, 0.0)  # Store actual recall for comparison
        }
        
        if verbose:
            actual_recall = recall_values.get(signal, 0.0)
            print(f"Signal {signal:2d}: count={count:6,} ({proportion:.3f}) | "
                  f"Actual recall={actual_recall:.3f} | "
                  f"Expected recall={expected_recall:.3f} | "
                  f"CI=[{lower:.3f}, {upper:.3f}] | width={upper-lower:.3f}")
    
    if verbose:
        print(f"{'='*70}")
    
    # Add summary statistics
    results['summary'] = {
        'total_samples': total_samples,
        'unique_signals': len(signal_counts),
        'most_frequent_signal': max(signal_counts.keys(), key=signal_counts.get),
        'most_frequent_proportion': max(signal_counts.values()) / total_samples,
        'least_frequent_signal': min(signal_counts.keys(), key=signal_counts.get),
        'least_frequent_proportion': min(signal_counts.values()) / total_samples,
    }
    
    return results

# Keep original function for backward compatibility
def compute_all_recall_intervals_random_baseline(targets: pd.Series, confidence: float = 0.95, verbose: bool = False) -> dict:
    """
    Original function: Compute confidence intervals for recall of all signals under random baseline.
    
    Args:
        targets (pd.Series): True target values containing -1, 0, 1
        confidence (float): Confidence level for the interval
        verbose (bool): Whether to print detailed results
        
    Returns:
        dict: Dictionary containing recall intervals for each signal and summary statistics
    """
    results = {}
    signal_counts = targets.value_counts()
    total_samples = len(targets)
    
    if verbose:
        print(f"🎯 RANDOM BASELINE RECALL CONFIDENCE INTERVALS")
        print(f"{'='*60}")
        print(f"Total samples: {total_samples:,}")
        print(f"Confidence level: {confidence*100:.1f}%")
        print(f"{'='*60}")
    
    # Calculate for each signal present in targets
    for signal in sorted(signal_counts.index):
        count = signal_counts[signal]
        proportion = count / total_samples
        
        # Expected recall under random prediction = signal proportion
        expected_recall = proportion
        
        # Confidence interval
        lower, upper = recall_interval_random_baseline(targets, signal, confidence)
        
        results[signal] = {
            'count': count,
            'proportion': proportion,
            'expected_recall': expected_recall,
            'ci_lower': lower,
            'ci_upper': upper,
            'ci_width': upper - lower
        }
        
        if verbose:
            print(f"Signal {signal:2d}: count={count:6,} ({proportion:.3f}) | "
                  f"Expected recall={expected_recall:.3f} | "
                  f"CI=[{lower:.3f}, {upper:.3f}] | width={upper-lower:.3f}")
    
    if verbose:
        print(f"{'='*60}")
    
    # Add summary statistics
    results['summary'] = {
        'total_samples': total_samples,
        'unique_signals': len(signal_counts),
        'most_frequent_signal': signal_counts.idxmax(),
        'most_frequent_proportion': signal_counts.max() / total_samples,
        'least_frequent_signal': signal_counts.idxmin(),
        'least_frequent_proportion': signal_counts.min() / total_samples,
    }
    
    return results

print("✅ Enhanced baseline analysis functions defined")
print("📊 Now supports both direct recall input and traditional prediction/target input")
print("🎯 Use compute_all_recall_intervals_random_baseline_from_recalls() for direct recall analysis")

## 4. Statistical Significance Testing

The main function to test if model performance is significantly better than random baseline.

In [None]:
def prediction_recall_significance_from_recalls(recall_values: Dict[int, float], 
                                                signal_counts: Dict[int, int],
                                                confidence: float = 0.95) -> dict:
    """
    🎯 MAIN SIGNIFICANCE TESTING FUNCTION (FROM RECALL VALUES)
    
    Check if recall values are significantly better than random baseline for all signals.
    
    Args:
        recall_values (Dict[int, float]): Dictionary mapping signal -> actual recall value
        signal_counts (Dict[int, int]): Dictionary mapping signal -> count of occurrences  
        confidence (float): Confidence level for the significance test
        
    Returns:
        dict: Significance test results for each signal including:
              - significant: bool (True if significantly better than random)
              - recall: float (actual recall)
              - ci_lower/ci_upper: float (random baseline confidence interval)
              - expected_random: float (expected recall under random prediction)
    """
    # Get random baseline confidence intervals using the new function
    ci_results = compute_all_recall_intervals_random_baseline_from_recalls(
        recall_values, signal_counts, confidence
    )
    
    # Perform significance test: actual recall > upper bound of random CI
    return {signal: {
        "significant": recall_values[signal] > ci_results[signal]["ci_upper"],
        'recall': recall_values[signal],
        'ci_lower': ci_results[signal]["ci_lower"],
        'ci_upper': ci_results[signal]["ci_upper"],
        'expected_random': ci_results[signal]["expected_recall"]
    } for signal in recall_values.keys() if signal in ci_results}

# Keep original function for backward compatibility
def prediction_recall_significance(predictions: pd.Series, targets: pd.Series, confidence: float = 0.95) -> dict:
    """
    🎯 MAIN SIGNIFICANCE TESTING FUNCTION (ORIGINAL)
    
    Check if predictions recall is significantly better than random baseline for all signals.
    
    Args:
        predictions (pd.Series): Predicted values (-1, 0, 1)
        targets (pd.Series): True target values (-1, 0, 1)
        confidence (float): Confidence level for the significance test
        
    Returns:
        dict: Significance test results for each signal including:
              - significant: bool (True if significantly better than random)
              - recall: float (actual recall)
              - ci_lower/ci_upper: float (random baseline confidence interval)
              - expected_random: float (expected recall under random prediction)
    """
    # Calculate actual recall for each signal
    recall_values = {signal: recall(predictions, targets, signal) for signal in [-1, 0, 1]}
    
    # Get random baseline confidence intervals
    ci_results = compute_all_recall_intervals_random_baseline(targets, confidence)
    
    # Perform significance test: actual recall > upper bound of random CI
    return {signal: {
        "significant": recall_values[signal] > ci_results[signal]["ci_upper"],
        'recall': recall_values[signal],
        'ci_lower': ci_results[signal]["ci_lower"],
        'ci_upper': ci_results[signal]["ci_upper"],
        'expected_random': ci_results[signal]["expected_recall"]
    } for signal in [-1, 0, 1] if signal in ci_results}

print("🎯 ENHANCED SIGNIFICANCE TESTING FUNCTIONS DEFINED")
print("✅ Use prediction_recall_significance_from_recalls() for direct recall analysis!")
print("📊 Returns complete significance analysis for all signals")
print("🔄 Original function preserved for backward compatibility")

## 5. Visualization Functions

### 5.1 Random Baseline Confidence Intervals Visualization

In [None]:
def plot_recall_confidence_intervals(targets: pd.Series, confidence: float = 0.95, 
                                   figsize: Tuple[int, int] = (12, 8), 
                                   title_prefix: str = "") -> plt.Figure:
    """
    Visualize recall confidence intervals for random baseline prediction.
    
    Args:
        targets (pd.Series): True target values containing -1, 0, 1
        confidence (float): Confidence level for the interval
        figsize (tuple): Figure size (width, height)
        title_prefix (str): Prefix for the plot title
        
    Returns:
        plt.Figure: The matplotlib figure object
    """
    results = compute_all_recall_intervals_random_baseline(targets, confidence)
    
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=figsize)
    fig.suptitle(f'{title_prefix}Random Baseline Recall Analysis (CI: {confidence*100:.1f}%)', 
                 fontsize=16, fontweight='bold')
    
    # Prepare data for plotting
    signals = [-1, 0, 1]
    signal_names = ['Sell (-1)', 'Hold (0)', 'Buy (1)']
    colors = ['red', 'gray', 'green']
    
    expected_recalls = [results[signal]['expected_recall'] for signal in signals if signal in results]
    ci_lowers = [results[signal]['ci_lower'] for signal in signals if signal in results]
    ci_uppers = [results[signal]['ci_upper'] for signal in signals if signal in results]
    counts = [results[signal]['count'] for signal in signals if signal in results]
    present_signals = [signal for signal in signals if signal in results]
    present_names = [signal_names[i] for i, signal in enumerate(signals) if signal in results]
    present_colors = [colors[i] for i, signal in enumerate(signals) if signal in results]
    
    # Plot 1: Expected Recalls with Confidence Intervals
    ax1.bar(present_names, expected_recalls, color=present_colors, alpha=0.7, edgecolor='black')
    ax1.errorbar(present_names, expected_recalls, 
                yerr=[np.array(expected_recalls) - np.array(ci_lowers),
                      np.array(ci_uppers) - np.array(expected_recalls)],
                fmt='none', color='black', capsize=5, capthick=2)
    ax1.set_title('Expected Recall with Confidence Intervals', fontweight='bold')
    ax1.set_ylabel('Recall')
    ax1.grid(True, alpha=0.3)
    ax1.set_ylim(0, max(ci_uppers) * 1.1 if ci_uppers else 1)
    
    # Add value labels on bars
    for i, (recall, ci_lower, ci_upper) in enumerate(zip(expected_recalls, ci_lowers, ci_uppers)):
        ax1.text(i, recall + 0.01, f'{recall:.3f}\n[{ci_lower:.3f}, {ci_upper:.3f}]', 
                ha='center', va='bottom', fontsize=10, fontweight='bold')
    
    # Plot 2: Signal Distribution
    ax2.pie(counts, labels=[f'{name}\n({count:,})' for name, count in zip(present_names, counts)], 
            colors=present_colors, autopct='%1.2f%%', startangle=90)
    ax2.set_title('Signal Distribution', fontweight='bold')
    
    # Plot 3: Confidence Interval Widths
    ci_widths = [upper - lower for lower, upper in zip(ci_lowers, ci_uppers)]
    bars = ax3.bar(present_names, ci_widths, color=present_colors, alpha=0.7, edgecolor='black')
    ax3.set_title('Confidence Interval Widths', fontweight='bold')
    ax3.set_ylabel('CI Width')
    ax3.grid(True, alpha=0.3)
    
    # Add value labels
    for bar, width in zip(bars, ci_widths):
        height = bar.get_height()
        ax3.text(bar.get_x() + bar.get_width()/2., height + 0.001,
                f'{width:.4f}', ha='center', va='bottom', fontweight='bold')
    
    # Plot 4: Summary Statistics
    ax4.axis('off')
    summary_text = f"""
    📊 SUMMARY STATISTICS
    {'='*30}
    Total Samples: {results['summary']['total_samples']:,}
    Unique Signals: {results['summary']['unique_signals']}
    
    Most Frequent Signal: {results['summary']['most_frequent_signal']} 
    ({results['summary']['most_frequent_proportion']:.3f})
    
    Least Frequent Signal: {results['summary']['least_frequent_signal']} 
    ({results['summary']['least_frequent_proportion']:.3f})
    
    Confidence Level: {confidence*100:.1f}%
    
    🎯 Under random baseline:
    Expected Recall = Signal Proportion
    """
    ax4.text(0.05, 0.95, summary_text, transform=ax4.transAxes, fontsize=11,
             verticalalignment='top', fontfamily='monospace',
             bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.8))
    
    plt.tight_layout()
    return fig

print("📊 Random baseline visualization function defined")
print("✅ Creates 4-panel plot showing confidence intervals and statistics")

### 5.2 Prediction Performance Visualization

In [None]:
def plot_prediction_performance_from_recalls(recall_values: Dict[int, float], 
                                           signal_counts: Dict[int, int],
                                           confidence: float = 0.95, 
                                           figsize: Tuple[int, int] = (15, 10),
                                           title_prefix: str = "") -> plt.Figure:
    """
    🎯 COMPREHENSIVE PERFORMANCE VISUALIZATION (FROM RECALL VALUES)
    
    Visualize recall performance against random baseline with significance testing,
    working directly with recall values instead of predictions/targets.
    
    Args:
        recall_values (Dict[int, float]): Dictionary mapping signal -> actual recall value
        signal_counts (Dict[int, int]): Dictionary mapping signal -> count of occurrences
        confidence (float): Confidence level for significance testing
        figsize (tuple): Figure size (width, height)
        title_prefix (str): Prefix for the plot title
        
    Returns:
        plt.Figure: The matplotlib figure object with 4 subplots
    """
    # Get significance results
    significance_results = prediction_recall_significance_from_recalls(
        recall_values, signal_counts, confidence
    )
    baseline_results = compute_all_recall_intervals_random_baseline_from_recalls(
        recall_values, signal_counts, confidence
    )
    
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=figsize)
    fig.suptitle(f'{title_prefix}Recall Performance vs Random Baseline', 
                 fontsize=16, fontweight='bold')
    
    # Prepare data
    signals = [-1, 0, 1]
    signal_names = ['Sell (-1)', 'Hold (0)', 'Buy (1)']
    colors = ['red', 'gray', 'green']
    
    # Filter for present signals
    present_signals = [s for s in signals if s in significance_results]
    present_names = [signal_names[signals.index(s)] for s in present_signals]
    present_colors = [colors[signals.index(s)] for s in present_signals]
    
    actual_recalls = [significance_results[signal]['recall'] for signal in present_signals]
    expected_recalls = [significance_results[signal]['expected_random'] for signal in present_signals]
    ci_lowers = [significance_results[signal]['ci_lower'] for signal in present_signals]
    ci_uppers = [significance_results[signal]['ci_upper'] for signal in present_signals]
    is_significant = [significance_results[signal]['significant'] for signal in present_signals]
    
    # Plot 1: Actual vs Expected Recall Comparison
    x_pos = np.arange(len(present_signals))
    width = 0.35
    
    bars1 = ax1.bar(x_pos - width/2, actual_recalls, width, label='Actual Recall', 
                    color=present_colors, alpha=0.8, edgecolor='black')
    bars2 = ax1.bar(x_pos + width/2, expected_recalls, width, label='Expected (Random)', 
                    color=present_colors, alpha=0.4, edgecolor='black', hatch='///')
    
    # Add confidence intervals for expected recalls
    ax1.errorbar(x_pos + width/2, expected_recalls, 
                yerr=[np.array(expected_recalls) - np.array(ci_lowers),
                      np.array(ci_uppers) - np.array(expected_recalls)],
                fmt='none', color='black', capsize=3, capthick=1)
    
    ax1.set_title('Actual vs Expected Recall', fontweight='bold')
    ax1.set_ylabel('Recall')
    ax1.set_xticks(x_pos)
    ax1.set_xticklabels(present_names)
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Add significance markers
    for i, (actual, significant) in enumerate(zip(actual_recalls, is_significant)):
        marker = '★' if significant else '○'
        color = 'gold' if significant else 'lightgray'
        ax1.text(i - width/2, actual + 0.02, marker, ha='center', va='bottom', 
                fontsize=16, color=color, fontweight='bold')
    
    # Add value labels
    for i, (actual, expected) in enumerate(zip(actual_recalls, expected_recalls)):
        ax1.text(i - width/2, actual + 0.01, f'{actual:.3f}', ha='center', va='bottom', 
                fontsize=9, fontweight='bold')
        ax1.text(i + width/2, expected + 0.01, f'{expected:.3f}', ha='center', va='bottom', 
                fontsize=9, fontweight='bold')
    
    # Plot 2: Performance Improvement over Random
    improvements = [actual - expected for actual, expected in zip(actual_recalls, expected_recalls)]
    bar_colors = ['darkgreen' if imp > 0 else 'darkred' for imp in improvements]
    
    bars = ax2.bar(present_names, improvements, color=bar_colors, alpha=0.7, edgecolor='black')
    ax2.axhline(y=0, color='black', linestyle='-', alpha=0.5)
    ax2.set_title('Recall Improvement over Random Baseline', fontweight='bold')
    ax2.set_ylabel('Recall Difference')
    ax2.grid(True, alpha=0.3)
    
    # Add value labels
    for bar, imp, significant in zip(bars, improvements, is_significant):
        height = bar.get_height()
        va = 'bottom' if height >= 0 else 'top'
        y_pos = height + (0.005 if height >= 0 else -0.005)
        significance_marker = ' ★' if significant else ''
        ax2.text(bar.get_x() + bar.get_width()/2., y_pos,
                f'{imp:+.3f}{significance_marker}', ha='center', va=va, 
                fontweight='bold', fontsize=10)
    
    # Plot 3: Statistical Significance Summary
    ax3.axis('off')
    
    # Create significance summary
    total_signals = len(present_signals)
    significant_count = sum(is_significant)
    
    significance_text = f"""
    🎯 STATISTICAL SIGNIFICANCE ANALYSIS
    {'='*40}
    Confidence Level: {confidence*100:.1f}%
    
    Significant Signals: {significant_count}/{total_signals}
    
    """
    
    for signal, name, significant, actual, expected, ci_lower, ci_upper in zip(
        present_signals, present_names, is_significant, actual_recalls, expected_recalls, ci_lowers, ci_uppers):
        
        status = "✅ SIGNIFICANT" if significant else "❌ Not Significant"
        significance_text += f"""
    {name}:
    Actual Recall: {actual:.4f}
    Expected: {expected:.4f} [{ci_lower:.4f}, {ci_upper:.4f}]
    Status: {status}
    """
    
    ax3.text(0.05, 0.95, significance_text, transform=ax3.transAxes, fontsize=10,
             verticalalignment='top', fontfamily='monospace',
             bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8))
    
    # Plot 4: Signal Distribution and Recall Summary
    total_samples = sum(signal_counts.values())
    present_counts = [signal_counts[signal] for signal in present_signals]
    
    # Create pie chart for signal distribution
    wedges, texts, autotexts = ax4.pie(present_counts, 
                                      labels=[f'{name}\n({count:,})' for name, count in zip(present_names, present_counts)], 
                                      colors=present_colors, 
                                      autopct='%1.2f%%', 
                                      startangle=90)
    ax4.set_title('Signal Distribution with Recall Values', fontweight='bold')
    
    # Add recall values as text around the pie
    for i, (signal, recall_val) in enumerate(zip(present_signals, actual_recalls)):
        ax4.text(0.1, 0.9 - i*0.15, f'Signal {signal}: {recall_val:.3f}', 
                transform=ax4.transAxes, fontsize=10, fontweight='bold',
                bbox=dict(boxstyle='round', facecolor=present_colors[i], alpha=0.3))
    
    plt.tight_layout()
    return fig

# Keep original function for backward compatibility
def plot_prediction_performance(predictions: pd.Series, targets: pd.Series, 
                              confidence: float = 0.95, 
                              figsize: Tuple[int, int] = (15, 10),
                              title_prefix: str = "") -> plt.Figure:
    """
    🎯 COMPREHENSIVE PERFORMANCE VISUALIZATION (ORIGINAL)
    
    Visualize prediction performance against random baseline with significance testing.
    
    Args:
        predictions (pd.Series): Predicted values
        targets (pd.Series): True target values
        confidence (float): Confidence level for significance testing
        figsize (tuple): Figure size (width, height)
        title_prefix (str): Prefix for the plot title
        
    Returns:
        plt.Figure: The matplotlib figure object with 4 subplots
    """
    # Get significance results
    significance_results = prediction_recall_significance(predictions, targets, confidence)
    baseline_results = compute_all_recall_intervals_random_baseline(targets, confidence)
    
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=figsize)
    fig.suptitle(f'{title_prefix}Prediction Performance vs Random Baseline', 
                 fontsize=16, fontweight='bold')
    
    # Prepare data
    signals = [-1, 0, 1]
    signal_names = ['Sell (-1)', 'Hold (0)', 'Buy (1)']
    colors = ['red', 'gray', 'green']
    
    # Filter for present signals
    present_signals = [s for s in signals if s in significance_results]
    present_names = [signal_names[signals.index(s)] for s in present_signals]
    present_colors = [colors[signals.index(s)] for s in present_signals]
    
    actual_recalls = [significance_results[signal]['recall'] for signal in present_signals]
    expected_recalls = [significance_results[signal]['expected_random'] for signal in present_signals]
    ci_lowers = [significance_results[signal]['ci_lower'] for signal in present_signals]
    ci_uppers = [significance_results[signal]['ci_upper'] for signal in present_signals]
    is_significant = [significance_results[signal]['significant'] for signal in present_signals]
    
    # Plot 1: Actual vs Expected Recall Comparison
    x_pos = np.arange(len(present_signals))
    width = 0.35
    
    bars1 = ax1.bar(x_pos - width/2, actual_recalls, width, label='Actual Recall', 
                    color=present_colors, alpha=0.8, edgecolor='black')
    bars2 = ax1.bar(x_pos + width/2, expected_recalls, width, label='Expected (Random)', 
                    color=present_colors, alpha=0.4, edgecolor='black', hatch='///')
    
    # Add confidence intervals for expected recalls
    ax1.errorbar(x_pos + width/2, expected_recalls, 
                yerr=[np.array(expected_recalls) - np.array(ci_lowers),
                      np.array(ci_uppers) - np.array(expected_recalls)],
                fmt='none', color='black', capsize=3, capthick=1)
    
    ax1.set_title('Actual vs Expected Recall', fontweight='bold')
    ax1.set_ylabel('Recall')
    ax1.set_xticks(x_pos)
    ax1.set_xticklabels(present_names)
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Add significance markers
    for i, (actual, significant) in enumerate(zip(actual_recalls, is_significant)):
        marker = '★' if significant else '○'
        color = 'gold' if significant else 'lightgray'
        ax1.text(i - width/2, actual + 0.02, marker, ha='center', va='bottom', 
                fontsize=16, color=color, fontweight='bold')
    
    # Add value labels
    for i, (actual, expected) in enumerate(zip(actual_recalls, expected_recalls)):
        ax1.text(i - width/2, actual + 0.01, f'{actual:.3f}', ha='center', va='bottom', 
                fontsize=9, fontweight='bold')
        ax1.text(i + width/2, expected + 0.01, f'{expected:.3f}', ha='center', va='bottom', 
                fontsize=9, fontweight='bold')
    
    # Plot 2: Performance Improvement over Random
    improvements = [actual - expected for actual, expected in zip(actual_recalls, expected_recalls)]
    bar_colors = ['darkgreen' if imp > 0 else 'darkred' for imp in improvements]
    
    bars = ax2.bar(present_names, improvements, color=bar_colors, alpha=0.7, edgecolor='black')
    ax2.axhline(y=0, color='black', linestyle='-', alpha=0.5)
    ax2.set_title('Recall Improvement over Random Baseline', fontweight='bold')
    ax2.set_ylabel('Recall Difference')
    ax2.grid(True, alpha=0.3)
    
    # Add value labels
    for bar, imp, significant in zip(bars, improvements, is_significant):
        height = bar.get_height()
        va = 'bottom' if height >= 0 else 'top'
        y_pos = height + (0.005 if height >= 0 else -0.005)
        significance_marker = ' ★' if significant else ''
        ax2.text(bar.get_x() + bar.get_width()/2., y_pos,
                f'{imp:+.3f}{significance_marker}', ha='center', va=va, 
                fontweight='bold', fontsize=10)
    
    # Plot 3: Statistical Significance Summary
    ax3.axis('off')
    
    # Create significance summary
    total_signals = len(present_signals)
    significant_count = sum(is_significant)
    
    significance_text = f"""
    🎯 STATISTICAL SIGNIFICANCE ANALYSIS
    {'='*40}
    Confidence Level: {confidence*100:.1f}%
    
    Significant Signals: {significant_count}/{total_signals}
    
    """
    
    for signal, name, significant, actual, expected, ci_lower, ci_upper in zip(
        present_signals, present_names, is_significant, actual_recalls, expected_recalls, ci_lowers, ci_uppers):
        
        status = "✅ SIGNIFICANT" if significant else "❌ Not Significant"
        significance_text += f"""
    {name}:
    Actual Recall: {actual:.4f}
    Expected: {expected:.4f} [{ci_lower:.4f}, {ci_upper:.4f}]
    Status: {status}
    """
    
    ax3.text(0.05, 0.95, significance_text, transform=ax3.transAxes, fontsize=10,
             verticalalignment='top', fontfamily='monospace',
             bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8))
    
    # Plot 4: Prediction vs Target Distribution
    pred_counts = predictions.value_counts().reindex(present_signals, fill_value=0)
    target_counts = targets.value_counts().reindex(present_signals, fill_value=0)
    
    # Create comparison DataFrame
    comparison_data = pd.DataFrame({
        'Predictions': pred_counts.values,
        'Targets': target_counts.values
    }, index=present_names)
    
    comparison_data.plot(kind='bar', ax=ax4, color=['lightblue', 'orange'], alpha=0.7)
    ax4.set_title('Prediction vs Target Distribution', fontweight='bold')
    ax4.set_ylabel('Count')
    ax4.tick_params(axis='x', rotation=45)
    ax4.legend()
    ax4.grid(True, alpha=0.3)
    
    plt.tight_layout()
    return fig

print("🎯 ENHANCED PERFORMANCE VISUALIZATION FUNCTIONS DEFINED")
print("📊 Use plot_prediction_performance_from_recalls() for direct recall analysis")
print("✅ Creates 4-panel analysis: actual vs expected, improvement, significance, distribution")
print("⭐ Shows significance with star markers")
print("🔄 Original function preserved for backward compatibility")

## 6. Theoretical Validation Functions

Validate our analytical confidence intervals using Monte Carlo simulation.

In [None]:
def theoretical_recall_distribution(targets: pd.Series, signal: int, n_simulations: int = 10000) -> np.ndarray:
    """
    Simulate the theoretical distribution of recall under random baseline prediction.
    
    This function validates our analytical confidence intervals by simulation.
    
    Args:
        targets (pd.Series): True target values
        signal (int): The signal value to calculate recall distribution for
        n_simulations (int): Number of simulations to run
        
    Returns:
        np.ndarray: Array of recall values from simulations
    """
    total_samples = len(targets)
    actual_positives = np.sum(targets == signal)
    
    if actual_positives == 0:
        return np.array([0.0] * n_simulations)
    
    # Get unique values and their proportions
    unique_values = targets.unique()
    proportions = [np.sum(targets == val) / total_samples for val in unique_values]
    
    recall_values = []
    
    for _ in range(n_simulations):
        # Generate random predictions following target distribution
        random_predictions = np.random.choice(
            unique_values, 
            size=total_samples, 
            p=proportions
        )
        
        # Calculate recall for this simulation
        true_positives = np.sum((random_predictions == signal) & (targets == signal))
        recall_sim = true_positives / actual_positives
        recall_values.append(recall_sim)
    
    return np.array(recall_values)

print("🔬 Theoretical validation function defined")
print("✅ Simulates random baseline to validate analytical confidence intervals")

In [None]:
def plot_theoretical_validation(targets: pd.Series, confidence: float = 0.95, 
                              n_simulations: int = 10000,
                              figsize: Tuple[int, int] = (15, 5)) -> plt.Figure:
    """
    Validate analytical confidence intervals against theoretical simulation.
    
    Args:
        targets (pd.Series): True target values
        confidence (float): Confidence level
        n_simulations (int): Number of simulations for validation
        figsize (tuple): Figure size
        
    Returns:
        plt.Figure: The matplotlib figure object
    """
    fig, axes = plt.subplots(1, 3, figsize=figsize)
    fig.suptitle(f'Theoretical Validation: Analytical CI vs Simulation ({n_simulations:,} runs)', 
                 fontsize=14, fontweight='bold')
    
    signals = [-1, 0, 1]
    signal_names = ['Sell (-1)', 'Hold (0)', 'Buy (1)']
    colors = ['red', 'gray', 'green']
    
    baseline_results = compute_all_recall_intervals_random_baseline(targets, confidence)
    
    plot_idx = 0
    for signal, name, color in zip(signals, signal_names, colors):
        if signal not in targets.values or plot_idx >= len(axes):
            if plot_idx < len(axes):
                axes[plot_idx].text(0.5, 0.5, f'Signal {signal}\nnot present', 
                            ha='center', va='center', transform=axes[plot_idx].transAxes)
                axes[plot_idx].set_title(name)
                plot_idx += 1
            continue
            
        # Get analytical results
        expected_recall = baseline_results[signal]['expected_recall']
        ci_lower = baseline_results[signal]['ci_lower']
        ci_upper = baseline_results[signal]['ci_upper']
        
        # Run simulation
        simulated_recalls = theoretical_recall_distribution(targets, signal, n_simulations)
        
        # Plot histogram of simulated recalls
        axes[plot_idx].hist(simulated_recalls, bins=50, alpha=0.7, color=color, 
                    density=True, edgecolor='black', linewidth=0.5)
        
        # Add analytical expected value and CI
        axes[plot_idx].axvline(expected_recall, color='red', linestyle='--', linewidth=2, 
                       label=f'Analytical Expected: {expected_recall:.3f}')
        axes[plot_idx].axvline(ci_lower, color='orange', linestyle=':', linewidth=2, 
                       label=f'Analytical CI: [{ci_lower:.3f}, {ci_upper:.3f}]')
        axes[plot_idx].axvline(ci_upper, color='orange', linestyle=':', linewidth=2)
        
        # Add simulation statistics
        sim_mean = np.mean(simulated_recalls)
        
        axes[plot_idx].axvline(sim_mean, color='blue', linestyle='-', linewidth=2, 
                       label=f'Simulation Mean: {sim_mean:.3f}')
        
        axes[plot_idx].set_title(name, fontweight='bold')
        axes[plot_idx].set_xlabel('Recall')
        axes[plot_idx].set_ylabel('Density')
        axes[plot_idx].legend(fontsize=8)
        axes[plot_idx].grid(True, alpha=0.3)
        
        # Add validation text
        validation_text = f"""
        Analytical: {expected_recall:.4f}
        Simulation: {sim_mean:.4f}
        Difference: {abs(expected_recall - sim_mean):.4f}
        """
        axes[plot_idx].text(0.02, 0.98, validation_text, transform=axes[plot_idx].transAxes, 
                    verticalalignment='top', fontsize=8, fontfamily='monospace',
                    bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
        
        plot_idx += 1
    
    plt.tight_layout()
    return fig

print("🔬 Theoretical validation visualization defined")
print("✅ Compares analytical methods with Monte Carlo simulation")

## 7. Example Usage and Testing

Let's create some sample data to demonstrate the framework.

In [None]:
# Generate sample data for demonstration
np.random.seed(42)  # For reproducibility

# Create sample targets with realistic class imbalance
n_samples = 10000
targets_sample = np.random.choice([-1, 0, 1], size=n_samples, p=[0.3, 0.4, 0.3])
targets_sample = pd.Series(targets_sample, name='targets')

# Create sample predictions (slightly better than random for demonstration)
predictions_sample = targets_sample.copy()
# Add some noise to make it realistic
noise_indices = np.random.choice(n_samples, size=int(0.7 * n_samples), replace=False)
predictions_sample.iloc[noise_indices] = np.random.choice([-1, 0, 1], size=len(noise_indices), p=[0.3, 0.4, 0.3])
predictions_sample.name = 'predictions'

print("📊 Sample Data Generated:")
print(f"Total samples: {len(targets_sample):,}")
print(f"Target distribution:")
print(targets_sample.value_counts().sort_index())
print(f"\nPrediction distribution:")
print(predictions_sample.value_counts().sort_index())

### 7.1 Calculate Random Baseline Confidence Intervals

In [None]:
# Calculate baseline confidence intervals with verbose output
baseline_results = compute_all_recall_intervals_random_baseline(targets_sample, confidence=0.95, verbose=True)

### 7.2 Perform Significance Testing

In [None]:
# 🎯 MAIN SIGNIFICANCE TESTING
significance_results = prediction_recall_significance(predictions_sample, targets_sample, confidence=0.95)

print("🎯 RECALL SIGNIFICANCE ANALYSIS RESULTS")
print("=" * 60)

signal_names = {-1: "Sell", 0: "Hold", 1: "Buy"}

for signal in [-1, 0, 1]:
    if signal in significance_results:
        result = significance_results[signal]
        signal_name = signal_names[signal]
        
        print(f"\n{signal_name} Signal ({signal}):")
        print(f"  Actual Recall: {result['recall']:.4f}")
        print(f"  Random Expected: {result['expected_random']:.4f}")
        print(f"  Random CI: [{result['ci_lower']:.4f}, {result['ci_upper']:.4f}]")
        print(f"  Significant: {'✅ YES' if result['significant'] else '❌ NO'}")
        print(f"  Improvement: {result['recall'] - result['expected_random']:+.4f}")

# Summary
total_signals = len(significance_results)
significant_signals = sum(result['significant'] for result in significance_results.values())
print(f"\n📊 SUMMARY: {significant_signals}/{total_signals} signals are statistically significant")

### 7.3 Visualize Random Baseline Analysis

In [None]:
# Create baseline confidence intervals visualization
fig1 = plot_recall_confidence_intervals(targets_sample, confidence=0.95, 
                                       title_prefix="Sample Data: ")
plt.show()

### 7.4 Comprehensive Performance Analysis

In [None]:
# 🎯 COMPREHENSIVE PERFORMANCE VISUALIZATION
fig2 = plot_prediction_performance(predictions_sample, targets_sample, 
                                  confidence=0.95, 
                                  title_prefix="Sample Model: ")
plt.show()

### 7.5 Theoretical Validation

In [None]:
# Validate analytical methods with simulation
fig3 = plot_theoretical_validation(targets_sample, confidence=0.95, n_simulations=5000)
plt.show()

## 8. Direct Recall Analysis (NEW)

Demonstrate the new functionality that works with recall values directly instead of computing from predictions/targets.

In [None]:
# Example: Working with Direct Recall Values
# This is useful when you already have computed recall values from your model

# Sample recall values (these could come from your model evaluation)
sample_recall_values = {
    -1: 0.65,  # Sell signal recall: 65%
     0: 0.42,  # Hold signal recall: 42% 
     1: 0.68   # Buy signal recall: 68%
}

# Sample signal counts (how many times each signal appeared in your data)
sample_signal_counts = {
    -1: 3000,  # 3000 sell signals in dataset
     0: 4000,  # 4000 hold signals in dataset
     1: 3000   # 3000 buy signals in dataset
}

print("🆕 DIRECT RECALL ANALYSIS EXAMPLE")
print("=" * 50)
print(f"Recall Values: {sample_recall_values}")
print(f"Signal Counts: {sample_signal_counts}")
print(f"Total Samples: {sum(sample_signal_counts.values()):,}")

### 8.1 Baseline Analysis with Direct Recalls

In [None]:
# Calculate baseline confidence intervals using recall values directly
baseline_results_from_recalls = compute_all_recall_intervals_random_baseline_from_recalls(
    sample_recall_values, sample_signal_counts, confidence=0.95, verbose=True
)

### 8.2 Significance Testing with Direct Recalls

In [None]:
# 🎯 SIGNIFICANCE TESTING WITH DIRECT RECALL VALUES
significance_results_from_recalls = prediction_recall_significance_from_recalls(
    sample_recall_values, sample_signal_counts, confidence=0.95
)

print("🎯 RECALL SIGNIFICANCE ANALYSIS RESULTS (FROM DIRECT RECALLS)")
print("=" * 70)

signal_names = {-1: "Sell", 0: "Hold", 1: "Buy"}

for signal in [-1, 0, 1]:
    if signal in significance_results_from_recalls:
        result = significance_results_from_recalls[signal]
        signal_name = signal_names[signal]
        
        print(f"\n{signal_name} Signal ({signal}):")
        print(f"  Actual Recall: {result['recall']:.4f}")
        print(f"  Random Expected: {result['expected_random']:.4f}")
        print(f"  Random CI: [{result['ci_lower']:.4f}, {result['ci_upper']:.4f}]")
        print(f"  Significant: {'✅ YES' if result['significant'] else '❌ NO'}")
        print(f"  Improvement: {result['recall'] - result['expected_random']:+.4f}")

# Summary
total_signals = len(significance_results_from_recalls)
significant_signals = sum(result['significant'] for result in significance_results_from_recalls.values())
print(f"\n📊 SUMMARY: {significant_signals}/{total_signals} signals are statistically significant")

### 8.3 Comprehensive Visualization with Direct Recalls

In [None]:
# 🎯 COMPREHENSIVE PERFORMANCE VISUALIZATION WITH DIRECT RECALLS
fig_recalls = plot_prediction_performance_from_recalls(
    sample_recall_values, sample_signal_counts, 
    confidence=0.95, 
    title_prefix="Direct Recall Analysis: "
)
plt.show()

### 8.4 Comparison: Direct Recalls vs Traditional Method

In [None]:
# Let's verify that both methods give the same results
# by comparing with the traditional prediction/target approach

print("🔄 COMPARISON: Direct Recalls vs Traditional Method")
print("=" * 60)

# Compare significance results
print("\n🎯 Significance Testing Results:")
print("Signal | Direct Recalls | Traditional | Match")
print("-" * 45)

for signal in [-1, 0, 1]:
    if signal in significance_results and signal in significance_results_from_recalls:
        direct_sig = significance_results_from_recalls[signal]['significant']
        traditional_sig = significance_results[signal]['significant']
        match = "✅" if direct_sig == traditional_sig else "❌"
        
        print(f"{signal:6} | {direct_sig:13} | {traditional_sig:11} | {match}")

# Compare recall values
print("\n📊 Recall Values Comparison:")
print("Signal | Direct Recalls | Traditional | Difference")
print("-" * 50)

for signal in [-1, 0, 1]:
    if signal in significance_results and signal in significance_results_from_recalls:
        direct_recall = significance_results_from_recalls[signal]['recall']
        traditional_recall = significance_results[signal]['recall']
        diff = abs(direct_recall - traditional_recall)
        
        print(f"{signal:6} | {direct_recall:13.4f} | {traditional_recall:11.4f} | {diff:10.6f}")

print("\n✅ Both methods should give identical results when using the same underlying data!")

## 9. Quick Reference Guide

### 🎯 Main Functions for Analysis

In [None]:
print("🎯 QUICK REFERENCE GUIDE")
print("=" * 50)
print()
print("📊 WORKING WITH DIRECT RECALL VALUES:")
print("   # Example recall values and signal counts")
print("   recall_values = {-1: 0.65, 0: 0.42, 1: 0.68}")
print("   signal_counts = {-1: 3000, 0: 4000, 1: 3000}")
print()
print("   # Significance testing with recalls")
print("   results = prediction_recall_significance_from_recalls(recall_values, signal_counts)")
print()
print("   # Visualization with recalls")
print("   fig = plot_prediction_performance_from_recalls(recall_values, signal_counts)")
print()
print("📊 TRADITIONAL PREDICTION/TARGET ANALYSIS:")
print("   results = prediction_recall_significance(predictions, targets)")
print("   fig = plot_prediction_performance(predictions, targets, confidence=0.95)")
print()
print("📊 BASELINE ANALYSIS:")
print("   baseline_results = compute_all_recall_intervals_random_baseline(targets, verbose=True)")
print()
print("🔬 THEORETICAL VALIDATION:")
print("   fig = plot_theoretical_validation(targets, confidence=0.95, n_simulations=10000)")
print()
print("✅ INTERPRETATION:")
print("   - significant=True: Model beats random baseline")
print("   - Stars (★) indicate statistical significance")
print("   - Confidence intervals show random baseline range")
print()
print("🆕 NEW FEATURES:")
print("   - Functions ending with '_from_recalls' work with direct recall values")
print("   - Original functions preserved for backward compatibility")
print("   - Enhanced analysis supports both approaches")

## 10. Load Your Own Data

Use this template to analyze your own model predictions.

In [None]:
# Template for your own analysis
print("📋 TEMPLATE FOR YOUR ANALYSIS")
print("=" * 40)
print()
print("# Option 1: Working with direct recall values")
print("# recall_values = {-1: 0.65, 0: 0.42, 1: 0.68}  # Your actual recall values")
print("# signal_counts = {-1: 3000, 0: 4000, 1: 3000}   # Count of each signal in your data")
print()
print("# results = prediction_recall_significance_from_recalls(recall_values, signal_counts)")
print("# fig = plot_prediction_performance_from_recalls(recall_values, signal_counts,")
print("#                                                 title_prefix='My Model: ')")
print("# plt.show()")
print()
print("# Option 2: Traditional approach with predictions/targets")
print("# your_predictions = pd.Series([...])  # Your model predictions (-1, 0, 1)")
print("# your_targets = pd.Series([...])      # True target values (-1, 0, 1)")
print()
print("# results = prediction_recall_significance(your_predictions, your_targets)")
print("# fig = plot_prediction_performance(your_predictions, your_targets,")
print("#                                   title_prefix='My Model: ')")
print("# plt.show()")
print()
print("# Print summary for both approaches")
print("# for signal, result in results.items():")
print("#     print(f\"Signal {signal}: {'Significant' if result['significant'] else 'Not Significant'}\")")
print()
print("🆕 Choose the approach that best fits your data:")
print("   - Use '_from_recalls' functions if you already have recall values")
print("   - Use original functions if you have prediction/target series")

## 10. Summary

This notebook provides a complete framework for:

### ✅ **Core Capabilities**
- **Statistical Significance Testing**: Determine if your model beats random baseline
- **Confidence Intervals**: Wilson method for robust statistical inference
- **Comprehensive Visualizations**: 4-panel analysis plots
- **Theoretical Validation**: Monte Carlo simulation verification

### 🎯 **Key Functions**
1. **`prediction_recall_significance()`** - Main significance testing function
2. **`plot_prediction_performance()`** - Comprehensive visualization
3. **`compute_all_recall_intervals_random_baseline()`** - Baseline analysis
4. **`plot_theoretical_validation()`** - Simulation validation

### 📊 **Output Interpretation**
- **Stars (★)**: Indicate statistical significance
- **Confidence Intervals**: Show random baseline performance range
- **Improvement Bars**: Show how much better (or worse) than random
- **Significance Summary**: Detailed statistical results

### 🚀 **Next Steps**
1. Load your model predictions and targets
2. Run `prediction_recall_significance()` for main analysis
3. Use `plot_prediction_performance()` for visualization
4. Interpret results using significance markers and confidence intervals