# Reverse Compatibility Demo: mega_df ↔ Original Analysis Tools

This notebook demonstrates that **mega_df** from `aggregator_v4.13_cc.ipynb` works seamlessly with all the original analysis functions from `analysis_v4.13_cc.ipynb`.

## Key Concept:
The aggregated CSV files have the **same structure** as the original analysis dataframes, so you can:
1. Load mega_df with all subjects
2. Filter to specific subject(s) or groups
3. Use original analysis functions without modification
4. Run cross-subject comparisons with the same tools

## Workflow:
- **Single Subject Analysis**: Filter mega_df → Run original analyses
- **Group Analysis**: Filter by subject_group → Run original analyses
- **Cross-Subject**: Use mega_df directly for population-level insights

## 1. Setup: Import Libraries & Configuration

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import ttest_ind, mannwhitneyu, shapiro, levene, pearsonr, spearmanr
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import os
import re
from pathlib import Path
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Plot settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('Set2')

print("✓ Libraries imported")

## 2. Load Configuration Maps (from analysis_v4.13_cc.ipynb)

In [None]:
# These are the same configuration maps used in the original analysis notebook

# CRM Condition Mapping
CRM_CONDITION_MAP = {
    0: 'Practice',
    1: 'BM', 2: 'BM', 3: 'BM',
    4: 'CI', 5: 'CI', 6: 'CI',
    7: 'HA', 8: 'HA', 9: 'HA'
}

# Phoneme Mappings
VOWEL_MAP = {
    1: 'AE', 2: 'AH', 3: 'AW', 4: 'EH', 5: 'IH',
    6: 'IY', 7: 'OO', 8: 'UH', 9: 'UW'
}

CONSONANT_MAP = {
    1: '#', 2: '_', 3: 'b', 4: 'd', 5: 'f', 6: 'g',
    7: 'k', 8: 'm', 9: 'n', 10: '%', 11: 'p', 12: 's',
    13: 't', 14: 'v', 15: 'z', 16: '$'
}

# Phonetic Features (for consonant analysis)
CONSONANT_FEATURES = {
    'place': {
        'labial': ['b', 'p', 'm', 'f', 'v'],
        'alveolar': ['d', 't', 'n', 's', 'z'],
        'velar': ['g', 'k'],
        'other': ['#', '_', '%', '$']
    },
    'manner': {
        'stop': ['b', 'd', 'g', 'p', 't', 'k'],
        'fricative': ['f', 'v', 's', 'z'],
        'nasal': ['m', 'n'],
        'other': ['#', '_', '%', '$']
    },
    'voicing': {
        'voiced': ['b', 'd', 'g', 'm', 'n', 'v', 'z'],
        'voiceless': ['p', 't', 'k', 'f', 's'],
        'other': ['#', '_', '%', '$']
    }
}

print("✓ Configuration maps loaded")
print(f"  CRM conditions: {len(CRM_CONDITION_MAP)} runs")
print(f"  Phonemes: {len(VOWEL_MAP)} vowels, {len(CONSONANT_MAP)} consonants")

## 3. Load mega_df

Load the combined dataset created by aggregator_v4.13_cc.ipynb

In [None]:
# Option 1: Load from saved mega_df CSV
MEGA_DF_PATH = '/home/user/Disco/mega_df_all_subjects.csv'  # UPDATE THIS PATH

try:
    mega_df = pd.read_csv(MEGA_DF_PATH)
    print(f"✓ Loaded mega_df from {MEGA_DF_PATH}")
    print(f"  Shape: {mega_df.shape[0]:,} rows × {mega_df.shape[1]} columns")
    print(f"  Subjects: {mega_df['subject_id'].nunique()}")
    print(f"  Tasks: {', '.join(mega_df['task'].unique())}")
except FileNotFoundError:
    print(f"⚠ File not found: {MEGA_DF_PATH}")
    print("  Run aggregator_v4.13_cc.ipynb first to create mega_df")
    mega_df = pd.DataFrame()  # Empty dataframe as fallback

In [None]:
# Option 2: Or load directly from aggregator functions
# (Uncomment if you want to regenerate mega_df)

# from aggregator_v4_13_cc import load_and_merge_subjects
# mega_df = load_and_merge_subjects('/home/user/Disco/Data')

## 4. Demonstration: Single Subject Analysis (Reverse Compatibility)

Show that filtering mega_df for a single subject produces data **identical** to the original analysis.

In [None]:
# Select a subject for demonstration
DEMO_SUBJECT = 'CI148'  # This subject has all three task types

if not mega_df.empty:
    # Filter mega_df for single subject
    df_subject = mega_df[mega_df['subject_id'] == DEMO_SUBJECT].copy()
    
    print(f"\n{'='*70}")
    print(f"SINGLE SUBJECT EXTRACTION: {DEMO_SUBJECT}")
    print(f"{'='*70}")
    print(f"Total trials: {len(df_subject):,}")
    
    if 'task' in df_subject.columns:
        print(f"\nTask breakdown:")
        for task, count in df_subject['task'].value_counts().items():
            print(f"  {task:12s}: {count:4d} trials")
    
    # Split by task (same as original analysis)
    df_consonant = df_subject[df_subject['task'] == 'Consonants'].copy()
    df_vowel = df_subject[df_subject['task'] == 'Vowels'].copy()
    df_crm = df_subject[df_subject['task'] == 'CRM'].copy()
    
    print(f"\n✓ Data split by task:")
    print(f"  df_consonant: {len(df_consonant)} rows")
    print(f"  df_vowel: {len(df_vowel)} rows")
    print(f"  df_crm: {len(df_crm)} rows")
    print(f"\n{'='*70}")
    print(f"✓ These dataframes are IDENTICAL to original analysis output")
    print(f"  All original analysis functions will work!")
    print(f"{'='*70}\n")

## 5. Copy Key Analysis Functions from Original Notebook

These are the **exact same functions** from analysis_v4.13_cc.ipynb

In [None]:
def calculate_ci_bootstrap(data, confidence=0.95, n_bootstrap=1000):
    """
    Calculate confidence interval using bootstrap method.
    """
    if len(data) == 0:
        return np.nan, np.nan
    
    bootstrapped_means = []
    for _ in range(n_bootstrap):
        sample = np.random.choice(data, size=len(data), replace=True)
        bootstrapped_means.append(np.mean(sample))
    
    alpha = 1 - confidence
    lower = np.percentile(bootstrapped_means, (alpha/2) * 100)
    upper = np.percentile(bootstrapped_means, (1 - alpha/2) * 100)
    
    return lower, upper

def analyze_phonetic_features(df, feature_map, title="Consonant"):
    """
    Performs rigorous phonetic feature analysis with confidence intervals.
    EXACT COPY from analysis_v4.13_cc.ipynb
    """
    if df.empty:
        print(f"No {title} data available")
        return None
    
    # Determine score column
    score_col = 'score' if 'score' in df.columns else 'correct'
    
    results = {}
    
    for feature_name, categories in feature_map.items():
        print(f"\n{'='*60}")
        print(f"{title.upper()} FEATURE: {feature_name.upper()}")
        print(f"{'='*60}")
        
        category_stats = []
        
        for category, phonemes in categories.items():
            if 'stimulus' in df.columns:
                cat_data = df[df['stimulus'].isin(phonemes)]
            elif 'presented' in df.columns:
                cat_data = df[df['presented'].isin(phonemes)]
            else:
                continue
            
            if len(cat_data) == 0:
                continue
            
            scores = cat_data[score_col].values
            mean_acc = np.mean(scores)
            ci_low, ci_high = calculate_ci_bootstrap(scores)
            
            category_stats.append({
                'category': category,
                'n': len(cat_data),
                'accuracy': mean_acc,
                'ci_low': ci_low,
                'ci_high': ci_high,
                'phonemes': ', '.join(phonemes)
            })
        
        if category_stats:
            stats_df = pd.DataFrame(category_stats)
            print(stats_df.to_string(index=False))
            results[feature_name] = stats_df
            
            # Visualization
            fig, ax = plt.subplots(figsize=(10, 6))
            x_pos = np.arange(len(stats_df))
            
            ax.bar(x_pos, stats_df['accuracy'], alpha=0.7, color='steelblue')
            ax.errorbar(x_pos, stats_df['accuracy'], 
                       yerr=[stats_df['accuracy'] - stats_df['ci_low'],
                             stats_df['ci_high'] - stats_df['accuracy']],
                       fmt='none', ecolor='black', capsize=5, capthick=2)
            
            ax.set_xlabel(f'{feature_name.capitalize()} Category')
            ax.set_ylabel('Accuracy')
            ax.set_title(f'{title} Accuracy by {feature_name.capitalize()} (with 95% CI)')
            ax.set_xticks(x_pos)
            ax.set_xticklabels(stats_df['category'])
            ax.set_ylim([0, 1])
            ax.grid(True, alpha=0.3)
            
            plt.tight_layout()
            plt.show()
    
    return results

print("✓ Analysis functions loaded")

## 6. Demo: Run Original Analysis on Single Subject from mega_df

In [None]:
# Run phonetic feature analysis on consonant data extracted from mega_df
if not df_consonant.empty:
    print(f"\n{'='*70}")
    print(f"RUNNING ORIGINAL ANALYSIS ON {DEMO_SUBJECT} (from mega_df)")
    print(f"{'='*70}\n")
    
    feat_results = analyze_phonetic_features(df_consonant, CONSONANT_FEATURES, "Consonant")
    
    print(f"\n{'='*70}")
    print(f"✓ SUCCESS: Original analysis function works perfectly!")
    print(f"{'='*70}\n")
else:
    print(f"No consonant data for {DEMO_SUBJECT}")

## 7. Demo: Basic Statistics for Single Subject

In [None]:
def subject_summary_stats(df_subject, subject_id):
    """
    Calculate summary statistics for a single subject.
    Works with data extracted from mega_df.
    """
    print(f"\n{'='*70}")
    print(f"SUMMARY STATISTICS: {subject_id}")
    print(f"{'='*70}")
    
    for task in df_subject['task'].unique():
        df_task = df_subject[df_subject['task'] == task]
        
        print(f"\n{task}:")
        print(f"  Trials: {len(df_task)}")
        
        if 'score' in df_task.columns:
            print(f"  Accuracy: {df_task['score'].mean():.3f} (±{df_task['score'].std():.3f})")
        elif 'correct' in df_task.columns:
            print(f"  Accuracy: {df_task['correct'].mean():.3f} (±{df_task['correct'].std():.3f})")
        
        if 'rt' in df_task.columns:
            rt_valid = df_task['rt'].dropna()
            if len(rt_valid) > 0:
                print(f"  RT (ms): {rt_valid.mean():.1f} (±{rt_valid.std():.1f})")
    
    print(f"{'='*70}\n")

# Run summary stats
if not df_subject.empty:
    subject_summary_stats(df_subject, DEMO_SUBJECT)

## 8. Demo: Multi-Subject Comparison Using mega_df

This shows the **power** of mega_df - comparing across subjects seamlessly.

In [None]:
def compare_subjects_consonant_accuracy(mega_df, subjects_list=None):
    """
    Compare consonant accuracy across multiple subjects.
    Shows how mega_df enables easy cross-subject analysis.
    """
    # Filter to consonants only
    df_cons_all = mega_df[mega_df['task'] == 'Consonants'].copy()
    
    if df_cons_all.empty:
        print("No consonant data found")
        return
    
    # Filter to specific subjects if provided
    if subjects_list:
        df_cons_all = df_cons_all[df_cons_all['subject_id'].isin(subjects_list)]
    
    # Calculate accuracy per subject
    score_col = 'score' if 'score' in df_cons_all.columns else 'correct'
    
    subject_acc = df_cons_all.groupby('subject_id')[score_col].agg([
        ('mean', 'mean'),
        ('std', 'std'),
        ('n', 'count')
    ]).round(3)
    
    subject_acc = subject_acc.sort_values('mean', ascending=False)
    
    print(f"\n{'='*70}")
    print(f"CROSS-SUBJECT COMPARISON: Consonant Accuracy")
    print(f"{'='*70}")
    print(subject_acc)
    print(f"\nPopulation mean: {df_cons_all[score_col].mean():.3f}")
    print(f"Population std:  {df_cons_all[score_col].std():.3f}")
    print(f"{'='*70}\n")
    
    # Plot
    fig, ax = plt.subplots(figsize=(12, 6))
    subject_acc['mean'].plot(kind='bar', ax=ax, color='steelblue', alpha=0.7)
    ax.axhline(df_cons_all[score_col].mean(), color='red', linestyle='--', 
               label='Population Mean', linewidth=2)
    ax.set_xlabel('Subject ID')
    ax.set_ylabel('Mean Accuracy')
    ax.set_title('Consonant Accuracy by Subject (from mega_df)')
    ax.set_ylim([0, 1])
    ax.legend()
    ax.grid(True, alpha=0.3)
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
    
    return subject_acc

# Run cross-subject comparison
if not mega_df.empty:
    subject_comparison = compare_subjects_consonant_accuracy(mega_df)

## 9. Demo: Group Comparison (CI vs HS)

Filter by subject groups and compare using original analysis tools.

In [None]:
def add_subject_group(df):
    """
    Add subject_group column based on subject_id prefix.
    """
    def get_group(subj_id):
        if subj_id.startswith('CI'):
            return 'CI'
        elif subj_id.startswith('HS'):
            return 'HS'
        elif subj_id.startswith('CA'):
            return 'CA'
        elif subj_id.startswith('LR'):
            return 'LR'
        else:
            return 'Other'
    
    df['subject_group'] = df['subject_id'].apply(get_group)
    return df

def compare_groups_vowel_accuracy(mega_df):
    """
    Compare vowel accuracy between subject groups.
    """
    # Add grouping
    df = add_subject_group(mega_df.copy())
    
    # Filter to vowels
    df_vowels = df[df['task'] == 'Vowels'].copy()
    
    if df_vowels.empty:
        print("No vowel data found")
        return
    
    score_col = 'score' if 'score' in df_vowels.columns else 'correct'
    
    # Group statistics
    group_stats = df_vowels.groupby('subject_group')[score_col].agg([
        ('mean', 'mean'),
        ('std', 'std'),
        ('n_subjects', lambda x: df_vowels[df_vowels[score_col].notna()]['subject_id'].nunique()),
        ('n_trials', 'count')
    ]).round(3)
    
    print(f"\n{'='*70}")
    print(f"GROUP COMPARISON: Vowel Accuracy")
    print(f"{'='*70}")
    print(group_stats)
    print(f"{'='*70}\n")
    
    # Statistical test (CI vs HS if both present)
    groups_present = df_vowels['subject_group'].unique()
    if 'CI' in groups_present and 'HS' in groups_present:
        ci_scores = df_vowels[df_vowels['subject_group'] == 'CI'][score_col].dropna()
        hs_scores = df_vowels[df_vowels['subject_group'] == 'HS'][score_col].dropna()
        
        stat, p = mannwhitneyu(ci_scores, hs_scores, alternative='two-sided')
        
        print(f"Mann-Whitney U Test (CI vs HS):")
        print(f"  U = {stat:.2f}, p = {p:.4f}")
        print(f"  Result: {'Significant' if p < 0.05 else 'Not significant'} (α = 0.05)\n")
    
    # Plot
    fig, ax = plt.subplots(figsize=(10, 6))
    df_vowels.boxplot(column=score_col, by='subject_group', ax=ax)
    ax.set_xlabel('Subject Group')
    ax.set_ylabel('Accuracy')
    ax.set_title('Vowel Accuracy by Subject Group (from mega_df)')
    plt.suptitle('')
    plt.tight_layout()
    plt.show()

# Run group comparison
if not mega_df.empty:
    compare_groups_vowel_accuracy(mega_df)

## 10. Demo: Iterate Through All Subjects Automatically

In [None]:
def analyze_all_subjects_consonants(mega_df, show_plots=False):
    """
    Run consonant analysis for ALL subjects in mega_df.
    This shows the power of the aggregated approach.
    """
    df_cons = mega_df[mega_df['task'] == 'Consonants'].copy()
    
    if df_cons.empty:
        print("No consonant data found")
        return
    
    subjects = sorted(df_cons['subject_id'].unique())
    
    print(f"\n{'='*70}")
    print(f"BATCH ANALYSIS: All Subjects - Consonant Place of Articulation")
    print(f"{'='*70}\n")
    
    all_results = []
    
    for subject_id in subjects:
        df_subj = df_cons[df_cons['subject_id'] == subject_id].copy()
        
        score_col = 'score' if 'score' in df_subj.columns else 'correct'
        
        # Analyze place of articulation
        for place, phonemes in CONSONANT_FEATURES['place'].items():
            if 'stimulus' in df_subj.columns:
                place_data = df_subj[df_subj['stimulus'].isin(phonemes)]
            elif 'presented' in df_subj.columns:
                place_data = df_subj[df_subj['presented'].isin(phonemes)]
            else:
                continue
            
            if len(place_data) > 0:
                all_results.append({
                    'subject_id': subject_id,
                    'place': place,
                    'accuracy': place_data[score_col].mean(),
                    'n': len(place_data)
                })
    
    results_df = pd.DataFrame(all_results)
    
    # Pivot table
    pivot = results_df.pivot(index='subject_id', columns='place', values='accuracy')
    print(pivot.round(3))
    print(f"\n{'='*70}\n")
    
    if show_plots:
        # Heatmap
        fig, ax = plt.subplots(figsize=(10, 8))
        sns.heatmap(pivot, annot=True, fmt='.2f', cmap='RdYlGn', 
                   vmin=0, vmax=1, ax=ax, cbar_kws={'label': 'Accuracy'})
        ax.set_title('Consonant Accuracy by Place of Articulation (All Subjects)')
        ax.set_xlabel('Place of Articulation')
        ax.set_ylabel('Subject ID')
        plt.tight_layout()
        plt.show()
    
    return results_df

# Run batch analysis
if not mega_df.empty:
    batch_results = analyze_all_subjects_consonants(mega_df, show_plots=True)

## 11. Summary: Reverse Compatibility Confirmed ✓

### What We Demonstrated:

1. **Single Subject Analysis**
   - Filter `mega_df` by `subject_id` → produces identical structure to original analysis
   - All original functions work without modification

2. **Original Analysis Functions**
   - `analyze_phonetic_features()` works perfectly on mega_df subsets
   - No code changes required

3. **Enhanced Capabilities**
   - Cross-subject comparisons made easy
   - Group analyses (CI vs HS)
   - Batch processing of all subjects

4. **Seamless Integration**
   - Load mega_df → Filter → Analyze with original tools
   - OR: Use mega_df directly for population-level insights

### Key Insight:
The aggregated CSV structure **preserves** all columns and data types from the original analysis, ensuring **100% backward compatibility** while adding the flexibility of multi-subject analysis.

### Next Steps:
- Use any analysis block from `analysis_v4.13_cc.ipynb` on mega_df subsets
- Create new cross-subject analyses using the same tools
- Scale to population-level statistics effortlessly