# Dance Style Classification - Threshold Tuning Notebook

This notebook provides an interactive environment for:
1. Analyzing feature distributions across dance styles
2. Tuning classification thresholds
3. Evaluating impact of threshold changes
4. Visualizing decision boundaries

**Workflow:**
- Run baseline evaluation
- Analyze misclassifications
- Adjust thresholds interactively
- Re-evaluate and compare metrics
- Document successful changes

## Setup

In [None]:
import sys
import json
import yaml
from pathlib import Path
from collections import defaultdict

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import neckenml
try:
    from neckenml.analyzer import AudioAnalyzer
    from neckenml.classifier import StyleClassifier
except ImportError:
    print("Error: Cannot import neckenml. Make sure it's installed.")
    print("Try: pip install -e /path/to/neckenml")

# Import evaluation utilities
from evaluate_classification import ClassificationEvaluator

# Configure plotting
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

%matplotlib inline
%load_ext autoreload
%autoreload 2

print("‚úì Setup complete")

## 1. Load Test Data and Run Baseline Evaluation

In [None]:
# Initialize evaluator
evaluator = ClassificationEvaluator('test_data/test_tracks.yaml')

# Load test tracks
evaluator.load_test_data()

print(f"Loaded {len(evaluator.test_data['tracks'])} test tracks")
print(f"\nStyles represented:")
style_counts = pd.Series([t['true_style'] for t in evaluator.test_data['tracks']]).value_counts()
print(style_counts)

In [None]:
# Run baseline evaluation
print("Running baseline evaluation...\n")
evaluator.evaluate_all(verbose=False)

# Generate metrics
baseline_metrics = evaluator.generate_metrics()

# Print report
evaluator.print_report(baseline_metrics)

# Save baseline
evaluator.save_results('test_data/baseline_results.json')

## 2. Feature Distribution Analysis

Analyze how different features distribute across dance styles to identify good threshold candidates.

In [None]:
# Extract features from all analyzed tracks
results_df = pd.DataFrame([
    {
        'track_id': r['track_id'],
        'true_style': r['true_style'],
        'predicted_style': r['predicted_style'],
        'is_correct': r['is_correct'],
        'confidence': r['confidence'],
        'decision_path': r['decision_path'],
        'bpm': r['features']['bpm'],
        'detected_meter': r['features']['detected_meter'],
        'ternary_confidence': r['features']['ternary_confidence'],
        'polska_score': r['features']['polska_score'],
        'hambo_score': r['features']['hambo_score'],
        'swing_ratio': r['features']['swing_ratio'],
        'punchiness': r['features']['punchiness'],
    }
    for r in evaluator.results
    if r.get('status') == 'analyzed'
])

print(f"Extracted features from {len(results_df)} tracks")
results_df.head()

In [None]:
# BPM distribution by style
fig, ax = plt.subplots(figsize=(14, 6))

styles = sorted(results_df['true_style'].unique())
data_by_style = [results_df[results_df['true_style'] == style]['bpm'].values for style in styles]

bp = ax.boxplot(data_by_style, labels=styles, patch_artist=True)
for patch in bp['boxes']:
    patch.set_facecolor('lightblue')

ax.set_ylabel('BPM', fontsize=12, weight='bold')
ax.set_xlabel('Dance Style', fontsize=12, weight='bold')
ax.set_title('BPM Distribution by Dance Style', fontsize=14, weight='bold')
ax.grid(axis='y', alpha=0.3)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Print statistics
print("\nBPM Statistics by Style:")
print(results_df.groupby('true_style')['bpm'].agg(['mean', 'std', 'min', 'max']).round(1))

In [None]:
# Ternary confidence distribution (critical for Polska vs Polka)
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# All styles
for style in styles:
    data = results_df[results_df['true_style'] == style]['ternary_confidence']
    axes[0].hist(data, alpha=0.5, label=style, bins=20)

axes[0].axvline(x=0.5, color='red', linestyle='--', label='Binary/Ternary boundary')
axes[0].set_xlabel('Ternary Confidence', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)
axes[0].set_title('Ternary Confidence Distribution - All Styles', fontsize=14, weight='bold')
axes[0].legend(bbox_to_anchor=(1.05, 1), loc='upper left')
axes[0].grid(alpha=0.3)

# Focus on Polska vs Polka
polska_data = results_df[results_df['true_style'] == 'Polska']['ternary_confidence']
polka_data = results_df[results_df['true_style'] == 'Polka']['ternary_confidence']

axes[1].hist(polska_data, alpha=0.6, label='Polska (should be high)', bins=20, color='green')
axes[1].hist(polka_data, alpha=0.6, label='Polka (should be low)', bins=20, color='red')
axes[1].axvline(x=0.5, color='black', linestyle='--', linewidth=2, label='Decision boundary')
axes[1].set_xlabel('Ternary Confidence', fontsize=12)
axes[1].set_ylabel('Count', fontsize=12)
axes[1].set_title('Ternary Confidence: Polska vs Polka', fontsize=14, weight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\nTernary Confidence Statistics:")
print(results_df.groupby('true_style')['ternary_confidence'].agg(['mean', 'std', 'min', 'max']).round(3))

In [None]:
# Polska vs Hambo score comparison (for ternary styles)
ternary_styles = results_df[results_df['true_style'].isin(['Polska', 'Hambo', 'Vals', 'Sl√§ngpolska'])]

fig, ax = plt.subplots(figsize=(12, 8))

for style in ['Polska', 'Hambo', 'Vals', 'Sl√§ngpolska']:
    style_data = ternary_styles[ternary_styles['true_style'] == style]
    if len(style_data) > 0:
        ax.scatter(
            style_data['polska_score'],
            style_data['hambo_score'],
            label=style,
            alpha=0.7,
            s=100
        )

# Add decision boundary lines
ax.axhline(y=0.45, color='blue', linestyle='--', alpha=0.5, label='Hambo threshold (0.45)')
ax.axvline(x=0.45, color='green', linestyle='--', alpha=0.5, label='Polska threshold (0.45)')

# Add diagonal (where polska_score = hambo_score)
lims = [0, max(ax.get_xlim()[1], ax.get_ylim()[1])]
ax.plot(lims, lims, 'r--', alpha=0.5, label='Equal scores')

ax.set_xlabel('Polska Score', fontsize=12, weight='bold')
ax.set_ylabel('Hambo Score', fontsize=12, weight='bold')
ax.set_title('Polska vs Hambo Score Distribution', fontsize=14, weight='bold')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("\nPolska/Hambo Score Statistics:")
print(ternary_styles.groupby('true_style')[['polska_score', 'hambo_score']].agg(['mean', 'std']).round(3))

In [None]:
# Swing ratio distribution (for binary styles)
binary_styles = results_df[results_df['true_style'].isin(['Polka', 'Schottis', 'Snoa', 'Engelska'])]

fig, ax = plt.subplots(figsize=(14, 6))

for style in ['Polka', 'Schottis', 'Snoa', 'Engelska']:
    style_data = binary_styles[binary_styles['true_style'] == style]
    if len(style_data) > 0:
        ax.hist(style_data['swing_ratio'], alpha=0.5, label=style, bins=15)

ax.axvline(x=1.25, color='red', linestyle='--', linewidth=2, label='Schottis threshold (1.25)')
ax.set_xlabel('Swing Ratio', fontsize=12, weight='bold')
ax.set_ylabel('Count', fontsize=12)
ax.set_title('Swing Ratio Distribution - Binary Styles', fontsize=14, weight='bold')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("\nSwing Ratio Statistics:")
print(binary_styles.groupby('true_style')['swing_ratio'].agg(['mean', 'std', 'min', 'max']).round(3))

## 3. Error Analysis

Deep dive into misclassifications to identify patterns.

In [None]:
# Get all errors
errors_df = results_df[~results_df['is_correct']].copy()

print(f"Total errors: {len(errors_df)} / {len(results_df)} ({len(errors_df)/len(results_df):.1%})\n")

if len(errors_df) > 0:
    # Group by confusion pair
    errors_df['confusion_pair'] = errors_df.apply(
        lambda row: f"{row['true_style']} ‚Üí {row['predicted_style']}",
        axis=1
    )
    
    print("Most common confusions:")
    print(errors_df['confusion_pair'].value_counts())
    
    # Show error details
    print("\nError details:")
    print(errors_df[[
        'track_id', 'true_style', 'predicted_style', 'confidence',
        'decision_path', 'bpm', 'ternary_confidence'
    ]].to_string())
else:
    print("üéâ No errors! Perfect classification!")

In [None]:
# Analyze Polska ‚Üí Polka errors specifically
polska_to_polka = errors_df[
    (errors_df['true_style'] == 'Polska') &
    (errors_df['predicted_style'] == 'Polka')
]

if len(polska_to_polka) > 0:
    print(f"\nPolska ‚Üí Polka errors: {len(polska_to_polka)}")
    print("\nFeature analysis:")
    print(polska_to_polka[[
        'track_id', 'bpm', 'ternary_confidence', 'polska_score',
        'detected_meter', 'confidence'
    ]].to_string())
    
    # Check if these are rescue candidates
    print("\n--- Rescue Logic Analysis ---")
    for _, row in polska_to_polka.iterrows():
        print(f"\n{row['track_id']}:")
        print(f"  Ternary confidence: {row['ternary_confidence']:.3f} (need ‚â•0.45 for rescue)")
        print(f"  Polska score: {row['polska_score']:.3f} (need ‚â•0.25 for weak signal)")
        print(f"  BPM: {row['bpm']:.1f} (Polska range: 95-115)")
        print(f"  Detected meter: {row['detected_meter']}")
        
        # Estimate rescue signals
        signals = 0
        if row['ternary_confidence'] >= 0.65:
            signals += 2
            print(f"  ‚úì High ternary conf (+2 signals)")
        elif row['ternary_confidence'] >= 0.55:
            signals += 1
            print(f"  ‚úì Moderate ternary conf (+1 signal)")
        
        if row['polska_score'] >= 0.50:
            signals += 2
            print(f"  ‚úì High polska score (+2 signals)")
        elif row['polska_score'] >= 0.35:
            signals += 1
            print(f"  ‚úì Moderate polska score (+1 signal)")
        
        if 95 <= row['bpm'] <= 115:
            signals += 1
            print(f"  ‚úì In Polska BPM range (+1 signal)")
        
        print(f"  Total signals: {signals} (need ‚â•3 for rescue)")
        
        if signals >= 3:
            print(f"  ‚ö†Ô∏è  SHOULD HAVE BEEN RESCUED! Investigate why it wasn't.")
        else:
            print(f"  ‚Üí Not enough signals for rescue")
else:
    print("‚úì No Polska ‚Üí Polka errors!")

## 4. Interactive Threshold Tuning

Experiment with different threshold values and see immediate impact.

In [None]:
# Define current thresholds (baseline)
THRESHOLDS = {
    # Polska detection
    'polska_score_min': 0.45,
    'polska_score_weak': 0.25,
    
    # Hambo detection
    'hambo_score_min': 0.45,
    'hambo_polska_separation': 0.10,
    
    # Polska rescue (binary ‚Üí ternary)
    'rescue_ternary_min': 0.45,
    'rescue_ternary_strong': 0.65,
    'rescue_ternary_moderate': 0.55,
    'rescue_signals_needed': 3,
    
    # Binary styles
    'schottis_swing_min': 1.25,
    'snoa_tempo_min': 80,
    'snoa_tempo_max': 115,
    'polka_tempo_min': 115,
}

print("Current thresholds:")
for key, value in THRESHOLDS.items():
    print(f"  {key}: {value}")

In [None]:
# Threshold tuning experiment template
# Copy this cell and modify values to test different configurations

# ============================================================================
# EXPERIMENT: Reduce Polska rescue threshold
# Hypothesis: Lower ternary_min from 0.45 to 0.42 to catch more edge cases
# ============================================================================

EXPERIMENTAL_THRESHOLDS = THRESHOLDS.copy()

# Modify thresholds here
EXPERIMENTAL_THRESHOLDS['rescue_ternary_min'] = 0.42  # Changed from 0.45
# EXPERIMENTAL_THRESHOLDS['rescue_signals_needed'] = 2  # Uncomment to test

print("Experimental thresholds:")
for key, value in EXPERIMENTAL_THRESHOLDS.items():
    if value != THRESHOLDS[key]:
        print(f"  {key}: {THRESHOLDS[key]} ‚Üí {value} ‚ö†Ô∏è  CHANGED")
    else:
        print(f"  {key}: {value}")

# TODO: Apply these thresholds to classifier and re-evaluate
# This requires modifying the neckenml classifier to accept threshold overrides
# For now, this serves as documentation for manual threshold changes

## 5. Compare Metrics After Changes

After making threshold changes in the neckenml code, run this section to compare before/after.

In [None]:
# Re-run evaluation after making changes
print("Re-evaluating with updated thresholds...\n")

# Create new evaluator instance to pick up code changes
evaluator_new = ClassificationEvaluator('test_data/test_tracks.yaml')
evaluator_new.load_test_data()
evaluator_new.evaluate_all(verbose=False)

new_metrics = evaluator_new.generate_metrics()
evaluator_new.print_report(new_metrics)

# Save updated results
evaluator_new.save_results('test_data/updated_results.json')

In [None]:
# Compare baseline vs updated metrics
comparison_data = []

for style in baseline_metrics['per_style_accuracy'].keys():
    baseline_acc = baseline_metrics['per_style_accuracy'].get(style, 0)
    new_acc = new_metrics['per_style_accuracy'].get(style, 0)
    diff = new_acc - baseline_acc
    
    comparison_data.append({
        'Style': style,
        'Baseline': baseline_acc,
        'Updated': new_acc,
        'Change': diff,
        'Change_pct': diff * 100
    })

comparison_df = pd.DataFrame(comparison_data)
comparison_df = comparison_df.sort_values('Change', ascending=False)

print("\n" + "="*70)
print("BEFORE vs AFTER COMPARISON")
print("="*70)
print(f"\nOverall Accuracy:")
print(f"  Baseline: {baseline_metrics['overall_accuracy']:.1%}")
print(f"  Updated:  {new_metrics['overall_accuracy']:.1%}")
print(f"  Change:   {(new_metrics['overall_accuracy'] - baseline_metrics['overall_accuracy'])*100:+.1f}%")

print("\nPer-Style Accuracy Changes:")
print(comparison_df.to_string(index=False))

# Check for regressions
regressions = comparison_df[comparison_df['Change'] < -0.05]
if len(regressions) > 0:
    print("\n‚ö†Ô∏è  REGRESSIONS DETECTED:")
    for _, row in regressions.iterrows():
        print(f"  {row['Style']}: {row['Change_pct']:+.1f}%")
else:
    print("\n‚úì No significant regressions")

# Check for improvements
improvements = comparison_df[comparison_df['Change'] > 0.05]
if len(improvements) > 0:
    print("\n‚úÖ IMPROVEMENTS:")
    for _, row in improvements.iterrows():
        print(f"  {row['Style']}: {row['Change_pct']:+.1f}%")

In [None]:
# Visualize comparison
fig, ax = plt.subplots(figsize=(12, 6))

x = np.arange(len(comparison_df))
width = 0.35

bars1 = ax.bar(x - width/2, comparison_df['Baseline'], width, label='Baseline', alpha=0.8)
bars2 = ax.bar(x + width/2, comparison_df['Updated'], width, label='Updated', alpha=0.8)

# Color bars based on improvement/regression
for i, change in enumerate(comparison_df['Change']):
    if change > 0.05:
        bars2[i].set_color('green')
    elif change < -0.05:
        bars2[i].set_color('red')

ax.set_ylabel('Accuracy', fontsize=12, weight='bold')
ax.set_xlabel('Dance Style', fontsize=12, weight='bold')
ax.set_title('Classification Accuracy: Baseline vs Updated', fontsize=14, weight='bold')
ax.set_xticks(x)
ax.set_xticklabels(comparison_df['Style'], rotation=45, ha='right')
ax.legend()
ax.set_ylim([0, 1.1])
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig('test_data/threshold_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print("Comparison plot saved to: test_data/threshold_comparison.png")

## 6. Document Changes

When you find improvements, document them here for tracking.

In [None]:
# Template for documenting threshold changes
change_log_entry = {
    'date': '2025-12-19',
    'threshold': 'rescue_ternary_min',
    'old_value': 0.45,
    'new_value': 0.42,
    'reason': 'Reduce false negatives for subtle Polska tracks',
    'affected_test_cases': ['polska_003'],
    'metrics_before': {
        'polska_accuracy': baseline_metrics['per_style_accuracy'].get('Polska', 0),
        'overall_accuracy': baseline_metrics['overall_accuracy'],
    },
    'metrics_after': {
        'polska_accuracy': new_metrics['per_style_accuracy'].get('Polska', 0),
        'overall_accuracy': new_metrics['overall_accuracy'],
    },
    'regression_check': {
        'polka_accuracy_before': baseline_metrics['per_style_accuracy'].get('Polka', 0),
        'polka_accuracy_after': new_metrics['per_style_accuracy'].get('Polka', 0),
    },
    'status': 'testing'  # testing | deployed | reverted
}

print("Change log entry:")
print(json.dumps(change_log_entry, indent=2))

# Append to known_issues.yaml
# (You would do this manually or with additional code)

## 7. Feature Importance Analysis

Which features are most discriminative for each style?

In [None]:
# Correlation heatmap of features
feature_cols = ['bpm', 'ternary_confidence', 'polska_score', 'hambo_score', 'swing_ratio', 'punchiness']
correlation_matrix = results_df[feature_cols].corr()

fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
ax.set_title('Feature Correlation Matrix', fontsize=14, weight='bold', pad=20)
plt.tight_layout()
plt.savefig('test_data/feature_correlation.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Feature importance for distinguishing styles
# Calculate variance ratio (between-class / within-class) for each feature

from scipy.stats import f_oneway

importance_scores = {}

for feature in feature_cols:
    # Group data by style
    groups = [results_df[results_df['true_style'] == style][feature].values 
              for style in results_df['true_style'].unique()]
    
    # Remove empty groups
    groups = [g for g in groups if len(g) > 0]
    
    if len(groups) > 1:
        # One-way ANOVA F-statistic
        f_stat, p_value = f_oneway(*groups)
        importance_scores[feature] = f_stat

# Sort by importance
importance_df = pd.DataFrame([
    {'Feature': feature, 'F-statistic': score}
    for feature, score in importance_scores.items()
]).sort_values('F-statistic', ascending=False)

print("\nFeature Importance (F-statistic from ANOVA):")
print(importance_df.to_string(index=False))

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
ax.barh(importance_df['Feature'], importance_df['F-statistic'], color='steelblue', alpha=0.8)
ax.set_xlabel('F-statistic (Higher = More Discriminative)', fontsize=12, weight='bold')
ax.set_title('Feature Importance for Style Classification', fontsize=14, weight='bold')
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.savefig('test_data/feature_importance.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nüí° Insight: Features with high F-statistic are most useful for classification.")
print("   Focus threshold tuning on these features for maximum impact.")

## Summary

Use this notebook to:
1. ‚úì Run baseline evaluation
2. ‚úì Analyze feature distributions
3. ‚úì Identify error patterns
4. ‚öôÔ∏è Experiment with threshold changes
5. ‚úì Compare before/after metrics
6. üìù Document improvements

**Next steps:**
- Add more test tracks to increase confidence
- Focus on critical confusions (Polska/Polka)
- Consider ML model retraining with user feedback data
- Implement automatic threshold optimization (grid search)