# NBA Career Longevity Modeling

This notebook demonstrates survival analysis techniques to model NBA player career longevity using the Econometric Suite.

## Objectives

1. Generate and prepare player career data with duration and events
2. Fit Cox Proportional Hazards model
3. Compare parametric survival models (Weibull, Log-Normal, Log-Logistic)
4. Create Kaplan-Meier survival curves by draft position
5. Interpret hazard ratios and survival probabilities
6. Use EconometricSuite for automatic model selection

## Use Cases

- **Draft Analysis**: How does draft position affect career length?
- **Position Effects**: Do guards last longer than centers?
- **College vs. International**: Compare career longevity by origin
- **Injury Impact**: How do early injuries affect career duration?
- **Contract Planning**: Estimate probability of lasting N years

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Import econometric modules
from mcp_server.survival_analysis import SurvivalAnalyzer
from mcp_server.econometric_suite import EconometricSuite

# Set plotting style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("Set2")
%matplotlib inline

## 1. Generate Synthetic Career Data

We'll create a dataset of 500 players with:
- Draft position (1-60, plus undrafted)
- Position (PG, SG, SF, PF, C)
- College vs. International
- Career years (duration)
- Retirement status (event: 1=retired, 0=still active/censored)

In [None]:
def generate_career_data(n_players=500):
    """
    Generate synthetic NBA career data for survival analysis.
    
    Career duration is influenced by:
    - Draft position (earlier = longer careers)
    - Position (guards slightly longer)
    - Origin (college vs international)
    - Early injury flag
    """
    np.random.seed(42)
    
    data = []
    
    for i in range(n_players):
        # Draft position (1-60, some undrafted at 61)
        draft_pick = np.random.choice(
            list(range(1, 61)) + [61]*10  # More undrafted
        )
        draft_round = 1 if draft_pick <= 30 else (2 if draft_pick <= 60 else 0)
        
        # Position
        position = np.random.choice(['PG', 'SG', 'SF', 'PF', 'C'])
        
        # Origin
        origin = np.random.choice(['College', 'International'], p=[0.75, 0.25])
        
        # Early injury (first 2 years)
        early_injury = np.random.choice([0, 1], p=[0.8, 0.2])
        
        # Height (inches) - taller = slightly shorter careers
        if position in ['PG', 'SG']:
            height = np.random.normal(75, 2)
        elif position == 'SF':
            height = np.random.normal(78, 2)
        else:  # PF, C
            height = np.random.normal(81, 2)
        
        # Generate career duration with realistic effects
        # Base duration
        base_duration = 8.0
        
        # Draft position effect (lottery picks +3 years, late 2nd round -2 years)
        draft_effect = -0.08 * (draft_pick - 1)  # Each pick costs ~1 month
        
        # Position effect (guards +0.5 years)
        position_effect = 0.5 if position in ['PG', 'SG'] else 0
        
        # Origin effect (international -0.3 years on average)
        origin_effect = -0.3 if origin == 'International' else 0
        
        # Injury effect (-1.5 years if early injury)
        injury_effect = -1.5 if early_injury == 1 else 0
        
        # Height effect (very tall players have slightly shorter careers)
        height_effect = -0.05 * (height - 78)  # Penalty for height above 6'6"
        
        # Random component
        random_effect = np.random.gamma(shape=2, scale=1.5)
        
        # Total duration (minimum 1 year)
        duration = max(1.0, base_duration + draft_effect + position_effect + 
                      origin_effect + injury_effect + height_effect + random_effect)
        
        # Censoring (some players still active)
        # More likely to be censored if recently drafted (shorter observed duration)
        censoring_prob = 0.15 if duration < 5 else 0.05
        retired = 1 if np.random.random() > censoring_prob else 0
        
        data.append({
            'player_id': f'P{i+1:04d}',
            'draft_pick': draft_pick,
            'draft_round': draft_round,
            'position': position,
            'origin': origin,
            'height_inches': round(height, 1),
            'early_injury': early_injury,
            'career_years': round(duration, 2),
            'retired': retired
        })
    
    return pd.DataFrame(data)

# Generate data
career_data = generate_career_data(n_players=500)

print(f"Generated {len(career_data)} player careers")
print(f"Retired: {career_data['retired'].sum()} ({career_data['retired'].mean()*100:.1f}%)")
print(f"Still Active (Censored): {(~career_data['retired'].astype(bool)).sum()}")
print(f"\nFirst 5 players:")
career_data.head()

In [None]:
# Summary statistics
print("=" * 60)
print("CAREER DURATION SUMMARY")
print("=" * 60)
print(career_data['career_years'].describe())

print("\n" + "=" * 60)
print("DURATION BY DRAFT ROUND")
print("=" * 60)
print(career_data.groupby('draft_round')['career_years'].agg(['count', 'mean', 'std', 'min', 'max']))

print("\n" + "=" * 60)
print("DURATION BY POSITION")
print("=" * 60)
print(career_data.groupby('position')['career_years'].agg(['count', 'mean', 'std']))

# Visualize distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
axes[0].hist(career_data['career_years'], bins=30, edgecolor='black', alpha=0.7)
axes[0].axvline(career_data['career_years'].mean(), color='red', 
               linestyle='--', linewidth=2, label=f'Mean: {career_data["career_years"].mean():.1f} years')
axes[0].axvline(career_data['career_years'].median(), color='green', 
               linestyle='--', linewidth=2, label=f'Median: {career_data["career_years"].median():.1f} years')
axes[0].set_xlabel('Career Years', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Distribution of Career Lengths', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Box plot by draft round
career_data.boxplot(column='career_years', by='draft_round', ax=axes[1])
axes[1].set_xlabel('Draft Round (0=Undrafted)', fontsize=12)
axes[1].set_ylabel('Career Years', fontsize=12)
axes[1].set_title('Career Length by Draft Round', fontsize=13, fontweight='bold')
plt.suptitle('')  # Remove auto-generated title

plt.tight_layout()
plt.show()

## 2. Kaplan-Meier Survival Curves

Non-parametric estimation of survival functions by draft round.

In [None]:
# Initialize SurvivalAnalyzer
analyzer = SurvivalAnalyzer(
    data=career_data,
    duration_col='career_years',
    event_col='retired'
)

# Fit Kaplan-Meier for each draft round
print("Fitting Kaplan-Meier Survival Curves...\n")

km_overall = analyzer.kaplan_meier()
km_by_round = analyzer.kaplan_meier_by_group(group_col='draft_round')

print("Kaplan-Meier Results by Draft Round:")
print(f"Groups: {list(km_by_round['groups'].keys())}")
print(f"\nMedian Survival Times:")
for group, km_fit in km_by_round['groups'].items():
    round_name = "Undrafted" if group == 0 else f"Round {group}"
    print(f"  {round_name}: {km_fit['median_survival']:.2f} years")

In [None]:
# Plot Kaplan-Meier curves
fig, ax = plt.subplots(figsize=(12, 7))

colors = ['red', 'blue', 'green']
for i, (group, km_fit) in enumerate(km_by_round['groups'].items()):
    round_name = "Undrafted" if group == 0 else f"Round {group}"
    ax.step(km_fit['time'], km_fit['survival_function'], 
           where='post', label=round_name, linewidth=2.5, color=colors[i])
    
    # Add confidence intervals
    ax.fill_between(km_fit['time'], 
                    km_fit['ci_lower'], 
                    km_fit['ci_upper'], 
                    alpha=0.2, color=colors[i], step='post')

ax.set_xlabel('Years in NBA', fontsize=13)
ax.set_ylabel('Survival Probability', fontsize=13)
ax.set_title('Kaplan-Meier Survival Curves by Draft Round', fontsize=15, fontweight='bold')
ax.legend(fontsize=12, loc='best')
ax.grid(True, alpha=0.3)
ax.set_ylim([0, 1.05])
plt.tight_layout()
plt.show()

# Log-rank test for difference between groups
print("\nLog-Rank Test:")
print(f"Chi-square statistic: {km_by_round['logrank_test']['chi2']:.3f}")
print(f"P-value: {km_by_round['logrank_test']['p_value']:.4f}")
print(f"Conclusion: {'Significant difference' if km_by_round['logrank_test']['p_value'] < 0.05 else 'No significant difference'} between groups")

## 3. Cox Proportional Hazards Model

Semi-parametric model to estimate the effect of covariates on career longevity.

In [None]:
# Prepare covariates
# Create dummy variables for position
cox_data = career_data.copy()
cox_data['is_guard'] = (cox_data['position'].isin(['PG', 'SG'])).astype(int)
cox_data['is_international'] = (cox_data['origin'] == 'International').astype(int)

# Fit Cox model
print("Fitting Cox Proportional Hazards Model...\n")

cox_result = analyzer.cox_proportional_hazards(
    formula='career_years ~ draft_pick + is_guard + is_international + early_injury + height_inches',
    data=cox_data
)

print("=" * 70)
print("COX PROPORTIONAL HAZARDS MODEL")
print("=" * 70)
print(cox_result['summary'])

print("\n" + "=" * 70)
print("HAZARD RATIOS (HR)")
print("=" * 70)
print("Interpretation: HR > 1 means higher risk of retirement (shorter career)")
print("                HR < 1 means lower risk of retirement (longer career)\n")
for coef, hr in cox_result['hazard_ratios'].items():
    print(f"{coef:25s}: HR = {hr:.3f}")

In [None]:
# Visualize hazard ratios
hrs = pd.DataFrame(cox_result['hazard_ratios'], index=['Hazard Ratio']).T
hrs = hrs.sort_values('Hazard Ratio')

fig, ax = plt.subplots(figsize=(10, 6))
colors = ['green' if x < 1 else 'red' for x in hrs['Hazard Ratio']]
hrs.plot(kind='barh', ax=ax, color=colors, legend=False)
ax.axvline(x=1, color='black', linestyle='--', linewidth=2, label='No Effect (HR=1)')
ax.set_xlabel('Hazard Ratio', fontsize=12)
ax.set_ylabel('Covariate', fontsize=12)
ax.set_title('Cox Model Hazard Ratios', fontsize=14, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

print("\nKey Insights:")
print("- Lower draft pick (earlier) → Lower hazard → Longer career")
print("- Guards → Lower hazard → Longer career than big men")
print("- Early injury → Higher hazard → Shorter career")

## 4. Parametric Survival Models

Compare Weibull, Log-Normal, and Log-Logistic models.

In [None]:
# Fit parametric models
print("Fitting Parametric Survival Models...\n")

models = {}
for distribution in ['weibull', 'lognormal', 'loglogistic']:
    print(f"Fitting {distribution.capitalize()} model...")
    result = analyzer.parametric_survival(
        formula='career_years ~ draft_pick + is_guard + early_injury',
        data=cox_data,
        distribution=distribution
    )
    models[distribution] = result
    print(f"  AIC: {result['aic']:.2f}")
    print(f"  Log-Likelihood: {result['log_likelihood']:.2f}\n")

# Compare models
print("=" * 60)
print("MODEL COMPARISON")
print("=" * 60)
comparison = pd.DataFrame({
    'AIC': {k: v['aic'] for k, v in models.items()},
    'BIC': {k: v['bic'] for k, v in models.items()},
    'Log-Likelihood': {k: v['log_likelihood'] for k, v in models.items()}
})
comparison['Best_AIC'] = comparison['AIC'] == comparison['AIC'].min()
print(comparison.round(2))

best_model = comparison['AIC'].idxmin()
print(f"\nBest Model (by AIC): {best_model.capitalize()}")

In [None]:
# Plot survival functions from parametric models
fig, ax = plt.subplots(figsize=(12, 7))

time_range = np.linspace(0, 20, 100)
colors_map = {'weibull': 'blue', 'lognormal': 'green', 'loglogistic': 'red'}

for model_name, result in models.items():
    # Use model to predict survival function
    survival_func = result['survival_function'](time_range)
    ax.plot(time_range, survival_func, 
           label=f'{model_name.capitalize()} (AIC={result["aic"]:.1f})',
           linewidth=2.5, color=colors_map[model_name])

ax.set_xlabel('Years in NBA', fontsize=13)
ax.set_ylabel('Survival Probability', fontsize=13)
ax.set_title('Parametric Survival Models Comparison', fontsize=15, fontweight='bold')
ax.legend(fontsize=11, loc='best')
ax.grid(True, alpha=0.3)
ax.set_ylim([0, 1.05])
plt.tight_layout()
plt.show()

## 5. Survival Predictions

Predict career longevity for specific player profiles.

In [None]:
# Define player profiles
profiles = pd.DataFrame([
    {
        'profile': 'Lottery Pick Guard',
        'draft_pick': 5,
        'is_guard': 1,
        'is_international': 0,
        'early_injury': 0,
        'height_inches': 75
    },
    {
        'profile': 'Late 1st Round Big',
        'draft_pick': 28,
        'is_guard': 0,
        'is_international': 0,
        'early_injury': 0,
        'height_inches': 82
    },
    {
        'profile': '2nd Round w/ Injury',
        'draft_pick': 45,
        'is_guard': 1,
        'is_international': 0,
        'early_injury': 1,
        'height_inches': 76
    },
    {
        'profile': 'Undrafted International',
        'draft_pick': 61,
        'is_guard': 0,
        'is_international': 1,
        'early_injury': 0,
        'height_inches': 80
    }
])

# Predict survival probabilities at specific time points
time_points = [3, 5, 10, 15]

print("=" * 80)
print("SURVIVAL PREDICTIONS FOR PLAYER PROFILES")
print("=" * 80)
print("Probability of staying in NBA for at least N years:\n")

for _, profile in profiles.iterrows():
    print(f"\n{profile['profile']}:")
    print("-" * 50)
    
    # Use Cox model for predictions
    profile_df = profile.drop('profile').to_frame().T
    predictions = analyzer.predict_survival(
        model=cox_result,
        new_data=profile_df,
        times=time_points
    )
    
    for t, prob in zip(time_points, predictions['survival_probs'][0]):
        print(f"  {t} years: {prob*100:.1f}%")

## 6. EconometricSuite Auto-Analysis

Use the Suite to automatically detect survival data and select best model.

In [None]:
# Initialize EconometricSuite with survival data
suite = EconometricSuite(
    data=cox_data,
    target='career_years'
)

print("EconometricSuite Initialized")
print(f"Data structure detected: {suite.data_structure}")
print(f"Recommended methods: {suite.recommended_methods}")

In [None]:
# Run survival analysis through Suite
suite_result = suite.survival_analysis(
    duration_col='career_years',
    event_col='retired',
    method='auto',  # Auto-select best method
    covariates=['draft_pick', 'is_guard', 'early_injury']
)

print("=" * 60)
print("SUITE AUTO-ANALYSIS RESULT")
print("=" * 60)
print(suite_result.summary())

In [None]:
# Compare multiple survival methods
print("Comparing multiple survival methods via Suite...\n")

comparison = suite.compare_methods(
    methods=[
        {
            'category': 'survival',
            'method': 'cox',
            'params': {
                'formula': 'career_years ~ draft_pick + is_guard + early_injury'
            }
        },
        {
            'category': 'survival',
            'method': 'weibull',
            'params': {
                'formula': 'career_years ~ draft_pick + is_guard + early_injury'
            }
        },
        {
            'category': 'survival',
            'method': 'lognormal',
            'params': {
                'formula': 'career_years ~ draft_pick + is_guard + early_injury'
            }
        }
    ],
    metric='concordance_index'  # C-index for survival models
)

print("\nMethod Comparison:")
print(comparison)

## 7. Summary and Insights

### Key Findings

1. **Draft Position Matters**: 
   - Lottery picks have significantly longer careers (median ~10-12 years)
   - Second round picks average ~6-7 years
   - Undrafted players struggle to last >5 years

2. **Position Effects**:
   - Guards (PG, SG) tend to have longer careers
   - Centers and power forwards have higher retirement hazards
   - Likely due to physical demands and injury risks

3. **Early Injury Impact**:
   - Injuries in first 2 years strongly predict shorter careers
   - Hazard ratio ~1.5-2.0 (50-100% higher retirement risk)

4. **Model Selection**:
   - Cox model provides good semi-parametric fit
   - Weibull often best parametric choice (if hazard increases over time)
   - Log-normal good if hazard peaks then declines

### NBA Applications

- **Drafting**: Quantify career length expectations by pick
- **Contracts**: Structure deals based on survival probabilities
- **Injury Management**: Prioritize early career health
- **Roster Planning**: Account for position-specific longevity
- **Player Development**: Identify high-risk players early

### Statistical Notes

- **Censoring**: Active players (censored) handled correctly
- **Proportional Hazards**: Cox model assumes constant HR over time
- **Parametric Assumptions**: Weibull assumes monotonic hazard
- **Concordance Index**: Measures predictive discrimination (like AUC)

## Next Steps

- Explore **Notebook 3**: Coaching Change Impact with Causal Inference
- Try with real NBA career data
- Add time-varying covariates (performance decline)
- Fit competing risks models (retirement vs. injury)
- Incorporate frailty models for unobserved heterogeneity