# NBA Fan Engagement Analysis
## Predicting Brooklyn Nets Home Game Attendance

**Author**: Siya Vyas  
**Project**: Machine Learning Analysis for NBA Analytics

---

### Project Overview

This project analyzes Brooklyn Nets home game data (2022-2025 seasons) to predict attendance levels and identify key drivers of fan engagement. The analysis provides data-driven recommendations for pricing, marketing, and scheduling strategies.

**Key Objectives:**
1. Identify factors that drive attendance at Brooklyn Nets home games
2. Build classification models to predict attendance tiers (Low/Medium/High)
3. Generate actionable business insights for sports management

**Dataset:**
- 123 home games from 2022-23, 2023-24, and 2024-25 seasons
- Features include temporal patterns, opponent characteristics, and team performance


## 1. Data Loading and Overview


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

# Load data
df = pd.read_csv('../data/raw/nets_home_games_raw.csv')
df['Date'] = pd.to_datetime(df['Date'])

print("Dataset Overview:")
print(f"Total games: {len(df)}")
print(f"Date range: {df['Date'].min().date()} to {df['Date'].max().date()}")
print(f"Seasons: {', '.join(df['season'].unique())}")
print(f"\nColumns: {len(df.columns)}")
print(f"\nFirst few rows:")
df.head()


In [None]:
# Basic statistics
print("Attendance Statistics:")
print(df['attendance'].describe())
print(f"\nMissing values: {df['attendance'].isna().sum()}")
print(f"\nBarclays Center capacity: 17,732")
print(f"Average attendance: {df['attendance'].mean():,.0f}")
print(f"Capacity utilization: {(df['attendance'].mean() / 17732) * 100:.1f}%")


## 2. Exploratory Data Analysis

### 2.1 Attendance Distribution


In [None]:
# Attendance distribution and time series
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Distribution
axes[0].hist(df['attendance'].dropna(), bins=20, color='#00A693', alpha=0.7, edgecolor='black')
axes[0].axvline(df['attendance'].mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {df["attendance"].mean():,.0f}')
axes[0].axvline(17732, color='orange', linestyle='--', linewidth=2, label='Capacity: 17,732')
axes[0].set_xlabel('Attendance', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Attendance Distribution', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Time series
axes[1].plot(df['Date'], df['attendance'], marker='o', markersize=4, alpha=0.6, color='#00A693')
axes[1].axhline(df['attendance'].mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {df["attendance"].mean():,.0f}')
axes[1].set_xlabel('Date', fontsize=12)
axes[1].set_ylabel('Attendance', fontsize=12)
axes[1].set_title('Attendance Over Time', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


### 2.2 Temporal Patterns


In [None]:
# Weekend vs Weekday
weekend_avg = df[df['is_weekend'] == 1]['attendance'].mean()
weekday_avg = df[df['is_weekend'] == 0]['attendance'].mean()
weekend_lift = ((weekend_avg - weekday_avg) / weekday_avg) * 100

print(f"Weekend Average Attendance: {weekend_avg:,.0f}")
print(f"Weekday Average Attendance: {weekday_avg:,.0f}")
print(f"Weekend Lift: {weekend_lift:.1f}%")

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
weekend_data = [df[df['is_weekend'] == 0]['attendance'].dropna(), 
                df[df['is_weekend'] == 1]['attendance'].dropna()]
bp = ax.boxplot(weekend_data, labels=['Weekday', 'Weekend'], patch_artist=True, widths=0.6)

for patch, color in zip(bp['boxes'], ['#444444', '#00A693']):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)

ax.set_ylabel('Attendance', fontsize=12)
ax.set_title('Weekend vs Weekday Attendance', fontsize=14, fontweight='bold')
ax.grid(axis='y', alpha=0.3)

# Add mean markers
for i, data in enumerate(weekend_data, 1):
    mean_val = data.mean()
    ax.plot(i, mean_val, marker='D', color='red', markersize=10, zorder=5)
    ax.text(i + 0.2, mean_val, f'{mean_val:,.0f}', fontsize=10, va='center', fontweight='bold')

plt.tight_layout()
plt.show()


In [None]:
# Day of week analysis
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_stats = df.groupby('day_name')['attendance'].agg(['mean', 'count']).reindex(day_order)

fig, ax = plt.subplots(figsize=(12, 6))
bars = ax.bar(day_stats.index, day_stats['mean'], color='#00A693', alpha=0.8, edgecolor='black')
ax.set_ylabel('Average Attendance', fontsize=12)
ax.set_xlabel('Day of Week', fontsize=12)
ax.set_title('Average Attendance by Day of Week', fontsize=14, fontweight='bold')
ax.grid(axis='y', alpha=0.3)
plt.xticks(rotation=45)

# Add value labels
for bar, count in zip(bars, day_stats['count']):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
           f'{height:,.0f}\n(n={count})',
           ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()


### 2.3 Opponent Impact Analysis


In [None]:
# Star opponent impact
star_avg = df[df['is_star_opponent'] == 1]['attendance'].mean()
non_star_avg = df[df['is_star_opponent'] == 0]['attendance'].mean()
star_lift = ((star_avg - non_star_avg) / non_star_avg) * 100

print(f"Star Opponent Average: {star_avg:,.0f}")
print(f"Regular Opponent Average: {non_star_avg:,.0f}")
print(f"Star Opponent Lift: {star_lift:.1f}%")

# Visualize opponent impacts
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Star opponents
star_data = [df[df['is_star_opponent'] == 0]['attendance'].dropna(),
             df[df['is_star_opponent'] == 1]['attendance'].dropna()]
bp1 = axes[0].boxplot(star_data, labels=['Regular', 'Star'], patch_artist=True)
for patch, color in zip(bp1['boxes'], ['#444444', '#FFA500']):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)
axes[0].set_title('Star Opponent Impact', fontweight='bold')
axes[0].set_ylabel('Attendance')
axes[0].grid(axis='y', alpha=0.3)

# Rival opponents
rival_data = [df[df['is_rival'] == 0]['attendance'].dropna(),
              df[df['is_rival'] == 1]['attendance'].dropna()]
bp2 = axes[1].boxplot(rival_data, labels=['Regular', 'Rival'], patch_artist=True)
for patch, color in zip(bp2['boxes'], ['#444444', '#DC143C']):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)
axes[1].set_title('Rival Opponent Impact', fontweight='bold')
axes[1].set_ylabel('Attendance')
axes[1].grid(axis='y', alpha=0.3)

# Large market
market_data = [df[df['is_large_market'] == 0]['attendance'].dropna(),
               df[df['is_large_market'] == 1]['attendance'].dropna()]
bp3 = axes[2].boxplot(market_data, labels=['Regular', 'Large Market'], patch_artist=True)
for patch, color in zip(bp3['boxes'], ['#444444', '#4169E1']):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)
axes[2].set_title('Large Market Impact', fontweight='bold')
axes[2].set_ylabel('Attendance')
axes[2].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()


## 3. Feature Engineering

### 3.1 Feature Categories

Based on EDA insights, we created 20 features across 4 categories:


In [None]:
# Load processed data to show features
X_train = pd.read_csv('../data/processed/X_train.csv')

print("Feature Categories:")
print(f"\n1. Core Features (7):")
core = ['day_of_week', 'is_weekend', 'month', 'is_holiday_week', 
        'is_star_opponent', 'is_rival', 'is_large_market']
for f in core:
    if f in X_train.columns:
        print(f"   - {f}")

print(f"\n2. Interaction Features (5):")
interaction = ['weekend_star', 'weekend_rival', 'holiday_star', 'star_rival', 'market_star']
for f in interaction:
    if f in X_train.columns:
        print(f"   - {f}")

print(f"\n3. Temporal Features (5):")
temporal = ['is_fr_sat', 'is_monday', 'is_early_season', 'is_mid_season', 'is_late_season']
for f in temporal:
    if f in X_train.columns:
        print(f"   - {f}")

print(f"\n4. Performance Features (3):")
performance = ['is_above_500', 'is_last_5_above_500', 'is_on_win_streak']
for f in performance:
    if f in X_train.columns:
        print(f"   - {f}")

print(f"\nTotal Features: {len(X_train.columns)}")
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(pd.read_csv('../data/processed/X_test.csv'))}")


### 3.2 Attendance Tier Creation

We created 3 attendance tiers based on training data percentiles:


In [None]:
# Load classification targets and config
y_train_class = pd.read_csv('../data/processed/y_train_classification.csv')
y_test_class = pd.read_csv('../data/processed/y_test_classification.csv')

import pickle
with open('../data/processed/config.pkl', 'rb') as f:
    config = pickle.load(f)

low_threshold = config['low_threshold']
high_threshold = config['high_threshold']

print(f"Attendance Tier Thresholds:")
print(f"  Low:    < {low_threshold:,.0f}")
print(f"  Medium: {low_threshold:,.0f} - {high_threshold:,.0f}")
print(f"  High:   > {high_threshold:,.0f}")

# Distribution
train_dist = y_train_class['tier'].value_counts().sort_index()
test_dist = y_test_class['tier'].value_counts().sort_index()

tier_names = ['Low', 'Medium', 'High']
print(f"\nTraining Set Distribution:")
for tier, count in train_dist.items():
    pct = (count / len(y_train_class)) * 100
    print(f"  {tier_names[tier]:>6}: {count:>3} ({pct:>5.1f}%)")

print(f"\nTest Set Distribution:")
for tier, count in test_dist.items():
    pct = (count / len(y_test_class)) * 100
    print(f"  {tier_names[tier]:>6}: {count:>3} ({pct:>5.1f}%)")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].bar([tier_names[i] for i in train_dist.index], train_dist.values, 
           color=['#DC143C', '#FFA500', '#00A693'], alpha=0.8, edgecolor='black')
axes[0].set_ylabel('Count', fontsize=12)
axes[0].set_title('Training Set Distribution', fontsize=14, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)

axes[1].bar([tier_names[i] for i in test_dist.index], test_dist.values,
           color=['#DC143C', '#FFA500', '#00A693'], alpha=0.8, edgecolor='black')
axes[1].set_ylabel('Count', fontsize=12)
axes[1].set_title('Test Set Distribution', fontsize=14, fontweight='bold')
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()


## 4. Model Training and Evaluation

### 4.1 Model Comparison


In [None]:
# Load model results
results_df = pd.read_csv('../results/models/model_results.csv', index_col=0)

print("Model Performance Summary:")
print("="*70)
print(f"{'Model':<25} {'Accuracy':>12} {'F1-Score':>12} {'Improvement':>15}")
print("-"*70)

baseline_f1 = results_df.loc['Baseline', 'test_f1']

for model_name in results_df.index:
    acc = results_df.loc[model_name, 'test_accuracy']
    f1 = results_df.loc[model_name, 'test_f1']
    if model_name != 'Baseline':
        improvement = ((f1 - baseline_f1) / baseline_f1) * 100
        print(f"{model_name:<25} {acc:>12.3f} {f1:>12.3f} {improvement:>14.1f}%")
    else:
        print(f"{model_name:<25} {acc:>12.3f} {f1:>12.3f} {'-':>15}")

# Visualize
fig, ax = plt.subplots(figsize=(12, 6))
model_names = results_df.index
x = np.arange(len(model_names))
width = 0.35

bars1 = ax.bar(x - width/2, results_df['test_accuracy'], width, 
               label='Accuracy', alpha=0.8, color='#00A693')
bars2 = ax.bar(x + width/2, results_df['test_f1'], width, 
               label='F1-Score', alpha=0.8, color='#FFA500')

ax.set_ylabel('Score', fontsize=12)
ax.set_title('Model Performance Comparison', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(model_names, rotation=45, ha='right')
ax.legend()
ax.grid(axis='y', alpha=0.3)
ax.set_ylim([0, 1])

# Add value labels
for bars in [bars1, bars2]:
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
               f'{height:.3f}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

# Best model
best_model = results_df.sort_values('test_f1', ascending=False).index[0]
best_acc = results_df.loc[best_model, 'test_accuracy']
best_f1 = results_df.loc[best_model, 'test_f1']
improvement = ((best_f1 - baseline_f1) / baseline_f1) * 100

print(f"\nBest Model: {best_model}")
print(f"  Accuracy: {best_acc:.3f}")
print(f"  F1-Score: {best_f1:.3f}")
print(f"  Improvement over baseline: {improvement:.1f}%")


In [None]:
# Load best model and evaluate
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Load best model
with open('../results/models/best_model.pkl', 'rb') as f:
    best_model = pickle.load(f)

# Load test data
X_test_scaled = pd.read_csv('../data/processed/X_test_scaled.csv')
y_test = pd.read_csv('../data/processed/y_test_classification.csv')['tier']

# Predictions
y_pred = best_model.predict(X_test_scaled)

# Classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Low', 'Medium', 'High'], digits=3))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)

fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Low', 'Medium', 'High'], 
            yticklabels=['Low', 'Medium', 'High'],
            cbar_kws={'label': 'Count'}, ax=ax, linewidths=1, linecolor='black')
ax.set_ylabel('Actual Tier', fontsize=12)
ax.set_xlabel('Predicted Tier', fontsize=12)
ax.set_title('Confusion Matrix - Logistic Regression', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nTest Accuracy: {accuracy:.3f} ({accuracy*100:.1f}%)")


## 5. Key Findings Summary

### 5.1 Model Performance

- **Best Model**: Logistic Regression
- **Accuracy**: 56.7%
- **F1-Score**: 0.539
- **Improvement**: 173.6% over baseline

### 5.2 Key Attendance Drivers

1. **Weekend games** show significantly higher attendance
2. **Star opponent games** (Lakers, Warriors, Celtics) boost attendance
3. **Rivalry games** (Knicks, Celtics, 76ers) drive increased engagement
4. **Day of week** and **season timing** are strong predictors
5. **Interaction effects** (weekend + star opponent) create premium opportunities

### 5.3 Business Insights

**Scheduling Strategy:**
- Prioritize weekend dates for marquee matchups
- Schedule star opponents on Fridays/Saturdays
- Avoid Monday games when possible

**Dynamic Pricing:**
- Premium pricing for weekend + star opponent games
- Promotional pricing for predicted Low attendance games
- Mid-tier pricing for weekday non-rival games

**Marketing & Promotions:**
- Focus marketing budget on predicted Mediumâ†’High conversion games
- Run promotions on predicted Low games
- Early bird discounts for weekday games

**Staffing & Operations:**
- Increase staff for predicted High attendance games
- Reduce costs on predicted Low attendance games
- Better inventory planning based on predictions


## 6. Conclusion

This analysis successfully identified key drivers of attendance at Brooklyn Nets home games and built a predictive model with **56.7% accuracy** and **173.6% improvement** over baseline. The model provides actionable insights for:

1. **Strategic Scheduling**: Optimize game dates for maximum attendance
2. **Dynamic Pricing**: Implement data-driven pricing strategies
3. **Targeted Marketing**: Allocate resources based on predicted attendance
4. **Operational Planning**: Optimize staffing and inventory

**Future Improvements:**
- Incorporate ticket pricing data
- Add social media sentiment analysis
- Build real-time prediction API
- Expand to other NBA teams for comparison

---

**Project Repository**: [GitHub](https://github.com/siyavyas/nba-fan-engagement-analysis)  
**Author**: Siya Vyas  
**Technologies**: Python, Pandas, Scikit-learn, XGBoost, Matplotlib, Seaborn
