
# James-Stein Estimators for NBA Player Heights and Weights

This notebook demonstrates the use of the James-Stein estimator in the context of parallel experiments. Specifically, we analyze the heights and weights of NBA players across five positions (Point Guard, Shooting Guard, Small Forward, Power Forward, Center).

The goal is to:
1. Compute Maximum Likelihood Estimators (MLE) and James-Stein estimators for the mean heights and weights of players by position.
2. Visualize the shrinkage effect of the James-Stein estimator.
3. Show that the James-Stein estimator achieves a lower Mean Squared Error (MSE) compared to the MLE.


## Libraries

Let's import the needed python libraries

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

  from .autonotebook import tqdm as notebook_tqdm


## Dataset Overview
The dataset contains information on over 13,000 NBA players, including attributes such as height, weight, and position. For this analysis, we focus on a random sample of 50 players from each position.

In [1]:
# Load the dataset
file_path = 'final_dataset_master.csv' 
data = pd.read_csv(file_path)

# Focus on relevant columns for analysis
data_subset = data[['player_height', 'player_weight', 'Pos.x']]
data_subset.head()


Unnamed: 0,player_height,player_weight,Pos.x
0,183.58,74.61,SG
1,198.15,99.23,SG
2,216.31,142.56,C
3,201.6,95.2,SG
4,198.64,102.34,SF



## Height Analysis
We begin by analyzing the heights of players across positions. For each position, we:
1. Randomly sample 50 players.
2. Compute the sample mean and variance.


In [2]:
# Randomly sample 50 players from each position for height analysis
sampled_data = data_subset.groupby('Pos.x', group_keys=False).apply(lambda x: x.sample(50, random_state=42))

# Compute sample mean and variance for heights
summary_stats = sampled_data.groupby('Pos.x')['player_height'].agg(['mean', 'var']).reset_index()
summary_stats.rename(columns={'mean': 'sample_mean', 'var': 'sample_variance'}, inplace=True)
summary_stats


  sampled_data = data_subset.groupby('Pos.x', group_keys=False).apply(lambda x: x.sample(50, random_state=42))


Unnamed: 0,Pos.x,sample_mean,sample_variance
0,C,210.883,20.318695
1,PF,205.4462,18.760857
2,PG,190.0254,22.537095
3,SF,202.3034,17.399223
4,SG,196.5998,13.744659



### Applying James-Stein Estimator for Heights
The James-Stein estimator shrinks the sample means toward the grand mean, reducing overall MSE. Below are the steps:
1. Compute the grand mean of sample means.
2. Estimate the pooled variance (common variance assumption).
3. Calculate the shrinkage factor $  (\delta) $.
4. Compute the James-Stein estimators for each position.


In [3]:

# Number of positions
p = len(summary_stats)

# Sample size per group
n = 50

# Calculate the grand mean
grand_mean = summary_stats['sample_mean'].mean()

# Estimate pooled variance
pooled_variance = summary_stats['sample_variance'].mean()

# Variance of sample means
sample_mean_variance = pooled_variance / n

# Calculate shrinkage factor
numerator = (p - 2) * sample_mean_variance
denominator = ((summary_stats['sample_mean'] - grand_mean) ** 2).sum()
delta = max(0, min(1, numerator / denominator))

# Apply James-Stein estimator
summary_stats['james_stein_estimator'] = (
    summary_stats['sample_mean'] - delta * (summary_stats['sample_mean'] - grand_mean)
)

# Add contextual columns
summary_stats['grand_mean'] = grand_mean
summary_stats['shrinkage_factor'] = delta
summary_stats['pooled_variance'] = pooled_variance
summary_stats


Unnamed: 0,Pos.x,sample_mean,sample_variance,james_stein_estimator,grand_mean,shrinkage_factor,pooled_variance
0,C,210.883,20.318695,210.840735,201.05156,0.004299,18.552106
1,PF,205.4462,18.760857,205.427308,201.05156,0.004299,18.552106
2,PG,190.0254,22.537095,190.072801,201.05156,0.004299,18.552106
3,SF,202.3034,17.399223,202.298018,201.05156,0.004299,18.552106
4,SG,196.5998,13.744659,196.618938,201.05156,0.004299,18.552106



### Mean Squared Error (MSE) Comparison for Heights
We compare the total MSE for the Maximum Likelihood Estimator (MLE) and the James-Stein estimator to demonstrate the improvement.


In [4]:

# Compute MSEs
summary_stats['mle_mse'] = (summary_stats['sample_mean'] - grand_mean) ** 2 + sample_mean_variance
summary_stats['js_mse'] = (summary_stats['james_stein_estimator'] - grand_mean) ** 2 + sample_mean_variance

# Total MSEs
total_mle_mse = summary_stats['mle_mse'].sum()
total_js_mse = summary_stats['js_mse'].sum()

# Display MSE comparison
mse_comparison = pd.DataFrame({
    "Estimator": ["MLE", "James-Stein"],
    "Total MSE": [total_mle_mse, total_js_mse]
})
mse_comparison


Unnamed: 0,Estimator,Total MSE
0,MLE,260.786759
1,James-Stein,258.565291



## Weight Analysis
We repeat the analysis for player weights.


In [6]:

# Randomly sample 50 players from each position for weight analysis
sampled_weight_data = data_subset.groupby('Pos.x', group_keys=False).apply(lambda x: x.sample(50, random_state=42))

# Compute sample mean and variance for weights
weight_stats = sampled_weight_data.groupby('Pos.x')['player_weight'].agg(['mean', 'var']).reset_index()
weight_stats.rename(columns={'mean': 'sample_mean', 'var': 'sample_variance'}, inplace=True)

# Compute grand mean and pooled variance for weights
grand_mean_weight = weight_stats['sample_mean'].mean()
pooled_variance_weight = weight_stats['sample_variance'].mean()
sample_mean_variance_weight = pooled_variance_weight / n

# Calculate shrinkage factor for weights
numerator_weight = (p - 2) * sample_mean_variance_weight
denominator_weight = ((weight_stats['sample_mean'] - grand_mean_weight) ** 2).sum()
delta_weight = max(0, min(1, numerator_weight / denominator_weight))

# Apply James-Stein estimator for weights
weight_stats['james_stein_estimator'] = (
    weight_stats['sample_mean'] - delta_weight * (weight_stats['sample_mean'] - grand_mean_weight)
)

# Add contextual columns for weights
weight_stats['grand_mean'] = grand_mean_weight
weight_stats['shrinkage_factor'] = delta_weight
weight_stats['pooled_variance'] = pooled_variance_weight
weight_stats


  sampled_weight_data = data_subset.groupby('Pos.x', group_keys=False).apply(lambda x: x.sample(50, random_state=42))


Unnamed: 0,Pos.x,sample_mean,sample_variance,james_stein_estimator,grand_mean,shrinkage_factor,pooled_variance
0,C,115.478,59.736955,115.40232,100.44452,0.005034,43.200933
1,PF,106.5946,48.323801,106.56364,100.44452,0.005034,43.200933
2,PG,86.348,31.817457,86.418964,100.44452,0.005034,43.200933
3,SF,100.5918,38.387717,100.591059,100.44452,0.005034,43.200933
4,SG,93.2102,37.738733,93.246618,100.44452,0.005034,43.200933



### Mean Squared Error (MSE) Comparison for Weights


In [7]:

# Compute MSEs for weights
weight_stats['mle_mse'] = (weight_stats['sample_mean'] - grand_mean_weight) ** 2 + sample_mean_variance_weight
weight_stats['js_mse'] = (weight_stats['james_stein_estimator'] - grand_mean_weight) ** 2 + sample_mean_variance_weight

# Total MSEs for weights
total_mle_mse_weight = weight_stats['mle_mse'].sum()
total_js_mse_weight = weight_stats['js_mse'].sum()

# Display MSE comparison for weights
mse_comparison_weight = pd.DataFrame({
    "Estimator": ["MLE", "James-Stein"],
    "Total MSE": [total_mle_mse_weight, total_js_mse_weight]
})
mse_comparison_weight


Unnamed: 0,Estimator,Total MSE
0,MLE,519.218052
1,James-Stein,514.046988
