# Phase 0: Enviroment Setup

In [1]:
# Install required packages
!pip install pgmpy pandas numpy matplotlib seaborn networkx fuzzywuzzy python-Levenshtein kaggle nba_api

Collecting pgmpy
  Downloading pgmpy-1.0.0-py3-none-any.whl.metadata (9.4 kB)
Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl.metadata (4.9 kB)
Collecting python-Levenshtein
  Downloading python_levenshtein-0.27.1-py3-none-any.whl.metadata (3.7 kB)
Collecting nba_api
  Downloading nba_api-1.10.2-py3-none-any.whl.metadata (5.8 kB)
Collecting pyro-ppl (from pgmpy)
  Downloading pyro_ppl-1.9.1-py3-none-any.whl.metadata (7.8 kB)
Collecting Levenshtein==0.27.1 (from python-Levenshtein)
  Downloading levenshtein-0.27.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting rapidfuzz<4.0.0,>=3.9.0 (from Levenshtein==0.27.1->python-Levenshtein)
  Downloading rapidfuzz-3.14.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (12 kB)
Collecting pyro-api>=0.1.1 (from pyro-ppl->pgmpy)
  Downloading pyro_api-0.1.2-py3-none-any.whl.metadata (2.5 kB)
Downloading pgmpy-1.0.0-py3-none-any.whl (2.0 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚î

In [2]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from pgmpy.models import BayesianNetwork
from pgmpy.estimators import MaximumLikelihoodEstimator, BayesianEstimator
from pgmpy.inference import VariableElimination
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ All packages installed and imported successfully!")

‚úÖ All packages installed and imported successfully!


# Phase 1: Data Acquistion & Problem Formalization

## Phase 1.1: Install NBA API and Get Data


In [3]:
print("üöÄ GETTING REAL NBA LINEUP DATA FROM OFFICIAL NBA API...")

# Install nba_api
!pip install nba_api

from nba_api.stats.endpoints import teamdashlineups
from nba_api.stats.static import teams
import pandas as pd

# Get all NBA teams
nba_teams = teams.get_teams()

# Create team dictionary
team_dict = {}
for team in nba_teams:
    team_name = team['full_name']
    team_id = team['id']
    team_dict[team_name] = team_id

print(f"‚úÖ Found {len(team_dict)} NBA teams")

# Function to get lineups for a team
def get_lineups(team_id_i):
    try:
        lineup = teamdashlineups.TeamDashLineups(
            team_id=team_id_i,
            season='2023-24',  # Using 2023-24 for more complete data
            season_type_all_star='Regular Season',
            group_quantity=5,  # 5-man lineups
            per_mode_detailed='Totals'
        )
        df = lineup.get_data_frames()
        all_lineups = df[1]  # This contains the lineup data
        return all_lineups
    except Exception as e:
        print(f"‚ùå Error getting lineups for team {team_id_i}: {e}")
        return None

# Get lineups for all teams
print("\nüì• DOWNLOADING LINEUP DATA FOR ALL TEAMS...")
dataframes = []

for i, team_name in enumerate(team_dict.keys()):
    team_id_i = team_dict[team_name]
    print(f"   {i+1}/{len(team_dict)}: Getting {team_name}...")

    team_lineup = get_lineups(team_id_i)
    if team_lineup is not None and not team_lineup.empty:
        team_lineup['team'] = team_name
        team_lineup['team_id'] = team_id_i
        dataframes.append(team_lineup)

    # Add small delay to avoid overwhelming API
    import time
    time.sleep(0.5)

# Combine all team lineups
if dataframes:
    league_lineup = pd.concat(dataframes, ignore_index=True)

    # Process the lineup data
    league_lineup['players_list'] = league_lineup['GROUP_NAME'].str.split(' - ')

    print(f"\n‚úÖ SUCCESS: Downloaded {len(league_lineup)} lineup combinations!")
    print(f"üìä Dataset shape: {league_lineup.shape}")

    # Save the data
    league_lineup.to_csv('nba_lineups_2024_api.csv', index=False)
    print("üíæ Saved as 'nba_lineups_2024_api.csv'")

    # Show sample
    print("\nüîç SAMPLE OF REAL NBA LINEUP DATA:")
    display(league_lineup[['GROUP_NAME', 'team', 'MIN', 'PLUS_MINUS', 'FG_PCT', 'FG3_PCT']].head(3))

else:
    print("‚ùå No lineup data could be downloaded")

üöÄ GETTING REAL NBA LINEUP DATA FROM OFFICIAL NBA API...
‚úÖ Found 30 NBA teams

üì• DOWNLOADING LINEUP DATA FOR ALL TEAMS...
   1/30: Getting Atlanta Hawks...
   2/30: Getting Boston Celtics...
   3/30: Getting Cleveland Cavaliers...
   4/30: Getting New Orleans Pelicans...
   5/30: Getting Chicago Bulls...
   6/30: Getting Dallas Mavericks...
   7/30: Getting Denver Nuggets...
   8/30: Getting Golden State Warriors...
   9/30: Getting Houston Rockets...
   10/30: Getting Los Angeles Clippers...
   11/30: Getting Los Angeles Lakers...
   12/30: Getting Miami Heat...
   13/30: Getting Milwaukee Bucks...
   14/30: Getting Minnesota Timberwolves...
   15/30: Getting Brooklyn Nets...
   16/30: Getting New York Knicks...
   17/30: Getting Orlando Magic...
   18/30: Getting Indiana Pacers...
   19/30: Getting Philadelphia 76ers...
   20/30: Getting Phoenix Suns...
   21/30: Getting Portland Trail Blazers...
   22/30: Getting Sacramento Kings...
   23/30: Getting San Antonio Spurs...
   2

Unnamed: 0,GROUP_NAME,team,MIN,PLUS_MINUS,FG_PCT,FG3_PCT
0,C. Capela - D. Murray - T. Young - S. Bey - J....,Atlanta Hawks,288.68,-88.0,0.446,0.312
1,C. Capela - D. Murray - T. Young - D. Hunter -...,Atlanta Hawks,176.911667,8.0,0.468,0.384
2,C. Capela - D. Murray - T. Young - D. Hunter -...,Atlanta Hawks,171.505,-26.0,0.464,0.367


## Phase 1.2: Analyze the API Data Structure

In [6]:
print("üî¨ ANALYZING NBA API DATA STRUCTURE...")

try:
    lineup_data = pd.read_csv('nba_lineups_2024_api.csv')

    print("üìã COLUMNS AVAILABLE:")
    for col in lineup_data.columns:
        print(f"   - {col}")

    print("\nüéØ VARIABLES FOR OUR BAYESIAN NETWORK:")

    # Check for critical variables
    critical_vars = {
        'Efficiency (Target)': ['PLUS_MINUS', 'PTS'],
        'Shooting': ['FG_PCT', 'FG3_PCT', 'EFG_PCT'],
        'Playmaking': ['AST', 'AST_PCT'],
        'Rebounding': ['OREB', 'DREB', 'REB'],
        'Turnovers': ['TOV', 'TOV_PCT']
    }

    available_cols = lineup_data.columns.tolist()

    for category, possible_vars in critical_vars.items():
        found = [var for var in possible_vars if var in available_cols]
        if found:
            print(f"   ‚úÖ {category}: {found}")
        else:
            print(f"   ‚ùå {category}: Not found")

    print(f"\nüìä Dataset info: {lineup_data.shape}")
    print(f"üë• Unique lineups: {lineup_data['GROUP_NAME'].nunique()}")

except Exception as e:
    print(f"‚ùå Error analyzing data: {e}")

üî¨ ANALYZING NBA API DATA STRUCTURE...
üìã COLUMNS AVAILABLE:
   - GROUP_SET
   - GROUP_ID
   - GROUP_NAME
   - GP
   - W
   - L
   - W_PCT
   - MIN
   - FGM
   - FGA
   - FG_PCT
   - FG3M
   - FG3A
   - FG3_PCT
   - FTM
   - FTA
   - FT_PCT
   - OREB
   - DREB
   - REB
   - AST
   - TOV
   - STL
   - BLK
   - BLKA
   - PF
   - PFD
   - PTS
   - PLUS_MINUS
   - GP_RANK
   - W_RANK
   - L_RANK
   - W_PCT_RANK
   - MIN_RANK
   - FGM_RANK
   - FGA_RANK
   - FG_PCT_RANK
   - FG3M_RANK
   - FG3A_RANK
   - FG3_PCT_RANK
   - FTM_RANK
   - FTA_RANK
   - FT_PCT_RANK
   - OREB_RANK
   - DREB_RANK
   - REB_RANK
   - AST_RANK
   - TOV_RANK
   - STL_RANK
   - BLK_RANK
   - BLKA_RANK
   - PF_RANK
   - PFD_RANK
   - PTS_RANK
   - PLUS_MINUS_RANK
   - SUM_TIME_PLAYED
   - team
   - team_id
   - players_list

üéØ VARIABLES FOR OUR BAYESIAN NETWORK:
   ‚úÖ Efficiency (Target): ['PLUS_MINUS', 'PTS']
   ‚úÖ Shooting: ['FG_PCT', 'FG3_PCT']
   ‚úÖ Playmaking: ['AST']
   ‚úÖ Rebounding: ['OREB', 'DREB', 

## Phase 1.3: Integration with Kaggle Data

In [9]:
# === PHASE 1.3 FIXED: USE ONLY NBA API DATA ===
print("=== PHASE 1.3: PROPER NBA API DATA INTEGRATION ===")

# Load the NBA API data we just downloaded
print("üì• Loading NBA API lineup data...")
lineup_data = pd.read_csv('nba_lineups_2024_api.csv')

print(f"üìä Original NBA API data: {lineup_data.shape}")

# Select only the variables we need for our Bayesian network
print("\nüéØ SELECTING VARIABLES FOR BAYESIAN NETWORK:")
selected_vars = {
    'Efficiency': 'PLUS_MINUS',  # Net rating as efficiency proxy
    'Shooting_FG': 'FG_PCT',     # Field goal percentage
    'Shooting_3PT': 'FG3_PCT',   # 3-point percentage
    'Playmaking': 'AST',         # Assists
    'Turnovers': 'TOV',          # Turnovers
    'Offensive_Rebounding': 'OREB'  # Offensive rebounds
}

# Create our feature dataset
print("üîß Creating feature dataset from NBA API data...")
feature_data = lineup_data[list(selected_vars.values())].copy()
feature_data.columns = list(selected_vars.keys())

print(f"üìä Feature dataset shape: {feature_data.shape}")

# Remove any missing values
feature_data = feature_data.dropna()
print(f"üìä After removing missing values: {feature_data.shape}")

# Check data quality
print("\nüîç DATA QUALITY CHECK:")
print("Basic statistics:")
print(feature_data.describe())

# Check for reasonable ranges (basketball logic)
print("\nüèÄ BASKETBALL LOGIC VALIDATION:")
print("Ranges should make sense for NBA:")
for col in feature_data.columns:
    min_val = feature_data[col].min()
    max_val = feature_data[col].max()
    print(f"  {col}: {min_val:.2f} to {max_val:.2f}")

# Verify we have enough data for discretization
print(f"\nüìà DATA SUFFICIENCY:")
print(f"  Total samples: {len(feature_data)}")
print(f"  Minimum required: ~1,000 (for 3^5=243 combinations)")
print(f"  Status: {'‚úÖ SUFFICIENT' if len(feature_data) >= 1000 else '‚ùå INSUFFICIENT'}")

if len(feature_data) >= 1000:
    # Save the integrated data for Phase 2
    feature_data.to_csv('nba_api_integrated_data.csv', index=False)
    print("üíæ Saved integrated data as 'nba_api_integrated_data.csv'")

    print("\n‚úÖ PHASE 1.3 COMPLETED SUCCESSFULLY!")
    print("üéØ Using ONLY NBA API data for consistency")
    print("üöÄ Ready for Phase 2: Data Preprocessing")
else:
    print("\n‚ùå INSUFFICIENT DATA - Need to collect more NBA API data")
    print("   Consider multiple seasons or different API endpoints")

=== PHASE 1.3: PROPER NBA API DATA INTEGRATION ===
üì• Loading NBA API lineup data...
üìä Original NBA API data: (7500, 59)

üéØ SELECTING VARIABLES FOR BAYESIAN NETWORK:
üîß Creating feature dataset from NBA API data...
üìä Feature dataset shape: (7500, 6)
üìä After removing missing values: (7500, 6)

üîç DATA QUALITY CHECK:
Basic statistics:
        Efficiency  Shooting_FG  Shooting_3PT   Playmaking    Turnovers  \
count  7500.000000  7500.000000   7500.000000  7500.000000  7500.000000   
mean      0.555867     0.473603      0.354854     7.921333     3.910133   
std      11.382578     0.151966      0.248258    21.286298     9.458745   
min     -88.000000     0.000000      0.000000     0.000000     0.000000   
25%      -5.000000     0.387000      0.200000     2.000000     1.000000   
50%       0.000000     0.476000      0.333000     4.000000     2.000000   
75%       5.000000     0.563000      0.500000     7.000000     4.000000   
max     282.000000     1.000000      1.000000  

# Phase 2: Data Preprocessing & Discretization


## Phase 2.1: Data Cleaning & Filtering

In [10]:
# === PHASE 2.1 UPDATED: CLEAN NBA API DATA ===
print("=== PHASE 2.1: CLEANING NBA API DATA ===")

# Load the integrated NBA API data
print("üì• Loading integrated NBA API data...")
nba_api_data = pd.read_csv('nba_api_integrated_data.csv')

print(f"üìä Dataset shape: {nba_api_data.shape}")
print(f"üéØ Columns: {list(nba_api_data.columns)}")

# The data is already clean (no missing values), but let's verify
print("\nüîç DATA CLEANLINESS CHECK:")
print(f"Missing values: {nba_api_data.isnull().sum().sum()}")  # Should be 0
print(f"Duplicate rows: {nba_api_data.duplicated().sum()}")    # Should be minimal

# Check for extreme outliers that might skew discretization
print("\nüìä OUTLIER DETECTION:")
for col in nba_api_data.columns:
    Q1 = nba_api_data[col].quantile(0.25)
    Q3 = nba_api_data[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = nba_api_data[(nba_api_data[col] < lower_bound) | (nba_api_data[col] > upper_bound)]
    print(f"  {col}: {len(outliers)} outliers ({len(outliers)/len(nba_api_data):.1%})")

print("\n‚úÖ PHASE 2.1 COMPLETED!")
print("üöÄ Ready for Phase 2.2: Feature Selection & Engineering")

=== PHASE 2.1: CLEANING NBA API DATA ===
üì• Loading integrated NBA API data...
üìä Dataset shape: (7500, 6)
üéØ Columns: ['Efficiency', 'Shooting_FG', 'Shooting_3PT', 'Playmaking', 'Turnovers', 'Offensive_Rebounding']

üîç DATA CLEANLINESS CHECK:
Missing values: 0
Duplicate rows: 162

üìä OUTLIER DETECTION:
  Efficiency: 324 outliers (4.3%)
  Shooting_FG: 234 outliers (3.1%)
  Shooting_3PT: 343 outliers (4.6%)
  Playmaking: 777 outliers (10.4%)
  Turnovers: 628 outliers (8.4%)
  Offensive_Rebounding: 729 outliers (9.7%)

‚úÖ PHASE 2.1 COMPLETED!
üöÄ Ready for Phase 2.2: Feature Selection & Engineering


## Phase 2.2: Data Preprocessing & Engineering

In [13]:
# === PHASE 2.2 FIXED: PROPER RATE STATISTICS ===
print("=== PHASE 2.2 FIXED: PROPER RATE STATISTICS ===")

# We need the original lineup data with MINUTES to convert to rates
print("üì• Loading full NBA lineup data with minutes...")
lineup_data = pd.read_csv('nba_lineups_2024_api.csv')

print("üîß Converting totals to per-minute rates...")

# Calculate rates per 48 minutes (standard NBA rate)
def calculate_rates(data):
    rates_data = data.copy()

    # Efficiency stays as PLUS_MINUS (already a rate)
    rates_data['Efficiency'] = data['PLUS_MINUS']

    # Convert totals to per-48-minute rates
    minutes = data['MIN']

    # Shooting percentages stay the same (already rates)
    rates_data['Shooting_FG'] = data['FG_PCT']
    rates_data['Shooting_3PT'] = data['FG3_PCT']

    # Playmaking: Assists per 48 minutes
    rates_data['Playmaking'] = (data['AST'] / minutes) * 48

    # Turnovers: Turnovers per 48 minutes (INVERTED - lower is better)
    rates_data['Turnovers'] = (data['TOV'] / minutes) * 48

    # Offensive Rebounding: Offensive rebounds per 48 minutes
    rates_data['Offensive_Rebounding'] = (data['OREB'] / minutes) * 48

    return rates_data

# Create rate-based features
rates_data = calculate_rates(lineup_data)

# Select only our 6 key variables
rates_data = rates_data[['Efficiency', 'Shooting_FG', 'Shooting_3PT',
                        'Playmaking', 'Turnovers', 'Offensive_Rebounding']]

# Remove any infinite/NaN values from division
rates_data = rates_data.replace([np.inf, -np.inf], np.nan).dropna()

print(f"üìä Rate-based dataset shape: {rates_data.shape}")

# Check new correlations
print("\nüìä FIXED CORRELATIONS WITH EFFICIENCY:")
corr_matrix = rates_data.corr()
efficiency_correlations = corr_matrix['Efficiency'].sort_values(ascending=False)

for feature, corr in efficiency_correlations.items():
    if feature != 'Efficiency':
        print(f"   {feature}: {corr:.3f}")

# Verify basketball logic is now correct
positive_expected = ['Shooting_FG', 'Shooting_3PT', 'Playmaking', 'Offensive_Rebounding']
negative_expected = ['Turnovers']

actual_positive = [f for f in efficiency_correlations.index
                  if f != 'Efficiency' and efficiency_correlations[f] > 0]
actual_negative = [f for f in efficiency_correlations.index
                  if f != 'Efficiency' and efficiency_correlations[f] < 0]

print(f"\n‚úÖ Expected Positive: {positive_expected}")
print(f"‚úÖ Expected Negative: {negative_expected}")
print(f"üìä Actual Positive: {actual_positive}")
print(f"üìä Actual Negative: {actual_negative}")

# Basketball logic validation
if 'Turnovers' in actual_negative:
    print("üéØ BASKETBALL LOGIC: Turnovers now negatively correlate with efficiency ‚úì")
else:
    print("‚ùå BASKETBALL LOGIC STILL BROKEN - Need further investigation")

# Save the corrected data
rates_data.to_csv('nba_api_corrected_rates.csv', index=False)
print("\nüíæ Saved corrected rate-based data as 'nba_api_corrected_rates.csv'")

print("\n‚úÖ PHASE 2.2 FIXED COMPLETED!")
print("üöÄ Ready for Phase 2.3 with proper basketball logic")

=== PHASE 2.2 FIXED: PROPER RATE STATISTICS ===
üì• Loading full NBA lineup data with minutes...
üîß Converting totals to per-minute rates...
üìä Rate-based dataset shape: (7500, 6)

üìä FIXED CORRELATIONS WITH EFFICIENCY:
   Shooting_FG: 0.329
   Playmaking: 0.274
   Shooting_3PT: 0.223
   Offensive_Rebounding: 0.022
   Turnovers: -0.141

‚úÖ Expected Positive: ['Shooting_FG', 'Shooting_3PT', 'Playmaking', 'Offensive_Rebounding']
‚úÖ Expected Negative: ['Turnovers']
üìä Actual Positive: ['Shooting_FG', 'Playmaking', 'Shooting_3PT', 'Offensive_Rebounding']
üìä Actual Negative: ['Turnovers']
üéØ BASKETBALL LOGIC: Turnovers now negatively correlate with efficiency ‚úì

üíæ Saved corrected rate-based data as 'nba_api_corrected_rates.csv'

‚úÖ PHASE 2.2 FIXED COMPLETED!
üöÄ Ready for Phase 2.3 with proper basketball logic


## Phase 2.3: Discretization

In [14]:
# === PHASE 2.3: SMART DISCRETIZATION ===
print("=== PHASE 2.3: SMART DISCRETIZATION ===")

# Load the corrected rate-based data
print("üì• Loading corrected rate-based data...")
rates_data = pd.read_csv('nba_api_corrected_rates.csv')

print(f"üìä Dataset shape: {rates_data.shape}")
print("üéØ Variables to discretize: Efficiency, Shooting_FG, Shooting_3PT, Playmaking, Turnovers, Offensive_Rebounding")

# Define basketball-informed discretization thresholds
print("\nüèÄ SETTING BASKETBALL-INFORMED THRESHOLDS:")

discretization_rules = {
    'Efficiency': {
        'description': 'Plus/Minus per game',
        'Low': ('Below -5', 'Negative impact'),
        'Medium': ('-5 to +5', 'Neutral impact'),
        'High': ('Above +5', 'Positive impact')
    },
    'Shooting_FG': {
        'description': 'Field Goal Percentage',
        'Low': ('Below 45%', 'Poor shooting'),
        'Medium': ('45% to 50%', 'Average shooting'),
        'High': ('Above 50%', 'Elite shooting')
    },
    'Shooting_3PT': {
        'description': '3-Point Percentage',
        'Low': ('Below 35%', 'Poor 3PT'),
        'Medium': ('35% to 40%', 'Average 3PT'),
        'High': ('Above 40%', 'Elite 3PT')
    },
    'Playmaking': {
        'description': 'Assists per 48 minutes',
        'Low': ('Below 15', 'Low playmaking'),
        'Medium': ('15 to 25', 'Average playmaking'),
        'High': ('Above 25', 'High playmaking')
    },
    'Turnovers': {
        'description': 'Turnovers per 48 minutes',
        'Low': ('Below 10', 'Good ball control'),  # Lower turnovers = better
        'Medium': ('10 to 15', 'Average ball control'),
        'High': ('Above 15', 'Poor ball control')  # Higher turnovers = worse
    },
    'Offensive_Rebounding': {
        'description': 'Offensive Rebounds per 48 minutes',
        'Low': ('Below 8', 'Poor offensive rebounding'),
        'Medium': ('8 to 12', 'Average offensive rebounding'),
        'High': ('Above 12', 'Elite offensive rebounding')
    }
}

# Apply discretization
print("\nüîß APPLYING DISCRETIZATION...")
final_discretized_data = rates_data.copy()

for column in final_discretized_data.columns:
    if column == 'Efficiency':
        bins = [-float('inf'), -5, 5, float('inf')]
        labels = ['Low', 'Medium', 'High']
    elif column == 'Shooting_FG':
        bins = [-float('inf'), 0.45, 0.50, float('inf')]
        labels = ['Low', 'Medium', 'High']
    elif column == 'Shooting_3PT':
        bins = [-float('inf'), 0.35, 0.40, float('inf')]
        labels = ['Low', 'Medium', 'High']
    elif column == 'Playmaking':
        bins = [-float('inf'), 15, 25, float('inf')]
        labels = ['Low', 'Medium', 'High']
    elif column == 'Turnovers':
        bins = [-float('inf'), 10, 15, float('inf')]
        labels = ['Low', 'Medium', 'High']  # Lower turnovers = "Low" category (good)
    elif column == 'Offensive_Rebounding':
        bins = [-float('inf'), 8, 12, float('inf')]
        labels = ['Low', 'Medium', 'High']

    final_discretized_data[column] = pd.cut(final_discretized_data[column], bins=bins, labels=labels)

print("‚úÖ DISCRETIZATION COMPLETED!")

# Check the distribution of discretized variables
print("\nüìä DISCRETIZED DISTRIBUTIONS:")
for column in final_discretized_data.columns:
    dist = final_discretized_data[column].value_counts(normalize=True).sort_index()
    print(f"{column}:")
    for state in ['Low', 'Medium', 'High']:
        count = final_discretized_data[column].value_counts().get(state, 0)
        percentage = dist.get(state, 0) * 100
        print(f"  {state}: {count} samples ({percentage:.1f}%)")

# Verify we have enough samples in each category
print("\nüîç SAMPLE SUFFICIENCY CHECK:")
min_samples = 100  # Minimum samples per category for reliable learning
for column in final_discretized_data.columns:
    for state in ['Low', 'Medium', 'High']:
        count = (final_discretized_data[column] == state).sum()
        if count < min_samples:
            print(f"‚ö†Ô∏è  {column}-{state}: Only {count} samples")
        else:
            print(f"‚úÖ {column}-{state}: {count} samples")

# Save the final discretized data
final_discretized_data.to_csv('final_discretized_nba_data.csv', index=False)
print("\nüíæ Saved final discretized data as 'final_discretized_nba_data.csv'")

print("\n‚úÖ PHASE 2.3 COMPLETED!")
print("üöÄ Ready for Phase 2.4: Save Processed Data")

=== PHASE 2.3: SMART DISCRETIZATION ===
üì• Loading corrected rate-based data...
üìä Dataset shape: (7500, 6)
üéØ Variables to discretize: Efficiency, Shooting_FG, Shooting_3PT, Playmaking, Turnovers, Offensive_Rebounding

üèÄ SETTING BASKETBALL-INFORMED THRESHOLDS:

üîß APPLYING DISCRETIZATION...
‚úÖ DISCRETIZATION COMPLETED!

üìä DISCRETIZED DISTRIBUTIONS:
Efficiency:
  Low: 1906 samples (25.4%)
  Medium: 3912 samples (52.2%)
  High: 1682 samples (22.4%)
Shooting_FG:
  Low: 3229 samples (43.1%)
  Medium: 1545 samples (20.6%)
  High: 2726 samples (36.3%)
Shooting_3PT:
  Low: 3889 samples (51.9%)
  Medium: 750 samples (10.0%)
  High: 2861 samples (38.1%)
Playmaking:
  Low: 1385 samples (18.5%)
  Medium: 2185 samples (29.1%)
  High: 3930 samples (52.4%)
Turnovers:
  Low: 2707 samples (36.1%)
  Medium: 1935 samples (25.8%)
  High: 2858 samples (38.1%)
Offensive_Rebounding:
  Low: 3118 samples (41.6%)
  Medium: 1565 samples (20.9%)
  High: 2817 samples (37.6%)

üîç SAMPLE SUFFICIEN

## Phase 2.4: Save Processed Data & Phase Completion

In [15]:
# === PHASE 2.4: SAVE PROCESSED DATA ===
print("=== PHASE 2.4: SAVE PROCESSED DATA ===")

# Verify the final dataset
print("üîç FINAL DATASET VERIFICATION:")
print(f"üìä Shape: {final_discretized_data.shape}")
print(f"üéØ Columns: {list(final_discretized_data.columns)}")
print(f"üìà Total samples: {len(final_discretized_data)}")

# Check data types and ensure proper categorical encoding
print("\nüîß DATA TYPE OPTIMIZATION:")
for col in final_discretized_data.columns:
    unique_vals = final_discretized_data[col].unique()
    print(f"  {col}: {list(unique_vals)} - {final_discretized_data[col].dtype}")

# Convert to categorical with logical order for Bayesian network
print("\nüéØ OPTIMIZING FOR BAYESIAN NETWORK:")
final_processed_data = final_discretized_data.copy()

# Ensure consistent categorical ordering
for col in final_processed_data.columns:
    final_processed_data[col] = pd.Categorical(
        final_processed_data[col],
        categories=['Low', 'Medium', 'High'],
        ordered=True
    )

print("‚úÖ All variables encoded as ordered categoricals")

# Final save
final_processed_data.to_csv('nba_lineup_efficiency_final_data.csv', index=False)
print("üíæ Saved as 'nba_lineup_efficiency_final_data.csv'")

# Summary statistics
print("\nüìã FINAL DATASET SUMMARY:")
print(f"‚úÖ Samples: {len(final_processed_data):,}")
print(f"‚úÖ Features: {len(final_processed_data.columns)}")
print(f"‚úÖ Data types: All categorical (Low/Medium/High)")
print(f"‚úÖ Basketball logic: Preserved through discretization")
print(f"‚úÖ Ready for Bayesian network training")

print("\nüéâ PHASE 2 COMPLETED SUCCESSFULLY!")
print("üöÄ READY FOR PHASE 3: BAYESIAN NETWORK STRUCTURE & LEARNING")

=== PHASE 2.4: SAVE PROCESSED DATA ===
üîç FINAL DATASET VERIFICATION:
üìä Shape: (7500, 6)
üéØ Columns: ['Efficiency', 'Shooting_FG', 'Shooting_3PT', 'Playmaking', 'Turnovers', 'Offensive_Rebounding']
üìà Total samples: 7500

üîß DATA TYPE OPTIMIZATION:
  Efficiency: ['Low', 'High', 'Medium'] - category
  Shooting_FG: ['Low', 'Medium', 'High'] - category
  Shooting_3PT: ['Low', 'Medium', 'High'] - category
  Playmaking: ['High', 'Medium', 'Low'] - category
  Turnovers: ['High', 'Medium', 'Low'] - category
  Offensive_Rebounding: ['High', 'Medium', 'Low'] - category

üéØ OPTIMIZING FOR BAYESIAN NETWORK:
‚úÖ All variables encoded as ordered categoricals
üíæ Saved as 'nba_lineup_efficiency_final_data.csv'

üìã FINAL DATASET SUMMARY:
‚úÖ Samples: 7,500
‚úÖ Features: 6
‚úÖ Data types: All categorical (Low/Medium/High)
‚úÖ Basketball logic: Preserved through discretization
‚úÖ Ready for Bayesian network training

üéâ PHASE 2 COMPLETED SUCCESSFULLY!
üöÄ READY FOR PHASE 3: BAYESIAN 

# Phase 3: Bayesian Network Structure & Learning

## Phase 3.1: Design the DAG Structure

In [16]:
# === PHASE 3.1: HIERARCHICAL NETWORK STRUCTURE (RESTART) ===
print("=== PHASE 3.1: HIERARCHICAL NETWORK STRUCTURE ===")

# Load the clean, processed data
print("üì• Loading final processed data...")
final_data = pd.read_csv('nba_lineup_efficiency_final_data.csv')

print(f"üìä Dataset shape: {final_data.shape}")
print(f"üéØ Columns: {list(final_data.columns)}")

# Verify data types are correct for Bayesian network
print("\nüîç DATA TYPE VERIFICATION:")
for col in final_data.columns:
    print(f"  {col}: {final_data[col].dtype} - {list(final_data[col].unique())}")

# Define the same hierarchical structure
print("\nüîó DESIGNING HIERARCHICAL STRUCTURE...")
print("üèÄ BASKETBALL LOGIC:")
print("  Level 0: Shooting_FG, Shooting_3PT, Playmaking, Turnovers, Offensive_Rebounding")
print("  Level 1: Shooting_Quality ‚Üê [FG + 3PT], Ball_Control ‚Üê [Playmaking - Turnovers]")
print("  Level 2: Efficiency ‚Üê [Shooting_Quality + Ball_Control + Second_Chances]")

from pgmpy.models import DiscreteBayesianNetwork

# Create the hierarchical Bayesian Network structure
hierarchical_model = DiscreteBayesianNetwork([
    # Level 1: Intermediate basketball concepts
    ('Shooting_FG', 'Shooting_Quality'),
    ('Shooting_3PT', 'Shooting_Quality'),
    ('Playmaking', 'Ball_Control'),
    ('Turnovers', 'Ball_Control'),
    ('Offensive_Rebounding', 'Second_Chances'),

    # Level 2: Final efficiency
    ('Shooting_Quality', 'Efficiency'),
    ('Ball_Control', 'Efficiency'),
    ('Second_Chances', 'Efficiency')
])

print("‚úÖ HIERARCHICAL NETWORK STRUCTURE CREATED!")
print(f"üìà Nodes: {hierarchical_model.nodes()}")
print(f"üìà Edges: {hierarchical_model.edges()}")

print("\nüéØ MATHEMATICAL ADVANTAGE:")
print("  ‚Ä¢ 5 raw skills ‚Üí 3 intermediate concepts ‚Üí 1 target")
print("  ‚Ä¢ Reduces parameter complexity from 729 to 81")
print("  ‚Ä¢ 8.9x more data-efficient learning!")

print("\n‚úÖ PHASE 3.1 COMPLETED SUCCESSFULLY!")
print("üöÄ Ready for Phase 3.2: Learn CPTs with clean data")

=== PHASE 3.1: HIERARCHICAL NETWORK STRUCTURE ===
üì• Loading final processed data...
üìä Dataset shape: (7500, 6)
üéØ Columns: ['Efficiency', 'Shooting_FG', 'Shooting_3PT', 'Playmaking', 'Turnovers', 'Offensive_Rebounding']

üîç DATA TYPE VERIFICATION:
  Efficiency: object - ['Low', 'High', 'Medium']
  Shooting_FG: object - ['Low', 'Medium', 'High']
  Shooting_3PT: object - ['Low', 'Medium', 'High']
  Playmaking: object - ['High', 'Medium', 'Low']
  Turnovers: object - ['High', 'Medium', 'Low']
  Offensive_Rebounding: object - ['High', 'Medium', 'Low']

üîó DESIGNING HIERARCHICAL STRUCTURE...
üèÄ BASKETBALL LOGIC:
  Level 0: Shooting_FG, Shooting_3PT, Playmaking, Turnovers, Offensive_Rebounding
  Level 1: Shooting_Quality ‚Üê [FG + 3PT], Ball_Control ‚Üê [Playmaking - Turnovers]
  Level 2: Efficiency ‚Üê [Shooting_Quality + Ball_Control + Second_Chances]
‚úÖ HIERARCHICAL NETWORK STRUCTURE CREATED!
üìà Nodes: ['Shooting_FG', 'Shooting_Quality', 'Shooting_3PT', 'Playmaking', 'Bal

## Phase 3.2: Learn Conditional probability Tables (CPTs)

In [27]:
# === PHASE 3.2 FIXED: BAYESIAN ESTIMATION WITH SMOOTHING ===
print("=== PHASE 3.2 FIXED: BAYESIAN ESTIMATION WITH SMOOTHING ===")

print("üéØ Using Bayesian Estimation for better probability calibration...")

from pgmpy.estimators import BayesianEstimator

# Create optimized intermediate variables
print("üìä Creating optimized intermediate variables...")
hierarchical_data = final_data.copy()

def create_optimized_intermediates(data):
    """Create intermediates with better basketball logic"""
    results = data.copy()

    # Use weighted scoring for more precision
    score_map = {'Low': 0, 'Medium': 1, 'High': 2}

    # Shooting: Weight FG% more than 3PT%
    def shooting_quality(row):
        fg_score = score_map[row['Shooting_FG']] * 1.5  # Weight FG% more
        threept_score = score_map[row['Shooting_3PT']] * 1.0
        total = fg_score + threept_score

        if total >= 4.5:  # High threshold
            return 'High'
        elif total <= 1.5:  # Low threshold
            return 'Low'
        else:
            return 'Medium'

    # Ball Control: Strong emphasis on turnover avoidance
    def ball_control(row):
        pm_score = score_map[row['Playmaking']] * 1.0
        to_score = (2 - score_map[row['Turnovers']]) * 1.5  # Weight turnovers heavier
        total = pm_score + to_score

        if total >= 3.5:
            return 'High'
        elif total <= 1.5:
            return 'Low'
        else:
            return 'Medium'

    # Second Chances: Direct but with efficiency guidance
    def second_chances(row):
        return row['Offensive_Rebounding']  # Keep it simple

    results['Shooting_Quality'] = results.apply(shooting_quality, axis=1)
    results['Ball_Control'] = results.apply(ball_control, axis=1)
    results['Second_Chances'] = results.apply(second_chances, axis=1)

    return results

hierarchical_data = create_optimized_intermediates(hierarchical_data)

print("‚úÖ Optimized intermediates created!")
print(f"üìä Enhanced data shape: {hierarchical_data.shape}")

# Learn CPTs with BAYESIAN ESTIMATION (not MLE)
print("\nüéØ LEARNING CPTs WITH BAYESIAN ESTIMATION...")
print("   Using BDeu prior for smoother probability estimates...")

hierarchical_model.fit(
    hierarchical_data,
    estimator=BayesianEstimator,
    prior_type='BDeu',
    equivalent_sample_size=10  # Smoothing parameter
)

print("‚úÖ CPTs learned with Bayesian smoothing!")

# Create inference engine
from pgmpy.inference import VariableElimination
inference = VariableElimination(hierarchical_model)

# Test accuracy with Bayesian estimation
print("üìä TESTING BAYESIAN ESTIMATION ACCURACY...")
bayesian_predictions = []
bayesian_true = []

for idx, row in hierarchical_data.iterrows():
    evidence = {
        'Shooting_FG': row['Shooting_FG'],
        'Shooting_3PT': row['Shooting_3PT'],
        'Playmaking': row['Playmaking'],
        'Turnovers': row['Turnovers'],
        'Offensive_Rebounding': row['Offensive_Rebounding']
    }
    try:
        result = inference.query(variables=['Efficiency'], evidence=evidence)
        predicted = result.state_names['Efficiency'][result.values.argmax()]
        bayesian_predictions.append(predicted)
        bayesian_true.append(row['Efficiency'])
    except:
        continue

bayesian_accuracy = accuracy_score(bayesian_true, bayesian_predictions)

# Check prediction distribution
bayesian_pred_dist = pd.Series(bayesian_predictions).value_counts(normalize=True)

print(f"üéØ BAYESIAN ESTIMATION ACCURACY: {bayesian_accuracy:.1%}")
print(f"üìä PREDICTION DISTRIBUTION: {dict(bayesian_pred_dist)}")

# Compare with previous approaches
print(f"\nüìà ACCURACY COMPARISON:")
print(f"  MLE Hierarchical: 53.9%")
print(f"  MLE Direct: 58.2%")
print(f"  BAYESIAN Hierarchical: {bayesian_accuracy:.1%}")

if bayesian_accuracy > 0.582:
    improvement = (bayesian_accuracy - 0.582) * 100
    print(f"  ‚úÖ IMPROVEMENT: +{improvement:.1f}%")

# Basketball logic validation
print("\nüèÄ BAYESIAN MODEL BASKETBALL LOGIC:")
test_cases = [
    ("Elite Shooting", {'Shooting_FG': 'High', 'Shooting_3PT': 'High'}),
    ("Great Ball Control", {'Playmaking': 'High', 'Turnovers': 'Low'}),
    ("Championship Team", {'Shooting_FG': 'High', 'Shooting_3PT': 'High', 'Playmaking': 'High', 'Turnovers': 'Low', 'Offensive_Rebounding': 'High'})
]

for name, evidence in test_cases:
    result = inference.query(variables=['Efficiency'], evidence=evidence)
    high_prob = result.values[result.state_names['Efficiency'].index('High')]
    low_prob = result.values[result.state_names['Efficiency'].index('Low')]
    print(f"  {name}: P(High)={high_prob:.3f}, P(Low)={low_prob:.3f}")

if bayesian_accuracy > 0.65:
    print(f"\nüéâ SUCCESS! Bayesian estimation achieves {bayesian_accuracy:.1%} accuracy!")
    print("üöÄ Ready for Phase 3.3 Validation")
else:
    print(f"\nüîß Bayesian: {bayesian_accuracy:.1%} - Better but needs more work")

print("\n‚úÖ PHASE 3.2 FIXED COMPLETED!")

=== PHASE 3.2 FIXED: BAYESIAN ESTIMATION WITH SMOOTHING ===
üéØ Using Bayesian Estimation for better probability calibration...
üìä Creating optimized intermediate variables...




‚úÖ Optimized intermediates created!
üìä Enhanced data shape: (7500, 9)

üéØ LEARNING CPTs WITH BAYESIAN ESTIMATION...
   Using BDeu prior for smoother probability estimates...
‚úÖ CPTs learned with Bayesian smoothing!
üìä TESTING BAYESIAN ESTIMATION ACCURACY...
üéØ BAYESIAN ESTIMATION ACCURACY: 55.1%
üìä PREDICTION DISTRIBUTION: {'Medium': np.float64(0.7221333333333333), 'High': np.float64(0.1552), 'Low': np.float64(0.12266666666666666)}

üìà ACCURACY COMPARISON:
  MLE Hierarchical: 53.9%
  MLE Direct: 58.2%
  BAYESIAN Hierarchical: 55.1%

üèÄ BAYESIAN MODEL BASKETBALL LOGIC:
  Elite Shooting: P(High)=0.429, P(Low)=0.064
  Great Ball Control: P(High)=0.288, P(Low)=0.170
  Championship Team: P(High)=0.555, P(Low)=0.018

üîß Bayesian: 55.1% - Better but needs more work

‚úÖ PHASE 3.2 FIXED COMPLETED!


## Phase 3.3: initial Model Validation

In [24]:
# === PHASE 3.3: INITIAL MODEL VALIDATION ===
print("=== PHASE 3.3: INITIAL MODEL VALIDATION ===")

# TEST 1: MARGINAL PROBABILITIES
print("\nüìä MARGINAL PROBABILITIES:")
efficiency_marginal = inference.query(variables=['Efficiency'])
print("Overall Efficiency Distribution:")
for state, prob in zip(efficiency_marginal.state_names['Efficiency'], efficiency_marginal.values):
    print(f"  P({state}): {prob:.3f}")

# TEST 2: REAL-WORLD BASKETBALL SCENARIOS
print("\nüèÄ REAL-WORLD SCENARIOS:")

# Championship team (elite everything)
print("‚≠ê CHAMPIONSHIP TEAM (Elite across the board):")
evidence_champ = {
    'Shooting_FG': 'High', 'Shooting_3PT': 'High',
    'Playmaking': 'High', 'Turnovers': 'Low',
    'Offensive_Rebounding': 'High'
}
result_champ = inference.query(variables=['Efficiency'], evidence=evidence_champ)
champ_high = result_champ.values[result_champ.state_names['Efficiency'].index('High')]
print(f"  P(High Efficiency): {champ_high:.3f}")

# Rebuilding team (poor everything)
print("\nüî® REBUILDING TEAM (Poor across the board):")
evidence_rebuild = {
    'Shooting_FG': 'Low', 'Shooting_3PT': 'Low',
    'Playmaking': 'Low', 'Turnovers': 'High',
    'Offensive_Rebounding': 'Low'
}
result_rebuild = inference.query(variables=['Efficiency'], evidence=evidence_rebuild)
rebuild_low = result_rebuild.values[result_rebuild.state_names['Efficiency'].index('Low')]
print(f"  P(Low Efficiency): {rebuild_low:.3f}")

# TEST 3: ACCURACY CHECK
print("\nüéØ TRAINING ACCURACY CHECK:")
from sklearn.metrics import accuracy_score, classification_report

predictions = []
true_labels = []

for idx, row in hierarchical_data.iterrows():
    evidence = {
        'Shooting_FG': row['Shooting_FG'],
        'Shooting_3PT': row['Shooting_3PT'],
        'Playmaking': row['Playmaking'],
        'Turnovers': row['Turnovers'],
        'Offensive_Rebounding': row['Offensive_Rebounding']
    }
    try:
        result = inference.query(variables=['Efficiency'], evidence=evidence)
        predicted = result.state_names['Efficiency'][result.values.argmax()]
        predictions.append(predicted)
        true_labels.append(row['Efficiency'])
    except:
        continue

accuracy = accuracy_score(true_labels, predictions)
print(f"üéØ TRAINING ACCURACY: {accuracy:.1%}")

print("\nüìä DETAILED PERFORMANCE:")
print(classification_report(true_labels, predictions, target_names=['High', 'Medium', 'Low']))

# Compare with previous attempts
print(f"\nüìà ACCURACY IMPROVEMENT:")
print(f"  Previous Best: 54.9%")
print(f"  Current: {accuracy:.1%}")
if accuracy > 0.549:
    improvement = (accuracy - 0.549) * 100
    print(f"  ‚úÖ IMPROVEMENT: +{improvement:.1f}%")
else:
    print(f"  ‚ö†Ô∏è  Still below previous best")

print("\n‚úÖ PHASE 3.3 COMPLETED!")
if accuracy > 0.60:
    print("üöÄ EXCELLENT MODEL - Ready for Phase 4!")
else:
    print("üîß Model needs tuning before Phase 4")

=== PHASE 3.3: INITIAL MODEL VALIDATION ===

üìä MARGINAL PROBABILITIES:
Overall Efficiency Distribution:
  P(High): 0.228
  P(Low): 0.226
  P(Medium): 0.546

üèÄ REAL-WORLD SCENARIOS:
‚≠ê CHAMPIONSHIP TEAM (Elite across the board):
  P(High Efficiency): 0.561

üî® REBUILDING TEAM (Poor across the board):
  P(Low Efficiency): 0.725

üéØ TRAINING ACCURACY CHECK:
üéØ TRAINING ACCURACY: 53.9%

üìä DETAILED PERFORMANCE:
              precision    recall  f1-score   support

        High       0.49      0.15      0.23      1682
      Medium       0.55      0.30      0.39      1906
         Low       0.54      0.82      0.65      3912

    accuracy                           0.54      7500
   macro avg       0.53      0.42      0.42      7500
weighted avg       0.53      0.54      0.49      7500


üìà ACCURACY IMPROVEMENT:
  Previous Best: 54.9%
  Current: 53.9%
  ‚ö†Ô∏è  Still below previous best

‚úÖ PHASE 3.3 COMPLETED!
üîß Model needs tuning before Phase 4


# Phase 4: Model Inference & Validation

# Task
Adapt the existing notebook code to use the provided Kaggle datasets ("games.csv", "games_details.csv", and "players.csv") instead of the NBA API data. This involves loading the Kaggle data, merging it to create lineup-level statistics, cleaning and filtering this data, and then adapting the existing data preprocessing, Bayesian network learning, and model inference/validation steps to work with the Kaggle data.

## Load kaggle data

### Subtask:
Load the relevant Kaggle datasets (`games.csv`, `games_details.csv`, and `players.csv`) into pandas DataFrames.


**Reasoning**:
The subtask is to load the three specified Kaggle datasets. I will use pandas to read the CSV files into DataFrames and display their head and info to verify the loading was successful.



In [None]:
import pandas as pd
import os

# Define the path to the Kaggle data directory
kaggle_data_path = 'kaggle_data'

# Load the datasets
games_df = pd.read_csv(os.path.join(kaggle_data_path, 'games.csv'))
games_details_df = pd.read_csv(os.path.join(kaggle_data_path, 'games_details.csv'))
players_df = pd.read_csv(os.path.join(kaggle_data_path, 'players.csv'))

print("--- games.csv ---")
display(games_df.head())
print("\n--- games_details.csv ---")
display(games_details_df.head())
print("\n--- players.csv ---")
display(players_df.head())

print("\n--- Info for games.csv ---")
games_df.info()
print("\n--- Info for games_details.csv ---")
games_details_df.info()
print("\n--- Info for players.csv ---")
players_df.info()

## Merge kaggle data

### Subtask:
Join the `games` and `games_details` DataFrames to link game information with individual player performance in each game.


**Reasoning**:
Merge the games and games_details dataframes using the GAME_ID column.



In [None]:
# Merge games and games_details on GAME_ID
merged_games_df = pd.merge(games_df, games_details_df, on='GAME_ID', how='inner')

print("--- Merged games_df and games_details_df ---")
display(merged_games_df.head())
print(f"\nShape of merged DataFrame: {merged_games_df.shape}")
print("\n--- Info for merged_games_df ---")
merged_games_df.info()

## Create lineup data from kaggle

### Subtask:
Aggregate the player-level data from the merged DataFrame to create lineup-level statistics. This will involve identifying lineups within each game and summing their players' stats for the minutes they played together. This is a complex step as the Kaggle data doesn't explicitly provide lineup information; we'll need to simulate it or identify common on-court combinations.


**Reasoning**:
Filter out rows with missing minutes, convert 'MIN' to numeric, and then group by game and team to prepare for lineup aggregation.



In [None]:
print("--- Aggregating player data to lineup level ---")

# 1. Filter to include only rows where player minutes ('MIN') are not null
lineup_agg_df = merged_games_df[merged_games_df['MIN'].notnull()].copy()
print(f"Rows after filtering for non-null MIN: {len(lineup_agg_df)}")

# 2. Convert the 'MIN' column from object type to numeric
# Handle potential errors by coercing non-numeric values to NaN and then dropping these rows.
lineup_agg_df['MIN'] = pd.to_numeric(lineup_agg_df['MIN'], errors='coerce')
lineup_agg_df.dropna(subset=['MIN'], inplace=True)
print(f"Rows after converting MIN to numeric and dropping NaNs: {len(lineup_agg_df)}")

# 3. Group the filtered DataFrame by 'GAME_ID' and 'TEAM_ID'
# We will use this grouping to identify players who played in the same game for the same team.
# The aggregation by MIN to simulate lineups will happen within this grouping.
grouped_games_teams = lineup_agg_df.groupby(['GAME_ID', 'TEAM_ID'])

print("\nPrepared data for lineup aggregation by GAME_ID and TEAM_ID.")

**Reasoning**:
Iterate through the game and team groups, identify potential lineups by grouping players with the same minutes, aggregate stats for each lineup, calculate percentages, create a unique lineup identifier, and store the results in a new DataFrame as per steps 4-8 of the instructions.



In [None]:
print("--- Identifying lineups and aggregating stats ---")

lineup_stats_list = []

# Iterate through each game and team
for (game_id, team_id), group in grouped_games_teams:
    # Within each group, identify potential lineups by grouping players with the same minutes
    # Assume players with the same minutes in the same game for the same team played together in a lineup
    lineups_in_game_team = group.groupby('MIN')

    for min_played, lineup_group in lineups_in_game_team:
        # Only consider groups with 5 players to represent 5-man lineups
        if len(lineup_group) == 5:
            # 5. Aggregate their statistics
            aggregated_stats = {
                'GAME_ID': game_id,
                'TEAM_ID': team_id,
                'MIN': min_played,
                'PTS': lineup_group['PTS'].sum(),
                'AST': lineup_group['AST'].sum(),
                'REB': lineup_group['REB'].sum(),
                'TOV': lineup_group['TOV'].sum(),
                'STL': lineup_group['STL'].sum(),
                'BLK': lineup_group['BLK'].sum(),
                'PF': lineup_group['PF'].sum(),
                'FGM': lineup_group['FGM'].sum(),
                'FGA': lineup_group['FGA'].sum(),
                'FG3M': lineup_group['FG3M'].sum(),
                'FG3A': lineup_group['FG3A'].sum(),
                'FTM': lineup_group['FTM'].sum(),
                'FTA': lineup_group['FTA'].sum(),
                'OREB': lineup_group['OREB'].sum(),
                'DREB': lineup_group['DREB'].sum(),
                'PLUS_MINUS': lineup_group['PLUS_MINUS'].sum() # Sum +/- for the lineup's duration
            }

            # 6. Calculate percentages
            aggregated_stats['FG_PCT'] = aggregated_stats['FGA'] / aggregated_stats['FGM'] if aggregated_stats['FGM'] > 0 else 0
            aggregated_stats['FG3_PCT'] = aggregated_stats['FG3A'] / aggregated_stats['FG3M'] if aggregated_stats['FG3M'] > 0 else 0
            aggregated_stats['FT_PCT'] = aggregated_stats['FTA'] / aggregated_stats['FTM'] if aggregated_stats['FTM'] > 0 else 0

            # 7. Create a unique identifier for each lineup
            # Sort player IDs to ensure consistent lineup naming regardless of order
            player_ids = sorted(lineup_group['PLAYER_ID'].tolist())
            aggregated_stats['LINEUP_ID'] = '-'.join(map(str, player_ids))
            aggregated_stats['PLAYER_IDS'] = player_ids # Keep list of player IDs

            lineup_stats_list.append(aggregated_stats)

# 8. Store the aggregated lineup statistics in a new DataFrame
lineup_agg_df = pd.DataFrame(lineup_stats_list)

# 9. Display the head and shape of the newly created lineup DataFrame
print("\n‚úÖ Lineup aggregation complete!")
print("\n--- Aggregated Lineup Data ---")
display(lineup_agg_df.head())
print(f"\nShape of aggregated lineup DataFrame: {lineup_agg_df.shape}")


**Reasoning**:
The previous command failed because the column 'TOV' was not found in the `lineup_group` DataFrame. This indicates that the column name in the `games_details.csv` file might be different or missing. I need to inspect the columns of `lineup_group` to identify the correct column name for turnovers and fix the aggregation step.



In [None]:
print("--- Debugging: Inspecting columns in lineup_group ---")
# Load the games_details_df again to check its columns
games_details_df_check = pd.read_csv(os.path.join('kaggle_data', 'games_details.csv'))
print("Columns in games_details.csv:")
print(games_details_df_check.columns.tolist())

print("\nColumns in the filtered data before grouping (lineup_agg_df):")
print(lineup_agg_df.columns.tolist())

# Take a sample group to inspect its columns
for (game_id, team_id), group in grouped_games_teams:
    for min_played, lineup_group_sample in group.groupby('MIN'):
        if len(lineup_group_sample) == 5:
            print(f"\nColumns in a sample lineup_group (GAME_ID: {game_id}, TEAM_ID: {team_id}, MIN: {min_played}):")
            print(lineup_group_sample.columns.tolist())
            break # Just need one sample
    break # Just need one game/team group


**Reasoning**:
The debugging step revealed that the turnover column in `games_details.csv` and the merged DataFrame is named 'TO', not 'TOV'. The previous code needs to be corrected to use 'TO' instead of 'TOV' during the aggregation step. I will regenerate the code from step 4-8, correcting the column name and ensuring the percentage calculations handle the zero division correctly, then display the head and shape as per step 9.



In [None]:
print("--- Identifying lineups and aggregating stats (Corrected) ---")

lineup_stats_list = []

# Iterate through each game and team
for (game_id, team_id), group in grouped_games_teams:
    # Within each group, identify potential lineups by grouping players with the same minutes
    # Assume players with the same minutes in the same game for the same team played together in a lineup
    lineups_in_game_team = group.groupby('MIN')

    for min_played, lineup_group in lineups_in_game_team:
        # Only consider groups with 5 players to represent 5-man lineups
        if len(lineup_group) == 5:
            # 5. Aggregate their statistics - CORRECTED COLUMN NAME FOR TURNOVERS
            aggregated_stats = {
                'GAME_ID': game_id,
                'TEAM_ID': team_id,
                'MIN': min_played,
                'PTS': lineup_group['PTS'].sum(),
                'AST': lineup_group['AST'].sum(),
                'REB': lineup_group['REB'].sum(),
                'TOV': lineup_group['TO'].sum(), # Corrected: Use 'TO' instead of 'TOV'
                'STL': lineup_group['STL'].sum(),
                'BLK': lineup_group['BLK'].sum(),
                'PF': lineup_group['PF'].sum(),
                'FGM': lineup_group['FGM'].sum(),
                'FGA': lineup_group['FGA'].sum(),
                'FG3M': lineup_group['FG3M'].sum(),
                'FG3A': lineup_group['FG3A'].sum(),
                'FTM': lineup_group['FTM'].sum(),
                'FTA': lineup_group['FTA'].sum(),
                'OREB': lineup_group['OREB'].sum(),
                'DREB': lineup_group['DREB'].sum(),
                'PLUS_MINUS': lineup_group['PLUS_MINUS'].sum() # Sum +/- for the lineup's duration
            }

            # 6. Calculate percentages - Ensure handling of division by zero
            aggregated_stats['FG_PCT'] = aggregated_stats['FGM'] / aggregated_stats['FGA'] if aggregated_stats['FGA'] > 0 else 0
            aggregated_stats['FG3_PCT'] = aggregated_stats['FG3M'] / aggregated_stats['FG3A'] if aggregated_stats['FG3A'] > 0 else 0
            aggregated_stats['FT_PCT'] = aggregated_stats['FTM'] / aggregated_stats['FTA'] if aggregated_stats['FTA'] > 0 else 0

            # 7. Create a unique identifier for each lineup
            # Sort player IDs to ensure consistent lineup naming regardless of order
            player_ids = sorted(lineup_group['PLAYER_ID'].tolist())
            aggregated_stats['LINEUP_ID'] = '-'.join(map(str, player_ids))
            aggregated_stats['PLAYER_IDS'] = player_ids # Keep list of player IDs for potential later use

            lineup_stats_list.append(aggregated_stats)

# 8. Store the aggregated lineup statistics in a new DataFrame
lineup_agg_df = pd.DataFrame(lineup_stats_list)

# 9. Display the head and shape of the newly created lineup DataFrame
print("\n‚úÖ Lineup aggregation complete!")
print("\n--- Aggregated Lineup Data ---")
display(lineup_agg_df.head())
print(f"\nShape of aggregated lineup DataFrame: {lineup_agg_df.shape}")

## Clean and filter kaggle lineup data

### Subtask:
Apply cleaning and filtering steps similar to those in Phase 2.1, ensuring the Kaggle-based lineup data is ready for processing. This includes handling missing values and filtering by minutes played.


**Reasoning**:
Apply cleaning and filtering steps to the Kaggle-based lineup data, including handling missing values and filtering by minutes played, and report the changes in data shape.



In [None]:
print("=== Applying Cleaning and Filtering to Kaggle Lineup Data ===")

# Load the aggregated lineup data (created in the previous step)
# We assume lineup_agg_df is already available from the previous steps.
# If not, you would reload it here, e.g.:
# lineup_agg_df = pd.read_csv('path_to_your_aggregated_lineup_data.csv')

print(f"\nInitial data shape: {lineup_agg_df.shape}")

# Check for missing values before filtering
print("\nChecking missing values before filtering:")
missing_before_filter = lineup_agg_df.isnull().sum()
print(missing_before_filter[missing_before_filter > 0])

# Filter for meaningful lineups (played significant minutes)
print("\nApplying filter: MIN > 20")
filtered_kaggle_lineups = lineup_agg_df[lineup_agg_df['MIN'] > 20].copy()
print(f"Shape after filtering by MIN > 20: {filtered_kaggle_lineups.shape}")

# Check for missing values in critical columns after filtering
critical_columns = ['PLUS_MINUS', 'FG_PCT', 'FG3_PCT', 'AST', 'OREB', 'TOV'] # Note: TOV was 'TO' in raw, but aggregated should be labeled TOV now
print(f"\nChecking missing values in critical columns {critical_columns} after filtering:")
missing_after_filter = filtered_kaggle_lineups[critical_columns].isnull().sum()
print(missing_after_filter[missing_after_filter > 0])


# Remove any remaining rows with missing critical data
print(f"\nRemoving rows with missing values in critical columns: {critical_columns}")
cleaned_kaggle_lineups = filtered_kaggle_lineups.dropna(subset=critical_columns).copy()

print(f"\nFinal data shape after cleaning and filtering: {cleaned_kaggle_lineups.shape}")

# Display a sample of the cleaned data
print("\nSample of the cleaned and filtered Kaggle lineup data:")
display(cleaned_kaggle_lineups.head())

## Adapt phase 2 (data preprocessing & discretization) for kaggle data

### Subtask:
Modify the code in Phase 2 cells to use the Kaggle-based lineup data. Select equivalent columns (or the closest available) from the Kaggle data and apply the smart discretization logic.


**Reasoning**:
Apply the smart discretization functions to the relevant columns in the `cleaned_kaggle_lineups` DataFrame and create a new DataFrame with only the discretized columns, then display the head and value counts.



In [None]:
print("=== Applying Smart Discretization to Kaggle Lineup Data ===")

# Ensure numpy is imported for np.select in discretization functions
import numpy as np

# Identify the columns in cleaned_kaggle_lineups corresponding to the original BN variables
# PLUS_MINUS -> PLUS_MINUS (matches)
# FG_PCT -> FG_PCT (matches)
# FG3_PCT -> FG3_PCT (matches)
# AST -> AST (matches)
# OREB -> OREB (matches)
# TOV -> TO (needs renaming to TOV for the function)

# Rename the 'TO' column to 'TOV' to match the expected name in the discretization function
if 'TO' in cleaned_kaggle_lineups.columns and 'TOV' not in cleaned_kaggle_lineups.columns:
    cleaned_kaggle_lineups.rename(columns={'TO': 'TOV'}, inplace=True)
    print("‚úÖ Renamed 'TO' column to 'TOV'.")

# Apply the smart discretization logic
print("\nüéØ APPLYING SMART DISCRETIZATION...")

kaggle_discretized_data = pd.DataFrame()

# Discretize Efficiency
kaggle_discretized_data['Efficiency'] = smart_discretize_efficiency(cleaned_kaggle_lineups)

# Discretize Shooting
kaggle_discretized_data['Shooting_FG'] = smart_discretize_shooting(cleaned_kaggle_lineups, 'FG_PCT')
kaggle_discretized_data['Shooting_3PT'] = smart_discretize_shooting(cleaned_kaggle_lineups, 'FG3_PCT')

# Discretize Count Stats (per minute)
# Ensure MIN column is present for per-minute calculation
if 'MIN' not in cleaned_kaggle_lineups.columns:
     print("‚ùå Error: 'MIN' column not found in cleaned_kaggle_lineups for per-minute calculations.")
else:
    kaggle_discretized_data['Playmaking'] = smart_discretize_count_stats_refined(cleaned_kaggle_lineups, 'AST')
    kaggle_discretized_data['Offensive_Rebounding'] = smart_discretize_count_stats_refined(cleaned_kaggle_lineups, 'OREB')
    kaggle_discretized_data['Turnovers'] = smart_discretize_count_stats_refined(cleaned_kaggle_lineups, 'TOV')

print("\n‚úÖ SMART DISCRETIZATION COMPLETE FOR KAGGLE DATA!")

# Display the head of the discretized data
print("\nüîç KAGGLE DISCRETIZED DATA PREVIEW:")
display(kaggle_discretized_data.head())

# Print the value counts for each discretized column
print("\nüìä CATEGORY DISTRIBUTIONS IN KAGGLE DATA:")
for col in kaggle_discretized_data.columns:
    distribution = kaggle_discretized_data[col].value_counts().sort_index()
    print(f"   {col}: {dict(distribution)}")


## Adapt phase 3 (bayesian network structure & learning) for kaggle data

### Subtask:
Ensure the Bayesian Network structure and learning process use the discretized data derived from the Kaggle source.


**Reasoning**:
Load the kaggle_discretized_data, define the Bayesian Network structure, learn the CPTs from the data, and create the inference engine.



In [None]:
print("=== PHASE 3.2 (KAGGLE DATA): Learn Conditional Probability Tables (CPTs) ===")

# 1. Load the kaggle_discretized_data DataFrame
# We assume kaggle_discretized_data is available from the previous step.
# If not, you would load it here, e.g.:
# kaggle_discretized_data = pd.read_csv('path_to_kaggle_discretized_data.csv')

print("üîß Using kaggle_discretized_data for BN learning...")
print(f"üìä Data shape: {kaggle_discretized_data.shape}")
print(f"üéØ Columns: {list(kaggle_discretized_data.columns)}")


# 2. Redefine the hierarchical Bayesian Network structure
print("\nüîó Redefining hierarchical Bayesian network structure...")
from pgmpy.models import DiscreteBayesianNetwork

# The hierarchical structure defined in Phase 3.1
# Note: Intermediate nodes ('Shooting_Quality', 'Ball_Control', 'Second_Chances') are not directly
# learned from Kaggle data, as the Kaggle data is already at the "raw stats" level.
# However, the structure still assumes these intermediate concepts influence Efficiency.
# We will learn the CPTs for the defined edges based on the direct relationships
# available in the data. The interpretation of the intermediate nodes is conceptual
# within the model structure itself.

# The model should only include the nodes present in our kaggle_discretized_data
# and the edges between them based on the hierarchical concept.
# The direct dependencies from the raw stats to Efficiency in the original BN design
# were removed in favor of the intermediate nodes.
# We will use the edges that connect our available nodes.

# Let's redefine edges based on direct relationships available in the discretized data
# and the hierarchical idea:
# Shooting_FG and Shooting_3PT influence Efficiency (conceptually via Shooting_Quality)
# Playmaking and Turnovers influence Efficiency (conceptually via Ball_Control)
# Offensive_Rebounding influences Efficiency (conceptually via Second_Chances)
# So, the relevant edges connecting our observable nodes would be:
# ('Shooting_FG', 'Efficiency'), ('Shooting_3PT', 'Efficiency')
# ('Playmaking', 'Efficiency'), ('Turnovers', 'Efficiency')
# ('Offensive_Rebounding', 'Efficiency')

# This structure is the 'flat' version of the hierarchical model based on available data.
# If we wanted the *exact* hierarchical model, we would need data for the intermediate nodes.
# Since we don't have explicit intermediate nodes in the Kaggle data, we learn the CPTs
# for Efficiency based on its direct parents in this "flattened" view, which corresponds
# to the dependencies implied by the hierarchy where the intermediate nodes were removed.

# Using the direct dependencies present in the data for learning:
kaggle_bn_edges = [
    ('Shooting_FG', 'Efficiency'),
    ('Shooting_3PT', 'Efficiency'),
    ('Playmaking', 'Efficiency'),
    ('Turnovers', 'Efficiency'),
    ('Offensive_Rebounding', 'Efficiency')
]

kaggle_hierarchical_model = DiscreteBayesianNetwork(kaggle_bn_edges)

print("‚úÖ Hierarchical network structure defined for Kaggle data!")
print(f"Nodes: {kaggle_hierarchical_model.nodes()}")
print(f"Edges: {kaggle_hierarchical_model.edges()}")


# 3. Adapt the CPT learning step to fit the hierarchical_model using the kaggle_discretized_data DataFrame
print("\nüéØ Learning CPTs from kaggle_discretized_data...")
from pgmpy.estimators import MaximumLikelihoodEstimator # Using MLE for simplicity

# Ensure the data only contains the nodes present in the model
data_for_learning = kaggle_discretized_data[list(kaggle_hierarchical_model.nodes())].copy()

kaggle_hierarchical_model.fit(data_for_learning, estimator=MaximumLikelihoodEstimator)

print("‚úÖ CPTs successfully learned from Kaggle data!")

# 4. Create an inference engine
print("\nüîß Creating inference engine...")
from pgmpy.inference import VariableElimination
kaggle_inference = VariableElimination(kaggle_hierarchical_model)

print("‚úÖ Inference engine created!")

# Print confirmation messages
print("\nüéâ PHASE 3.2 (KAGGLE DATA) COMPLETED SUCCESSFULLY!")
print("Kaggle data loaded, BN structure defined, CPTs learned, and inference engine created.")
print("\nüöÄ Ready for Phase 3.3 (Kaggle Data) Validation")

**Reasoning**:
Perform initial model validation on the Bayesian Network learned from the Kaggle data by checking marginal probabilities and running basketball logic scenarios.



In [None]:
print("=== PHASE 3.3 (KAGGLE DATA): Initial Model Validation ===")

# TEST 1: MARGINAL PROBABILITIES
print("\nüìä MARGINAL PROBABILITIES (Kaggle Data):")
efficiency_marginal_kaggle = kaggle_inference.query(variables=['Efficiency'])
print("Overall Efficiency Distribution:")
for state, prob in zip(efficiency_marginal_kaggle.state_names['Efficiency'], efficiency_marginal_kaggle.values):
    print(f"  P({state}): {prob:.3f}")

# TEST 2: BASKETBALL LOGIC SCENARIOS (Kaggle Data)
print("\nüèÄ BASKETBALL LOGIC TESTS (Kaggle Data):")

# Elite shooting team
# Need to check if 'High' state exists for Shooting_FG and Shooting_3PT in Kaggle data
if 'High' in kaggle_discretized_data['Shooting_FG'].unique() and 'High' in kaggle_discretized_data['Shooting_3PT'].unique():
    evidence_elite_kaggle = {'Shooting_FG': 'High', 'Shooting_3PT': 'High'}
    try:
        result_elite_kaggle = kaggle_inference.query(variables=['Efficiency'], evidence=evidence_elite_kaggle)
        elite_high_kaggle = result_elite_kaggle.values[result_elite_kaggle.state_names['Efficiency'].index('High')]
        print(f"Elite Shooting ‚Üí P(High Efficiency): {elite_high_kaggle:.3f}")
    except Exception as e:
        print(f"  Could not query for Elite Shooting: {e}")
else:
    print("  Cannot test Elite Shooting scenario: 'High' state not present in Kaggle data for Shooting_FG or Shooting_3PT.")


# Poor ball control team
# Need to check if 'Low' state exists for Playmaking and 'High' for Turnovers
if 'Low' in kaggle_discretized_data['Playmaking'].unique() and 'High' in kaggle_discretized_data['Turnovers'].unique():
    evidence_poor_ball_kaggle = {'Playmaking': 'Low', 'Turnovers': 'High'}
    try:
        result_poor_ball_kaggle = kaggle_inference.query(variables=['Efficiency'], evidence=evidence_poor_ball_kaggle)
        poor_ball_low_kaggle = result_poor_ball_kaggle.values[result_poor_ball_kaggle.state_names['Efficiency'].index('Low')]
        print(f"Poor Ball Control ‚Üí P(Low Efficiency): {poor_ball_low_kaggle:.3f}")
    except Exception as e:
        print(f"  Could not query for Poor Ball Control: {e}")
else:
     print("  Cannot test Poor Ball Control scenario: required states not present in Kaggle data for Playmaking or Turnovers.")


# TEST 3: NETWORK PATH VERIFICATION (Kaggle Data)
print("\nüîó NETWORK PATH VERIFICATION (Kaggle Data):")
paths_kaggle = [
    ("Shooting_FG ‚Üí Efficiency", {'Shooting_FG': 'High'}),
    ("Shooting_3PT ‚Üí Efficiency", {'Shooting_3PT': 'High'}),
    ("Playmaking ‚Üí Efficiency", {'Playmaking': 'High'}),
    ("Turnovers ‚Üí Efficiency", {'Turnovers': 'Low'}), # Low turnovers is good
    ("Offensive_Rebounding ‚Üí Efficiency", {'Offensive_Rebounding': 'High'})
]

for path_name, evidence in paths_kaggle:
    # Check if the evidence state exists in the data
    valid_evidence = True
    for var, state in evidence.items():
        if state not in kaggle_discretized_data[var].unique():
            print(f"  Skipping '{path_name}': Evidence state '{state}' not present for '{var}' in Kaggle data.")
            valid_evidence = False
            break

    if valid_evidence:
        try:
            result_kaggle = kaggle_inference.query(variables=['Efficiency'], evidence=evidence)
            # Check if 'High' state exists in the query result
            if 'High' in result_kaggle.state_names['Efficiency']:
                high_prob_kaggle = result_kaggle.values[result_kaggle.state_names['Efficiency'].index('High')]
                baseline_kaggle = efficiency_marginal_kaggle.values[efficiency_marginal_kaggle.state_names['Efficiency'].index('High')]
                impact_kaggle = high_prob_kaggle - baseline_kaggle
                print(f"  {path_name}: +{impact_kaggle:.3f} {'‚úì' if impact_kaggle > 0 else '‚ùå'}")
            else:
                 print(f"  Skipping '{path_name}': 'High' state not in query result for Efficiency.")
        except Exception as e:
            print(f"  Could not query for '{path_name}': {e}")


print("\n‚úÖ PHASE 3.3 (Kaggle Data) VALIDATION COMPLETED!")
print("Results provide initial insights into how variables influence Efficiency in the Kaggle data.")

## Adapt phase 2 (data preprocessing & discretization) for kaggle data

### Subtask:
Adapt phase 2 (data preprocessing & discretization) for kaggle data


**Reasoning**:
Reload the cleaned Kaggle lineup data and examine the distribution of the raw numerical values to understand the limited variation and inform the revised discretization strategy.



In [None]:
print("=== Revising Discretization for Kaggle Data ===")

# Reload the cleaned_kaggle_lineups DataFrame
# Assuming it was saved or is available in the environment from previous steps
# If not, you might need to recreate it from the merged data
# For robustness, let's assume it's available as `cleaned_kaggle_lineups`

print("üîß Examining raw data distribution before revised discretization...")
print(f"Shape of cleaned_kaggle_lineups: {cleaned_kaggle_lineups.shape}")

# Re-examine the distribution of raw numerical values
raw_columns_to_discretize = ['PLUS_MINUS', 'FG_PCT', 'FG3_PCT', 'AST', 'OREB', 'TOV'] # Using TOV after renaming

print("\nüìä Statistical summary of raw numerical data:")
display(cleaned_kaggle_lineups[raw_columns_to_discretize + ['MIN']].describe())

print("\nüìà Value counts for raw numerical data (showing unique values if few):")
for col in raw_columns_to_discretize:
    unique_values = cleaned_kaggle_lineups[col].nunique()
    print(f"   {col}: {unique_values} unique values")
    if unique_values <= 15: # Show values if there are not too many unique ones
        print(f"      Unique values: {sorted(cleaned_kaggle_lineups[col].unique())}")

print("\n‚úÖ Raw data examination complete. Planning revised discretization.")

**Reasoning**:
Based on the raw data analysis, the 'PLUS_MINUS' column has only one unique value (0.0), making it impossible to create multiple categories for 'Efficiency'. Other columns have limited unique values. We need to revise the discretization strategy to handle this small dataset and lack of variation, potentially using simpler binning based on min/max for variables with more than one unique value, or acknowledging that 'Efficiency' cannot be discretized meaningfully with this specific Kaggle subset.



In [None]:
print("\n=== Applying Revised Discretization for Kaggle Data ===")
print("Addressing the limited data points and single value in PLUS_MINUS.")

# Create fresh copy for revised discretization
kaggle_discretized_data = pd.DataFrame()

# Revised Discretization Strategy:
# - For PLUS_MINUS: Given only one unique value (0.0), we cannot create multiple categories.
#   We will assign a single category, but acknowledge this limitation.
# - For other variables: Use simple binning based on min/max values if more than one unique value exists.
#   Aim for 2 or 3 bins if feasible, prioritizing creating at least two categories.

def revised_discretize_simple(data, column, num_bins=3, labels=['Low', 'Medium', 'High']):
    """Simple binning based on min/max for small datasets."""
    unique_values = data[column].nunique()
    if unique_values <= 1:
        print(f"   {column}: Only 1 unique value ({data[column].iloc[0]}). Cannot discretize into multiple categories.")
        # Assign a single category, e.g., 'Single' or the label for the median bin
        return pd.Series([labels[len(labels)//2]] * len(data), index=data.index)
    elif unique_values < num_bins:
         # If fewer unique values than bins, use unique values as categories or fewer bins
         print(f"   {column}: {unique_values} unique values. Using {unique_values} bins based on unique values.")
         bins = sorted(data[column].unique())
         # Create labels based on the number of unique values
         if unique_values == 2:
             current_labels = ['Low', 'High']
         else: # For 3 unique values, use default labels
              current_labels = labels[:unique_values]

         # Use cut with defined bins and labels
         try:
             discretized = pd.cut(data[column], bins=bins, labels=current_labels, include_lowest=True, right=False)
             # For the last bin, need to handle the upper edge
             if unique_values > 1:
                  last_bin_label = current_labels[-1]
                  # Find the maximum value and assign it to the last bin's label
                  max_val_indices = data[data[column] == data[column].max()].index
                  discretized.loc[max_val_indices] = last_bin_label
         except Exception as e:
              print(f"Warning: Could not use unique values for binning {column}. Using simple cut.")
              discretized = pd.cut(data[column], bins=num_bins, labels=labels, include_lowest=True)

    else:
        # Use quantile-based or simple cut if enough unique values
        try:
            # Try quantile first for better distribution if possible
            discretized = pd.qcut(data[column], q=num_bins, labels=labels, duplicates='drop')
            if len(discretized.cat.categories) < num_bins:
                 print(f"   {column}: Quantile binning resulted in fewer than {num_bins} categories. Using simple cut.")
                 discretized = pd.cut(data[column], bins=num_bins, labels=labels, include_lowest=True)
        except Exception as e:
            print(f"Warning: Quantile binning failed for {column}. Using simple cut. Error: {e}")
            discretized = pd.cut(data[column], bins=num_bins, labels=labels, include_lowest=True)


    print(f"   {column}:")
    print(f"      Range: {data[column].min():.2f} to {data[column].max():.2f}")
    print(f"      Distribution: {discretized.value_counts().sort_index().to_dict()}")
    return discretized

def revised_discretize_count_stats_simple(data, column, num_bins=3, labels=['Low', 'Medium', 'High']):
    """Simple binning for count stats per minute for small datasets."""
    # Convert to per-minute rates
    # Handle cases where MIN might be 0, although filtered data should have MIN > 20
    rate_data = data[column] / data['MIN'] if (data['MIN'] > 0).all() else data[column]

    unique_values = rate_data.nunique()
    if unique_values <= 1:
        print(f"   {column} (rate): Only 1 unique value ({rate_data.iloc[0]:.4f}). Cannot discretize into multiple categories.")
        # Assign a single category
        return pd.Series([labels[len(labels)//2]] * len(data), index=data.index)
    elif unique_values < num_bins:
         # If fewer unique values than bins, use unique values as categories or fewer bins
         print(f"   {column} (rate): {unique_values} unique values. Using {unique_values} bins based on unique values.")
         bins = sorted(rate_data.unique())
         if unique_values == 2:
             current_labels = ['Low', 'High']
         else:
              current_labels = labels[:unique_values]

         try:
             discretized = pd.cut(rate_data, bins=bins, labels=current_labels, include_lowest=True, right=False)
             if unique_values > 1:
                  last_bin_label = current_labels[-1]
                  max_val_indices = rate_data[rate_data == rate_data.max()].index
                  discretized.loc[max_val_indices] = last_bin_label
         except Exception as e:
              print(f"Warning: Could not use unique values for binning {column} rate. Using simple cut.")
              discretized = pd.cut(rate_data, bins=num_bins, labels=labels, include_lowest=True)

    else:
        # Use quantile-based or simple cut if enough unique values
        try:
            discretized = pd.qcut(rate_data, q=num_bins, labels=labels, duplicates='drop')
            if len(discretized.cat.categories) < num_bins:
                 print(f"   {column} (rate): Quantile binning resulted in fewer than {num_bins} categories. Using simple cut.")
                 discretized = pd.cut(rate_data, bins=num_bins, labels=labels, include_lowest=True)
        except Exception as e:
            print(f"Warning: Quantile binning failed for {column} rate. Using simple cut. Error: {e}")
            discretized = pd.cut(rate_data, bins=num_bins, labels=labels, include_lowest=True)


    # For TOV, 'Low' rate is good, 'High' rate is bad. Re-map if necessary.
    if column == 'TOV' and isinstance(discretized.dtype, pd.CategoricalDtype):
        # Check if labels are in expected order from qcut/cut (Low rate = 'Low' label)
        if list(discretized.cat.categories) == ['Low', 'Medium', 'High']:
             discretized = discretized.cat.rename_categories({'Low': 'High', 'Medium': 'Medium', 'High': 'Low'})
             print("      Note: TOV labels reversed (Low rate = High category, High rate = Low category)")
        elif list(discretized.cat.categories) == ['Low', 'High']:
             discretized = discretized.cat.rename_categories({'Low': 'High', 'High': 'Low'})
             print("      Note: TOV labels reversed (Low rate = High category, High rate = Low category)")


    print(f"   {column} (rate):")
    print(f"      Range: {rate_data.min():.4f} to {rate_data.max():.4f}")
    if hasattr(discretized, 'cat'):
         print(f"      Categories: {list(discretized.cat.categories)}")
    print(f"      Distribution: {discretized.value_counts().sort_index().to_dict()}")


    return discretized


# Apply the revised simple discretization
print("\nüéØ APPLYING REVISED SIMPLE DISCRETIZATION...")

# Efficiency (PLUS_MINUS) - Will only have one category
kaggle_discretized_data['Efficiency'] = revised_discretize_simple(cleaned_kaggle_lineups, 'PLUS_MINUS', num_bins=3)

# Shooting Percentages
kaggle_discretized_data['Shooting_FG'] = revised_discretize_simple(cleaned_kaggle_lineups, 'FG_PCT', num_bins=3)
kaggle_discretized_data['Shooting_3PT'] = revised_discretize_simple(cleaned_kaggle_lineups, 'FG3_PCT', num_bins=3)

# Count Stats (per minute)
kaggle_discretized_data['Playmaking'] = revised_discretize_count_stats_simple(cleaned_kaggle_lineups, 'AST', num_bins=3)
kaggle_discretized_data['Offensive_Rebounding'] = revised_discretize_count_stats_simple(cleaned_kaggle_lineups, 'OREB', num_bins=3)
kaggle_discretized_data['Turnovers'] = revised_discretize_count_stats_simple(cleaned_kaggle_lineups, 'TOV', num_bins=3)


print("\n‚úÖ REVISED SIMPLE DISCRETIZATION COMPLETE FOR KAGGLE DATA!")

# Verify that each resulting categorical variable has at least two distinct states (except Efficiency)
print("\nüìä REVISED CATEGORY DISTRIBUTIONS IN KAGGLE DATA:")
for col in kaggle_discretized_data.columns:
    distribution = kaggle_discretized_data[col].value_counts().sort_index()
    print(f"   {col}: {dict(distribution)}")
    if col != 'Efficiency' and len(distribution) < 2:
        print(f"      ‚ö†Ô∏è Warning: {col} has fewer than 2 categories after discretization.")


print("\nüîç REVISED KAGGLE DATA PREVIEW:")
display(kaggle_discretized_data.head())


## Adapt phase 3 (bayesian network structure & learning) for kaggle data

### Subtask:
Adapt phase 3 (bayesian network structure & learning) for kaggle data


**Reasoning**:
Adapt Phase 3 (Bayesian Network Structure & Learning) to use the `kaggle_discretized_data` DataFrame, define the network structure, learn CPTs, and create an inference engine. This directly addresses the current subtask.



In [None]:
print("=== PHASE 3 (KAGGLE DATA): Bayesian Network Structure & Learning ===")
print("=== Sub-phase 3.1 & 3.2: Define Structure & Learn CPTs ===")

# 1. Load the kaggle_discretized_data DataFrame
# We assume kaggle_discretized_data is available from the previous step.
# If not, you would load it here, e.g.:
# kaggle_discretized_data = pd.read_csv('path_to_kaggle_discretized_data.csv')

print("üîß Using kaggle_discretized_data for BN learning...")
print(f"üìä Data shape: {kaggle_discretized_data.shape}")
print(f"üéØ Columns: {list(kaggle_discretized_data.columns)}")


# 2. Define the Bayesian Network structure
print("\nüîó Defining Bayesian network structure for Kaggle data...")

# Based on the available nodes in kaggle_discretized_data and the conceptual
# hierarchical structure, we define the direct dependencies available in the data.
# The intermediate nodes are implicit in this structure.
# Edges: Statistical categories -> Efficiency
kaggle_bn_edges = [
    ('Shooting_FG', 'Efficiency'),
    ('Shooting_3PT', 'Efficiency'),
    ('Playmaking', 'Efficiency'),
    ('Turnovers', 'Efficiency'),
    ('Offensive_Rebounding', 'Efficiency')
]

kaggle_model = BayesianNetwork(kaggle_bn_edges)

print("‚úÖ Bayesian network structure defined for Kaggle data!")
print(f"Nodes: {kaggle_model.nodes()}")
print(f"Edges: {kaggle_model.edges()}")


# 3. Adapt the CPT learning step
print("\nüéØ Learning CPTs from kaggle_discretized_data...")

# Ensure the data only contains the nodes present in the model
data_for_learning = kaggle_discretized_data[list(kaggle_model.nodes())].copy()

# Use MaximumLikelihoodEstimator to learn CPTs from the data
kaggle_model.fit(data_for_learning, estimator=MaximumLikelihoodEstimator)

print("‚úÖ CPTs successfully learned from Kaggle data!")

# 4. Create an inference engine
print("\nüîß Creating inference engine...")
kaggle_inference = VariableElimination(kaggle_model)

print("‚úÖ Inference engine created!")

# Print confirmation messages for the subtask completion
print("\nüéâ SUBTASK: Adapt phase 3 (Bayesian Network Structure & Learning) for Kaggle data COMPLETED SUCCESSFULLY!")

**Reasoning**:
The previous code failed because `BayesianNetwork` was deprecated. The error message explicitly suggests using `DiscreteBayesianNetwork`. I need to correct the import and the class name used for creating the model.



In [None]:
print("=== PHASE 3 (KAGGLE DATA): Bayesian Network Structure & Learning ===")
print("=== Sub-phase 3.1 & 3.2: Define Structure & Learn CPTs ===")

# 1. Load the kaggle_discretized_data DataFrame
# We assume kaggle_discretized_data is available in the environment from the previous step.
# No need to load again if it's already a variable.

print("üîß Using kaggle_discretized_data for BN learning...")
print(f"üìä Data shape: {kaggle_discretized_data.shape}")
print(f"üéØ Columns: {list(kaggle_discretized_data.columns)}")


# 2. Define the Bayesian Network structure (Corrected)
print("\nüîó Defining Bayesian network structure for Kaggle data...")

# Based on the available nodes in kaggle_discretized_data and the conceptual
# hierarchical structure, we define the direct dependencies available in the data.
# The intermediate nodes are implicit in this structure.
# Edges: Statistical categories -> Efficiency
kaggle_bn_edges = [
    ('Shooting_FG', 'Efficiency'),
    ('Shooting_3PT', 'Efficiency'),
    ('Playmaking', 'Efficiency'),
    ('Turnovers', 'Efficiency'),
    ('Offensive_Rebounding', 'Efficiency')
]

# Use the correct class name: DiscreteBayesianNetwork
from pgmpy.models import DiscreteBayesianNetwork

kaggle_model = DiscreteBayesianNetwork(kaggle_bn_edges)

print("‚úÖ Bayesian network structure defined for Kaggle data!")
print(f"Nodes: {kaggle_model.nodes()}")
print(f"Edges: {kaggle_model.edges()}")


# 3. Adapt the CPT learning step
print("\nüéØ Learning CPTs from kaggle_discretized_data...")

# Ensure the data only contains the nodes present in the model
data_for_learning = kaggle_discretized_data[list(kaggle_model.nodes())].copy()

# Use MaximumLikelihoodEstimator to learn CPTs from the data
from pgmpy.estimators import MaximumLikelihoodEstimator # Keep import here for clarity within this block if it wasn't in the very first import cell

kaggle_model.fit(data_for_learning, estimator=MaximumLikelihoodEstimator)

print("‚úÖ CPTs successfully learned from Kaggle data!")

# 4. Create an inference engine
print("\nüîß Creating inference engine...")
from pgmpy.inference import VariableElimination # Keep import here for clarity

kaggle_inference = VariableElimination(kaggle_model)

print("‚úÖ Inference engine created!")

# Print confirmation messages for the subtask completion
print("\nüéâ SUBTASK: Adapt phase 3 (Bayesian Network Structure & Learning) for Kaggle data COMPLETED SUCCESSFULLY!")

## Adapt phase 4 (model inference & validation) for kaggle data

### Subtask:
Adapt the inference and validation code to work with the Kaggle-based model.


**Reasoning**:
Implement the validation steps for the Kaggle-based Bayesian Network model using the kaggle_inference engine.



In [None]:
print("=== PHASE 4 (KAGGLE DATA): Model Inference & Validation ===")
print("=== Sub-phase 4.1: Sanity Checks ===")

# TEST 1: MARGINAL PROBABILITIES
print("\nüìä MARGINAL PROBABILITIES (Kaggle Data):")
# Check if the 'Efficiency' node exists in the model
if 'Efficiency' in kaggle_model.nodes():
    try:
        efficiency_marginal_kaggle = kaggle_inference.query(variables=['Efficiency'])
        print("Overall Efficiency Distribution:")
        for state, prob in zip(efficiency_marginal_kaggle.state_names['Efficiency'], efficiency_marginal_kaggle.values):
            print(f"  P({state}): {prob:.3f}")
    except Exception as e:
        print(f"  Error querying marginal probability for Efficiency: {e}")
else:
    print("  'Efficiency' node not found in the Kaggle model.")


# TEST 2: BASKETBALL LOGIC SCENARIOS (Kaggle Data)
print("\nüèÄ BASKETBALL LOGIC TESTS (Kaggle Data):")

# Elite shooting team
evidence_elite_kaggle = {'Shooting_FG': 'High', 'Shooting_3PT': 'High'}
valid_evidence_elite = True
for var, state in evidence_elite_kaggle.items():
    if var not in kaggle_model.nodes() or state not in kaggle_model.get_cardinality([var])[var]:
        print(f"  Cannot test Elite Shooting scenario: State '{state}' not present for '{var}' in Kaggle model.")
        valid_evidence_elite = False
        break

if valid_evidence_elite:
    try:
        result_elite_kaggle = kaggle_inference.query(variables=['Efficiency'], evidence=evidence_elite_kaggle)
        # Check if 'High' state exists in the query result for Efficiency
        if 'High' in result_elite_kaggle.state_names['Efficiency']:
            elite_high_kaggle = result_elite_kaggle.values[result_elite_kaggle.state_names['Efficiency'].index('High')]
            print(f"Elite Shooting ‚Üí P(High Efficiency): {elite_high_kaggle:.3f}")
        else:
            print(f"  'High' efficiency state not in result for Elite Shooting scenario.")
    except Exception as e:
        print(f"  Could not query for Elite Shooting: {e}")

# Poor ball control team
evidence_poor_ball_kaggle = {'Playmaking': 'Low', 'Turnovers': 'High'}
valid_evidence_poor_ball = True
for var, state in evidence_poor_ball_kaggle.items():
    if var not in kaggle_model.nodes() or state not in kaggle_model.get_cardinality([var])[var]:
        print(f"  Cannot test Poor Ball Control scenario: State '{state}' not present for '{var}' in Kaggle model.")
        valid_evidence_poor_ball = False
        break

if valid_evidence_poor_ball:
    try:
        result_poor_ball_kaggle = kaggle_inference.query(variables=['Efficiency'], evidence=evidence_poor_ball_kaggle)
        # Check if 'Low' state exists in the query result for Efficiency
        if 'Low' in result_poor_ball_kaggle.state_names['Efficiency']:
            poor_ball_low_kaggle = result_poor_ball_kaggle.values[result_poor_ball_kaggle.state_names['Efficiency'].index('Low')]
            print(f"Poor Ball Control ‚Üí P(Low Efficiency): {poor_ball_low_kaggle:.3f}")
        else:
            print(f"  'Low' efficiency state not in result for Poor Ball Control scenario.")
    except Exception as e:
        print(f"  Could not query for Poor Ball Control: {e}")


# TEST 3: NETWORK PATH VERIFICATION (Kaggle Data)
print("\nüîó NETWORK PATH VERIFICATION (Kaggle Data):")
paths_kaggle = [
    ("Shooting_FG ‚Üí Efficiency", {'Shooting_FG': 'High'}),
    ("Shooting_3PT ‚Üí Efficiency", {'Shooting_3PT': 'High'}),
    ("Playmaking ‚Üí Efficiency", {'Playmaking': 'High'}),
    ("Turnovers ‚Üí Efficiency", {'Turnovers': 'Low'}), # Low turnovers is good
    ("Offensive_Rebounding ‚Üí Efficiency", {'Offensive_Rebounding': 'High'})
]

# Get baseline marginal probability for 'High' Efficiency if it exists
baseline_kaggle_high = 0
if 'Efficiency' in kaggle_model.nodes() and 'High' in kaggle_model.get_cardinality(['Efficiency'])['Efficiency']:
     try:
        baseline_marginal = kaggle_inference.query(variables=['Efficiency'])
        baseline_kaggle_high = baseline_marginal.values[baseline_marginal.state_names['Efficiency'].index('High')]
     except Exception as e:
         print(f"  Could not get baseline marginal for High Efficiency: {e}")


for path_name, evidence in paths_kaggle:
    # Check if the evidence state exists in the model
    valid_evidence = True
    for var, state in evidence.items():
        if var not in kaggle_model.nodes() or state not in kaggle_model.get_cardinality([var])[var]:
            print(f"  Skipping '{path_name}': Evidence state '{state}' not present for '{var}' in Kaggle model.")
            valid_evidence = False
            break

    if valid_evidence:
        try:
            result_kaggle = kaggle_inference.query(variables=['Efficiency'], evidence=evidence)
            # Check if 'High' state exists in the query result
            if 'High' in result_kaggle.state_names['Efficiency']:
                high_prob_kaggle = result_kaggle.values[result_kaggle.state_names['Efficiency'].index('High')]
                impact_kaggle = high_prob_kaggle - baseline_kaggle_high
                print(f"  {path_name}: +{impact_kaggle:.3f} {'‚úì' if impact_kaggle > 0 else '‚ùå'}")
            else:
                 print(f"  Skipping '{path_name}': 'High' state not in query result for Efficiency.")
        except Exception as e:
            print(f"  Could not query for '{path_name}': {e}")


print("\n‚úÖ PHASE 4 (Kaggle Data) VALIDATION COMPLETED!")

**Reasoning**:
The error "TypeError: unhashable type: 'list'" occurs because the `get_cardinality` method is being called with a list `[var]` instead of the node name `var`. The `in` operator for checking if a state exists in the cardinality list is also incorrect. I need to fix how I access the cardinality and check for the existence of states.



In [None]:
print("=== PHASE 4 (KAGGLE DATA): Model Inference & Validation ===")
print("=== Sub-phase 4.1: Sanity Checks ===")

# TEST 1: MARGINAL PROBABILITIES
print("\nüìä MARGINAL PROBABILITIES (Kaggle Data):")
# Check if the 'Efficiency' node exists in the model
if 'Efficiency' in kaggle_model.nodes():
    try:
        efficiency_marginal_kaggle = kaggle_inference.query(variables=['Efficiency'])
        print("Overall Efficiency Distribution:")
        for state, prob in zip(efficiency_marginal_kaggle.state_names['Efficiency'], efficiency_marginal_kaggle.values):
            print(f"  P({state}): {prob:.3f}")
    except Exception as e:
        print(f"  Error querying marginal probability for Efficiency: {e}")
else:
    print("  'Efficiency' node not found in the Kaggle model.")


# TEST 2: BASKETBALL LOGIC SCENARIOS (Kaggle Data)
print("\nüèÄ BASKETBALL LOGIC TESTS (Kaggle Data):")

# Elite shooting team
evidence_elite_kaggle = {'Shooting_FG': 'High', 'Shooting_3PT': 'High'}
valid_evidence_elite = True
for var, state in evidence_elite_kaggle.items():
    if var not in kaggle_model.nodes():
        print(f"  Cannot test Elite Shooting scenario: Variable '{var}' not in Kaggle model nodes.")
        valid_evidence_elite = False
        break
    # Correctly check if the state exists in the possible states for the variable
    if state not in kaggle_model.get_state_names(var):
        print(f"  Cannot test Elite Shooting scenario: State '{state}' not present for '{var}' in Kaggle model.")
        valid_evidence_elite = False
        break


if valid_evidence_elite:
    try:
        result_elite_kaggle = kaggle_inference.query(variables=['Efficiency'], evidence=evidence_elite_kaggle)
        # Check if 'High' state exists in the query result for Efficiency
        if 'High' in result_elite_kaggle.state_names['Efficiency']:
            elite_high_kaggle = result_elite_kaggle.values[result_elite_kaggle.state_names['Efficiency'].index('High')]
            print(f"Elite Shooting ‚Üí P(High Efficiency): {elite_high_kaggle:.3f}")
        else:
            print(f"  'High' efficiency state not in result for Elite Shooting scenario.")
    except Exception as e:
        print(f"  Could not query for Elite Shooting: {e}")

# Poor ball control team
evidence_poor_ball_kaggle = {'Playmaking': 'Low', 'Turnovers': 'High'}
valid_evidence_poor_ball = True
for var, state in evidence_poor_ball_kaggle.items():
    if var not in kaggle_model.nodes():
        print(f"  Cannot test Poor Ball Control scenario: Variable '{var}' not in Kaggle model nodes.")
        valid_evidence_poor_ball = False
        break
    # Correctly check if the state exists in the possible states for the variable
    if state not in kaggle_model.get_state_names(var):
        print(f"  Cannot test Poor Ball Control scenario: State '{state}' not present for '{var}' in Kaggle model.")
        valid_evidence_poor_ball = False
        break

if valid_evidence_poor_ball:
    try:
        result_poor_ball_kaggle = kaggle_inference.query(variables=['Efficiency'], evidence=evidence_poor_ball_kaggle)
        # Check if 'Low' state exists in the query result for Efficiency
        if 'Low' in result_poor_ball_kaggle.state_names['Efficiency']:
            poor_ball_low_kaggle = result_poor_ball_kaggle.values[result_poor_ball_kaggle.state_names['Efficiency'].index('Low')]
            print(f"Poor Ball Control ‚Üí P(Low Efficiency): {poor_ball_low_kaggle:.3f}")
        else:
            print(f"  'Low' efficiency state not in result for Poor Ball Control scenario.")
    except Exception as e:
        print(f"  Could not query for Poor Ball Control: {e}")


# TEST 3: NETWORK PATH VERIFICATION (Kaggle Data)
print("\nüîó NETWORK PATH VERIFICATION (Kaggle Data):")
paths_kaggle = [
    ("Shooting_FG ‚Üí Efficiency", {'Shooting_FG': 'High'}),
    ("Shooting_3PT ‚Üí Efficiency", {'Shooting_3PT': 'High'}),
    ("Playmaking ‚Üí Efficiency", {'Playmaking': 'High'}),
    ("Turnovers ‚Üí Efficiency", {'Turnovers': 'Low'}), # Low turnovers is good
    ("Offensive_Rebounding ‚Üí Efficiency", {'Offensive_Rebounding': 'High'})
]

# Get baseline marginal probability for 'High' Efficiency if it exists
baseline_kaggle_high = 0
if 'Efficiency' in kaggle_model.nodes() and 'High' in kaggle_model.get_state_names('Efficiency'):
     try:
        baseline_marginal = kaggle_inference.query(variables=['Efficiency'])
        if 'High' in baseline_marginal.state_names['Efficiency']:
            baseline_kaggle_high = baseline_marginal.values[baseline_marginal.state_names['Efficiency'].index('High')]
     except Exception as e:
         print(f"  Could not get baseline marginal for High Efficiency: {e}")


for path_name, evidence in paths_kaggle:
    # Check if the evidence state exists in the model
    valid_evidence = True
    for var, state in evidence.items():
        if var not in kaggle_model.nodes():
             print(f"  Skipping '{path_name}': Variable '{var}' not in Kaggle model nodes.")
             valid_evidence = False
             break
        if state not in kaggle_model.get_state_names(var):
            print(f"  Skipping '{path_name}': Evidence state '{state}' not present for '{var}' in Kaggle model.")
            valid_evidence = False
            break

    if valid_evidence:
        try:
            result_kaggle = kaggle_inference.query(variables=['Efficiency'], evidence=evidence)
            # Check if 'High' state exists in the query result
            if 'High' in result_kaggle.state_names['Efficiency']:
                high_prob_kaggle = result_kaggle.values[result_kaggle.state_names['Efficiency'].index('High')]
                impact_kaggle = high_prob_kaggle - baseline_kaggle_high
                print(f"  {path_name}: +{impact_kaggle:.3f} {'‚úì' if impact_kaggle > 0 else '‚ùå'}")
            else:
                 print(f"  Skipping '{path_name}': 'High' state not in query result for Efficiency.")
        except Exception as e:
            print(f"  Could not query for '{path_name}': {e}")


print("\n‚úÖ PHASE 4 (Kaggle Data) VALIDATION COMPLETED!")

**Reasoning**:
The error "AttributeError: 'DiscreteBayesianNetwork' object has no attribute 'get_state_names'" indicates that the `get_state_names` method does not exist for the `DiscreteBayesianNetwork` object. I need to find the correct way to access the states of a variable in the pgmpy model. The states are stored within the CPDs of each node. I will modify the code to access the states through the CPDs.



In [None]:
print("=== PHASE 4 (KAGGLE DATA): Model Inference & Validation ===")
print("=== Sub-phase 4.1: Sanity Checks ===")

# TEST 1: MARGINAL PROBABILITIES
print("\nüìä MARGINAL PROBABILITIES (Kaggle Data):")
# Check if the 'Efficiency' node exists in the model
if 'Efficiency' in kaggle_model.nodes():
    try:
        efficiency_marginal_kaggle = kaggle_inference.query(variables=['Efficiency'])
        print("Overall Efficiency Distribution:")
        for state, prob in zip(efficiency_marginal_kaggle.state_names['Efficiency'], efficiency_marginal_kaggle.values):
            print(f"  P({state}): {prob:.3f}")
    except Exception as e:
        print(f"  Error querying marginal probability for Efficiency: {e}")
else:
    print("  'Efficiency' node not found in the Kaggle model.")


# TEST 2: BASKETBALL LOGIC SCENARIOS (Kaggle Data)
print("\nüèÄ BASKETBALL LOGIC TESTS (Kaggle Data):")

# Helper function to get states for a node
def get_node_states(model, node):
    """Returns the list of states for a given node in the model."""
    try:
        cpd = model.get_cpds(node)
        return cpd.state_names[node]
    except Exception:
        return []


# Elite shooting team
evidence_elite_kaggle = {'Shooting_FG': 'High', 'Shooting_3PT': 'High'}
valid_evidence_elite = True
for var, state in evidence_elite_kaggle.items():
    if var not in kaggle_model.nodes():
        print(f"  Cannot test Elite Shooting scenario: Variable '{var}' not in Kaggle model nodes.")
        valid_evidence_elite = False
        break
    # Correctly check if the state exists in the possible states for the variable
    if state not in get_node_states(kaggle_model, var):
        print(f"  Cannot test Elite Shooting scenario: State '{state}' not present for '{var}' in Kaggle model.")
        valid_evidence_elite = False
        break


if valid_evidence_elite:
    try:
        result_elite_kaggle = kaggle_inference.query(variables=['Efficiency'], evidence=evidence_elite_kaggle)
        # Check if 'High' state exists in the query result for Efficiency
        if 'High' in result_elite_kaggle.state_names['Efficiency']:
            elite_high_kaggle = result_elite_kaggle.values[result_elite_kaggle.state_names['Efficiency'].index('High')]
            print(f"Elite Shooting ‚Üí P(High Efficiency): {elite_high_kaggle:.3f}")
        else:
            print(f"  'High' efficiency state not in result for Elite Shooting scenario.")
    except Exception as e:
        print(f"  Could not query for Elite Shooting: {e}")

# Poor ball control team
evidence_poor_ball_kaggle = {'Playmaking': 'Low', 'Turnovers': 'High'}
valid_evidence_poor_ball = True
for var, state in evidence_poor_ball_kaggle.items():
    if var not in kaggle_model.nodes():
        print(f"  Cannot test Poor Ball Control scenario: Variable '{var}' not in Kaggle model nodes.")
        valid_evidence_poor_ball = False
        break
    # Correctly check if the state exists in the possible states for the variable
    if state not in get_node_states(kaggle_model, var):
        print(f"  Cannot test Poor Ball Control scenario: State '{state}' not present for '{var}' in Kaggle model.")
        valid_evidence_poor_ball = False
        break

if valid_evidence_poor_ball:
    try:
        result_poor_ball_kaggle = kaggle_inference.query(variables=['Efficiency'], evidence=evidence_poor_ball_kaggle)
        # Check if 'Low' state exists in the query result for Efficiency
        if 'Low' in result_poor_ball_kaggle.state_names['Efficiency']:
            poor_ball_low_kaggle = result_poor_ball_kaggle.values[result_poor_ball_kaggle.state_names['Efficiency'].index('Low')]
            print(f"Poor Ball Control ‚Üí P(Low Efficiency): {poor_ball_low_kaggle:.3f}")
        else:
            print(f"  'Low' efficiency state not in result for Poor Ball Control scenario.")
    except Exception as e:
        print(f"  Could not query for Poor Ball Control: {e}")


# TEST 3: NETWORK PATH VERIFICATION (Kaggle Data)
print("\nüîó NETWORK PATH VERIFICATION (Kaggle Data):")
paths_kaggle = [
    ("Shooting_FG ‚Üí Efficiency", {'Shooting_FG': 'High'}),
    ("Shooting_3PT ‚Üí Efficiency", {'Shooting_3PT': 'High'}),
    ("Playmaking ‚Üí Efficiency", {'Playmaking': 'High'}),
    ("Turnovers ‚Üí Efficiency", {'Turnovers': 'Low'}), # Low turnovers is good
    ("Offensive_Rebounding ‚Üí Efficiency", {'Offensive_Rebounding': 'High'})
]

# Get baseline marginal probability for 'High' Efficiency if it exists
baseline_kaggle_high = 0
if 'Efficiency' in kaggle_model.nodes() and 'High' in get_node_states(kaggle_model, 'Efficiency'):
     try:
        baseline_marginal = kaggle_inference.query(variables=['Efficiency'])
        if 'High' in baseline_marginal.state_names['Efficiency']:
            baseline_kaggle_high = baseline_marginal.values[baseline_marginal.state_names['Efficiency'].index('High')]
     except Exception as e:
         print(f"  Could not get baseline marginal for High Efficiency: {e}")


for path_name, evidence in paths_kaggle:
    # Check if the evidence state exists in the model
    valid_evidence = True
    for var, state in evidence.items():
        if var not in kaggle_model.nodes():
             print(f"  Skipping '{path_name}': Variable '{var}' not in Kaggle model nodes.")
             valid_evidence = False
             break
        if state not in get_node_states(kaggle_model, var):
            print(f"  Skipping '{path_name}': Evidence state '{state}' not present for '{var}' in Kaggle model.")
            valid_evidence = False
            break

    if valid_evidence:
        try:
            result_kaggle = kaggle_inference.query(variables=['Efficiency'], evidence=evidence)
            # Check if 'High' state exists in the query result
            if 'High' in result_kaggle.state_names['Efficiency']:
                high_prob_kaggle = result_kaggle.values[result_kaggle.state_names['Efficiency'].index('High')]
                impact_kaggle = high_prob_kaggle - baseline_kaggle_high
                print(f"  {path_name}: +{impact_kaggle:.3f} {'‚úì' if impact_kaggle > 0 else '‚ùå'}")
            else:
                 print(f"  Skipping '{path_name}': 'High' state not in query result for Efficiency.")
        except Exception as e:
            print(f"  Could not query for '{path_name}': {e}")


print("\n‚úÖ PHASE 4 (Kaggle Data) VALIDATION COMPLETED!")

## Summary:

### Data Analysis Key Findings

*   The Kaggle datasets (`games.csv`, `games_details.csv`, `players.csv`) were successfully loaded and merged to create a base DataFrame containing game and player details.
*   A method to simulate lineup data by grouping players with the same minutes played within a game and team was implemented, focusing on 5-player groups. The turnover column name required correction from 'TOV' to 'TO' during aggregation.
*   The initial cleaning and filtering of the Kaggle lineup data resulted in a DataFrame with 11 rows after filtering for lineups with more than 20 minutes played. No missing values were found in critical columns in this filtered subset.
*   The initial "smart" discretization strategy applied to the Kaggle data resulted in the 'Efficiency' variable having only a single category ('Medium'), rendering subsequent Bayesian network learning and inference for 'Efficiency' states other than 'Medium' impossible.
*   A revised, simpler discretization strategy using basic binning was implemented to address the small dataset size and limited unique values. This successfully created multiple categories (2 or 3) for all variables except 'Efficiency', which remained a single category ('Medium') due to the input data.
*   The Bayesian Network structure was adapted to the available discretized variables from the Kaggle data, and CPTs were learned successfully using the Maximum Likelihood Estimator.
*   Inference and validation tests were successfully adapted and run on the Kaggle-based Bayesian Network. However, the results confirmed that meaningful inference regarding 'High' or 'Low' efficiency was not possible due to the single state present in the 'Efficiency' node of the learned model.

### Insights or Next Steps

*   The current Kaggle dataset and the simple discretization strategy employed yield a "flat" Bayesian Network where 'Efficiency' is a constant. To build a functional predictive or analytical model for different levels of 'Efficiency', either a dataset with more variability in the relevant statistics (especially plus-minus) is needed, or a more sophisticated discretization method that can create multiple categories from limited data points (if statistically justifiable).
*   Explore alternative methods for simulating lineups from the Kaggle data, as grouping solely by exact minutes played might not capture all on-court combinations and could limit the dataset size. Consider approaches based on substitution patterns or sequential analysis if feasible with the available data timestamps.
