In this project, I built a machine learning model that predicts the winner of an NBA basketball game using real match statistics. The main goal is to take stats like total points, rebounds, assists, steals, turnovers, fouls, and home advantage from two teams and determine which team is more likely to win.

‚úÖ How the Model Works
Data Collection & Merging
I used Kaggle‚Äôs NBA dataset which contains:

games.csv (match results)

games_details.csv (player stats)

teams.csv (team information)

I merged these datasets using GAME_ID, then converted player-level stats into team-level totals.
So instead of many rows per player, I created one row per team per game.

Feature Engineering
For each team in each game, I created meaningful features such as:

Total points (TOTAL_PTS)

Rebounds (TOTAL_REB)

Assists (TOTAL_AST)

Steals (TOTAL_STL)

Blocks (TOTAL_BLK)

Turnovers (TOTAL_TO)

Personal fouls (TOTAL_PF)

Home advantage (IS_HOME)

The target variable (TEAM_WON) is 1 if the team won and 0 if the team lost.

Model Training
I split the dataset into training and testing sets (80% ‚Äì 20%) and trained multiple models:

Logistic Regression ‚úÖ (baseline)

Random Forest

XGBoost (best performing model)

XGBoost gave a stable prediction performance of around 73% accuracy, which is a strong result considering only match stats are used.

Prediction Interface
I built a simple user interface inside Colab using ipywidgets, where:

The user selects two NBA teams from dropdown menus

Inputs their match stats

Clicks ‚ÄúPredict Winner‚Äù

The model returns:
‚úÖ the predicted winner
‚úÖ win probabilities for both teams

‚úÖ What This Model Achieves
‚úî Predicts which team is more likely to win
‚úî Compares two teams using their game statistics
‚úî Allows interactive testing with different match scenarios
‚úî Shows how machine learning can be applied to real sports analytics

In [2]:
import pandas as pd

# Load datasets
games = pd.read_csv("/content/games.csv", low_memory=False)
games_details = pd.read_csv("/content/games_details.csv", low_memory=False)

# Merge using uppercase GAME_ID
df = pd.merge(games, games_details, on="GAME_ID", how="left")

print("Merged shape:", df.shape)
df.head()


Merged shape: (134796, 49)


Unnamed: 0,GAME_DATE_EST,GAME_ID,GAME_STATUS_TEXT,HOME_TEAM_ID,VISITOR_TEAM_ID,SEASON,TEAM_ID_home,PTS_home,FG_PCT_home,FT_PCT_home,...,OREB,DREB,REB,AST,STL,BLK,TO,PF,PTS,PLUS_MINUS
0,2022-12-22,22200477,Final,1610612740,1610612759,2022,1610612740,126.0,0.484,0.926,...,1.0,1.0,2.0,0.0,1.0,0.0,2.0,5.0,2.0,-2.0
1,2022-12-22,22200477,Final,1610612740,1610612759,2022,1610612740,126.0,0.484,0.926,...,6.0,3.0,9.0,6.0,1.0,0.0,2.0,1.0,23.0,-14.0
2,2022-12-22,22200477,Final,1610612740,1610612759,2022,1610612740,126.0,0.484,0.926,...,1.0,3.0,4.0,1.0,1.0,0.0,2.0,4.0,13.0,-4.0
3,2022-12-22,22200477,Final,1610612740,1610612759,2022,1610612740,126.0,0.484,0.926,...,0.0,9.0,9.0,5.0,3.0,0.0,2.0,1.0,10.0,-18.0
4,2022-12-22,22200477,Final,1610612740,1610612759,2022,1610612740,126.0,0.484,0.926,...,0.0,2.0,2.0,3.0,0.0,0.0,2.0,2.0,19.0,0.0


In [3]:
# Aggregate stats by GAME_ID and TEAM_ID
team_stats = df.groupby(['GAME_ID', 'TEAM_ID']).agg({
    'PTS': 'sum',
    'REB': 'sum',
    'AST': 'sum',
    'STL': 'sum',
    'BLK': 'sum',
    'TO': 'sum',
    'PF': 'sum'
}).reset_index()

# Rename columns for clarity
team_stats.rename(columns={
    'PTS': 'TOTAL_PTS',
    'REB': 'TOTAL_REB',
    'AST': 'TOTAL_AST',
    'STL': 'TOTAL_STL',
    'BLK': 'TOTAL_BLK',
    'TO': 'TOTAL_TO',
    'PF': 'TOTAL_PF'
}, inplace=True)

# Merge aggregated stats back with games (to get winner info)
final_df = pd.merge(team_stats, games[['GAME_ID','HOME_TEAM_ID','VISITOR_TEAM_ID','HOME_TEAM_WINS']], on='GAME_ID', how='left')

# Mark whether a row represents the home or away team
final_df['IS_HOME'] = (final_df['TEAM_ID'] == final_df['HOME_TEAM_ID']).astype(int)

# Define label: team won = 1, lost = 0
final_df['TEAM_WON'] = final_df.apply(lambda x: 1 if (x['IS_HOME'] == 1 and x['HOME_TEAM_WINS'] == 1) else
                                             1 if (x['IS_HOME'] == 0 and x['HOME_TEAM_WINS'] == 0) else 0,
                                      axis=1)

print("Final dataset shape:", final_df.shape)
final_df.head()


Final dataset shape: (8497, 14)


Unnamed: 0,GAME_ID,TEAM_ID,TOTAL_PTS,TOTAL_REB,TOTAL_AST,TOTAL_STL,TOTAL_BLK,TOTAL_TO,TOTAL_PF,HOME_TEAM_ID,VISITOR_TEAM_ID,HOME_TEAM_WINS,IS_HOME,TEAM_WON
0,11900101,1610613000.0,99.0,43.0,21.0,4.0,4.0,11.0,22.0,1610612746,1610612753,1,1,1
1,11900101,1610613000.0,90.0,42.0,21.0,3.0,5.0,9.0,20.0,1610612746,1610612753,1,0,0
2,11900102,1610613000.0,89.0,47.0,18.0,4.0,7.0,25.0,20.0,1610612743,1610612764,1,1,1
3,11900102,1610613000.0,82.0,34.0,22.0,10.0,1.0,13.0,26.0,1610612743,1610612764,1,0,0
4,11900103,1610613000.0,99.0,39.0,24.0,8.0,8.0,12.0,14.0,1610612751,1610612740,0,0,1


In [4]:
final_df.to_csv("/content/nba_team_cleaned.csv", index=False)
print("Saved as nba_team_cleaned.csv")


Saved as nba_team_cleaned.csv


In [5]:
import pandas as pd

df = pd.read_csv("/content/nba_team_cleaned.csv")

# Select best features
features = ['TOTAL_PTS', 'TOTAL_REB', 'TOTAL_AST', 'TOTAL_STL',
            'TOTAL_BLK', 'TOTAL_TO', 'TOTAL_PF', 'IS_HOME']

# If your dataset has these, add them:
optional_feats = ['FG_PCT_home', 'FG3_PCT_home', 'FT_PCT_home']

for col in optional_feats:
    if col in df.columns:
        features.append(col)

X = df[features]
y = df['TEAM_WON']



In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [7]:

from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# Train powerful gradient boosting model
xgb = XGBClassifier(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric='logloss'
)

xgb.fit(X_train, y_train)

# Predictions
y_pred = xgb.predict(X_test)

# Accuracy
acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print(" XGBoost Accuracy:", acc)
print("\nConfusion Matrix:\n", cm)


 XGBoost Accuracy: 0.7235294117647059

Confusion Matrix:
 [[641 223]
 [247 589]]


In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load your final dataset
df = pd.read_csv("/content/nba_team_cleaned.csv")

# ---- FEATURES you used earlier (adjust if you added more)
features = ['TOTAL_PTS','TOTAL_REB','TOTAL_AST','TOTAL_STL',
            'TOTAL_BLK','TOTAL_TO','TOTAL_PF','IS_HOME']

X = df[features]
y = df['TEAM_WON']

# Stratified split to preserve class balance
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)


In [9]:

from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score

xgb = XGBClassifier(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric='logloss',
    random_state=42
)
xgb.fit(X_train, y_train)

y_pred = xgb.predict(X_test)
y_proba = xgb.predict_proba(X_test)[:,1]

print("Accuracy:", accuracy_score(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_proba))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred, digits=3))


Accuracy: 0.7405882352941177
ROC-AUC: 0.8209564013840831
Confusion Matrix:
 [[639 211]
 [230 620]]

Classification Report:
               precision    recall  f1-score   support

           0      0.735     0.752     0.743       850
           1      0.746     0.729     0.738       850

    accuracy                          0.741      1700
   macro avg      0.741     0.741     0.741      1700
weighted avg      0.741     0.741     0.741      1700



In [10]:
from sklearn.model_selection import StratifiedKFold, cross_val_score

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(xgb, X, y, cv=cv, scoring='accuracy', n_jobs=-1)
print("CV Accuracy (mean ¬± std): {:.3f} ¬± {:.3f}".format(cv_scores.mean(), cv_scores.std()))


CV Accuracy (mean ¬± std): 0.745 ¬± 0.008


In [11]:
import joblib, json

final_model = best_xgb if 'best_xgb' in globals() else xgb
joblib.dump(final_model, "nba_win_predictor_xgb.pkl")

with open("feature_list.json","w") as f:
    json.dump(features, f)

print("Saved: nba_win_predictor_xgb.pkl and feature_list.json")


Saved: nba_win_predictor_xgb.pkl and feature_list.json


In [12]:
import numpy as np

def predict_win_from_stats(model, feature_order, stats_dict):
    # stats_dict keys must match feature_order
    x = np.array([[stats_dict[k] for k in feature_order]], dtype=float)
    proba = model.predict_proba(x)[0,1]
    label = int(proba >= 0.5)
    return label, proba

# Example
example_stats = {
    'TOTAL_PTS':110, 'TOTAL_REB':48, 'TOTAL_AST':26,
    'TOTAL_STL':7, 'TOTAL_BLK':4, 'TOTAL_TO':12, 'TOTAL_PF':19, 'IS_HOME':1
}
model_loaded = joblib.load("nba_win_predictor_xgb.pkl")
label, proba = predict_win_from_stats(model_loaded, features, example_stats)
print("WIN" if label==1 else "LOSS", "with prob:", round(proba,3))



WIN with prob: 0.59


In [13]:
import pandas as pd

df = pd.read_csv("/content/nba_team_cleaned.csv")

# Unique teams from TEAM_ID or team name if available
team_list = sorted(df['TEAM_ID'].unique())  # or TEAM_NAME if exists
print(len(team_list), "teams found")


30 teams found


In [14]:
teams = pd.read_csv("/content/teams.csv")

# Preview columns
print(teams.columns)


Index(['LEAGUE_ID', 'TEAM_ID', 'MIN_YEAR', 'MAX_YEAR', 'ABBREVIATION',
       'NICKNAME', 'YEARFOUNDED', 'CITY', 'ARENA', 'ARENACAPACITY', 'OWNER',
       'GENERALMANAGER', 'HEADCOACH', 'DLEAGUEAFFILIATION'],
      dtype='object')


In [15]:
teams['TEAM_NAME'] = teams['CITY'] + " " + teams['NICKNAME']


In [16]:
team_map = dict(zip(teams['TEAM_ID'], teams['TEAM_NAME']))
name_to_id = {v: k for k, v in team_map.items()}


In [17]:
df['TEAM_NAME'] = df['TEAM_ID'].map(team_map)
team_names = sorted(df['TEAM_NAME'].dropna().unique())


In [18]:
import pandas as pd
import ipywidgets as widgets
from IPython.display import display, clear_output
import joblib
import numpy as np

# Load data
df = pd.read_csv("/content/nba_team_cleaned.csv")
teams = pd.read_csv("/content/teams.csv")

# Create TEAM_NAME as "CITY + NICKNAME"
teams['TEAM_NAME'] = teams['CITY'] + " " + teams['NICKNAME']

#  Mapping: ID ‚Üí name and name ‚Üí ID
team_map = dict(zip(teams['TEAM_ID'], teams['TEAM_NAME']))
name_to_id = {v:k for k,v in team_map.items()}

# Add team names into df
df['TEAM_NAME'] = df['TEAM_ID'].map(team_map)

#  Dropdown list
team_names = sorted(df['TEAM_NAME'].dropna().unique())

# Load model
model = joblib.load("nba_win_predictor_xgb.pkl")

# Features used
features = ['TOTAL_PTS','TOTAL_REB','TOTAL_AST','TOTAL_STL',
             'TOTAL_BLK','TOTAL_TO','TOTAL_PF','IS_HOME']

def predict(stats):
    x = np.array([[stats[f] for f in features]])
    proba = model.predict_proba(x)[0,1]
    label = int(proba >= 0.5)
    return label, proba

#  Dropdown widgets
teamA_dropdown = widgets.Dropdown(options=team_names, description="Team A")
teamB_dropdown = widgets.Dropdown(options=team_names, description="Team B")

#  Venue
A_home = widgets.ToggleButtons(options=[("Home",1),("Away",0)], description="Venue A")
B_home = widgets.ToggleButtons(options=[("Home",1),("Away",0)], description="Venue B")

#  Input boxes
def num(desc,val):
    return widgets.BoundedFloatText(description=desc, value=val, min=0, max=300)

A_pts,A_reb,A_ast,A_stl,A_blk,A_to,A_pf = num("A PTS",110),num("A REB",48),num("A AST",25),num("A STL",7),num("A BLK",4),num("A TO",12),num("A PF",18)
B_pts,B_reb,B_ast,B_stl,B_blk,B_to,B_pf = num("B PTS",107),num("B REB",46),num("B AST",23),num("B STL",6),num("B BLK",3),num("B TO",13),num("B PF",20)

run_btn = widgets.Button(description="Predict Winner", button_style="primary")
out = widgets.Output()

#  Predict on button click
def on_click(_):
    with out:
        clear_output()

        statsA = { 'TOTAL_PTS':A_pts.value,'TOTAL_REB':A_reb.value,'TOTAL_AST':A_ast.value,
                   'TOTAL_STL':A_stl.value,'TOTAL_BLK':A_blk.value,'TOTAL_TO':A_to.value,
                   'TOTAL_PF':A_pf.value,'IS_HOME':int(A_home.value) }

        statsB = { 'TOTAL_PTS':B_pts.value,'TOTAL_REB':B_reb.value,'TOTAL_AST':B_ast.value,
                   'TOTAL_STL':B_stl.value,'TOTAL_BLK':B_blk.value,'TOTAL_TO':B_to.value,
                   'TOTAL_PF':B_pf.value,'IS_HOME':int(B_home.value) }

        labelA, probaA = predict(statsA)
        labelB, probaB = predict(statsB)

        if labelA > labelB:
            winner = teamA_dropdown.value
            p = probaA
        elif labelB > labelA:
            winner = teamB_dropdown.value
            p = probaB
        else:
            winner = teamA_dropdown.value if probaA>=probaB else teamB_dropdown.value
            p = max(probaA, probaB)

        print(f"üèÄ {teamA_dropdown.value} vs {teamB_dropdown.value}")
        print(f"‚úÖ Winner Prediction: {winner}")
        print(f"{teamA_dropdown.value} win probability: {probaA:.3f}")
        print(f"{teamB_dropdown.value} win probability: {probaB:.3f}")

run_btn.on_click(on_click)

ui_left = widgets.VBox([teamA_dropdown,A_home,A_pts,A_reb,A_ast,A_stl,A_blk,A_to,A_pf])
ui_right = widgets.VBox([teamB_dropdown,B_home,B_pts,B_reb,B_ast,B_stl,B_blk,B_to,B_pf])

display(widgets.HBox([ui_left, ui_right]))
display(run_btn, out)


HBox(children=(VBox(children=(Dropdown(description='Team A', options=('Atlanta Hawks', 'Boston Celtics', 'Broo‚Ä¶

Button(button_style='primary', description='Predict Winner', style=ButtonStyle())

Output()