Much thanks to the UC Davis ASA and WiCS clubs for teaching the bones of this code!

# Part 1: Cleaning WNBA Data

In [None]:
# Install necessary libraries
import pandas as pd

Let's import the first WNBA dataset

---

we will be working with of the regular season data. We will read this CSV file and remove unnecessary columns.

In [None]:
# Import regular season dataset, source: https://www.basketball-reference.com/wnba/years/2023_games.html
schedule = pd.read_csv("/reg_season.csv")

# Remove first and last two columns (date, unnamed, and notes)
schedule = schedule.iloc[:, 1:-2]
# Display dataframe, schedule
schedule.head()

Unnamed: 0,Visitor/Neutral,PTS,Home/Neutral,PTS.1
0,Connecticut Sun,70,Indiana Fever,61
1,Phoenix Mercury,71,Los Angeles Sparks,94
2,Chicago Sky,77,Minnesota Lynx,66
3,New York Liberty,64,Washington Mystics,80
4,Atlanta Dream,78,Dallas Wings,85


Now let's read in the second WNBA dataset we will work with, which contains advanced statistics about each team's performance in 2023.

In [None]:
# Import second WNBA dataset, source: https://www.basketball-reference.com/wnba/years/2023.html
advanced_stats = pd.read_csv("/advanced_stats.csv")
# Remove nan columns
advanced_stats = advanced_stats.dropna(axis = 1, how = 'all')
# Remove first and last columns (rank and arena)
advanced_stats = advanced_stats.iloc[:, 1:-1]
# Display dataframe, advanced_stats
advanced_stats.head()

Unnamed: 0,Team,Age,W,L,PW,PL,MOV,SOS,SRS,ORtg,...,3PAr,TS%,eFG%,TOV%,ORB%,FT/FGA,eFG%.1,TOV%.1,DRB%,FT/FGA.1
0,Las Vegas Aces,29.0,34.0,6.0,35,5,12.55,-1.09,11.46,114.8,...,0.36,0.597,0.553,12.5,18.9,0.236,0.48,14.4,78.2,0.171
1,New York Liberty,28.8,32.0,8.0,32,8,8.68,-0.53,8.14,111.8,...,0.424,0.577,0.539,14.9,25.8,0.198,0.476,13.1,78.5,0.184
2,Connecticut Sun,29.1,27.0,13.0,26,14,3.7,-0.27,3.43,105.4,...,0.294,0.54,0.498,14.0,23.9,0.224,0.486,16.7,75.8,0.236
3,Dallas Wings,26.0,22.0,18.0,25,15,2.95,-0.31,2.64,109.5,...,0.292,0.535,0.489,13.7,32.8,0.224,0.504,15.0,77.2,0.237
4,Washington Mystics,28.8,19.0,21.0,19,21,-0.35,0.03,-0.32,101.5,...,0.342,0.533,0.486,13.9,18.7,0.222,0.496,16.3,78.0,0.238


We will combine our two dataframes, schedule and advanced, by merging advanced onto schedule according to team name.

In [None]:
df = pd.merge(schedule, advanced_stats, left_on = "Visitor/Neutral", right_on = "Team")
df = pd.merge(df, advanced_stats, left_on="Home/Neutral", right_on="Team")
# Remove duplicate columns
df = df.drop(['Team_x', 'Team_y'], axis=1)
# Display dataframe, df
df.head()

Unnamed: 0,Visitor/Neutral,PTS,Home/Neutral,PTS.1,Age_x,W_x,L_x,PW_x,PL_x,MOV_x,...,3PAr_y,TS%_y,eFG%_y,TOV%_y,ORB%_y,FT/FGA_y,eFG%.1_y,TOV%.1_y,DRB%_y,FT/FGA.1_y
0,Connecticut Sun,70,Indiana Fever,61,29.1,27.0,13.0,26,14,3.7,...,0.29,0.532,0.491,15.6,26.4,0.204,0.509,13.9,76.8,0.248
1,Connecticut Sun,88,Indiana Fever,72,29.1,27.0,13.0,26,14,3.7,...,0.29,0.532,0.491,15.6,26.4,0.204,0.509,13.9,76.8,0.248
2,Phoenix Mercury,85,Indiana Fever,82,27.8,9.0,31.0,8,32,-8.3,...,0.29,0.532,0.491,15.6,26.4,0.204,0.509,13.9,76.8,0.248
3,Phoenix Mercury,71,Indiana Fever,72,27.8,9.0,31.0,8,32,-8.3,...,0.29,0.532,0.491,15.6,26.4,0.204,0.509,13.9,76.8,0.248
4,Chicago Sky,89,Indiana Fever,87,27.2,18.0,22.0,17,23,-1.63,...,0.29,0.532,0.491,15.6,26.4,0.204,0.509,13.9,76.8,0.248


We will add a column to this merged dataframe to account for home game winners in each game. To do this, we will loop through each row (game) and determine which team won, placing a 0 if the home team lost and a 1 if the home team won.

In [None]:
# Add a column to show if the home team won
for index, row in df.iterrows():
  # Determine which team had more points
  if df.loc[index, "PTS"] > df.loc[index, "PTS.1"]:
    # 0 for home loss
    df.loc[index, "Home_winner"] = 0
  if df.loc[index, "PTS"] < df.loc[index, "PTS.1"]:
    # 1 for home win
    df.loc[index, "Home_winner"] = 1

# Display dataframe, df
df.head()

Unnamed: 0,Visitor/Neutral,PTS,Home/Neutral,PTS.1,Age_x,W_x,L_x,PW_x,PL_x,MOV_x,...,TS%_y,eFG%_y,TOV%_y,ORB%_y,FT/FGA_y,eFG%.1_y,TOV%.1_y,DRB%_y,FT/FGA.1_y,Home_winner
0,Connecticut Sun,70,Indiana Fever,61,29.1,27.0,13.0,26,14,3.7,...,0.532,0.491,15.6,26.4,0.204,0.509,13.9,76.8,0.248,0.0
1,Connecticut Sun,88,Indiana Fever,72,29.1,27.0,13.0,26,14,3.7,...,0.532,0.491,15.6,26.4,0.204,0.509,13.9,76.8,0.248,0.0
2,Phoenix Mercury,85,Indiana Fever,82,27.8,9.0,31.0,8,32,-8.3,...,0.532,0.491,15.6,26.4,0.204,0.509,13.9,76.8,0.248,0.0
3,Phoenix Mercury,71,Indiana Fever,72,27.8,9.0,31.0,8,32,-8.3,...,0.532,0.491,15.6,26.4,0.204,0.509,13.9,76.8,0.248,1.0
4,Chicago Sky,89,Indiana Fever,87,27.2,18.0,22.0,17,23,-1.63,...,0.532,0.491,15.6,26.4,0.204,0.509,13.9,76.8,0.248,0.0


Now, we must remove certain columns that we don't want to keep for our prediction. As we want our ML model to determine the winner of a game without knowing the number of points each team got or the winner, let's remove the columns with this information.

In [None]:
# Create an array of column names that we want to remove
remove_cols = ["Home/Neutral", "Visitor/Neutral", "PTS", "PTS.1", "Home_winner"]
# For every column, if column isn't in list of columns to be removed, add it to list of selected columns (kept columns)
selected_cols = [x for x in df.columns if x not in remove_cols]
# Display selected columns of dataframe, df
df[selected_cols].head()

Unnamed: 0,Age_x,W_x,L_x,PW_x,PL_x,MOV_x,SOS_x,SRS_x,ORtg_x,DRtg_x,...,3PAr_y,TS%_y,eFG%_y,TOV%_y,ORB%_y,FT/FGA_y,eFG%.1_y,TOV%.1_y,DRB%_y,FT/FGA.1_y
0,29.1,27.0,13.0,26,14,3.7,-0.27,3.43,105.4,100.7,...,0.29,0.532,0.491,15.6,26.4,0.204,0.509,13.9,76.8,0.248
1,29.1,27.0,13.0,26,14,3.7,-0.27,3.43,105.4,100.7,...,0.29,0.532,0.491,15.6,26.4,0.204,0.509,13.9,76.8,0.248
2,27.8,9.0,31.0,8,32,-8.3,0.55,-7.75,99.1,109.9,...,0.29,0.532,0.491,15.6,26.4,0.204,0.509,13.9,76.8,0.248
3,27.8,9.0,31.0,8,32,-8.3,0.55,-7.75,99.1,109.9,...,0.29,0.532,0.491,15.6,26.4,0.204,0.509,13.9,76.8,0.248
4,27.2,18.0,22.0,17,23,-1.63,0.11,-1.52,103.3,105.3,...,0.29,0.532,0.491,15.6,26.4,0.204,0.509,13.9,76.8,0.248


Let's scale our data using MinMaxScaler from sklearn to be between 0 and 1 for the Logistic Regression we will implement later on.

In [None]:
from sklearn.preprocessing import MinMaxScaler
# Initialize scaler
scalar = MinMaxScaler()
# Scale our data
df[selected_cols] = scalar.fit_transform(df[selected_cols])
# Display dataframe, df
df.head()

Unnamed: 0,Visitor/Neutral,PTS,Home/Neutral,PTS.1,Age_x,W_x,L_x,PW_x,PL_x,MOV_x,...,TS%_y,eFG%_y,TOV%_y,ORB%_y,FT/FGA_y,eFG%.1_y,TOV%.1_y,DRB%_y,FT/FGA.1_y,Home_winner
0,Connecticut Sun,70,Indiana Fever,61,1.0,0.72,0.28,0.666667,0.333333,0.57554,...,0.177215,0.225,0.659574,0.546099,0.392405,0.891892,0.216216,0.690909,0.9625,0.0
1,Connecticut Sun,88,Indiana Fever,72,1.0,0.72,0.28,0.666667,0.333333,0.57554,...,0.177215,0.225,0.659574,0.546099,0.392405,0.891892,0.216216,0.690909,0.9625,0.0
2,Phoenix Mercury,85,Indiana Fever,82,0.59375,0.0,1.0,0.0,1.0,0.0,...,0.177215,0.225,0.659574,0.546099,0.392405,0.891892,0.216216,0.690909,0.9625,0.0
3,Phoenix Mercury,71,Indiana Fever,72,0.59375,0.0,1.0,0.0,1.0,0.0,...,0.177215,0.225,0.659574,0.546099,0.392405,0.891892,0.216216,0.690909,0.9625,1.0
4,Chicago Sky,89,Indiana Fever,87,0.40625,0.36,0.64,0.333333,0.666667,0.319904,...,0.177215,0.225,0.659574,0.546099,0.392405,0.891892,0.216216,0.690909,0.9625,0.0


# Part 2: Determine our Predictors

We initialize a feature selector using a ridge regression classifier, which we will use for creating our model.

In [None]:
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import RidgeClassifier

# Initialize our ridge regression classification
rr = RidgeClassifier(alpha = 1.0)

# Initialize our feature selector which picks the best 10 features backward
sfs = SequentialFeatureSelector(rr, n_features_to_select = 10, direction = "backward")

Now, we will use our selector to pick the 10 best features.

In [None]:
# Determine which columns are the most impactful when predicting the winner
sfs.fit(df[selected_cols], df["Home_winner"])

We want to select certain features because we want to avoid overfitting our model. Let's see which features were selected.

In [None]:
# Create a list of the most impactful columns
predictors = list(df[selected_cols].columns[sfs.get_support()])
# Display these most impactful
df[predictors].head()

Unnamed: 0,ORB%_x,eFG%.1_x,TOV%.1_x,L_y,TS%_y,eFG%_y,ORB%_y,FT/FGA_y,eFG%.1_y,TOV%.1_y
0,0.368794,0.27027,0.972973,0.84,0.177215,0.225,0.546099,0.392405,0.891892,0.216216
1,0.368794,0.27027,0.972973,0.84,0.177215,0.225,0.546099,0.392405,0.891892,0.216216
2,0.156028,0.810811,0.081081,0.84,0.177215,0.225,0.546099,0.392405,0.891892,0.216216
3,0.156028,0.810811,0.081081,0.84,0.177215,0.225,0.546099,0.392405,0.891892,0.216216
4,0.404255,0.621622,0.405405,0.84,0.177215,0.225,0.546099,0.392405,0.891892,0.216216


# Part 3: Creating and Testing our Model

In [None]:
# Import the necessary packages
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Let's write a function to train and test our model, in order to get the average accuracy based on many simulations.

In [None]:
def monte_carlo(n):
    accuracy = []
    for i in range(n):
        # Split the data into training and test sets - use predictors to
        X_train, X_test, y_train, y_test = train_test_split(df[predictors], df['Home_winner'], test_size=0.2)

        # Train a logistic regression model on the training data
        model = LogisticRegression()
        # Fit the model to our training data
        model.fit(X_train, y_train)

        # Predict the winners for the test data
        y_pred = model.predict(X_test)

        # Evaluate the accuracy of the model on the test data
        accuracy.append(accuracy_score(y_test, y_pred))

    # Get the average accuracy
    score = sum(accuracy) / len(accuracy)
    return (score)

We will run our function 1000 times in order to determine the average accuracy of our model.

In [None]:
score = monte_carlo(1000)
print("Accuracy:", score)

Accuracy: 0.7038750000000005


Based on 1000 simulations, our model is about 70% accurate in predicting outcomes of WNBA games.

# Part 4: Predicting the Finals

First, we must only train our model on games where the Las Vegas Aces and New York Liberty do NOT face each other. This would be cheating and our model would not be accurate to predict future games.

In [None]:
# Remove the rows where the Aces and Liberty play against each other
non_aces_liberty_game = df
non_aces_liberty_game = non_aces_liberty_game.drop(non_aces_liberty_game[(non_aces_liberty_game['Home/Neutral'] == 'Las Vegas Aces') & (non_aces_liberty_game['Visitor/Neutral'] == 'New York Liberty')].index)
non_aces_liberty_game = non_aces_liberty_game.drop(non_aces_liberty_game[(non_aces_liberty_game['Home/Neutral'] == 'New York Liberty') & (non_aces_liberty_game['Visitor/Neutral'] == 'Las Vegas Aces')].index)
non_aces_liberty_game[(non_aces_liberty_game['Home/Neutral'] == 'Las Vegas Aces') & (non_aces_liberty_game['Visitor/Neutral'] == 'New York Liberty')]

Unnamed: 0,Visitor/Neutral,PTS,Home/Neutral,PTS.1,Age_x,W_x,L_x,PW_x,PL_x,MOV_x,...,TS%_y,eFG%_y,TOV%_y,ORB%_y,FT/FGA_y,eFG%.1_y,TOV%.1_y,DRB%_y,FT/FGA.1_y,Home_winner


Then, we grab the data for the matchup between the two teams.

In [None]:
# Get a game where the Aces are home and Liberty is away
final_matchup = df[(df['Home/Neutral'] == 'Las Vegas Aces') & (df['Visitor/Neutral'] == 'New York Liberty')][:1]
# Show the predictors we will use
final_matchup[predictors]

Unnamed: 0,ORB%_x,eFG%.1_x,TOV%.1_x,L_y,TS%_y,eFG%_y,ORB%_y,FT/FGA_y,eFG%.1_y,TOV%.1_y
144,0.503546,0.0,0.0,0.0,1.0,1.0,0.014184,0.797468,0.108108,0.351351


We can now train our model, based on non Aces v Liberty games, and use it to predict the outcome game we selected.

In [None]:
# Predict the winner of the final matchup
model = LogisticRegression()
model.fit(non_aces_liberty_game[predictors], non_aces_liberty_game['Home_winner'])

# Predict the outcome of the final_matchup
y_pred = model.predict(final_matchup[predictors])
print(f"Prediction: {y_pred[0]}")

Prediction: 1.0


This predicts that the home team will win, meaning the Aces take the win for the first game of the finals.

For a more complex prediction for each finals game, reference predict.ipynb.

By predicting all potential matchups based on home and away we predict:

In [None]:
# Grab rows from the df where Aces are playing the Liberty (home and away) (the matchup for the finals)
final_matchup_1 = df[(df['Home/Neutral'] == 'Las Vegas Aces') & (df['Visitor/Neutral'] == 'New York Liberty')][1:]
final_matchup_2 = df[(df['Home/Neutral'] == 'New York Liberty') & (df['Visitor/Neutral'] == 'Las Vegas Aces')][1:]
# Combine the two data frames to match the games played in the finals
finals = pd.concat([final_matchup_1, final_matchup_1, final_matchup_2, final_matchup_2, final_matchup_1, final_matchup_2])

In [None]:
# Predict the winner of the final matchup
model = LogisticRegression()
model.fit(non_aces_liberty_game[predictors], non_aces_liberty_game['Home_winner'])

# Predict the outcome of the final matchup
y_pred = model.predict(final_matchup[predictors])
print(f"Prediction: {y_pred[0]}")

Prediction: 1.0


In [None]:
finals[['Home/Neutral', 'Visitor/Neutral', 'Home_winner']]

Unnamed: 0,Home/Neutral,Visitor/Neutral,Home_winner
145,Las Vegas Aces,New York Liberty,1.0
145,Las Vegas Aces,New York Liberty,1.0
49,New York Liberty,Las Vegas Aces,1.0
49,New York Liberty,Las Vegas Aces,1.0
145,Las Vegas Aces,New York Liberty,1.0
49,New York Liberty,Las Vegas Aces,1.0


As this model predicted that each team would win all their home games, each game has a prediction of 1.0. The Game 1, Game 2, and Game 3 predictions were all correct, while the Game 4 prediction was incorrect, meaning that the Aces won the WNBA 2023 Finals 3-1!