---
Title: "Intro to Categorical Model-Making Module by Austin Hayes"

Author:
  - Name: Austin Hayes

  -  Email: ahayes65@charlotte.edu

  -  Affiliation: University of North Carolina at Charlotte

Date: July 29, 2025

Description: Using WNBA Team 'Per Game' stats, we will review the basics of categorical model-making. We will make QDA, LDA and Logistic models in order to predict which team will become the next WNBA Champion.

Categories:
  - Logistic Regression
  - Quadratic Discriminant Analysis
  - Linear Discriminant Analysis
  - Importing and Reading data
  - Data Cleaning
  - Data Science
  - Pandas


### Data

This Dataset is from Basketball Reference @ https://www.basketball-reference.com

Visit the original data page here: https://www.basketball-reference.com/wnba/years/2024.html

The data set contains 60 rows and 26 columns. Each row represents a WNBA team during the 2024 season.

Download data: 

Available on the [Intro to Categorical Model-Making Module by Austin Hayes](https://github.com/schuckers/Charlotte_SCORE_Summer25/tree/main/Data%20for%20Modules/Data%20For%20Intro%20to%20Categorical%20Model-Making%20Module%20by%20Austin%20Hayes): [2024_WNBA_Per_Game.csv](https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20For%20Intro%20to%20Categorical%20Model-Making%20Module%20by%20Austin%20Hayes/2024_WNBA_Per_Game.csv)

---

### Variables and their Descriptions:


<details>
<summary><b>Variable Descriptions</b></summary>

| Variable | Description |
|----------|-------------|
| Rk       | Rank of the team in the league |
| Team     | Name of the team |
| G        | Games played |
| MP       | Minutes played per game |
| FG       | Field Goals made per game |
| FGA      | Field Goal attempts per game |
| FG%      | Field Goal percentage (FG ÷ FGA) |
| 3P       | Three-Point Field Goals made per game |
| 3PA      | Three-Point Field Goal attempts per game |
| 3P%      | Three-Point Field Goal percentage (3P ÷ 3PA) |
| 2P       | Two-Point Field Goals made per game |
| 2PA      | Two-Point Field Goal attempts per game |
| 2P%      | Two-Point Field Goal percentage (2P ÷ 2PA) |
| FT       | Free Throws made per game |
| FTA      | Free Throw attempts per game |
| FT%      | Free Throw percentage (FT ÷ FTA) |
| ORB      | Offensive Rebounds per game |
| DRB      | Defensive Rebounds per game |
| TRB      | Total Rebounds per game |
| AST      | Assists per game |
| STL      | Steals per game |
| BLK      | Blocks per game |
| TOV      | Turnovers per game |
| PF       | Personal Fouls per game |
| PTS      | Points scored per game |
| CHAMPION | Did that team win the championship that year? 1 = Yes, 0 = No |

</details>

---

# Learning Goals

- Learn about the basics of categorical modeling
- Compare and contrast models using statistical metrics
- Learn about Quadratic Discriminant Analysis
- Learn about Linear Discriminant Analysis
- Learn about Logistic Regression

---

# Getting Started


In this lesson, we will learn about categorical modeling in order to predict the outcome of a certain event based on past and current metrics. 

Today, we will be predicting what team will be the WNBA champion this year! Before we get started, we need to import certain librarys in order to achieve this goal.

If you don't have any of these librarys downloaded, you can learn about how to do so with the following links:

Pandas: https://pandas.pydata.org/docs/getting_started/install.html 

Numpy: https://numpy.org/install/

Sci-Kit Learn: https://scikit-learn.org/stable/install.html

After downloading those librarys, let's import them.

In [None]:
# Import the necessary librarys for the module
# Basic Data Science Library for importing data and data manipulation
import pandas as pd

# Library for mathematical operations
import numpy as np

# Import only the necessary functions from the sklearn library
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis

---

# Categorical Modeling

Categorical Modeling is a way of measuring/predicting a non-numeric outcome using previously available data. For today, we will be predicting a binary outcome, like 'yes' or 'no', 'Champion' or 'Not Champion', etc. There are various models that can be used to achieve this but today we will go over Logistic, LDA and QDA models. Each of these models use different methodologies to achieve the same goal. Therefore, each models predictions may differ from each other, warranting comparison. This can be done using certain metrics that we will go over later in this lesson.

---

# Modeling Process For This Module

In order to build a model, we will need to train the model and then test the model to see how accurate it is. Normally, we would want to build the model by splitting a season's worth of data into a train dataset and a test dataset. You want to split the data into a higher ratio of training data than test data. This ensures that the model will be more adequately prepared to make predicitions on the test data. On the flip side, there needs to be a balance between your train and test sets because the more test data you have, the more data you have to test your models accuracy. Normally, I use 60-70% of the data for training the model and 40-30% for testing the model.

However, we will want to do something a little different in this module. For the seasons we will be training the model with, there are only 12 WNBA teams. That is not enough data! In this case, I have compiled a dataset of 'Per Game' data for every team over the last 5 seasons, before the 2024 season. 'Per Game' means that the stats are an average of each catgory among all games played. For example, the 'PTS' column is an average of each teams points scored per game. We will train our models on that compiled dataset and then test the accuracy of our models with the 2024 season data. After reviewing each model's accuracy score, we will pick the model with highest accuracy and re-train it with another compiled dataset. This other compiled dataset will include 2020-2024 data instead of 2019-2023 in order to provide the newest data to the model. After retraining the model, we will have the model predict the next WNBA Champion.

NOTE: The compiled dataset was built upon several 'Per Game' datasets from Basketball-Reference.com. Please visit their website using the link at the top of this module. Each teams name has been edited to identify what year each team's stats are from since there are reoccuring team names. A champion column was also added in order to identify which teams won a championship. 1 = Champions, 0 = Not Champions

In order to test the accuracy of the model(s) amongst one another, we want to train the model(s) on the previous 5 season's worth of data. Since this is a special case (only one 'champion' and NOT multiple) we will train the model(s) using 2019-2023 season data and test it with the 2024 season data. The model with the highest accuracy, or of our choosing, will be used to predict who the WNBA champion will be this season using the 'Per Game' stats thus far.

To summarize, a model is something we can train with data so it can make a prediction on new data.

# Importing data and making train/test sets

Let's get started by reading in the data.

In [None]:
# Read in the WNBA 2019-2023 data from Github
# This dataset will be used to train the intial models
TRAIN_WNBA_19_23_Data = pd.read_csv('https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20For%20Intro%20to%20Categorical%20Model-Making%20Module%20by%20Austin%20Hayes/WNBA_PER_GAME_19_23_DATA.csv')

In [None]:
# Read in the WNBA 2024 data from Github
# This will be used to test the intial models 
TEST_WNBA_2024_Data = pd.read_csv('https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20For%20Intro%20to%20Categorical%20Model-Making%20Module%20by%20Austin%20Hayes/2024_WNBA_Per_Game.csv')

In [None]:
# Read in the WNBA 2020-2024 data from Github 
# This dataset will be used to train the final model
TRAIN_WNBA_20_24_Data = pd.read_csv('https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20For%20Intro%20to%20Categorical%20Model-Making%20Module%20by%20Austin%20Hayes/WNBA_PER_GAME_20_24_DATA.csv')

In [122]:
# Read in the current WNBA 2025 stats for our model prediction later on in this module
CURRENT_WNBA_2025_Data = pd.read_csv('https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20For%20Intro%20to%20Categorical%20Model-Making%20Module%20by%20Austin%20Hayes/2025_WNBA_Per_Game.csv')

---

# Logistic Regression

Logistic Regression is a categorical model that uses probabilities to predict a binary outcome. It assumes a linear relationship between independent variables. Let's go ahead and establish a logistic model and train it with the compiled 2019-2023 data.

We will train it ONLY using the neccesary stats. The model also needs to be trained using integer values, so let's keep it simple by predicting the champion based on the following stats: FG, FGA, 3P, 3PA, 2P, 2PA, FT, FTA, ORB, DRB, TRB, AST, STL, BLK, TOV, PF, PTS. The definitions of these variables can be found at the top of this module. 

In [None]:
# Create the logistic regression model
logistic_model = LogisticRegression()

# Train the model using the 2023 WNBA data
logistic_model.fit(TRAIN_WNBA_19_23_Data[['FG', 'FGA', '3P', '3PA', '2P', '2PA', 'FT', 'FTA', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']], TRAIN_WNBA_2023_Data['Champion'])

In [132]:
# Have the model make predictions using the 2024 WNBA data
logistic_model.predict(TEST_WNBA_2024_Data[['FG', 'FGA', '3P', '3PA', '2P', '2PA', 'FT', 'FTA', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']])

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

MMHHHMMM... very interesting! It appears as though it predicted that the model did not predict any team as the winner! Therefore, let's announce a tie breaker by looking at the probabilities of winning the championship based on their per game stats.

In [133]:
# Have the model show the probabilities of each team winning the championship
logistic_model.predict_proba(TEST_WNBA_2024_Data[['FG', 'FGA', '3P', '3PA', '2P', '2PA', 'FT', 'FTA', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']])

array([[8.63740715e-01, 1.36259285e-01],
       [9.87723563e-01, 1.22764373e-02],
       [9.92599507e-01, 7.40049344e-03],
       [9.95804644e-01, 4.19535558e-03],
       [9.90795369e-01, 9.20463065e-03],
       [9.97955560e-01, 2.04443956e-03],
       [9.95532831e-01, 4.46716886e-03],
       [9.96026021e-01, 3.97397866e-03],
       [9.99763317e-01, 2.36682617e-04],
       [9.99562400e-01, 4.37599855e-04],
       [9.99940828e-01, 5.91722938e-05],
       [9.99863659e-01, 1.36340892e-04]])

The probabilities on the left are for 0 and the right is for 1. Remember, these probabilities are in a rounded format so you can indicate where they actually are beyond the decimal point by looking at the number beyond the hyphen. We can see that the model predicted the first team in the list to have the highest odds to win the championship but still chose to deem it as a non-champion. However, looking back at the data, the second team in the data won the championship in 2024, that team being the New York Liberty. Therefore, the second team should have the highest odds for us to feel better about this model.

Logistic regression assumes a linear relationship in the data so this may indicate that this is a poor type of model to use for this situation. For now, let's go ahead and calculate an accuracy score and move on to Discriminant Analysis Models. 

In [134]:
# Make accuracy score
accuracy_score_log = accuracy_score(TEST_WNBA_2024_Data['Champion'], logistic_model.predict(TEST_WNBA_2024_Data[['FG', 'FGA', '3P', '3PA', '2P', '2PA', 'FT', 'FTA', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']]))

In [135]:
# Print the accuracy score
print(f"Accuracy Score: {accuracy_score}")

Accuracy Score: <function accuracy_score at 0x000001C85A308B80>


---

# Discriminant Analysis


There are 2 types of Discriminant Analysis Models that we will be going over in this lesson. One is called Linear Discriminat Analysis (LDA) and the other being Quadratic Discriminant Analysis (QDA). These models essentially create a line of best fit to classify data points into a category. Discriminant Analysis finds the best combination of features through variance analysis. We will begin with LDA first.

--- 

# Linear Discriminant Analysis


LDA is a linear form of discriminant analysis and it assumes that each group follows a normal distribution. It also assumes that the variance for each classification group is the same. This means that the variance(s) between champion and non-champion is the same. Let's get started; we will follow the EXACT same process as we did with logistic regression.

In [None]:
# Create the Linear Discriminant Analysis model
lda_model = LinearDiscriminantAnalysis()

# Train the model using the 2023 WNBA data
lda_model.fit(TRAIN_WNBA_19_23_Data[['FG', 'FGA', '3P', '3PA', '2P', '2PA', 'FT', 'FTA', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']], TRAIN_WNBA_19_23_Data['Champion'])

In [137]:
# Now let's test the model using the 2024 WNBA data and have it make predictions
lda_model.predict(TEST_WNBA_2024_Data[['FG', 'FGA', '3P', '3PA', '2P', '2PA', 'FT', 'FTA', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']])

array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

Interesting; this LDA model has created a completely different result compared to that of the logistic model. This model appears to have classified the 5th team on the list as the WNBA champion, which would be the Seattle Storm. Unfortunately, this result is incorrect because the New York Liberty won the championship during this season. Let's take a look at the probability breakdown.

In [138]:
lda_model.predict_proba(TEST_WNBA_2024_Data[['FG', 'FGA', '3P', '3PA', '2P', '2PA', 'FT', 'FTA', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']])

array([[9.99999999e-01, 1.35027822e-09],
       [1.00000000e+00, 2.19759302e-33],
       [1.00000000e+00, 2.20412020e-19],
       [1.00000000e+00, 8.07393739e-11],
       [7.52968352e-02, 9.24703165e-01],
       [1.00000000e+00, 5.43801150e-22],
       [1.00000000e+00, 5.15106391e-26],
       [1.00000000e+00, 4.66956756e-46],
       [1.00000000e+00, 1.31260261e-22],
       [1.00000000e+00, 1.31463182e-25],
       [1.00000000e+00, 3.10133759e-53],
       [1.00000000e+00, 4.03924124e-60]])

WOW! It looks like there is not one other team that REMOTELY has a probability to win the championship close to the Seattle Storm, according to this model. This is quite different from what we had seen from the logistic model. The model may have predicted them to be champions because their per game stats were better than some of the teams ranked above them in the data. They had some of the best per game stats in the league despite not winning the championship. Very interesting! This is something to take note of before moving on. Let's record the accuracy score and move onto QDA.

In [None]:
# Assign LDA Model Prediction from 2024 WNBA Data to a variable
lda_predictions = lda_model.predict(TEST_WNBA_2024_Data[['FG', 'FGA', '3P', '3PA', '2P', '2PA', 'FT', 'FTA', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']])
lda_predictions

array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

In [140]:
# Get Accuracy Score
accuracy_score_lda = accuracy_score(TEST_WNBA_2024_Data['Champion'], lda_predictions)

# Print the accuracy score
print(f"Accuracy Score: {accuracy_score_lda}")

Accuracy Score: 0.8333333333333334


---

# Quadratic Disciminant Analysis

QDA is very similar to LDA. QDA, like LDA, assumes a normal distribution. Unlike LDA, QDA allows each classification group to have different variances. QDA does NOT assume a linear classification boundary so it is seen as a flexible option in comparison. Now Let's get started.

NOTE: Before we get started, QDA is a bit different and needs more than one sample in the champion category. In other words, we need to add another row that has 1 under the champion column. This get's a bit hairy since we only want to include teams in the present. However, lets go ahead and do the following: Combine the 2022 and 2023 datasets.

In [None]:
# Make the Quadratic Discriminant Analysis model
qda_model = QuadraticDiscriminantAnalysis()

# Train the model using the 2023 WNBA data
qda_model.fit(TRAIN_WNBA_19_23_Data[['FG', 'FGA', '3P', '3PA', '2P', '2PA', 'FT', 'FTA', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']], TRAIN_WNBA_19_23_Data['Champion'])



In [162]:
# Have the model make predictions using the 2024 WNBA data
qda_model.predict(TEST_WNBA_2024_Data[['FG', 'FGA', '3P', '3PA', '2P', '2PA', 'FT', 'FTA', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']])

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)