---
Title: "Intro to Categorical Model-Making Module by Austin Hayes"

Author:
  - Name: Austin Hayes

  -  Email: ahayes65@charlotte.edu

  -  Affiliation: University of North Carolina at Charlotte

Date: July 29, 2025

Description: Using 2024 WNBA Team 'Per Game' stats, we will review the basics of categorical model-making. We will make QDA, LDA and Logistic models in order to predict which team will become the next WNBA Champion.

Categories:
  - Logistic Regression
  - Quadratic Discriminant Analysis
  - Linear Discriminant Analysis
  - Importing and Reading data
  - Data Cleaning
  - Data Science
  - Pandas


### Data

This Dataset is from Basketball Reference @ https://www.basketball-reference.com

Visit the original data page here: https://www.basketball-reference.com/wnba/years/2024.html

The data set contains 13 rows and 25 columns. Each row represents a WNBA team during the 2024 season.

Download data: 

Available on the [Intro to Categorical Model-Making Module by Austin Hayes](https://github.com/schuckers/Charlotte_SCORE_Summer25/tree/main/Data%20for%20Modules/Data%20For%20Intro%20to%20Categorical%20Model-Making%20Module%20by%20Austin%20Hayes): [2024_WNBA_Per_Game.csv](https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20For%20Intro%20to%20Categorical%20Model-Making%20Module%20by%20Austin%20Hayes/2024_WNBA_Per_Game.csv)

---

### Variables and their Descriptions:


<details>
<summary><b>Variable Descriptions</b></summary>

| Variable | Description |
|----------|-------------|
| Rk       | Rank of the team in the league |
| Team     | Name of the team |
| G        | Games played |
| MP       | Minutes played per game |
| FG       | Field Goals made per game |
| FGA      | Field Goal attempts per game |
| FG%      | Field Goal percentage (FG ÷ FGA) |
| 3P       | Three-Point Field Goals made per game |
| 3PA      | Three-Point Field Goal attempts per game |
| 3P%      | Three-Point Field Goal percentage (3P ÷ 3PA) |
| 2P       | Two-Point Field Goals made per game |
| 2PA      | Two-Point Field Goal attempts per game |
| 2P%      | Two-Point Field Goal percentage (2P ÷ 2PA) |
| FT       | Free Throws made per game |
| FTA      | Free Throw attempts per game |
| FT%      | Free Throw percentage (FT ÷ FTA) |
| ORB      | Offensive Rebounds per game |
| DRB      | Defensive Rebounds per game |
| TRB      | Total Rebounds per game |
| AST      | Assists per game |
| STL      | Steals per game |
| BLK      | Blocks per game |
| TOV      | Turnovers per game |
| PF       | Personal Fouls per game |
| PTS      | Points scored per game |

</details>

---

# Learning Goals

- Learn about the basics of categorical modeling
- Compare and contrast models using statistical metrics
- Learn about Quadratic Discriminant Analysis
- Learn about Linear Discriminant Analysis
- Learn about Logistic Regression

---

# Getting Started


In this lesson, we will learn about categorical modeling in order to predict the outcome of a certain event based on past and current metrics. 

Today, we will be predicting what team will be the WNBA champion this year! Before we get started, we need to import certain librarys in order to achieve this goal.

If you don't have any of these librarys downloaded, you can learn about how to do so with the following links:

Pandas: https://pandas.pydata.org/docs/getting_started/install.html 

Numpy: https://numpy.org/install/

Sci-Kit Learn: https://scikit-learn.org/stable/install.html

After downloading those librarys, let's import them.

In [None]:
# Import the necessary librarys for the module

# Basic Data Science Library for importing data and data manipulation
import pandas as pd

# Library for mathematical operations
import numpy as np

# Import only the necessary functions from the sklearn library
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis

---

# Categorical Modeling

Categorical Modeling is a way of measuring/predicting a non-numeric outcome using previously available data. For today, we will be predicting a binary outcome, like 'yes' or 'no', 'Champion' or 'Not Champion', etc. There are various models that can be used to achieve this but today we will go over Logistic, LDA and QDA models. Each of these models use different methodologies to achieve the same goal. Therefore, each models predictions may differ from each other, warranting comparison. This can be done using certain metrics that we will go over later in this lesson.

---

# Modeling Process For This Module

In order to build a model, we will need to train the model and then test the model to see how accurate it is. Normally, we would want to build the model by splitting a season's worth of data into a train dataset and a test dataset. Usually, you want to split the data into a higher ratio of training data than test data. This ensures that the model will be more adequately prepared to make predicitions on the test data. On the flip side, there needs to be a balance between your train and test sets because the more test data you have, the more data you have to test your models accuracy. Normally, I use 60-70% of the data for training the model and 40-30% for testing the model.

However, we will want to do something a little different in this module. In order to test the accuracy of the model(s) amongst one another, we want to train the model(s) on the previous 2 season's worth of data. Since this is a special case (only one 'champion' and NOT multiple) we will train the model(s) using 2023 season data and test it with the 2024 season data. The model with the highest accuracy, or of our choosing, will be used to predict who the WNBA champion will be this season using the 'Per Game' stats thus far.

To summarize, a model is something we can train with data so it can make a predicition on new data.

# Importing data and making train/test sets

Let's get started by reading in the data.

In [None]:
# Read in the WNBA 2023 data from Github 
# This dataset will be used to train the model
TRAIN_WNBA_2023_Data = pd.read_csv('https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20For%20Intro%20to%20Categorical%20Model-Making%20Module%20by%20Austin%20Hayes/2023_WNBA_Per_Game.csv')

In [None]:
# Read in the WNBA 2024 data from Github 
# This dataset will be used to test the model
TEST_WNBA_2024_Data = pd.read_csv('https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20For%20Intro%20to%20Categorical%20Model-Making%20Module%20by%20Austin%20Hayes/2024_WNBA_Per_Game.csv')

In [None]:
# Read in the current WNBA 2025 stats for our model prediction later on in this module
CURRENT_WNBA_2025_Data = pd.read_csv('https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20For%20Intro%20to%20Categorical%20Model-Making%20Module%20by%20Austin%20Hayes/2025_WNBA_Per_Game.csv')

https://www.basketball-reference.com/wnba/years/2024.html

Before we go any further, we need to clean up the data. If we pull up the data, the last row is a league average row so let's get rid of that for all 3 dataframes.

In [None]:
# Display the dataframe so you can see the data before changing it
TRAIN_WNBA_2023_Data[:]

Unnamed: 0,Rk,Team,G,MP,FG,FGA,FG%,3P,3PA,3P%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1.0,Las Vegas Aces*,40,200.0,33.6,69.2,0.486,9.3,24.9,0.372,...,0.84,6.2,28.7,34.8,21.7,8.0,4.8,11.1,16.0,92.8
1,2.0,New York Liberty*,40,201.9,32.2,69.9,0.46,11.1,29.7,0.374,...,0.829,8.7,29.2,37.9,24.1,6.7,4.5,13.5,16.0,89.2
2,3.0,Dallas Wings*,40,200.6,32.4,73.2,0.443,6.8,21.3,0.317,...,0.806,11.8,26.9,38.7,20.3,7.6,4.3,13.1,18.9,87.9
3,4.0,Connecticut Sun*,40,201.9,30.2,67.8,0.445,7.2,20.0,0.36,...,0.766,8.1,25.6,33.6,20.7,8.1,3.8,12.4,18.4,82.7
4,5.0,Atlanta Dream*,40,201.3,29.4,68.7,0.428,6.4,19.2,0.336,...,0.788,8.0,28.1,36.1,18.6,6.3,4.6,13.6,19.3,82.5
5,6.0,Chicago Sky*,40,201.3,30.7,69.6,0.442,8.3,22.2,0.372,...,0.752,8.6,24.8,33.3,20.5,6.7,4.5,13.5,17.3,81.7
6,7.0,Indiana Fever,40,201.9,30.2,68.3,0.442,6.7,19.8,0.34,...,0.783,8.9,25.1,34.0,18.0,6.5,3.0,14.1,19.7,81.0
7,8.0,Washington Mystics*,40,200.6,28.9,67.5,0.428,7.8,23.1,0.336,...,0.823,6.6,25.7,32.3,19.2,7.7,3.2,12.2,18.6,80.5
8,9.0,Minnesota Lynx*,40,201.3,29.4,67.5,0.435,6.8,20.8,0.325,...,0.801,8.0,26.3,34.3,19.4,6.4,2.6,13.4,16.9,80.2
9,10.0,Los Angeles Sparks,40,200.6,28.4,66.8,0.425,6.5,19.5,0.333,...,0.819,6.7,24.8,31.5,19.0,8.5,2.9,12.5,16.9,78.9


In [None]:
# Drop the last row of the dataframe for 2023 data
TRAIN_WNBA_2023_Data = TRAIN_WNBA_2023_Data[:-1]

In [None]:
# Check to make sure it worked as expected
TRAIN_WNBA_2023_Data[:]

Unnamed: 0,Rk,Team,G,MP,FG,FGA,FG%,3P,3PA,3P%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1.0,Las Vegas Aces*,40,200.0,33.6,69.2,0.486,9.3,24.9,0.372,...,0.84,6.2,28.7,34.8,21.7,8.0,4.8,11.1,16.0,92.8
1,2.0,New York Liberty*,40,201.9,32.2,69.9,0.46,11.1,29.7,0.374,...,0.829,8.7,29.2,37.9,24.1,6.7,4.5,13.5,16.0,89.2
2,3.0,Dallas Wings*,40,200.6,32.4,73.2,0.443,6.8,21.3,0.317,...,0.806,11.8,26.9,38.7,20.3,7.6,4.3,13.1,18.9,87.9
3,4.0,Connecticut Sun*,40,201.9,30.2,67.8,0.445,7.2,20.0,0.36,...,0.766,8.1,25.6,33.6,20.7,8.1,3.8,12.4,18.4,82.7
4,5.0,Atlanta Dream*,40,201.3,29.4,68.7,0.428,6.4,19.2,0.336,...,0.788,8.0,28.1,36.1,18.6,6.3,4.6,13.6,19.3,82.5
5,6.0,Chicago Sky*,40,201.3,30.7,69.6,0.442,8.3,22.2,0.372,...,0.752,8.6,24.8,33.3,20.5,6.7,4.5,13.5,17.3,81.7
6,7.0,Indiana Fever,40,201.9,30.2,68.3,0.442,6.7,19.8,0.34,...,0.783,8.9,25.1,34.0,18.0,6.5,3.0,14.1,19.7,81.0
7,8.0,Washington Mystics*,40,200.6,28.9,67.5,0.428,7.8,23.1,0.336,...,0.823,6.6,25.7,32.3,19.2,7.7,3.2,12.2,18.6,80.5
8,9.0,Minnesota Lynx*,40,201.3,29.4,67.5,0.435,6.8,20.8,0.325,...,0.801,8.0,26.3,34.3,19.4,6.4,2.6,13.4,16.9,80.2
9,10.0,Los Angeles Sparks,40,200.6,28.4,66.8,0.425,6.5,19.5,0.333,...,0.819,6.7,24.8,31.5,19.0,8.5,2.9,12.5,16.9,78.9


In [None]:
# Now do it for the 2024 data
TEST_WNBA_2024_Data = TEST_WNBA_2024_Data[:-1]

# Now do it for the 2025 data
CURRENT_WNBA_2025_Data = CURRENT_WNBA_2025_Data[:-1]

Now we want to make a new column that marks each team as the champion or not the champion of that season. We will do this with 1's and 0's. 1 means that team won the championship that year and 0 means that they were not the champion that year. We will do this for each dataset.

In [None]:
# Make a new column for the 2023 data that marks the champion team
TRAIN_WNBA_2023_Data['Champion'] = [1,0,0,0,0,0,0,0,0,0,0,0]

In [None]:
# Make sure it worked as expected
TRAIN_WNBA_2023_Data[:]

Unnamed: 0,Rk,Team,G,MP,FG,FGA,FG%,3P,3PA,3P%,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Champion
0,1.0,Las Vegas Aces*,40,200.0,33.6,69.2,0.486,9.3,24.9,0.372,...,6.2,28.7,34.8,21.7,8.0,4.8,11.1,16.0,92.8,1
1,2.0,New York Liberty*,40,201.9,32.2,69.9,0.46,11.1,29.7,0.374,...,8.7,29.2,37.9,24.1,6.7,4.5,13.5,16.0,89.2,0
2,3.0,Dallas Wings*,40,200.6,32.4,73.2,0.443,6.8,21.3,0.317,...,11.8,26.9,38.7,20.3,7.6,4.3,13.1,18.9,87.9,0
3,4.0,Connecticut Sun*,40,201.9,30.2,67.8,0.445,7.2,20.0,0.36,...,8.1,25.6,33.6,20.7,8.1,3.8,12.4,18.4,82.7,0
4,5.0,Atlanta Dream*,40,201.3,29.4,68.7,0.428,6.4,19.2,0.336,...,8.0,28.1,36.1,18.6,6.3,4.6,13.6,19.3,82.5,0
5,6.0,Chicago Sky*,40,201.3,30.7,69.6,0.442,8.3,22.2,0.372,...,8.6,24.8,33.3,20.5,6.7,4.5,13.5,17.3,81.7,0
6,7.0,Indiana Fever,40,201.9,30.2,68.3,0.442,6.7,19.8,0.34,...,8.9,25.1,34.0,18.0,6.5,3.0,14.1,19.7,81.0,0
7,8.0,Washington Mystics*,40,200.6,28.9,67.5,0.428,7.8,23.1,0.336,...,6.6,25.7,32.3,19.2,7.7,3.2,12.2,18.6,80.5,0
8,9.0,Minnesota Lynx*,40,201.3,29.4,67.5,0.435,6.8,20.8,0.325,...,8.0,26.3,34.3,19.4,6.4,2.6,13.4,16.9,80.2,0
9,10.0,Los Angeles Sparks,40,200.6,28.4,66.8,0.425,6.5,19.5,0.333,...,6.7,24.8,31.5,19.0,8.5,2.9,12.5,16.9,78.9,0


In [None]:
# Now Let's do the same for the 2024 dataset
# Remember, we cannot do this for the 2025 dataset because there is no champion yet
TEST_WNBA_2024_Data['Champion'] = [0,1,0,0,0,0,0,0,0,0,0,0]

In [None]:
# Make sure it worked as expected
TEST_WNBA_2024_Data[:]

Unnamed: 0,Rk,Team,G,MP,FG,FGA,FG%,3P,3PA,3P%,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Champion
0,1.0,Las Vegas Aces*,40,200.6,30.9,68.1,0.454,9.4,26.5,0.355,...,5.6,28.5,34.1,20.5,7.1,5.0,10.8,16.5,86.4,0
1,2.0,New York Liberty*,40,200.0,30.8,68.7,0.448,10.1,29.0,0.349,...,8.5,28.1,36.6,22.8,7.9,4.5,12.7,15.4,85.6,1
2,3.0,Indiana Fever*,40,200.6,31.3,68.5,0.456,9.2,25.9,0.356,...,8.3,26.8,35.1,20.4,5.9,4.3,14.2,18.2,85.0,0
3,4.0,Dallas Wings,40,201.9,31.7,71.0,0.446,6.3,19.2,0.326,...,10.5,24.3,34.8,20.4,7.1,4.0,14.8,18.5,84.2,0
4,5.0,Seattle Storm*,40,201.2,31.1,71.3,0.435,6.1,21.0,0.288,...,8.7,26.0,34.7,20.7,9.3,5.2,12.4,16.5,83.2,0
5,6.0,Minnesota Lynx*,40,201.9,30.1,67.3,0.448,9.5,25.0,0.38,...,7.4,26.8,34.3,23.0,8.6,4.2,13.4,16.4,82.0,0
6,7.0,Phoenix Mercury*,40,201.2,29.1,66.3,0.439,8.5,26.2,0.326,...,6.7,25.6,32.3,19.9,6.6,4.7,13.3,16.9,81.5,0
7,8.0,Connecticut Sun*,40,201.2,29.3,65.9,0.444,5.9,18.0,0.327,...,8.4,25.1,33.5,19.9,8.2,3.7,12.1,16.1,80.1,0
8,9.0,Washington Mystics,40,201.2,29.0,67.0,0.433,9.7,26.6,0.366,...,7.1,24.8,31.9,21.6,7.3,3.4,15.1,18.4,79.3,0
9,10.0,Los Angeles Sparks,40,200.6,28.1,66.4,0.423,7.2,22.6,0.32,...,7.4,25.3,32.7,19.7,7.3,3.2,15.0,17.9,78.4,0


---

# Logistic Regression

Logistic Regression is a categorical model that uses probabilities to predict a binary outcome. It assumes a linear relationship between independent variables. Let's go ahead and eastblish a logistic model and train it with the 2023 data.

We will train it ONLY using the neccesary stats. The model also needs to be trained using integer values, so let's keep it simple by predicting the champion based on the following stats: FG, FGA, 3P, 3PA, 2P, 2PA, FT, FTA, ORB, DRB, TRB, AST, STL, BLK, TOV, PF, PTS. The definitions of these variables can be found at the top of this module. 

In [None]:
# Create the logistic regression model
logistic_model = LogisticRegression()

# Train the model using the 2023 WNBA data
logistic_model.fit(TRAIN_WNBA_2023_Data[['FG', 'FGA', '3P', '3PA', '2P', '2PA', 'FT', 'FTA', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']], TRAIN_WNBA_2023_Data['Champion'])