---
Title: "Intro to Categorical Model-Making Module by Austin Hayes"

Author:
  - Name: Austin Hayes

  -  Email: ahayes65@charlotte.edu

  -  Affiliation: University of North Carolina at Charlotte

Date: July 29, 2025

Description: Using WNBA Team 'Per Game' stats, we will review the basics of categorical model-making. We will make QDA, LDA and Logistic models in order to predict which team will become the next WNBA Champion.

Categories:
  - Logistic Regression
  - Quadratic Discriminant Analysis
  - Linear Discriminant Analysis
  - Importing and Reading data
  - Data Cleaning
  - Data Science
  - Pandas


### Data

This Dataset is from Basketball Reference @ https://www.basketball-reference.com

Visit the original data page here: https://www.basketball-reference.com/wnba/years/2024.html

The data set contains 60 rows and 26 columns. Each row represents a WNBA team during the 2024 season.

Download data: 

Available on the [Intro to Categorical Model-Making Module by Austin Hayes](https://github.com/schuckers/Charlotte_SCORE_Summer25/tree/main/Data%20for%20Modules/Data%20For%20Intro%20to%20Categorical%20Model-Making%20Module%20by%20Austin%20Hayes): [2024_WNBA_Per_Game.csv](https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20For%20Intro%20to%20Categorical%20Model-Making%20Module%20by%20Austin%20Hayes/2024_WNBA_Per_Game.csv)

---

### Variables and their Descriptions:


<details>
<summary><b>Variable Descriptions</b></summary>

| Variable | Description |
|----------|-------------|
| Rk       | Rank of the team in the league |
| Team     | Name of the team |
| G        | Games played |
| MP       | Minutes played per game |
| FG       | Field Goals made per game |
| FGA      | Field Goal attempts per game |
| FG%      | Field Goal percentage (FG ÷ FGA) |
| 3P       | Three-Point Field Goals made per game |
| 3PA      | Three-Point Field Goal attempts per game |
| 3P%      | Three-Point Field Goal percentage (3P ÷ 3PA) |
| 2P       | Two-Point Field Goals made per game |
| 2PA      | Two-Point Field Goal attempts per game |
| 2P%      | Two-Point Field Goal percentage (2P ÷ 2PA) |
| FT       | Free Throws made per game |
| FTA      | Free Throw attempts per game |
| FT%      | Free Throw percentage (FT ÷ FTA) |
| ORB      | Offensive Rebounds per game |
| DRB      | Defensive Rebounds per game |
| TRB      | Total Rebounds per game |
| AST      | Assists per game |
| STL      | Steals per game |
| BLK      | Blocks per game |
| TOV      | Turnovers per game |
| PF       | Personal Fouls per game |
| PTS      | Points scored per game |
| CHAMPION | Did that team win the championship that year? 1 = Yes, 0 = No |

</details>

---

# Review Questions

Below, you will be asked a series of questions to review the material you have learned throughout this module.

SETUP:

In [1]:
# Import the necessary librarys for the module
# Basic Data Science Library for importing data and data manipulation
import pandas as pd

# Library for mathematical operations
import numpy as np

# Import only the necessary functions from the sklearn library
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

In [2]:
# Read in the WNBA 2020-2024 data from Github 
# This dataset will be used to train the final model
TRAIN_WNBA_20_24_Data = pd.read_csv('https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20For%20Intro%20to%20Categorical%20Model-Making%20Module%20by%20Austin%20Hayes/WNBA_PER_GAME_20_24_DATA.csv')

In [3]:
# Read in the current WNBA 2025 stats for our model prediction later on in this module
CURRENT_WNBA_2025_Data = pd.read_csv('https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20For%20Intro%20to%20Categorical%20Model-Making%20Module%20by%20Austin%20Hayes/2025_WNBA_Per_Game.csv')

### 1.) How does KNN work and what needs to be done to the data before use of a KNN model?

ANSWER: _KNN classifys points based on their "nearest neighbors". This means the model compares the new points to similar points that the model was trained on in order to make a prediction. K is the number of neighbors that are used to classify a new point. K should also be an odd number in order to prevent a tie among K points. Before using KNN, it is important to scale your data so that the model does not make biased predictions when using predictors with larger numerical ranges._

### 2.) Using the 2020-2024 dataset, create a list of predictor variables that include FG%, 3P%, 2P%, FT%, ORB, AST. Scale the list of predictor variables. Make sure you also scale them for the 2025 data since you will be asked to make a prediction on them later. Create a new column in the ORIGINAL 2020-2024 dataset that assesses whether a teams points ('PTS') are over 82 per game. 1 = True, 0 = False. Make this column for the ORIGINAL 2025 dataset as well. 

Hint: _Use .astype(int)_ 

In [4]:
# ANSWER:

# Create the scaler object
scaler = StandardScaler()

# Fit the scaler to the 2020-2024 WNBA data and transform (scale) the data
training_data_20_24 = scaler.fit_transform(TRAIN_WNBA_20_24_Data[['FG%','3P%','FT%','ORB','AST']])

# Transform (scale) the 2025 WNBA data using the same scaler
test_data_2025 = scaler.fit_transform(CURRENT_WNBA_2025_Data[['FG%','3P%','FT%','ORB','AST']])

# Create a new column in the 2020-2024 dataset that assesses whether a team's points ('PTS') are over 82 per game
TRAIN_WNBA_20_24_Data['OVR82?'] = (TRAIN_WNBA_20_24_Data['PTS'] > 82).astype(int)

# Make a new column for the 2025 data as well so we can record an accuracy score.
CURRENT_WNBA_2025_Data['OVR82?'] = (CURRENT_WNBA_2025_Data['PTS'] > 82).astype(int)

In [5]:
# Check to see that new column works and is established as 1's and 0's
TRAIN_WNBA_20_24_Data

Unnamed: 0,Rk,Team,G,MP,FG,FGA,FG%,3P,3PA,3P%,...,TRB,AST,STL,BLK,TOV,PF,PTS,CHAMPION,PLAYOFFS,OVR82?
0,1,2024 Las Vegas Aces,40,200.6,30.9,68.1,0.454,9.4,26.5,0.355,...,34.1,20.5,7.1,5.0,10.8,16.5,86.4,0,1,1
1,2,2024 New York Liberty,40,200.0,30.8,68.7,0.448,10.1,29.0,0.349,...,36.6,22.8,7.9,4.5,12.7,15.4,85.6,1,1,1
2,3,2024 Indiana Fever,40,200.6,31.3,68.5,0.456,9.2,25.9,0.356,...,35.1,20.4,5.9,4.3,14.2,18.2,85.0,0,1,1
3,4,2024 Dallas Wings,40,201.9,31.7,71.0,0.446,6.3,19.2,0.326,...,34.8,20.4,7.1,4.0,14.8,18.5,84.2,0,0,1
4,5,2024 Seattle Storm,40,201.2,31.1,71.3,0.435,6.1,21.0,0.288,...,34.7,20.7,9.3,5.2,12.4,16.5,83.2,0,1,1
5,6,2024 Minnesota Lynx,40,201.9,30.1,67.3,0.448,9.5,25.0,0.38,...,34.3,23.0,8.6,4.2,13.4,16.4,82.0,0,1,0
6,7,2024 Phoenix Mercury,40,201.2,29.1,66.3,0.439,8.5,26.2,0.326,...,32.3,19.9,6.6,4.7,13.3,16.9,81.5,0,1,0
7,8,2024 Connecticut Sun,40,201.2,29.3,65.9,0.444,5.9,18.0,0.327,...,33.5,19.9,8.2,3.7,12.1,16.1,80.1,0,1,0
8,9,2024 Washington Mystics,40,201.2,29.0,67.0,0.433,9.7,26.6,0.366,...,31.9,21.6,7.3,3.4,15.1,18.4,79.3,0,0,0
9,10,2024 Los Angeles Sparks,40,200.6,28.1,66.4,0.423,7.2,22.6,0.32,...,32.7,19.7,7.3,3.2,15.0,17.9,78.4,0,0,0


### 3.) Using the scaled data you created in question 2, predict which team(s) will have above 82 points per game during the 2025 WNBA season with a QDA model. Record an accuracy score.

QUICK NOTE: You can use an accuracy score in this case because there are already pts per game recorded for each team thus far. Although they are not final for the WHOLE season, this is an interesting experiment to see if these are strong predictors of whether a team will have high points per game. 

In [6]:
# Create the model object
qda_model = QuadraticDiscriminantAnalysis()

# Fit the model using the 2020-2024 WNBA data that was scaled earlier
qda_model.fit(training_data_20_24, TRAIN_WNBA_20_24_Data['OVR82?'])

# Have the model make predictions using the 2025 WNBA data
qda_predictions = qda_model.predict(test_data_2025)

# takes a look at the array of predictions
qda_predictions

array([1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0])

In [7]:
# Record the accuracy score of the QDA model
qda_accuracy = accuracy_score(CURRENT_WNBA_2025_Data['OVR82?'], qda_predictions)

In [8]:
# Print the accuracy score
print(f"QDA Model Accuracy: {qda_accuracy:.2f}")

QDA Model Accuracy: 0.85


### 4.) Do the same thing you did in question 3 but with 3 KNN models instead. Use K=5, K=25 and K=51 respectively. Record an accuracy score.

ANSWER FOR K=5:

In [9]:
# Create the model object
knn_model_5 = KNeighborsClassifier(n_neighbors=5)

# Fit the model using the 2020-2024 WNBA data that was scaled earlier
knn_model_5.fit(training_data_20_24, TRAIN_WNBA_20_24_Data['OVR82?'])

# Have the model make predictions using the 2025 WNBA data
knn_predictions_5 = knn_model_5.predict(test_data_2025)

# takes a look at the array of predictions
knn_predictions_5

array([1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0])

In [10]:
# Record the accuracy score of the QDA model
knn_accuracy_5 = accuracy_score(CURRENT_WNBA_2025_Data['OVR82?'], knn_predictions_5)

# Print the accuracy score
print(f"KNN Model Accuracy for 5 neighbors: {knn_accuracy_5:.2f}")

KNN Model Accuracy for 5 neighbors: 0.85


ANSWER FOR K=25:

In [11]:
# Create the model object
knn_model_25 = KNeighborsClassifier(n_neighbors=25)

# Fit the model using the 2020-2024 WNBA data that was scaled earlier
knn_model_25.fit(training_data_20_24, TRAIN_WNBA_20_24_Data['OVR82?'])

# Have the model make predictions using the 2025 WNBA data
knn_predictions_25 = knn_model_25.predict(test_data_2025)

# takes a look at the array of predictions
knn_predictions_25

array([1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [12]:
# Record the accuracy score of the QDA model
knn_accuracy_25 = accuracy_score(CURRENT_WNBA_2025_Data['OVR82?'], knn_predictions_25)

# Print the accuracy score
print(f"KNN Model Accuracy for 25 neighbors: {knn_accuracy_25:.2f}")

KNN Model Accuracy for 25 neighbors: 0.77


ANSWER FOR K=51

In [13]:
# Create the model object
knn_model_51 = KNeighborsClassifier(n_neighbors=51)

# Fit the model using the 2020-2024 WNBA data that was scaled earlier
knn_model_51.fit(training_data_20_24, TRAIN_WNBA_20_24_Data['OVR82?'])

# Have the model make predictions using the 2025 WNBA data
knn_predictions_51 = knn_model_51.predict(test_data_2025)

# takes a look at the array of predictions
knn_predictions_51

array([1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0])

In [14]:
# Record the accuracy score of the QDA model
knn_accuracy_51 = accuracy_score(CURRENT_WNBA_2025_Data['OVR82?'], knn_predictions_51)

# Print the accuracy score
print(f"KNN Model Accuracy for 51 neighbors: {knn_accuracy_51:.2f}")

KNN Model Accuracy for 51 neighbors: 0.54


### 5.) As you did in questions 3 and 4, make a prediction but use a Logistic model instead. Record an accuracy score.

In [15]:
# Create the model object
log_model = LogisticRegression()

# Fit the model using the 2020-2024 WNBA data that was scaled earlier
log_model.fit(training_data_20_24, TRAIN_WNBA_20_24_Data['OVR82?'])

# Have the model make predictions using the 2025 WNBA data
log_predictions = log_model.predict(test_data_2025)

# takes a look at the array of predictions
log_predictions

array([1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0])

In [16]:
# Record the accuracy score of the QDA model
log_accuracy = accuracy_score(CURRENT_WNBA_2025_Data['OVR82?'], log_predictions)

# Print the accuracy score
print(f"Logistic Model Accuracy: {log_accuracy:.2f}")

Logistic Model Accuracy: 0.85


### 6.) Which model(s) had the highest accuracy score? Is there a tie between the most accurate models? Or is there a model that is clearly the best in terms of an accuracy score?

ANSWER: _The Logistic model, the QDA model, and the KNN model with K=5 are tied for the most accurate models with 85% accuracy!_

### 7.) What differentiates LDA and QDA? List a few differences.

ANSWER:

- LDA assumes the same variance between categorical groups.
- QDA is flexible and allows different variances.
- QDA allows for a non-linear, quadratic decision boundary between categorical groups
- LDA only allows for a linear decision boundary between categorical groups.