---
Title: "Intro to Categorical Model-Making Module by Austin Hayes"

Author:
  - Name: Austin Hayes

  -  Email: ahayes65@charlotte.edu

  -  Affiliation: University of North Carolina at Charlotte

Date: July 29, 2025

Description: Using WNBA Team 'Per Game' stats, we will review the basics of categorical model-making. We will make QDA, LDA and Logistic models in order to predict which team will become the next WNBA Champion.

Categories:
  - Logistic Regression
  - Quadratic Discriminant Analysis
  - Linear Discriminant Analysis
  - Importing and Reading data
  - Data Cleaning
  - Data Science
  - Pandas


### Data

This Dataset is from Basketball Reference @ https://www.basketball-reference.com

Visit the original data page here: https://www.basketball-reference.com/wnba/years/2024.html

The data set contains 60 rows and 26 columns. Each row represents a WNBA team during the 2024 season.

Download data: 

Available on the [Intro to Categorical Model-Making Module by Austin Hayes](https://github.com/schuckers/Charlotte_SCORE_Summer25/tree/main/Data%20for%20Modules/Data%20For%20Intro%20to%20Categorical%20Model-Making%20Module%20by%20Austin%20Hayes): [2024_WNBA_Per_Game.csv](https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20For%20Intro%20to%20Categorical%20Model-Making%20Module%20by%20Austin%20Hayes/2024_WNBA_Per_Game.csv)

---

### Variables and their Descriptions:


<details>
<summary><b>Variable Descriptions</b></summary>

| Variable | Description |
|----------|-------------|
| Rk       | Rank of the team in the league |
| Team     | Name of the team |
| G        | Games played |
| MP       | Minutes played per game |
| FG       | Field Goals made per game |
| FGA      | Field Goal attempts per game |
| FG%      | Field Goal percentage (FG รท FGA) |
| 3P       | Three-Point Field Goals made per game |
| 3PA      | Three-Point Field Goal attempts per game |
| 3P%      | Three-Point Field Goal percentage (3P รท 3PA) |
| 2P       | Two-Point Field Goals made per game |
| 2PA      | Two-Point Field Goal attempts per game |
| 2P%      | Two-Point Field Goal percentage (2P รท 2PA) |
| FT       | Free Throws made per game |
| FTA      | Free Throw attempts per game |
| FT%      | Free Throw percentage (FT รท FTA) |
| ORB      | Offensive Rebounds per game |
| DRB      | Defensive Rebounds per game |
| TRB      | Total Rebounds per game |
| AST      | Assists per game |
| STL      | Steals per game |
| BLK      | Blocks per game |
| TOV      | Turnovers per game |
| PF       | Personal Fouls per game |
| PTS      | Points scored per game |
| CHAMPION | Did that team win the championship that year? 1 = Yes, 0 = No |

</details>

---

# Review Questions

Below, you will be asked a series of questions to review the material you have learned throughout this module.

SETUP:

In [1]:
# Import the necessary librarys for the module
# Basic Data Science Library for importing data and data manipulation
import pandas as pd

# Library for mathematical operations
import numpy as np

# Import only the necessary functions from the sklearn library
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

In [2]:
# Read in the WNBA 2019-2024 data from Github 
# This dataset will be used to train the final model
TRAIN_WNBA_19_24_Data = pd.read_csv('https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20For%20Intro%20to%20Categorical%20Model-Making%20Module%20by%20Austin%20Hayes/WNBA_PER_GAME_19_24_DATA.csv')

In [3]:
# Read in the current WNBA 2025 stats for our model prediction later on in this module
CURRENT_WNBA_2025_Data = pd.read_csv('https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20For%20Intro%20to%20Categorical%20Model-Making%20Module%20by%20Austin%20Hayes/2025_WNBA_Per_Game.csv')

### 1.) How does KNN work and what needs to be done to the data before use of a KNN model?

ANSWER: _KNN classifys points based on their "nearest neighbors". This means the model compares the new points to similar points that the model was trained on in order to make a prediction. K is the number of neighbors that are used to classify a new point. K should also be an odd number in order to prevent a tie among K points. Before using KNN, it is important to scale your data so that the model does not make biased predictions when using predictors with larger numerical ranges._

### 2.) Using the 2019-2024 dataset, create the same list of predictor variables that we used for the 2019-2023 dataset. Scale the list of predictor variables. Make sure you also scale them for the 2025 data since you will be asked to make a prediction on them later.

In [4]:
# ANSWER:

# Create the scaler object
scaler = StandardScaler()

# Fit the scaler to the 2020-2024 WNBA data and transform (scale) the data
training_data_19_24 = scaler.fit_transform(TRAIN_WNBA_19_24_Data[['FG', 'FGA', '3P', '3PA', '2P', '2PA', 'FT', 'FTA', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']])

# Transform (scale) the 2025 WNBA data using the same scaler
test_data_2025 = scaler.transform(CURRENT_WNBA_2025_Data[['FG', 'FGA', '3P', '3PA', '2P', '2PA', 'FT', 'FTA', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']])

### 3.) Using the scaled data you created in question 2, predict which team(s) will make WNBA teams will make the 2025 playoffs with a QDA model.

In [5]:
# Create the model object
qda_model = QuadraticDiscriminantAnalysis()

# Fit the model using the 2020-2024 WNBA data that was scaled earlier
qda_model.fit(training_data_19_24, TRAIN_WNBA_19_24_Data['PLAYOFFS'])

# Have the model make predictions using the 2025 WNBA data
qda_predictions = qda_model.predict(test_data_2025)

# takes a look at the array of predictions
qda_predictions

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1], dtype=int64)

### 4.) Do the same thing you did in question 3 but with 3 KNN models instead. Use K=5, K=25 and K=51 respectively.

ANSWER FOR K=5:

In [6]:
# Create the model object
knn_model_5 = KNeighborsClassifier(n_neighbors=5)

# Fit the model using the 2020-2024 WNBA data that was scaled earlier
knn_model_5.fit(training_data_19_24, TRAIN_WNBA_19_24_Data['PLAYOFFS'])

# Have the model make predictions using the 2025 WNBA data
knn_predictions_5 = knn_model_5.predict(test_data_2025)

# takes a look at the array of predictions
knn_predictions_5

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0], dtype=int64)

ANSWER FOR K=25:

In [7]:
# Create the model object
knn_model_25 = KNeighborsClassifier(n_neighbors=25)

# Fit the model using the 2020-2024 WNBA data that was scaled earlier
knn_model_25.fit(training_data_19_24, TRAIN_WNBA_19_24_Data['PLAYOFFS'])

# Have the model make predictions using the 2025 WNBA data
knn_predictions_25 = knn_model_25.predict(test_data_2025)

# takes a look at the array of predictions
knn_predictions_25

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0], dtype=int64)

ANSWER FOR K=51

In [8]:
# Create the model object
knn_model_51 = KNeighborsClassifier(n_neighbors=51)

# Fit the model using the 2020-2024 WNBA data that was scaled earlier
knn_model_51.fit(training_data_19_24, TRAIN_WNBA_19_24_Data['PLAYOFFS'])

# Have the model make predictions using the 2025 WNBA data
knn_predictions_51 = knn_model_51.predict(test_data_2025)

# takes a look at the array of predictions
knn_predictions_51

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)

### 5.) As you did in questions 3 and 4, make a prediction but use a Logistic model instead. Record an accuracy score.

In [9]:
# Create the model object
log_model = LogisticRegression()

# Fit the model using the 2020-2024 WNBA data that was scaled earlier
log_model.fit(training_data_19_24, TRAIN_WNBA_19_24_Data['PLAYOFFS'])

# Have the model make predictions using the 2025 WNBA data
log_predictions = log_model.predict(test_data_2025)

# takes a look at the array of predictions
log_predictions

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0], dtype=int64)

### 6.) Which model(s) do you like the best and why?

ANSWER: _There is no wrong answer for this problem. However, it seems that the most realistic results come from the Logistic model along with the KNN models with K = 5 and K = 25. They both pick a smaller amount of teams and the teams that were picked tend to be ranked higher._

### 7.) What differentiates LDA and QDA? List a few differences.

ANSWER:

- LDA assumes the same variance between categorical groups.
- QDA is flexible and allows different variances.
- QDA allows for a non-linear, quadratic decision boundary between categorical groups
- LDA only allows for a linear decision boundary between categorical groups.