Comments M. Schuckers 11 Aug 2025
- Let's break this first markdown into a couple of cells _- Done._
- Make sure you mention kNN methods earlier _- Done._
- I think there are only 12 rows in these data and that is likely to lead to some overfitting. Your description says 60 rows which would be 5 years of data, I think. _- Done; I compiled 5 years of data for training the models._
- For transforming and rescaling data, you should use the scaler for the training data and apply that same scaler to the test data. This avoids some possible information leakage from the training to the test data. _- Done._ 
- When I run the logistic I don't get any 1's. Just 0's and the probability for the first team is 0.38 of being a one. _- Done. For some reason, when creating the module it gave me a 1 but I just re-ran it and got what you did._

- Also have a categorical response that is only 8% of rows can lead to some issues. Can we make the target, made the playoffs instead? The model can be 92% accurate by just saying all of the rows will be zero. _-Agreed; I have made a seperate notebook file to do this suggestion. The file is located in this module's file on the Github. The title starts with "PLAYOFF VERSION". I'm doing this just so we can keep this version as a backup for now._

---
Title: "Intro to Categorical Model-Making Module by Austin Hayes"

Author:
  - Name: Austin Hayes

  -  Email: ahayes65@charlotte.edu

  -  Affiliation: University of North Carolina at Charlotte

Date: July 29, 2025

Description: Using WNBA Team 'Per Game' stats, we will review the basics of categorical model-making. We will make QDA, LDA, kNN, and Logistic models in order to predict which teams will make the WNBA playoffs this season.

Categories:
  - Logistic Regression
  - Quadratic Discriminant Analysis
  - Linear Discriminant Analysis
  - K Nearest Neighbors
  - Importing and Reading data
  - Data Cleaning
  - Data Science
  - Pandas

### Data

This Dataset is from Basketball Reference @ https://www.basketball-reference.com

Visit the original data page here: https://www.basketball-reference.com/wnba/years/2024.html

The data set contains 60 rows and 26 columns. Each row represents a WNBA team during the 2019-2023 seasons. We also used other datasets including the following seasons: 2020-2024, 2024 ONLY and 2025 ONLY. No matter the dataset, all rows represent a team and the variables stay the same.

Download data: 

Available on the [Intro to Categorical Model-Making Module by Austin Hayes](https://github.com/schuckers/Charlotte_SCORE_Summer25/tree/main/Data%20for%20Modules/Data%20For%20Intro%20to%20Categorical%20Model-Making%20Module%20by%20Austin%20Hayes): [2024_WNBA_Per_Game.csv](https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20For%20Intro%20to%20Categorical%20Model-Making%20Module%20by%20Austin%20Hayes/2024_WNBA_Per_Game.csv)

---

### Variables and their Descriptions:


<details>
<summary><b>Variable Descriptions</b></summary>

| Variable | Description |
|----------|-------------|
| Rk       | Rank of the team in the league |
| Team     | Name of the team |
| G        | Games played |
| MP       | Minutes played per game |
| FG       | Field Goals made per game |
| FGA      | Field Goal attempts per game |
| FG%      | Field Goal percentage (FG ÷ FGA) |
| 3P       | Three-Point Field Goals made per game |
| 3PA      | Three-Point Field Goal attempts per game |
| 3P%      | Three-Point Field Goal percentage (3P ÷ 3PA) |
| 2P       | Two-Point Field Goals made per game |
| 2PA      | Two-Point Field Goal attempts per game |
| 2P%      | Two-Point Field Goal percentage (2P ÷ 2PA) |
| FT       | Free Throws made per game |
| FTA      | Free Throw attempts per game |
| FT%      | Free Throw percentage (FT ÷ FTA) |
| ORB      | Offensive Rebounds per game |
| DRB      | Defensive Rebounds per game |
| TRB      | Total Rebounds per game |
| AST      | Assists per game |
| STL      | Steals per game |
| BLK      | Blocks per game |
| TOV      | Turnovers per game |
| PF       | Personal Fouls per game |
| PTS      | Points scored per game |
| CHAMPION | Did that team win the championship that year? 1 = Yes, 0 = No |
| PLAYOFFS | Did that team make the playoffs that year? 1 = Yes, 0 = No |

</details>

---

# Learning Goals

- Learn about the basics of categorical modeling
- Compare and contrast models using statistical metrics
- Learn about Quadratic Discriminant Analysis
- Learn about Linear Discriminant Analysis
- Learn about Logistic Regression
- Learn about K Nearest Neighbors and how it works.

---

# Getting Started


In this lesson, we will learn about categorical modeling in order to predict the outcome of a certain event based on past and current metrics. 

Today, we will be predicting what teams are going to make the WNBA playoffs this year (2025)! Before we get started, we need to import certain librarys in order to achieve this goal.

If you don't have any of these librarys downloaded, you can learn about how to do so with the following links:

Pandas: https://pandas.pydata.org/docs/getting_started/install.html 

Numpy: https://numpy.org/install/

Sci-Kit Learn: https://scikit-learn.org/stable/install.html

After downloading those librarys, let's import them.

In [1]:
# Import the necessary librarys for the module
# Basic Data Science Library for importing data and data manipulation
import pandas as pd

# Library for mathematical operations
import numpy as np

# Import only the necessary functions from the sklearn library
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

---

# Classfication

Classification is a way of modeling where we try to predict a non-numeric outcome using previously available data. For today, we will be predicting a binary outcome, like 'yes' or 'no', 'Made Playoffs' or 'Did not make Playoffs', etc. There are various models that can be used to achieve this but today we will go over Logistic, KNN, LDA and QDA models. Each of these models use different methodologies to achieve the same goal. Therefore, each models predictions may differ from each other, warranting comparison. This can be done using certain metrics that we will go over later in this lesson.

---

# Modeling Process For This Module

In order to build a model, we will need to train the model and then test the model to see how accurate it is. Normally, we would want to build the model by splitting a season's worth of data into a train dataset and a test dataset. You want to split the data into a higher ratio of training data than test data. This ensures that the model will be more adequately prepared to make predicitions on the test data. On the flip side, there needs to be a balance between your train and test sets because the more test data you have, the more data you have to test your models accuracy. Normally, we use 60-70% of the data for training the model and 40-30% for testing the model.

However, we will want to do something a little different in this module. For the seasons we will be training the model with, there are only 12 WNBA teams. That is not enough data! In this case, we have compiled a dataset of 'Per Game' data for every team over the last 5 seasons, before the 2024 season. 'Per Game' means that the stats are an average of each catgory among all games played. For example, the 'PTS' column is an average of each teams points scored per game. We will train our models on that compiled dataset and then test the accuracy of our models with the 2024 season data. After reviewing each model's accuracy score, we will pick the model with highest accuracy and re-train it with another compiled dataset. This other compiled dataset will include 2020-2024 data instead of 2019-2023 in order to provide the newest data to the model. After retraining the model, we will have the model predict which teams will make the WNBA playoffs in the current 2025 season.

NOTE: The compiled dataset was built upon several 'Per Game' datasets from Basketball-Reference.com. Please visit their website using the link at the top of this module. Each teams name has been edited to identify what year each team's stats are from since there are reoccuring team names. A 'CHAMPION' column was also added in order to identify which teams won a championship. 1 = Champions, 0 = Not Champions. We will not be using this column in this lesson. Instead, we will use the 'PLAYOFFS' column since we are predicting who will make the playoffs. The 'PLAYOFFS' column was added after the fact, like the 'CHAMPION' column, and was not in the original Basketball-Reference.com dataset.

In order to test the accuracy of the model(s) amongst one another, we want to train the model(s) on the previous 5 season's worth of data for higher model accuracy. We will train the model(s) using 2019-2023 season data and test it with the 2024 season data. The model with the highest accuracy, or of our choosing, will be used to predict who will make the WNBA playoffs this season using the 'Per Game' stats thus far.

To summarize, a model is something we can train with data so it can make a prediction on new data.

# Importing data and making train/test sets

Let's get started by reading in the data.

In [2]:
# Read in the WNBA 2019-2023 data from Github
# This dataset will be used to train the intial models
TRAIN_WNBA_19_23_Data = pd.read_csv('https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20For%20Intro%20to%20Categorical%20Model-Making%20Module%20by%20Austin%20Hayes/WNBA_PER_GAME_19_23_DATA.csv')

In [3]:
# Read in the WNBA 2024 data from Github
# This will be used to test the intial models 
TEST_WNBA_2024_Data = pd.read_csv('https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20For%20Intro%20to%20Categorical%20Model-Making%20Module%20by%20Austin%20Hayes/2024_WNBA_Per_Game.csv')

In [4]:
# Read in the WNBA 2020-2024 data from Github 
# This dataset will be used to train the final model
TRAIN_WNBA_20_24_Data = pd.read_csv('https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20For%20Intro%20to%20Categorical%20Model-Making%20Module%20by%20Austin%20Hayes/WNBA_PER_GAME_20_24_DATA.csv')

In [5]:
# Read in the current WNBA 2025 stats for our model prediction later on in this module
CURRENT_WNBA_2025_Data = pd.read_csv('https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20For%20Intro%20to%20Categorical%20Model-Making%20Module%20by%20Austin%20Hayes/2025_WNBA_Per_Game.csv')

We need to scale the data in order to even the playing field among variables for model-making. This prevents a model from becoming too biased towards features or variables with larger numerical ranges. More will be explained about this during the KNN section of this model. 

In [6]:
# Make the scaler object
scaler = StandardScaler()

# Fit the scaler object to the training data for intital model-making
training_data_19_23 = scaler.fit_transform(TRAIN_WNBA_19_23_Data[['FG', 'FGA', '3P', '3PA', '2P', '2PA', 'FT', 'FTA', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']])


# SCale the test data for intial model-making
test_data_2024 = scaler.fit_transform(TEST_WNBA_2024_Data[['FG', 'FGA', '3P', '3PA', '2P', '2PA', 'FT', 'FTA', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']])


# Do the same for the other training data for final model-making
# final training data
training_data_20_24 = scaler.fit_transform(TRAIN_WNBA_20_24_Data[['FG', 'FGA', '3P', '3PA', '2P', '2PA', 'FT', 'FTA', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']])

# Scale the current 2025 WNBA data for final model-making
test_data_2025 = scaler.fit_transform(CURRENT_WNBA_2025_Data[['FG', 'FGA', '3P', '3PA', '2P', '2PA', 'FT', 'FTA', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']])

Let's get to work now! Remember, you only need to scale your predictors, not your target/variable that you are making a prediction on.

---

# Accuracy Score 

Throughout this module we will be using an accuracy score to access model prediction accuracy. An accuracy score is a metric that essentially divides the total number of correct predictions by the total amount of predictions. 

                                Total number of correct predictions by the model

                                ________________________________________________

                                Total number of predictions made by the model

---

# Logistic Regression

Logistic Regression is a categorical model that uses probabilities to predict a binary outcome. It assumes a linear relationship between independent variables. Let's go ahead and establish a logistic model and train it with the compiled 2019-2023 data.

We will train it ONLY using the neccesary stats. The model also needs to be trained using integer values, so let's keep it simple by predicting the champion based on the following stats: FG, FGA, 3P, 3PA, 2P, 2PA, FT, FTA, ORB, DRB, TRB, AST, STL, BLK, TOV, PF, PTS. The definitions of these variables can be found at the top of this module. 

In [7]:
# Create the logistic regression model
logistic_model = LogisticRegression()

# Train the model using the 2023 WNBA data
logistic_model.fit(training_data_19_23, TRAIN_WNBA_19_23_Data['PLAYOFFS'])

In [None]:
# Have the model make predictions using the 2024 WNBA data
log_predictions = logistic_model.predict(test_data_2024)

# takes a look at the array of predictions
log_predictions

array([1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0], dtype=int64)

MMHHHMMM... very interesting! The model predicted 7 teams to make the 2024 WNBA PLayoffs! Unfortunately, there are 8 teams that make the playoffs every year! The New York Liberty won the championship that season, which would be the second team on the list. Let's take a look at the probabilities of each team making the playoffs, according to the model's probabilties.

In [None]:
# Have the model show the probabilities of each team making the playoffs
logistic_model.predict_proba(test_data_2024)

array([[1.17314414e-03, 9.98826856e-01],
       [8.51131758e-05, 9.99914887e-01],
       [9.33085550e-02, 9.06691445e-01],
       [6.68190992e-01, 3.31809008e-01],
       [8.03934319e-03, 9.91960657e-01],
       [1.01969499e-03, 9.98980305e-01],
       [2.34193945e-01, 7.65806055e-01],
       [2.58720304e-01, 7.41279696e-01],
       [5.83648454e-01, 4.16351546e-01],
       [9.58740216e-01, 4.12597835e-02],
       [9.72189029e-01, 2.78109711e-02],
       [9.11061566e-01, 8.89384339e-02]])

The probabilities on the left are for 0 and the right is for 1. Remember, these probabilities are in a rounded format so you can indicate where they actually are beyond the decimal point by looking at the number beyond the hyphen. We can see that the model predicted the second team in the list (New York Liberty) to have the highest odds of making the playoffs for the 2024 season. For the model to predict a team as a playoff team, the probabiltiy needs to be greater than 50%.

Logistic regression assumes a linear relationship in the data and depending on the accuracy score, it could indicate a linear relationship among the predictors and their target variable. For now, let's go ahead and calculate an accuracy score and move on to Discriminant Analysis Models. 

In [11]:
# Make accuracy score
accuracy_score_log = accuracy_score(TEST_WNBA_2024_Data['PLAYOFFS'], log_predictions)

In [12]:
# Print the accuracy score
print(f"Accuracy Score: {accuracy_score_log}")

Accuracy Score: 0.9166666666666666


---

# Discriminant Analysis


There are 2 types of Discriminant Analysis Models that we will be going over in this lesson. One is called Linear Discriminat Analysis (LDA) and the other being Quadratic Discriminant Analysis (QDA). These models essentially create a line of best fit to classify data points into a category. Discriminant Analysis finds the best combination of features through variance analysis. We will begin with LDA first.

--- 

# Linear Discriminant Analysis


LDA is a linear form of discriminant analysis and it assumes that each group follows a normal distribution. It also assumes that the variance for each classification group is the same. This means that the variance(s) between champion and non-champion is the same. Let's get started; we will follow the EXACT same process as we did with logistic regression.

In [14]:
# Create the Linear Discriminant Analysis model
lda_model = LinearDiscriminantAnalysis()

# Train the model using the 2023 WNBA data
lda_model.fit(training_data_19_23, TRAIN_WNBA_19_23_Data['PLAYOFFS'])

In [15]:
# Now let's test the model using the 2024 WNBA data and have it make predictions
lda_model.predict(test_data_2024)

array([0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1], dtype=int64)

Interesting; this LDA model produced worse results than the logistic model since it only predicted 6 teams to make the playoffs. After scaling the predictor variables, this model has predicted that the top 3 teams, in terms of per game stats, will NOT make the playoffs. Although this is an incorrect result, let's analyze some the probabilities to see if some of the teams predicted as non-playoff teams had close to a 50% chance.

In [None]:
# Predict the probabilities of each team making the playoffs
lda_model.predict_proba(test_data_2024)

array([[1.00000000e+00, 3.71802539e-10],
       [9.99999983e-01, 1.70431488e-08],
       [9.99999977e-01, 2.25361888e-08],
       [5.37153704e-06, 9.99994628e-01],
       [7.47973294e-09, 9.99999993e-01],
       [9.99959434e-01, 4.05658430e-05],
       [9.99985284e-01, 1.47158581e-05],
       [5.66938464e-08, 9.99999943e-01],
       [9.99999486e-01, 5.14290773e-07],
       [4.40676993e-03, 9.95593230e-01],
       [6.66133815e-16, 1.00000000e+00],
       [3.63880037e-11, 1.00000000e+00]])

WOW! The LDA model predicted the 2 teams with the WORST per game stats to make the playoffs. It is certainly possible but seems a bit fishy since none of the top 3 teams were considered as playoff teams by the model. Let's record the accuracy score and note that this may not be a good model to use. 

In [17]:
# Assign LDA Model Prediction from 2024 WNBA Data to a variable
lda_predictions = lda_model.predict(test_data_2024)

# takes a look at the array of predictions
lda_predictions

array([0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1], dtype=int64)

In [18]:
# Get Accuracy Score
accuracy_score_lda = accuracy_score(TEST_WNBA_2024_Data['CHAMPION'], lda_predictions)

# Print the accuracy score
print(f"Accuracy Score: {accuracy_score_lda}")

Accuracy Score: 0.4166666666666667


Look at that accuracy score! That is extremely bad and this model should NOT be used for this dataset.

Let's move onto QDA!

---

# Quadratic Disciminant Analysis

QDA is very similar to LDA. QDA, like LDA, assumes a normal distribution. Unlike LDA, QDA allows each classification group to have different variances. QDA does NOT assume a linear classification boundary so it is seen as a flexible option in comparison. Now Let's get started.

In [19]:
# Make the Quadratic Discriminant Analysis model
qda_model = QuadraticDiscriminantAnalysis()

# Train the model using the 2023 WNBA data
qda_model.fit(training_data_19_23, TRAIN_WNBA_19_23_Data['PLAYOFFS'])

In [20]:
# Have the model make predictions using the 2024 WNBA data
qda_predictions = qda_model.predict(test_data_2024)

# takes a look at the array of predictions
qda_predictions

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)

Very Interesting! This QDA Model every team to make the playoffs in 2024! Let's take a look at the probability breakdown below.

In [21]:
# Predict the probabilities of each team making the playoffs
qda_model.predict_proba(test_data_2024)

array([[0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.]])

In [22]:
accuracy_score_qda = accuracy_score(TEST_WNBA_2024_Data['CHAMPION'], qda_predictions)

# Print the accuracy score
print(f"Accuracy Score: {accuracy_score_qda}")

Accuracy Score: 0.08333333333333333


As you could have guessed, there is an astonishingly low accuracy score. There might be a massive colinearity issue with using this model! It may be best to skip the use of this model!

Let's move onto KNN!

---

# K Nearest Neigbors (KNN)



K Nearest Neighbors is a type of model that decides where to classify a data point based on the distance from previously clssified data points that the model was trained on. It is essentially deciding where to classify a data point based on the closest points or "neighbors". You can control the number of neighbors that a data point will be classified with by defining K. If you make the K number of neighbors too high, it may make the model too generalized and cause it to have low accuracy. Likewise, if you make K too small, it will be too niche and become inaccurate as well. 

Due to KNN being distance based, you will need to scale the predictor variables in order to get the best results. Scaling data points essentially levels the playing field among different measures in order to prevent the model from becoming too biased towards predictor variables with larger measures. For example, 'FGA', or 'Field Goals Attempted Per Game' will always have a larger range of numbers than 'Offensive Rebounds Per Game'

Let's test it out.

IMPORTANT RULE: For good practice, you should ONLY use odd numbers for K. This is done to prevent a tie during the prediction process. For example, if you used K=4 and there happens to be a 2-2 vote among points, it will likely result in an inaccuracte outcome. 

We will need to use standard scaler to scale the data. Let's go ahead and make the KNN model. Let's start off with K = 5 Neighbors.

In [23]:
# Make the model object for KNN with 5 neighbors
KNN_model_5 = KNeighborsClassifier(n_neighbors=5)

# Fit the model using the 2019-2023 WNBA data that was scaled earlier
KNN_model_5.fit(training_data_19_23, TRAIN_WNBA_19_23_Data['PLAYOFFS'])

In [24]:
# Have the model make predictions using the 2024 WNBA data
knn_predictions_5 = KNN_model_5.predict(test_data_2024)

# takes a look at the array of predictions
knn_predictions_5

array([1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1], dtype=int64)

In [25]:
# calculate accuracy score for KNN model with 5 neighbors
accuracy_score_knn_5 = accuracy_score(TEST_WNBA_2024_Data['PLAYOFFS'], knn_predictions_5)

# Print the accuracy score
print(f"Accuracy Score: {accuracy_score_knn_5}")

Accuracy Score: 0.9166666666666666


Interesting! It predicted one to many teams to make the playoffs. However, this model has the same accuracy score as the logistic model. Now, let's do the same thing with a higher number of neigbors! Let's try a higher K of 51!

In [26]:
# Make the model object for KNN model with 51 neighbors
KNN_model_51 = KNeighborsClassifier(n_neighbors=51)

# Fit the model using the 2019-2023 WNBA data that was scaled earlier
KNN_model_51.fit(training_data_19_23, TRAIN_WNBA_19_23_Data['PLAYOFFS'])

In [27]:
# Have the model make predictions using the 2024 WNBA data
knn_predictions_51 = KNN_model_51.predict(test_data_2024)

# takes a look at the array of predictions
knn_predictions_51

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)

In [28]:
# calculate accuracy score for KNN model with 51 neighbors
accuracy_score_knn_51 = accuracy_score(TEST_WNBA_2024_Data['PLAYOFFS'], knn_predictions_51)

# Print the accuracy score
print(f"Accuracy Score: {accuracy_score_knn_51}")

Accuracy Score: 0.6666666666666666


Not great! The model used too many points and predicted every team to make the playoffs. Now let's try K = 15

In [29]:
# Make the model object for KNN model with 15 neighbors
KNN_model_15 = KNeighborsClassifier(n_neighbors=15)

# Fit the model using the 2019-2023 WNBA data that was scaled earlier
KNN_model_15.fit(training_data_19_23, TRAIN_WNBA_19_23_Data['PLAYOFFS'])

In [None]:
# Have the model make predictions using the 2024 WNBA data
knn_predictions_15 = KNN_model_15.predict(test_data_2024)

# takes a look at the array of predictions
knn_predictions_15

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0], dtype=int64)

In [31]:
# calculate accuracy score for KNN model with 15 neighbors
accuracy_score_knn_15 = accuracy_score(TEST_WNBA_2024_Data['PLAYOFFS'], knn_predictions_15)

# Print the accuracy score
print(f"Accuracy Score: {accuracy_score_knn_15}")

Accuracy Score: 0.75


This model predicted 9 teams to make the playoffs; one too many. Not a great accuracy score either. 

Now, let's try k=13 and k=21 now!

K=13:

In [None]:
# Make the model object for KNN model with 13 neighbors
KNN_model_13 = KNeighborsClassifier(n_neighbors=13)

# Fit the model using the 2019-2023 WNBA data that was scaled earlier
KNN_model_13.fit(training_data_19_23, TRAIN_WNBA_19_23_Data['PLAYOFFS'])

# Have the model make predictions using the 2024 WNBA data
knn_predictions_13 = KNN_model_13.predict(test_data_2024)

# takes a look at the array of predictions
knn_predictions_13

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0], dtype=int64)

In [33]:
# calculate accuracy score for KNN model with 13 neighbors
accuracy_score_knn_13 = accuracy_score(TEST_WNBA_2024_Data['PLAYOFFS'], knn_predictions_13)

# Print the accuracy score
print(f"Accuracy Score: {accuracy_score_knn_13}")

Accuracy Score: 0.75


K=21:

In [None]:
# Make the model object for KNN model with 21 neighbors
KNN_model_21 = KNeighborsClassifier(n_neighbors=21)

# Fit the model using the 2019-2023 WNBA data that was scaled earlier
KNN_model_21.fit(training_data_19_23, TRAIN_WNBA_19_23_Data['PLAYOFFS'])

# Have the model make predictions using the 2024 WNBA data
knn_predictions_21 = KNN_model_21.predict(test_data_2024)

# takes a look at the array of predictions
knn_predictions_21

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0], dtype=int64)

In [35]:
# calculate accuracy score for KNN model with 21 neighbors
accuracy_score_knn_21 = accuracy_score(TEST_WNBA_2024_Data['PLAYOFFS'], knn_predictions_21)

# Print the accuracy score
print(f"Accuracy Score: {accuracy_score_knn_21}")

Accuracy Score: 0.6666666666666666


Although our models have not produced the most ideal results, this is a part of the modeling experience. You learn what works and what does'nt. We will talk about final conclusions after we make our final prediction on the 2025 data.

Since we have a tie for the highest accuracy between the Logistic model and the KNN Model with K=5, let's have both of them make a prediction. Remember, an accuracy acore cannot be calculated since playoff teams have not been decided yet. Let's try to predict which teams will get in to the playoffs!

---

# Final Predictions:

As discussed earlier, we will retrain each model in order to maintain fresh data; only keeping data from 5 seasons before the season we are predicting. We will use a different dataset that removes the 2019 data and inputs the 2024 season data that we have been making predictions on throughout this whole module. 

Let's go! We will start by making the final logistic model

In [None]:
# Make a final Logistic model using the 2020-2024 WNBA data
logistic_model_final = LogisticRegression()

# Fit the model using the 2020-2024 WNBA data that was scaled earlier
logistic_model_final.fit(training_data_20_24, TRAIN_WNBA_20_24_Data['PLAYOFFS'])

# Have the model make predictions using the 2025 WNBA data
log_predictions_final = logistic_model_final.predict(test_data_2025)

# takes a look at the array of predictions
log_predictions_final

array([1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0], dtype=int64)

In [37]:
# Key to know what team the list of probabilities is listing
CURRENT_WNBA_2025_Data

Unnamed: 0,Rk,Team,G,MP,FG,FGA,FG%,3P,3PA,3P%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1,New York Liberty,25,200.0,30.4,66.9,0.454,10.0,28.2,0.354,...,0.836,7.0,27.6,34.6,21.8,8.6,5.0,12.9,16.8,87.7
1,2,Minnesota Lynx,27,200.9,31.7,68.2,0.464,9.3,26.3,0.352,...,0.79,8.3,25.7,34.1,23.6,8.2,5.1,12.3,16.9,85.9
2,3,Los Angeles Sparks,26,201.0,29.7,65.3,0.455,8.7,24.7,0.35,...,0.75,8.1,25.0,33.1,20.4,7.4,2.9,14.9,18.4,84.6
3,4,Indiana Fever,26,200.0,30.9,68.1,0.454,8.2,24.3,0.336,...,0.771,8.3,25.4,33.7,20.2,7.8,2.9,12.9,19.3,84.4
4,5,Atlanta Dream,26,201.0,29.2,67.7,0.432,9.3,28.3,0.33,...,0.761,8.8,27.6,36.4,20.8,6.5,4.2,11.9,16.7,83.5
5,6,Phoenix Mercury,25,200.0,29.8,68.8,0.433,9.7,28.6,0.338,...,0.773,8.6,25.6,34.2,21.0,8.4,3.8,12.5,18.7,83.1
6,7,Dallas Wings,27,200.9,29.9,71.6,0.417,6.8,21.1,0.32,...,0.798,11.1,25.7,36.8,20.1,7.4,4.3,13.4,19.9,82.1
7,8,Las Vegas Aces,27,200.0,28.3,67.3,0.42,8.1,25.1,0.323,...,0.831,8.1,25.4,33.5,17.7,7.2,4.7,12.6,17.7,81.6
8,9,Seattle Storm,27,200.0,30.5,68.3,0.446,7.4,21.9,0.338,...,0.779,6.9,24.7,31.6,21.1,8.4,4.9,12.0,17.4,80.7
9,10,Washington Mystics,26,201.0,28.0,64.4,0.435,5.5,16.5,0.333,...,0.749,8.7,26.5,35.2,18.7,6.5,3.0,14.8,19.4,79.1


Looks like the logistic model has predicted 9 teams to make the playoffs. Unfortuantely that is one too many once again.

For our final conclusion we need to look at the probabilties of each team making the playoffs in order to exclude one team.

In [None]:
# Calculate the probabilities of each team making the playoffs using the final logistic model
log_final_probs = logistic_model_final.predict_proba(test_data_2025)

# Print the probabilities of each team making the playoffs
print(log_final_probs)

[[7.88630716e-05 9.99921137e-01]
 [4.01551860e-04 9.99598448e-01]
 [2.21476996e-01 7.78523004e-01]
 [8.57735811e-02 9.14226419e-01]
 [2.34292630e-03 9.97657074e-01]
 [1.70541526e-02 9.82945847e-01]
 [5.43696066e-01 4.56303934e-01]
 [2.01952675e-01 7.98047325e-01]
 [5.07964029e-02 9.49203597e-01]
 [8.38273768e-01 1.61726232e-01]
 [3.44834686e-01 6.55165314e-01]
 [9.27621141e-01 7.23788588e-02]
 [9.97534982e-01 2.46501754e-03]]


Among the teams predicted to make the playoffs, the Valkyries have the lowest chance of making it with a 65.5% chance. Let's drop that team out of the playoffs for our conclusion from this model. 

Therefore the teams predicted to make the playoffs are: The Liberty, Lynx, Sparks, Fever, Dream, Mercury, Aces, and the Storm.

Now, let's try a KNN model with K=5 to see if it produces better results.

K=5:

In [None]:
# Make the model object for KNN model with 5 neighbors
Final_KNN_model_5 = KNeighborsClassifier(n_neighbors=5)

# Fit the model using the 2020-2024 WNBA data that was scaled earlier
Final_KNN_model_5.fit(training_data_20_24, TRAIN_WNBA_20_24_Data['PLAYOFFS'])

# Have the model make predictions using the 2025 WNBA data
Final_knn_predictions_5 = Final_KNN_model_5.predict(test_data_2025)

# takes a look at the array of predictions
Final_knn_predictions_5

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0], dtype=int64)

This model did slightly worse and predicted 10 teams to go to the playoffs. Not only that but the model only chose the top 10 teams in terms of per game stats. Most likely, that is where the boundary for the model is located. 

We won't look at the probabilities for this model since it performed worse than the logistic model. We will make our conclusion(s) using the logistic models' results.

After exploring the results across different models, we can see that the logistic model has produced the best results. Let's move onto the final conclusions for this module.

---

# Final Conclusions


Throughout this module, we have seen a lot of things. We have gone over the basics of modeling with KNN, LDA, QDA and Logistic models. We have discussed scaling, accuracy scores, tie-breakers, etc. There is a LOT that comes with the modeling process. There are also lots of other types of models that could have been used to find this outcome such as SVM's, Decision Trees, etc. There are plenty of other alternatives that could be used in place of the models we have learned about today. 

We learned that Discriminant Analysis was NOT a good idea for this dataset and that logistic models, along with KNN models, fit this project much better. This may be due to the scaling along with the variances/relationships among the predictor variables. Exploratory analysis and data validation is extremely important in the model-making process. Preparing the data for usage and discovering what works and what does'nt is a vital piece of the puzzle, as we figured out today. 

Despite the failures using Discriminant Analysis models, our Logistic models and KNN models produced consistant results among their respective predictions. After modeling the results, we will conclude that the teams making the playoffs in 2025 are the Liberty, Lynx, Sparks, Fever, Dream, Mercury, Aces, and the Storm!

---

# Review Questions

Below, you will be asked a series of questions to review the material you have learned throughout this module.

### 1.) How does KNN work and what needs to be done to the data before use of a KNN model?

### 2.) Using the 2020-2024 dataset, create a list of predictor variables that include FG%, 3P%, 2P%, FT%, ORB, AST. Scale the list of predictor variables. Make sure you also scale them for the 2025 data since you will be asked to make a prediction on them later. Create a new column in the ORIGINAL 2020-2024 dataset that assesses whether a teams points ('PTS') are over 82 per game. 1 = True, 0 = False. Make this column for the ORIGINAL 2025 dataset as well. 

Hint: _Use .astype(int)_ 

### 3.) Using the scaled data you created in question 2, predict which team(s) will have above 82 points per game during the 2025 WNBA season with a QDA model. Record an accuracy score. 

QUICK NOTE: You can use an accuracy score in this case because there are already pts per game recorded for each team thus far. Although they are not final for the WHOLE season, this is an interesting experiment to see if these are strong predictors of whether a team will have high points per game. 

### 4.) Do the same thing you did in question 3 but with 3 KNN models instead. Use K=5, K=25 and K=51 respectively. Record an accuracy score.

### 5.) As you did in questions 3 and 4, make a prediction but use a Logistic model instead. Record an accuracy score.

### 6.) Which model(s) had the highest accuracy score? Is there a tie between the most accurate models? Or is there a model that is clearly the best in terms of an accuracy score?

### 7.) What differentiates LDA and QDA? List a few differences.