<div class="alert alert-danger">
    <h4 style="font-weight: bold; font-size: 28px;">Feature Selection using Filter Methods</h4>
    <p style="font-size: 20px;">NBA API Data (2022-2024)</p>
</div>

<a name="Feature-Selection"></a>

# Table of Contents

[Setup](#Setup)

[Data](#Data)

**[1. Filter Methods for Total Points](#1.-Filter-Methods-for-Total-Points)**

- [1.1. Correlation Based](#1.1.-Correlation-Based)

- [1.2. `vtreat` Library](#1.2.-vtreat-Library)

**[2. Filter Methods for Plus Minus](#2.-Filter-Methods-for-Plus-Minus)**

- [2.1. Correlation Based](#2.1.-Correlation-Based)

- [2.2. `vtreat` Library](#2.2.-vtreat-Library)

**[3. Filter Methods for Game Winner](#3.-Filter-Methods-for-Game-Winner)**

- [2.1. Correlation Based](#2.1.-Correlation-Based)

- [2.2. `vtreat` Library](#2.2.-vtreat-Library)

# Setup

[Return to top](#Feature-Selection)

In [1]:
import sys
from pathlib import Path
# get current working directory
cwd = %pwd
# add shared_code directory to Python sys.path
sys.path.append(str(Path(cwd).parent / "shared_code"))
# import all libraries in shared_code directory 'imports.py' file
from imports import *
%matplotlib inline

IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html


# Data

[Return to top](#Feature-Selection)

In [2]:
# load, filter (by time) and scale data
pts_scaled_df, pm_scaled_df, res_scaled_df, test_set_obs = utl.load_and_scale_data(
    file_path='../../data/processed/nba_team_matchups_rolling_box_scores_2022_2024_r05.csv',
    seasons_to_keep=['2021-22', '2022-23', '2023-24'],
    training_season='2021-22',
    feature_prefix='ROLL_',
    scaler_type='minmax', 
    scale_target=False
)

Season 2021-22: 1186 games
Season 2022-23: 1181 games
Season 2023-24: 692 games
Total number of games across sampled seasons: 3059 games


<a name="1.-Filter-Methods-for-Total-Points"></a>
# 1. Filter Methods for Total Points

[Return to top](#Feature-Selection)

<a name="1.1.-Correlation-Based"></a>
## 1.1. Correlation Based

[Return to top](#Feature-Selection)

In [3]:
pts_selection = utl.filter_feature_selection(
    df=pts_scaled_df, 
    outcome_name='TOTAL_PTS'
)

{
    "outcome_correlation": [
        "ROLL_HOME_PTS",
        "ROLL_HOME_FGM",
        "ROLL_HOME_FG_PCT",
        "ROLL_HOME_FTM",
        "ROLL_HOME_FTA",
        "ROLL_HOME_AST",
        "ROLL_AWAY_PTS",
        "ROLL_AWAY_FGM",
        "ROLL_AWAY_FGA",
        "ROLL_AWAY_FG_PCT",
        "ROLL_AWAY_FTM",
        "ROLL_AWAY_FTA",
        "ROLL_AWAY_AST"
    ],
    "feature_correlation": [
        "ROLL_HOME_PTS",
        "ROLL_HOME_FGA",
        "ROLL_HOME_FG3M",
        "ROLL_HOME_FG3_PCT",
        "ROLL_HOME_FTM",
        "ROLL_HOME_FT_PCT",
        "ROLL_HOME_OREB",
        "ROLL_HOME_DREB",
        "ROLL_HOME_AST",
        "ROLL_HOME_STL",
        "ROLL_HOME_BLK",
        "ROLL_HOME_TOV",
        "ROLL_HOME_PF",
        "ROLL_AWAY_PTS",
        "ROLL_AWAY_FGA",
        "ROLL_AWAY_FG3M",
        "ROLL_AWAY_FTM",
        "ROLL_AWAY_FT_PCT",
        "ROLL_AWAY_OREB",
        "ROLL_AWAY_DREB",
        "ROLL_AWAY_STL",
        "ROLL_AWAY_BLK",
        "ROLL_AWAY_TOV",
        "ROLL

<a name="1.2.-vtreat-Library"></a>
## 1.2. `vtreat` Library

[Return to top](#Feature-Selection)

In [4]:
# automated feature selection and preprocessing
pts_scaled_df_selected, pts_selection = utl.vtreat_feature_selection(
    df=pts_scaled_df,
    outcome_name='TOTAL_PTS'
)

There were 18 features selected out of 36 original features



In [5]:
pts_scaled_df_selected.head()

Unnamed: 0,ROLL_AWAY_FTA,ROLL_HOME_FG3A,ROLL_AWAY_DREB,ROLL_HOME_FG3_PCT,ROLL_HOME_FGM,ROLL_HOME_FTA,ROLL_HOME_FTM,ROLL_HOME_AST,ROLL_AWAY_PTS,ROLL_AWAY_FGM,ROLL_AWAY_FTM,ROLL_AWAY_FG_PCT,ROLL_HOME_PTS,ROLL_AWAY_FGA,ROLL_AWAY_TOV,ROLL_HOME_FG_PCT,ROLL_AWAY_AST,ROLL_HOME_PF,TOTAL_PTS
0,0.285,0.58,0.369,0.731,0.522,0.878,0.805,0.612,0.577,0.586,0.336,0.704,0.745,0.202,0.391,0.753,0.5,0.661,185
1,0.163,0.412,0.685,0.0,0.0,0.534,0.466,0.0,0.096,0.017,0.294,0.0,0.0,0.362,0.348,0.0,0.083,0.576,198
2,0.772,0.454,0.685,0.466,0.652,0.534,0.593,0.561,0.635,0.586,0.672,0.728,0.691,0.176,0.174,0.758,0.708,0.661,239
3,0.813,0.244,0.369,0.772,0.826,0.382,0.297,0.918,0.25,0.069,0.588,0.225,0.727,0.122,0.348,0.827,0.208,0.661,232
4,0.569,0.58,0.73,0.82,0.783,0.229,0.254,0.765,1.0,0.897,0.504,0.362,0.745,1.0,0.478,0.848,0.833,0.322,204


In [6]:
pts_selection

['ROLL_AWAY_FTA',
 'ROLL_HOME_FG3A',
 'ROLL_AWAY_DREB',
 'ROLL_HOME_FG3_PCT',
 'ROLL_HOME_FGM',
 'ROLL_HOME_FTA',
 'ROLL_HOME_FTM',
 'ROLL_HOME_AST',
 'ROLL_AWAY_PTS',
 'ROLL_AWAY_FGM',
 'ROLL_AWAY_FTM',
 'ROLL_AWAY_FG_PCT',
 'ROLL_HOME_PTS',
 'ROLL_AWAY_FGA',
 'ROLL_AWAY_TOV',
 'ROLL_HOME_FG_PCT',
 'ROLL_AWAY_AST',
 'ROLL_HOME_PF']

<a name="2.-Filter-Methods-for-Plus-Minus"></a>
# 2. Filter Methods for Plus Minus

[Return to top](#Feature-Selection)

<a name="2.1.-Correlation-Based"></a>
## 2.1. Correlation Based

[Return to top](#Feature-Selection)

In [7]:
pm_selection = utl.filter_feature_selection(
    df=pm_scaled_df, 
    outcome_name='PLUS_MINUS'
)

{
    "outcome_correlation": [
        "ROLL_HOME_PTS",
        "ROLL_HOME_FG_PCT",
        "ROLL_AWAY_FT_PCT"
    ],
    "feature_correlation": [
        "ROLL_HOME_PTS",
        "ROLL_HOME_FGA",
        "ROLL_HOME_FG3M",
        "ROLL_HOME_FG3_PCT",
        "ROLL_HOME_FTM",
        "ROLL_HOME_FT_PCT",
        "ROLL_HOME_OREB",
        "ROLL_HOME_DREB",
        "ROLL_HOME_AST",
        "ROLL_HOME_STL",
        "ROLL_HOME_BLK",
        "ROLL_HOME_TOV",
        "ROLL_HOME_PF",
        "ROLL_AWAY_PTS",
        "ROLL_AWAY_FGA",
        "ROLL_AWAY_FG3M",
        "ROLL_AWAY_FTM",
        "ROLL_AWAY_FT_PCT",
        "ROLL_AWAY_OREB",
        "ROLL_AWAY_DREB",
        "ROLL_AWAY_STL",
        "ROLL_AWAY_BLK",
        "ROLL_AWAY_TOV",
        "ROLL_AWAY_PF"
    ],
    "VIF": [
        "ROLL_HOME_STL",
        "ROLL_AWAY_STL"
    ],
    "feature_intersection": []
}


<a name="2.2.-vtreat-Library"></a>
## 2.2. `vtreat` Library

[Return to top](#Feature-Selection)

In [8]:
# automated feature selection and preprocessing
pm_scaled_df_selected, pm_selection = utl.vtreat_feature_selection(
    df=pm_scaled_df,
    outcome_name='PLUS_MINUS'
)

There were 14 features selected out of 36 original features



In [9]:
pm_scaled_df_selected.head()

Unnamed: 0,ROLL_HOME_FG3A,ROLL_AWAY_DREB,ROLL_HOME_FG3_PCT,ROLL_HOME_FGM,ROLL_HOME_FG3M,ROLL_HOME_AST,ROLL_AWAY_PTS,ROLL_AWAY_FGM,ROLL_AWAY_FT_PCT,ROLL_AWAY_FG_PCT,ROLL_HOME_DREB,ROLL_HOME_PTS,ROLL_HOME_FG_PCT,ROLL_AWAY_AST,PLUS_MINUS
0,0.58,0.369,0.731,0.522,0.758,0.612,0.577,0.586,0.603,0.704,0.292,0.745,0.753,0.5,7.0
1,0.412,0.685,0.0,0.0,0.076,0.0,0.096,0.017,0.837,0.0,0.381,0.0,0.0,0.083,-8.0
2,0.454,0.685,0.466,0.652,0.455,0.561,0.635,0.586,0.469,0.728,0.602,0.691,0.758,0.708,29.0
3,0.244,0.369,0.772,0.826,0.53,0.918,0.25,0.069,0.268,0.225,0.159,0.727,0.827,0.208,-10.0
4,0.58,0.73,0.82,0.783,0.833,0.765,1.0,0.897,0.446,0.362,0.779,0.745,0.848,0.833,-10.0


In [10]:
pm_selection

['ROLL_HOME_FG3A',
 'ROLL_AWAY_DREB',
 'ROLL_HOME_FG3_PCT',
 'ROLL_HOME_FGM',
 'ROLL_HOME_FG3M',
 'ROLL_HOME_AST',
 'ROLL_AWAY_PTS',
 'ROLL_AWAY_FGM',
 'ROLL_AWAY_FT_PCT',
 'ROLL_AWAY_FG_PCT',
 'ROLL_HOME_DREB',
 'ROLL_HOME_PTS',
 'ROLL_HOME_FG_PCT',
 'ROLL_AWAY_AST']

<a name="3.-Filter-Methods-for-Game-Winner"></a>
# 3. Filter Methods for Game Winner

[Return to top](#Feature-Selection)

<a name="3.1.-Correlation-Based"></a>
## 3.1. Correlation Based

[Return to top](#Feature-Selection)

In [11]:
res_selection = utl.filter_feature_selection(
    df=res_scaled_df, 
    outcome_name='GAME_RESULT'
)

{
    "outcome_correlation": [
        "ROLL_HOME_FG_PCT",
        "ROLL_AWAY_STL"
    ],
    "feature_correlation": [
        "ROLL_HOME_PTS",
        "ROLL_HOME_FGA",
        "ROLL_HOME_FG3M",
        "ROLL_HOME_FG3_PCT",
        "ROLL_HOME_FTM",
        "ROLL_HOME_FT_PCT",
        "ROLL_HOME_OREB",
        "ROLL_HOME_DREB",
        "ROLL_HOME_AST",
        "ROLL_HOME_STL",
        "ROLL_HOME_BLK",
        "ROLL_HOME_TOV",
        "ROLL_HOME_PF",
        "ROLL_AWAY_PTS",
        "ROLL_AWAY_FGA",
        "ROLL_AWAY_FG3M",
        "ROLL_AWAY_FTM",
        "ROLL_AWAY_FT_PCT",
        "ROLL_AWAY_OREB",
        "ROLL_AWAY_DREB",
        "ROLL_AWAY_STL",
        "ROLL_AWAY_BLK",
        "ROLL_AWAY_TOV",
        "ROLL_AWAY_PF"
    ],
    "VIF": [
        "ROLL_HOME_STL",
        "ROLL_AWAY_STL"
    ],
    "feature_intersection": [
        "ROLL_AWAY_STL"
    ]
}


<a name="3.2.-vtreat-Library"></a>
## 3.2. `vtreat` Library

[Return to top](#Feature-Selection)

In [12]:
# automated feature selection and preprocessing
res_scaled_df_selected, res_selection = utl.vtreat_feature_selection(
    df=res_scaled_df,
    outcome_name='GAME_RESULT'
)

There were 12 features selected out of 36 original features



In [13]:
res_scaled_df_selected.head()

Unnamed: 0,ROLL_AWAY_STL,ROLL_HOME_REB,ROLL_HOME_FGM,ROLL_HOME_FG3M,ROLL_AWAY_PTS,ROLL_AWAY_FGM,ROLL_AWAY_FT_PCT,ROLL_HOME_DREB,ROLL_HOME_PTS,ROLL_AWAY_TOV,ROLL_HOME_FG_PCT,ROLL_AWAY_AST,GAME_RESULT
0,0.28,0.478,0.522,0.758,0.577,0.586,0.603,0.292,0.745,0.391,0.753,0.5,1
1,0.28,0.826,0.0,0.076,0.096,0.017,0.837,0.381,0.0,0.348,0.0,0.083,0
2,0.36,0.609,0.652,0.455,0.635,0.586,0.469,0.602,0.691,0.174,0.758,0.708,1
3,0.2,0.348,0.826,0.53,0.25,0.069,0.268,0.159,0.727,0.348,0.827,0.208,0
4,0.76,0.826,0.783,0.833,1.0,0.897,0.446,0.779,0.745,0.478,0.848,0.833,0


In [14]:
res_selection

['ROLL_AWAY_STL',
 'ROLL_HOME_REB',
 'ROLL_HOME_FGM',
 'ROLL_HOME_FG3M',
 'ROLL_AWAY_PTS',
 'ROLL_AWAY_FGM',
 'ROLL_AWAY_FT_PCT',
 'ROLL_HOME_DREB',
 'ROLL_HOME_PTS',
 'ROLL_AWAY_TOV',
 'ROLL_HOME_FG_PCT',
 'ROLL_AWAY_AST']