# PART 3: Exploratory Data Analysis

This notebook will employ exploratory mining techniques to study the NBA players, discern statistical patterns and better inform the modeling stage of the project pipeline.

(NOTE: Due to the use of widgets, this notebook won't save all the visuals executed. Please instead refer to either 3_explorer.html / saved profiling copies, or rerun this script.)

---

In [None]:
# Data Source / API
from nba_api.stats.endpoints import shotchartdetail
from nba_api.stats.static import players as p, teams as t

# Data Management & Manipulation
import numpy as np
import pandas as pd
from prettytable import PrettyTable
from scipy.stats import norm, gaussian_kde, percentileofscore

# Data Visualization
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
from autoviz.AutoViz_Class import AutoViz_Class
from matplotlib import cm, patches
from matplotlib.patches import Circle, Rectangle, Arc, ConnectionPatch, Polygon, PathPatch
from matplotlib.collections import PatchCollection
from matplotlib.colors import LinearSegmentedColormap, ListedColormap, BoundaryNorm
from matplotlib.path import Path
from pandas_profiling import ProfileReport, profile_report
from plotly.subplots import make_subplots
from plotnine import ggplot, aes, geom_jitter, scale_color_manual, theme, labs, theme_bw

# Utils
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
from IPython.display import display, HTML
from time import sleep
import random
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
sns.set_style('white')
sns.set_color_codes()
plt.style.use('default')  # sub 'dark_background'

## 3A: Setup

**Objective:** Load the processed player statistics into local dataframes.

---

In [None]:
player_df = pd.read_csv('./cln_comprehensive_stats.csv', sep=',', header=0, index_col=None)
player_df

## 3B: Auto-Profile Generation

**Objective:** This section implements automatic profiling modules to generate exploratory reports to study the player data. Due the large dimensionality of the dataset, attributes were broken down into offensive and defensive stats for this exploration phase. 

----

#### Pandas Profiling -- Player Info & Usage:

In [None]:
# Define attributes to profile
usg_cols = [
    'PLAYER', 'H', 'POS', 'TEAM', 'GP', 'MIN', 'POSS',  # Utils
    '%FGA', '%3PA', '%FTA', '%REB', '%AST', '%BLKA', '%PFD', '%PTS'  # Usage stats
]

usage_profile = player_df[usg_cols].profile_report(
    title = 'NBA Player Bio & Usage Profile',
    explorative = True,
    plot = {'histogram': {'bins': 12}, 'correlation': {'cmap': 'RdBu_r'}},
    missing_diagrams = {'bar': False, 'matrix': False, 'heatmap': False},  # already taken care of
    progress_bar = False,
    pool_size = 0,
    samples=None
)

# usage_profile.to_file('pd_profile_plyr_usage.html')  # report saved as html file in project directory for easier viewing
usage_profile.to_notebook_iframe()

To circumvent loading-issues from pandas profiling report generated from high-dimensional datasets, the attributes were broken into three primary categories: basic player info/usage, offense, defense. In this section, an exploratory report was generated for the player bio info and usage related data. The report was also saved as an HTML copy to the local directory for easier viewing. Some notable observations from the generated report are described below.

From the distribution plots, the following observations can be made:

- Categorical attributes:
    - `PLAYER` name essentially serves as indexing variable for the study as there are no duplicates and each record consists of an individual player
    - `POS` and `TEAM` are the only other categorical variables
        - Team info is uniformly distributed as an equal number of players were retrieved in the initial pre-processing stage, based on minutes. Therefore, apart from initial explorations to find any team-related patterns, this attribute won't be useful for the modeling stage.
        - Position info is captured into 5 bins: 3 single positions (G, F, C), and 2 dual positions (G-F, F-C). 
        - There were both surprising and expected elements to the position data. Considering that 'G' and 'F' each capture two positions ('PG'/'SG' and 'SF'/'PF' respectively) unlike 'C', it would be expected that the 'C' position would have the least amount of occurences int he dataset. However, it was surprising to see how low that proportion actually is. This can be attested to the fact that the modern NBA relies more on guards and forwards than the traditional big man. On the other side of the spectrum, most of the players within the dataset (and therefore, the players with most playing time) are guards. This has several implications for the modeling stage as clustering will naturally more focused on parsing the guards--this will be an important aspect to watch out for in this stage of the project pipeline.
<br><br>
- Numeric attributes:
    - Height information (`H`) shows an interesting distribution. Apart from a few exceptions, most height bins have a fairly similar amount of players. The biggest outlier is at 77 inches and the data shows that 7-footers are rare among the players that get meaningful minutes.
    - Games played information (`GP`) shows a heavy left-skew, with a majority of the players in the dataset playing a significant portion of the season. Interestingly, `MIN` doesn't follow the same distribution as expected, indicating that the filtered players with low number of games in the dataset, still played a fairly significant amount of cumulative minutes. Therefore, the low-sample size for some players resulted from sudden stoppage of play such as injuries, rather than other reasons. Therefore, it reaffirms that the filtered list of players are the ideal set of players needed to conduct such a study, as we want to learn patterns based on players who have or would have gotten a significant amount of playing time. `POSS` follows a nearly identical distribution to `MIN` to further confirm this. Although these features were left for the pre-processing steps, they will be removed for the modeling stages.
    - The remaining attributes all convey usage-related information by computing the proportion of a player's contribution in a particular statistic, in relation to the team's total. For example, `%FGA` is the percentage of a team's Field-Goal Attempts that belong to the `PLAYER` of that record. Apart from `%FGA` and `3PA`, all of these stats show a strong right-skew distribution, indicating that only a small number of players make up the heavy-usage tier. 

Overall, the distributions are mostly non-normal and this needs to be taken into account during the modeling stages. This is also an important aspect to consider when choosing a metric to evaluate correlations.

Next, correlations between the attributes were explored. The `Pearson's r` method finds correlations between continuous numeric variables if certain conditions are met (assumption of normality, linearity, limited outliers). The `Spearman's Rank Correlation Coefficient` and `Kendall's Tau Coefficient` also find correlations for continuous numeric variables, but assumes independence and a monotonic relationship (same direction but not necessarily linear). The latter non-parametric methods are more applicable to this data, but aren't as robust in finding correlations as the former. From the plotted heatmaps, however, the overall trends are consistent throughout:

- `H` correlates fairly strongly with a player's rebounding contribution to the team, indicating that height might be redundant for modeling.
- As expected based on the distribution analysis above, `POSS` and `MIN` are extremely correlated and have a slightly less collinearity with `GP`. This information won't be included as part of modeling regardless, but this affirms the pre-processing strategy of standardizing attributes to either per-possession or per-minute inter-changeably.
- The following pairs are all correlated fairly strongly and are redundant to include in the modeling stages: `%FGA` & `%PTS`, `%FTA` & `%PFD` (personal-fouls drawn), `%FTA` & `%PTS`. This needs to be addressed before modeling.
- `H` and `%3PA` show a slight negative correlation, which makes intuitive sense as most of a team's 3-pt contribution occurs from the smaller guards and wings.

Next, a similar report will be generated and analyzed for players' offensive statistics.

#### Pandas Profiling -- Offensive Stats:

In [None]:
# Define attributes to profile
off_cols = [
    'PassesMade/POSS', 'PassesReceived/POSS', 'ScreenAssists/POSS', 'ASTAdj/POSS', 'TOV/POSS',  # Passing stats
    'PTS/POSS', '2P%', '3P%', 'FT%', '%PTSFT',  # Base scoring info
    '%RA_FGA', '%PT_nonRA_FGA', '%MR_FGA',  # 2-Pt scoring zones
    '%cns_2FGA',  '%pullup_2FGA',  # 2-Pt scoring styles
    '%Corner3_FGA', '%ATB3_FGA',  # 3-Pt scoring zones
    '%cns_3PA', '%pullup_3PA',  # 3-Pt scoring styles
    '2FGM%UAST', '3FGM%UAST',  # Scoring dependency
    '%dr_1_2_fga', '%dr_3plus_fga', '%tch_0_2_fga', '%tch_2plus_fga',  # Scoring style (dribble & touch-time)
    '%trsn_FGA', '%iso_FGA', '%pnrbh_FGA', '%pnrrm_FGA', '%postup_FGA', '%spotup_FGA',  # Offensive-style
    '%handoff_FGA', '%cuts_FGA', '%offscrn_FGA', '%putbk_FGA', 'DRIVES/POSS', 'drives_PASS%', 
    'Avg Sec PerTouch', 'Avg Drib PerTouch', 'ElbowTouches/POSS', 'PostUps/POSS', 'PaintTouches/POSS',  # Offensive activity (touches)
    'PTS PerElbow Touch', 'PTS PerPost Touch', 'PTS PerPaint Touch',  # Touches-efficiency
    'Dist. Miles Off/POSS'  # Movement on offense
]

off_profile = player_df[off_cols].profile_report(
    title = 'NBA Player-Offense Profile',
    explorative = True,
    plot = {'histogram': {'bins': 12}, 'correlation': {'cmap': 'RdBu_r'}},
    missing_diagrams = {'bar': False, 'matrix': False, 'heatmap': False},  # already taken care of
    progress_bar = False,
    pool_size = 0,
    samples=None,
    interactions=None
)

# off_profile.to_file('pd_profile_plyr_off.html')  # report saved as html file in project directory for easier viewing
off_profile.to_notebook_iframe()

To circumvent loading-issues from pandas profiling report generated from high-dimensional datasets, the attributes were broken into three primary categories: basic player info/usage, offense, defense. In this section, an exploratory report was generated for the player offensive-related statistics. The report was also saved as an HTML copy to the local directory for easier viewing. Some notable observations from the generated report are described below.

This segment consists of only numeric attributes. The distribution patterns can be organized as below:
- LEFT-SKEW: `3P%`, `FT%`, `PTS PerPaint Touch`
- RIGHT-SKEW: `PassesReceived/POSS`, `ScreenAssists/POSS` (heavy), `ASTAdj/POSS`, `TOV/POSS`, `PTS/POSS` (apart from outlier players, mostly normal), `%RA_FGA`, `%MR_FGA`, `%cns_2FGA` (heavy), `%pullup_2FGA`, `%Corner3_FGA`, `%cns_3PA`, `%pullup_3PA`, `3FGM%UAST`, `%dr_3plus_fga`, `%iso_FGA` (heavy), `%pnrbh_FGA` (heavy), `%pnrrm_FGA` (heavy), `%postup_FGA` (heavy), `%handoff_FGA`, `%cuts_FGA` (heavy), `%offscrn_FGA` (heavy), `%putbk_FGA` (heavy), `DRIVES/POSS`, `Avg Sec PerTouch`, `Avg Drib PerTouch`, `ElbowTouches/POSS` (heavy), `PostUps/POSS` (heavy), `PaintTouches/POSS` (heavy), `PTS PerPost Touch`
- Weak / No Skew (mostly normal): `PassesMade/POSS`, `2P%`, `%PTSFT`, `%PT_nonRA_FGA`, `%ATB3_FGA`, `2FGM%UAST`, `%dr_1_2_fga`, `%tch_0_2_fga`, `%tch_2plus_fga`, `%trsn_FGA`, `%spotup_FGA`, `drives_PASS%`, `PTS PerElbow Touch`, `Dist. Miles Off/POSS`

Noteworthy patterns based on above observations:
- Although passes made per possession is normally distributed amongst the high-min players of the dataset, passes-received has a more interesting distribution. Apart from two bins with high densities around 0.4-0.5, the rest of the data looks mostly normal. Due to these exceptions, the overall distribution is skewed to the right. It would be interesting to dig deeper and examine the players and positions that make up the outliers in this distribution.
- Screen assists, which are more common for the Bigs, are heavily skewed as expected, since Bigs make up a smaller proportion of the dataset as revealed in the previous profile report
- `ASTAdj/POSS` is an attribute that was selected in lieu of the traditional AST metric, as it accounts for secondary assists as well. This attribute has a very similar distribution to `TOV/POSS` which intuitively makes sense, as greater ball-handling responsiblities naturally lead to more turnovers. This may also indicate the former stat redundant which needs to be examined further during correlation assessments.
- `2P%` was mostly normally distributed; However, it is hard to gauge the 3-point efficiency as there are large gaps between bins as less players attempt threes
- FGA in the Paint were broken into two categories within the dataset: within the "Restricted Area" around the basket, and outside of it. Within the restricted area, the FGA were right-skewed, indicating there are limited amount of players who take most of their shots close to the basket, while the non-RA FGA was more normally distributed. The remaining shooting zones for 2-PA is captured in the Mid-Range FGA which was also right-skewed. Therefore, digging into the distribution of the 2-PT shot attempts based on zone reveals that there are players specializing in RA and MR but the non-RA Paint area is more common amongst all.
- Similarly, the 3-PT shot broken into "Corner" or "Above the Break" zones shows that the corner is more specialized to some players based on the right-skew distribution. 
- Breaking down 2-PT and 3-PT based on style, however, shows a different pattern. For 2-PT shots, there is a right-skew for both catch-and-shoot style and pullup shotmaking. However, for the 3-PT shot, catch-and-shoot threes are normally distributed, and pull-up threes seem to be mostly prevalent amonst a smaller group of players. 
- The proportion of unassisted FGM that make up a player's shot profile is normally distributed, while its 3-PT counterpart is heavily skewed to the right. This means that players are less likely to create their own 2-pt shot, which also can be linked to the normally distributed catch-and-shoot 3-PT profile pattern.
- Proportion of the shot profile that a player attempts in 1-2 dribbles is normally distributed, while more dribbles was right-skewed. This again exposes the pattern of a group of players being responsible for the primary ball handling while others make quicker decision with limited responsibility. This is further confirmed by the `Avg Sec PerTouch` & `Avg Drib PerTouch` attributes which show a similar right-skew.
- Breaking down offensive plays based on transition, isolation, pick-and-roll, cuts, putbacks spotup shooting, it is evident that spotups and transition plays are more normally distributed amongst the players while the other two are specialized indicated by a heavy right-skew. 
- Interestingly, although paint & elbow touches were skewed, the points generated from those touches are fairly normally distributed amonst all players.
- Lastly, it was surprising to see that the distance per offensive possession was normally distributed -- as this metric was initially scraped as a way to distinguish players' off-ball skills, this feature might not be beneficial to the modeling objectives

Overall, the distributions for offensive metrics are mostly right-skewed or normally distributed, with a few displaying a left-skew. This needs to be taken into account during the modeling stages, as well as any feature selection processing involving attribute correlations.

Next, correlations between the attributes were explored. The `Pearson's r` method finds correlations between continuous numeric variables if certain conditions are met (assumption of normality, linearity, limited outliers). The `Spearman's Rank Correlation Coefficient` and `Kendall's Tau Coefficient` also find correlations for continuous numeric variables, but assumes independence and a monotonic relationship (same direction but not necessarily linear). The latter non-parametric methods are more applicable to this data, but aren't as robust in finding correlations as the former. From the plotted heatmaps, however, the overall trends are consistent throughout:

- `PassesMade/POSS` & `PassesReceived/POSS` are highly correlated with each other, as well as, a number of other features, most importantly `ASTAdj/POSS`, making them redundant. Interestingly, `PassesReceived/POSS` shares a particularly high correlation with ball-handling features such as touch time or number of dribbles per possession. Intuitively, this makes sense as players who have a larger ball-handling responsiblity will get passes from their teammates more often to dictate the offense. 
- `ScreenAssists/POSS` & `%pnrrm_FGA` are correlated with each other. Inherently, pick-and-roll actions involve screens set by a roll-man, which sometimes leads to assists when the ball handler makes a shot as a result, explaining this relationship. However, I am hesistant to truncate one of these attributes for modeling before further exploration, as players whose shot profile isn't primarily made up of pick-and-rolls can still be involved in screen actions that might be lost with any hasty feature selection here. Furthermore, as expected, screen assists do not share a clearly defined relationship with other passing / assist metrics. This hypothesis was part of the reason this attribute was procured to be a part of the study.
- `ASTAdj/POSS` is strongly correlated with ball-handling features such as touch time and number of dribbles per possession (and the consequent FGA in relation to both). This intuitively makes sense since ball-handlers are the primary playmakers for any NBA lineup. This is another opportunity to consider on dimensionality reduction as some features are redundant when trying to distinguish between players in any unsupervised learning model. Similarly, turnovers also correlate with the same attributes, albeit less strongly. The slight difference can be hypothesized as belonging to the non-ball-handling bigs who also lose the ball a bit more. 
- `2P%` is positively correlated with paint touches and taking a higher proportion of shots from cutting plays. This makes sense are these are characteristics associated with traditional bigs who have a higher efficiency rate from 2-PT shots. Interestingly, this attribute seems slightly negatively correlated with pullup shots from 3, which are normally associated with modern guards which further back up the reasoning. 
- `%RA-FGA` shows a strong positive correlation with paint touches, cuts, and putback attempts which normally occur in the restricted area. Therefore, if the latter 3 all show a similar relationship with each other, than just the %RA-FGA would be enough to capture this characteristic and prevent redundant features for modeling. This feature also showed a strong negative correlation with above-the-break 3-PT attempts.
- `%MR_FGA` and `%pullup_2FGA` are very strongly correlation can might be redudant for modeling.
- `%pullup_2FGA` shows a fairly high correlation with ball-handling attributes such as touch time and dribble stats as well. This can be removed in lieu of the remaining.
- `%Corner3_FGA`, `%cns_3PA` & `%spotup_FGA` are all positively correlated with each other, which makes intuitive sense as corner 3s usually result from spot-shooters camped out at the corners, until the team's primary playmakers find them. Therefore, these are redundant features that could be further trimmed as well.
- Unassisted-FGA features for both 2-pt and 3-pt shots are strongly correlated with dribble and touch time statistics as anticipated.
- Dribble and Touch time are very strongly correlated and multiple breakdown for each are redundant as high dribbles are inversely correlated with low touch-time.
- Touches info is correlated with a lot of offensive play-style metrics and need to be closely looked at to avoid redundancy.

Next, a similar report will be generated and analyzed for players' defensive statistics.

#### Pandas Profiling -- Defensive Stats:

In [None]:
# Define attributes to profile
def_cols = [
    'Dist. Miles Def/POSS',  # Player movement
    'REB/MIN', 'ContestedREB%', 'AVG REBDistance', 'Box Outs/MIN',  # Rebound-related stats
    'STL/MIN', 'BLK/MIN', 'Deflections/MIN', 'Loose BallsRecovered/MIN', 'ChargesDrawn/MIN',  # Defensive activity
    'Contested2PT Shots/MIN', 'Contested3PT Shots/MIN', 'OppTOV/MIN', 'OppPF/MIN', 'OppFTA/MIN', 
    'Opp2P%', '%opp_RA_FGA', 'opp_RA_FG%', '%opp_PT_nonRA_FGA', 'opp_PT_nonRA_FG%', '%opp_MR_FGA', 'opp_MR_FG%',  # Opponent 2-Pt efficiency
    'Opp3P%', '%opp_Corner3_FGA', 'opp_Corner3_FG%', '%opp_ATB3_FGA', 'opp_ATB3_FG%',  # Opponent 3-Pt efficiency
    '%opp_iso_FGA', 'opp_iso_FG%', '%opp_pnrbh_FGA', 'opp_pnrbh_FG%', '%opp_pnrrm_FGA', 'opp_pnrrm_FG%', '%opp_postup_FGA', 'opp_postup_FG%',  # Defended styles
    '%opp_spotup_FGA', 'opp_spotup_FG%', '%opp_handoff_FGA', 'opp_handoff_FG%', '%opp_offscrn_FGA', 'opp_offscrn_FG%'
]

def_profile = player_df[def_cols].profile_report(
    title = 'NBA Player-Defense Profile',
    explorative = True,
    plot = {'histogram': {'bins': 12}, 'correlation': {'cmap': 'RdBu_r'}},
    missing_diagrams = {'bar': False, 'matrix': False, 'heatmap': False},  # already taken care of
    progress_bar = False,
    pool_size = 0,
    samples=None,
    interactions=None
)

# def_profile.to_file('pd_profile_plyr_def.html')  # report saved as html file in project directory for easier viewing
def_profile.to_notebook_iframe()

To circumvent loading-issues from pandas profiling report generated from high-dimensional datasets, the attributes were broken into three primary categories: basic player info/usage, offense, defense. In this section, an exploratory report was generated for the player defensive-related statistics. The report was also saved as an HTML copy to the local directory for easier viewing. Some notable observations from the generated report are described below.

This segment consists of only numeric attributes. The distribution patterns can be organized as below:
- LEFT-SKEW: `%opp_RA_FGA`
- RIGHT-SKEW: `REB/MIN`, `ContestedREB%`, `Box Outs/MIN` (heavy), `STL/MIN`, `BLK/MIN`, `Deflections/MIN`, `ChargesDrawn/MIN` (heavy), `Contested2PT Shots/MIN`, `%opp_PT_nonRA_FGA`, `%opp_iso_FGA`, `%opp_pnrbh_FGA`, `%opp_pnrrm_FGA`, `%opp_handoff_FGA`
- Weak / No Skew (mostly normal): `Dist. Miles Def/POSS`, `AVG REBDistance`, `Loose BallsRecovered/MIN`, `Contested3PT Shots/MIN`, `OppTOV/MIN`, `OppPF/MIN`, `OppFTA/MIN`, `Opp2P%`, `opp_RA_FG%`, `opp_PT_nonRA_FG%`, `%opp_MR_FGA`, `opp_MR_FG%`, `Opp3P%`, `%opp_Corner3_FGA`, `opp_Corner3_FG%`, `%opp_ATB3_FGA`, `opp_ATB3_FG%`, `opp_iso_FG%`, `opp_pnrbh_FG%`, `opp_pnrrm_FG%`, `%opp_postup_FGA`, `opp_postup_FG%`, `%opp_spotup_FGA`, `opp_spotup_FG%`, `opp_handoff_FG%`, `%opp_offscrn_FGA`, `opp_offscrn_FG%`  

Noteworthy patterns based on above observations:

- Distance ran on defense seems to have a very similar distribution to the offensive counterpart, and might be redudant for modeling.
- A large number of defense-related metrics seem to show a mostly normal (or atleast no strong skew) distribution. This indicates that lot of these defensive attributes are prevalent among all the players, and not specialized to a few. There are some attributes that are heavyily skewed, however, which can distinguish players a bit more based on defensive role (such as box-outs or charges-drawn. Some skewed attributes, do not convey these specialized role information, and are more based upon playing time (i.e. percentage of FGA contested against different play-types or in various zones on the court). 
- There are missing records for players in particular defensive metrics, and still needs to be handled. Although it wouldn't matter as much for supervised techniques that use tree-based methods, any unsupervised approach definitely needs these addressed as algorithms tend to be more distance-based which won't be able to be computed with missing values.

Overall, the distributions for defensive metrics don't show much skew (mostly normal) with some right-skewed and one with a left-skew. This needs to be taken into account during the modeling stages, as well as any feature selection processing involving attribute correlations.

Next, correlations between the attributes were explored. The `Pearson's r` method finds correlations between continuous numeric variables if certain conditions are met (assumption of normality, linearity, limited outliers). The `Spearman's Rank Correlation Coefficient` and `Kendall's Tau Coefficient` also find correlations for continuous numeric variables, but assumes independence and a monotonic relationship (same direction but not necessarily linear). The former parametric method is more prevalent to the normally distributed attributes of the defensive data. There aren't as many interesting correlations among the defensive metrics, but from the plotted heatmaps, the overall trends are consistent throughout:

- `REB/MIN` has a strong positive correlation with `ContestedREB%`and a strong negative corerlation with `AVG REBDistance`. These three attributes essentially convey similar information and therefore can be trimmed to a single feature for modeling purposes. Rebounding stats are also correlated fairly strongly with defensive metrics such as contested 2-pt shots or pick-and-roll roll-man defense, which intuitively make sense since defending paint shots usually yields rebounds. 
- `STL/MIN` and `Deflections/MIN` have a strong positive correlation as expected and can be truncated as well.
- Interestingly, `Opp3P%` is strongly correlated with the specific breakdown for above-the-break 3's FG% (`opp_ATB3_FG%`) as expected, but not as strongly with corner threes (`opp_Corner3_FG%`). Therefore, the former can be removed to keep only the latter two which convey more information.
- `%opp_pnrbh_FGA` is strongly correlated with the handoff and offscreen counterparts and consist of redundant info that could also be trimmed before modeling.

As there are a few missing values for some defensive attributes, columns that contain redundant info can be trimmed as a strategy to solve both issues. Overall, the defensive metrics had less correlations amongst each other but a deeper look into pair-wise plots can help parse through the attribute interactions further to better inform the modeling stages.

#### AutoViz -- Offensive Stats:

In [None]:
av_off_report = AutoViz_Class().AutoViz('', dfte=player_df[off_cols], chart_format='bokeh', lowess=False) 

Executing AutoViz on the offensive data generated pair-wise attribute plots, KDE plots, scaled violin plots, and heatmap of correlations. Although some of this information was already gathered through previous methods, there are still some insights that can be drawn from this.

The most interesting observations from the pair-wise plots include:
- There are attribute pairs that seemed redundant when examined only through the lens of correlation, but the pair-wise plots illuminate how digging deeper can show that these attributes might still be valuable as stand-alone features to distinguish players. For example, `PassesMade/POSS` and `ASTAdj/POSS` have an incredibly strong positive correlation, but the pair-wise plot shows how there are some outlier players who pass a lot without generating assists, and some players who pass less but pass better (in terms of leading to a scoring play and therefore an assist). The pair-wise plots also show the importance of `ScreenAssists/POSS` info as these assists seem completely independent of other passing metrics (as expected). 
- Assessing `ASTAdj/POSS` vs. offensive plays shows interesting relationships -- assists linearly grow if a greater proportion of a player's FGA come from pick-and-roll plays, while the opposite holds true for transition plays. On the flip side, weighing assists against ball-handling stats such touch-time or dribbles shows that there is a positive linear and negative linear relationship with increased and decreased touch-time/dribbles respectively. Corner3-specialists also seem to have lower assist rates. All this points to the pattern that when the ball is less in the hands of a player, they are less likely to generate assists which makes intuitive sense. 
- A player's 2-pt FG% trends better if more of their shot profile involves assisted plays rather than unassisted attempts. However, interestingly there is no such trend for the 3-pt counterpart attributes. 2-FG% improves depending on the zone on the court as expected as well (restricted area yielding the most efficient 2s) while no such clear pattern is discernable for the 3-pt shot. Interestingly, the time the ball is in the hands of a player affects 2-PT efficiency as well unlike 3. This last trend intuitively makes sense since a lot of restricted-area 2FGA are assisted from the primary ball handlers and yield high-percentage shots. However, for the 3-pt shot, there isn't a strong trend between efficiency vs. zones (corner vs. atb) nor styles (catch-and-shoot vs. pullup)
- Some play styles affect free throw rate for a player: isolation plays and post ups.

#### AutoViz -- Defensive Stats:

In [None]:
av_def_report = AutoViz_Class().AutoViz('', dfte=player_df[def_cols], chart_format='bokeh', lowess=False) 

Executing AutoViz on the defensive data generated pair-wise attribute plots, KDE plots, scaled violin plots, and heatmap of correlations. Although some of this information was already gathered through previous methods, there are still some insights that can be drawn from this.

The most interesting observations from the pair-wise plots include:
- Similar to the offensive attributes, some correlated pairs show meaningful exceptions to the rule such as `REB/MIN` and `BLK/MIN`. Although it intuitively makes sense why these features would be correlated, since they mark the role of a big man, there are enough outliers that prevent us from regarding these as redundant variables. A modeling algorithm that can parse these in relation to the rest of the attributes could be invaluable in discerning between different types of players. On the other hand, some correlated features can be further confirmed as redundant based on pair-wise plots such as `STL/min` and `Deflections/MIN`.
- Interestingly, opponent's 3-PT % doesn't hold a strong linear visual trend based on corner vs above-the-break threes, indicating the variation in defending shots for those two separate zones.
- Most surprisingly, charges-drawn seemed fairly independent of FGA from restricted area vs. nonRA in the paint.

## 3C: Generate Advanced Visuals: Distributions / Interactions

**Objective:** This section digs deeper into the insights from 3B and explores more advanced visualization techniques to convey findings.

----

#### Generate violin-swarm plots as a visual approach to examining descriptive statistics:

In [None]:
# Create list of columns and colors to generate violin plots for (only numeric attributes)
violin_cols = player_df.drop(columns=['PLAYER', 'POS', 'TEAM']).columns.tolist()
random.shuffle(violin_cols)
cols_set1 = violin_cols.copy()
random.shuffle(violin_cols)
cols_set2 = violin_cols.copy()
violin_colors = list(sns.color_palette())

@interact
def generate_report(attr1=cols_set1, attr2=cols_set2):
    """Generates comparative violin-swarm plots for two attributes."""
    
    fig, axs = plt.subplots(1, 2, figsize=(10, 4))
    
    color1 = list(random.sample(violin_colors, len(violin_colors)))
    color2 = list(random.sample(violin_colors, len(violin_colors)))

    sns.violinplot(y=attr1, data=player_df, ax=axs[0], palette=color1)
    sns.swarmplot(y=attr1, data=player_df, ax=axs[0], color='black')
    
    sns.violinplot(y=attr2, data=player_df, ax=axs[1], palette=color2)
    sns.swarmplot(y=attr2, data=player_df, ax=axs[1], color='black')

    plt.tight_layout()
    plt.show()
    
    return 

Swarm-violin plots allows us to take a deeper look at how the data points of an attribute are distributed and helps visually spot outliers, central tendencies, and the amount of zero-values. These plots highlight the diversity of the attributes in the dataset, which was anticipated based on the initial cleaning and scraping measures. The above tool allows toggling between features for an easier comparison among all the attributes available in the cleaned dataset. 

#### Generate pair-wise scatter plots with specific player info conveyed:

In [None]:
# Create list of columns and colors to generate violin plots for (only numeric attributes)
violin_cols = player_df.drop(columns=['PLAYER', 'POS', 'TEAM']).columns.tolist()
random.shuffle(violin_cols)
cols_set1 = violin_cols.copy()
random.shuffle(violin_cols)
cols_set2 = violin_cols.copy()
violin_colors = list(sns.color_palette())

@interact
def generate_report(attr1=cols_set1, attr2=cols_set2):
    """Generates comparative violin-swarm plots for two attributes."""
    
    fig, axs = plt.subplots(1, 2, figsize=(10, 4))
    
    color1 = list(random.sample(violin_colors, len(violin_colors)))
    color2 = list(random.sample(violin_colors, len(violin_colors)))

    sns.violinplot(y=attr1, data=player_df, ax=axs[0], palette=color1)
    sns.swarmplot(y=attr1, data=player_df, ax=axs[0], color='black')
    
    sns.violinplot(y=attr2, data=player_df, ax=axs[1], palette=color2)
    sns.swarmplot(y=attr2, data=player_df, ax=axs[1], color='black')

    plt.tight_layout()
    plt.show()
    
    return 

#### Explore lineup data:

In [None]:
passing_cols = ['PassesMade/POSS', 'PassesReceived/POSS', 'ScreenAssists/POSS', 'ASTAdj/POSS', 'TOV/POSS']
scoring_cols = ['2P%', '3P%', 'FT%', '%PTSFT', '%RA_FGA', '%PT_nonRA_FGA', '%MR_FGA', '%cns_2FGA', '%pullup_2FGA',
       '%Corner3_FGA', '%ATB3_FGA', '%cns_3PA', '%pullup_3PA', '2FGM%UAST', '3FGM%UAST', '%trsn_FGA', '%iso_FGA', '%pnrbh_FGA', '%pnrrm_FGA',
       '%postup_FGA', '%spotup_FGA', '%handoff_FGA', '%cuts_FGA','%offscrn_FGA', '%putbk_FGA']
def_cols = ['REB/MIN', 'ContestedREB%', 'AVG REBDistance',
       'Box Outs/MIN', 'STL/MIN', 'BLK/MIN', 'Deflections/MIN',
       'Loose BallsRecovered/MIN', 'ChargesDrawn/MIN',
       'Contested2PT Shots/MIN', 'Contested3PT Shots/MIN', 'OppTOV/MIN',
       'OppPF/MIN', 'OppFTA/MIN', 'Opp2P%', '%opp_RA_FGA', 'opp_RA_FG%',
       '%opp_PT_nonRA_FGA', 'opp_PT_nonRA_FG%', '%opp_MR_FGA', 'opp_MR_FG%',
       'Opp3P%', '%opp_Corner3_FGA', 'opp_Corner3_FG%', '%opp_ATB3_FGA',
       'opp_ATB3_FG%', '%opp_iso_FGA', 'opp_iso_FG%', '%opp_pnrbh_FGA',
       'opp_pnrbh_FG%', '%opp_pnrrm_FGA', 'opp_pnrrm_FG%', '%opp_postup_FGA',
       'opp_postup_FG%', '%opp_spotup_FGA', 'opp_spotup_FG%',
       '%opp_handoff_FGA', 'opp_handoff_FG%', '%opp_offscrn_FGA',
       'opp_offscrn_FG%']

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
ldf = pd.read_csv('lineup_agg_stats.csv')
ldf[ldf.columns[7:]] = scaler.fit_transform(ldf[ldf.columns[7:]])

def plot_subset(df, cols):
    """Prepares input dataframe and cols to stratify by net rating for plotting."""
    
    # Create subset dataframes
    plus_df = df[df.NetRtg > 0]
    plus_df = plus_df[cols]
    minus_df = df[df.NetRtg < 0]
    minus_df = minus_df[cols]

    # Gather data for viz
    plus_dat = pd.DataFrame(plus_df.mean())
    minus_dat = pd.DataFrame(minus_df.mean())

    plus_dat['net'] = '+'
    minus_dat['net'] = '-'
    
    comp_df = plus_dat.append(minus_dat)
    
    return comp_df

passing_df = plot_subset(ldf, passing_cols)
scoring_df = plot_subset(ldf, scoring_cols)
defense_df = plot_subset(ldf, def_cols)

In [None]:
fig, ax = plt.subplots(3, 1, figsize=(14, 10))

sns.barplot(x=passing_df.index, y=passing_df[0], hue=passing_df.net, ax=ax[0])
sns.barplot(x=scoring_df.index, y=scoring_df[0], hue=scoring_df.net, ax=ax[1])
sns.barplot(x=defense_df.index, y=defense_df[0], hue=defense_df.net, ax=ax[2])

plt.tight_layout()
# ax[0].set_xticklabels(passing_df.index.tolist()[:5], rotation=45)
ax[1].set_xticklabels(scoring_df.index.tolist()[:25], rotation=45)
ax[2].set_xticklabels(defense_df.index.tolist()[:40], rotation=45)
fig.subplots_adjust(hspace=0.6)
plt.show()

Based on the passing stats, there doesn't seem to be much separating winning and losing lineups. Scoring and defense have attributes showing more separation. As expected, the winning lineups shoot with better efficiency. However, the more insightful observation is that positive lineups shoot more on catch-and-shoot and less on pull-ups and have more field goals unassisted. On the defensive side, opponents shooting more from above the break interestingly leads to negative lineups. The same pattern holds true for spot up shots.

## 3D: Generate Advanced Visuals: SHOT CHARTS

**Objective:** This section generates shot-charts commonly used in the industry to assess players.

Note: *nbashots* and *nba-shotcharts* modules were unable to be installed due to version-conflicts with other packages that were necessary for this project environment. Therefore, instead of importing pre-built functions to generate the visuals, modified versions were created below following the methodologies provided in the documentation of the originals. However, to fit the scope and objective of this project, which relied on aggregated stats (lineups) rather than individual players, customizations were designed where appropriate in order to explore these visuals in a context that made most sense for this project.

----

#### Gather API's ID info for each player in cleaned dataset:

In [None]:
# Get cleaned dataset
player_df = pd.read_csv('./cln_comprehensive_stats.csv', sep=',', header=0, index_col=None)

# Get player info from API
plyr_id_df = pd.DataFrame(p.get_players())

# Clean edge-case names to standardize with cleaned df
plyr_id_df['full_name'] = plyr_id_df.full_name.replace("Nah'Shon Hyland", "Bones Hyland")
plyr_id_df['full_name'] = plyr_id_df.full_name.replace("Kenyon Martin", "Kenyon Martin Jr.")

# Filter for players that exist in clean df
plyr_id_df = plyr_id_df[plyr_id_df.full_name.isin(player_df.PLAYER)].reset_index(drop=True)

# Rename columns
plyr_id_df = plyr_id_df.rename(columns={'id': 'player_id', 'full_name': 'name'})[['name', 'player_id']]

In [None]:
# Gather team ID info
team_id_df = pd.DataFrame(t.get_teams())[['id', 'abbreviation']]

# Replace team info with team ID
# (plyr_id_df.name == player_df.PLAYER).sum()  # verify player order is same
plyr_id_df['team_id'] = player_df.TEAM.map(lambda x: team_id_df[team_id_df.abbreviation == x]['id'].values[0])

plyr_id_df.to_csv('id.csv', index=False)
plyr_id_df

#### Get player shot details:

In [None]:
id_df = pd.read_csv('id.csv')

In [None]:
# # Instantiate df from first record
# shot_profile_df, league_profile_df = shotchartdetail.ShotChartDetail(
#     team_id=int('1610612761'), player_id=int('1630173'), season_type_all_star='Regular Season', season_nullable='2021-22', context_measure_simple="FGA"
# ).get_data_frames()

# league_profile_df.to_csv('league_avg.csv', index=False)

In [None]:
# for i in range(len(id_df)):
    
#     # Grab IDs
#     tid = id_df.iloc[i]['team_id']
#     pid = id_df.iloc[i]['player_id']
    
#     # Get player's shot profile
#     shot_prof = shotchartdetail.ShotChartDetail(
#             team_id=int(tid), player_id=int(pid), season_type_all_star='Regular Season', season_nullable='2021-22', context_measure_simple="FGA"
#     ).get_data_frames()[0]
    
#     # Add data to master
#     shot_profile_df = pd.concat([shot_profile_df, shot_prof], ignore_index=True, axis=0)
    
#     # Prevent timeouts
#     sleep(2)
    

In [None]:
# shot_profile_df.to_csv('shot_profiles.csv', index=False)

#### Create data subsets for visuals:

In [None]:
shots_df = pd.read_csv('shot_profiles.csv')
league_df = pd.read_csv('league_avg.csv')
shots_df

In [None]:
lineup_df = pd.read_csv('cln_lineup_stats.csv')

# +5 NetRating, with minimum 40min played
pos_rat_df = lineup_df[(lineup_df.NetRtg >= 5) & (lineup_df.MIN > 40)]
pos_rat_plyrs = np.ndarray.flatten(np.array(pos_rat_df.Lineups.str.split(', ').values.tolist()))
pos_rat_df = shots_df[shots_df['PLAYER_NAME'].isin(pos_rat_plyrs)]

#### Create helper functions to generate plots for the court and shots:

In [None]:
def plot_halfcourt(ax, ver):
    """Creates half-court visual on given input axes object. Methodology followed from: nbashots module"""
    
    if ver == 1:
        clr = 'White'
        ax.set_facecolor('Black')
    else:
        clr = 'Black'
    
    # Plot basket-area elements
    hoop = Circle((0, 60), radius=15, linewidth=2, color=clr, fill=False)
    backboard = Rectangle((-30, 40), 60, 0, linewidth=2, color=clr)
    ra_arc = Arc((0, 60), 80, 80, theta1=0, theta2=180, linewidth=1, color=clr)

    # Plot paint-area elements
    paint_o = Rectangle((-80, -10), 160, 190, linewidth=2, color=clr, fill=False)
    paint_i = Rectangle((-60, -10), 120, 190, linewidth=2, color=clr, fill=False)
    ft_arc = Arc((0, 180), 120, 120, theta1=0, theta2=180, linewidth=2, color=clr, fill=False)
    ft_arc2 = Arc((0, 180), 120, 120, theta1=180, theta2=0, linewidth=2, color=clr, linestyle='dashed')
    
    # Plot 3-pt area elements
    lc = Rectangle((-220, -10), 0, 160, linewidth=2, color=clr)
    rc = Rectangle((220, -10), 0, 160, linewidth=2, color=clr)
    arc = Arc((0, 140), 440, 315, theta1=0, theta2=180, linewidth=2, color=clr)
    hc_arc = Arc((0, 482.5), 120, 120, theta1=180, theta2=0, linewidth=2, color=clr)
    
    # Add each element as a patch to axis
    objects = [hoop, backboard, ra_arc, paint_o, paint_i, ft_arc, ft_arc2, lc, rc, arc, hc_arc]
    for i in objects:
        # ax.add_patch(i)
        ax.add_artist(i)
    
    # Remove tick labels and set viewpoint
    ax.set_xlim(-250, 250)
    ax.set_ylim(0, 470)
    ax.set_xticks([])
    ax.set_yticks([])
    
    return ax

# fig, ax = plt.subplots(1, 1)
# ax = plot_halfcourt(ax)
# plt.tight_layout()
# plt.show()

#### Visualize shot-chart via hex-bins for players in positive lineups:

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 8))

ax.hexbin(pos_rat_df['LOC_X'], pos_rat_df['LOC_Y']+60, gridsize=(100, 100), bins='log', cmap='seismic')  
ax = plot_halfcourt(ax, 1)

plt.tight_layout()
plt.show()

In [None]:
plot = sns.jointplot(pos_rat_df.LOC_X, pos_rat_df.LOC_Y+60, kind='kde', space=0, color=cm.gist_heat_r(0.1), fill=True, cmap=cm.gist_heat_r, n_levels=50, ax=ax)
plot.fig.set_size_inches(10, 8)
ax = plot.ax_joint
plot_halfcourt(ax, 2)

plt.tight_layout()
plt.show()

In [None]:
pos_rat_df