Skip to content

EDA, Dimensionality Reduction, and K-Means Clustering of International Players in the NBA

Notifications You must be signed in to change notification settings

rahprabhu/NBA-International

Repository files navigation

The Rise of International Players in the NBA

Quick Links

1: Web Scraping for EDA
2: Data Cleaning and EDA
3: Web Scraping for Clustering
4: Feature Extraction and K-Means Clustering

TL:DR
Tableau Dashboard - The Rise of International Players in the NBA
Tableau Dashboard - K-Means Clustering: International NBA Players

About

The NBA has grown into a global game over the last few decades, as evidenced by the chart below. Additionally, basketball has evolved into a more positionless game, where players are more often categorized according to their skills, and some of today's top international players are known for defying their positions with their broad set of skills (i.e. big men that can put the ball of the floor and shoot, like Jokic, Embiid etc.). With these two trends in mind, I would like to do an EDA profiling the rise and history of international players in the NBA, and then better classify these players using clustering techniques. newplot (14)

Data Sources

Data Acquisition

  • To first acquire the list of International players in the NBA, I web scrape a table from the Wikipedia link above that contains all of the potentially relevant players. For player statistics, I leverage Basketball Reference for advanced, per 100 possession, and shooting statistics. For the initial EDA, I scrape the statistics on a season by season basis, as this helps me track the number of players that were active each season, and thus the growth in the number of players over time. For the clustering analysis, I only need to scrape the career average statistics for each player.

EDA

Before beginning the analysis, it's important to define who an international player is. Below is the definition that I am using:

  • A player that was born outside out the United States to at least one non-American parent (excludes players born to American parents living abroad) OR
  • A player born in the United States to at least one non-American parent OR
  • A player growing up outside of the United States for most of their childhood

The rationale behind this definition is to filter out American players that became naturalized citizens of other countries to represent that country in international competition without having familial ties to that country or having grown up in that country.

With this in mind, I filter the dataframe created from the Wikipedia list of players and perform some data cleaning, which primarily revolves around cleaning string values to remove unwanted characters and standardizing country naming conventions. After cleaning the players dataframe, I can join in the Basketball Reference dataframe containing the seasonal player statistics and begin performing the EDA.

Question: Which are the top 10 countries represented in the NBA all time? newplot (13) Canada has the most international players to play in the NBA, while 5 of the top 10 countries are in Europe. Despite Canada producing the most NBA players outside of the United States, they have not fared as well in international tournaments, with their last Olympic appearance coming at the 2000 Sydney Olympics.

Question: What is the breakdown by player?

newplot (11)

There is clearly a positional imbalance here, as the center position is more than twice as big as the next highest position, power forward, which is the position most similar to center. This is important to know before clustering, as we might have smaller clusters for the players that fall outside of the 'big man' clusters.

Question: Which teams have fielded the most international players? newplot (15)

Question: Who are the top international players by total win shares? What positions do they play? newplot (16) newplot (17)

Over half of the top 20 players by win shares are centers, while 2/3 are frontcourt playes (PF and C). These positions are not only the most represented in the international player base, but are also some of the most impactful on winning.

Clustering Process

Critieria for player inclusion in clustering analysis:

  • Players playing at least 50% of their career after the year 2000 AND
  • Players who have played at least 50 regular season games

Shooting statistics are only available as of the 1996-97 season, though Basketball Reference notes that shooting statistics are more unreliable for the 1990s. Additionally, the vast majority of international players have come from the last 2 decades, as evidenced by the chart below, so we will essentially be focusing on the recent generation of international players for whom we have reliable shooting statistics. Applying this criteria to the dataset will result in 280 players remaining.

Before fitting a K-Means clustering algorithm on the dataset, we will need to standardize the features and address the high dimensionality of the dataset that we are currently using. When combining shooting, per 100 possession, and advanced statistics, we have nearly 60 dimensions. High dimensionality may result in a more sparse dataset that yields more overfitted yet less meaningful clusters. To combat this we will explore two dimensionality reduction techniques: Principal Componenent Analysis (PCA) and Linear Discriminant Analysis (LDA)

PCA

image

LDA

image

With LDA doing a better job in explaining the variance in the dataset, we will use LDA to reduce the dimensionality of the dataset and fit the LDA-transformed dataframe into a K-Means clustering model.

Using the elbow method, I choose to use k=8, or 8 clusters in the model, although this is a bit of a judgment call as there is not a clear-cut 'elbow' on the graph.

image

Clustering Results

image

Clusters with lower LDA 1 values appear to have less within-class scatter, whereas classes from the center to the right have more within-class scatter. Despite the discrepancy in within-class scatter, all of the classes appear to be mostly well-separated from each other. The 3 clusters on the left all correspond to players that play center or power forward, which makes sense as to why they are so close together with a little less separation than other clusters.

Cluster PER

With these players now clustered, we can try to better understand how they stack up against one another and the league overall. One way we can evaluate their average performance is to look at their average PER and compare that to the average for the league. PER, or Player Efficiency Rating, is a per-minute rating that sums up all of a player's positive and negative on-court contributions. Positive stats include field goals, 3 pointers, free throws, assists, rebounds, blocks, and steals. Negative stats include missed shots, turnovers ans personal fouls. The league average for this stat is always 15.00, so let's see how these clusters stack up versus the league average.

image

Somewhat surprisingly, only the Defensive Big Men outperform the league average PER. While there are certainly some high PER players in this dataset, it looks like there is a strong prevalence of rotation/bench players that reduces the averages to be below the league average.

Cluster Field Goal Attempts

Where does each cluster attempt their field goals?

image

Each cluster shows a fairly high percentage of shot being taken at the rim (0-3 ft), with the big men clusters having the highest percentage of shots taken here. Additionally, less field goals are being attempted in the midrange between the paint and the 3 point line, as the game has transitioned from high volume in the midrange to higher volume behind the 3 point line to maximize points scored. This has been league-wide trend over the last few decades, and the trend is clearly still present in this international subset of players.

Conclusion and Potential Next Steps

With the prevalence of international centers and power forwards in the NBA, it was no surprise that our clustering analysis yielded more than 1 cluster dedicated to primarily big men that were pretty statistically similar. Additionally, while there have been some incredible international players in the history of the league, most international players of the last few decades have been primarily bench/role players with less staying power, though there are a handful of exceptions.

Some extensions of this project might be to try to create a roster or a draft style big board using these clusters and the player statistics, with the goal of ranking the players from highest to lowest and assembling an ideal roster that provides the necessary balance to build a team. This kind of clustering analysis can also be extended to the league as a whole, or perhaps to college players, where a clustering analysis might be helpful for scouts looking to evaluate a future pool of NBA prospects.

About

EDA, Dimensionality Reduction, and K-Means Clustering of International Players in the NBA

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published