First of all, I would like to thank everyone who spend their time to take a look at my kernel. I will also greatly appreciate any suggestions or criticisms. I would also like to give my appreciation to this kernel as it helped me alot during the making of this kernel :
    
    https://www.kaggle.com/slavapasedko/nba-star-players-visualization-in-memory-of-coby
    
I made this kernel to try to find what characteristics of NBA players which may signify that they are going to be an All-Star. For this kernel, I used the dataset containing a list of players which became All-Stars between 2000 and 2016.

First, lets import some libraries which might be useful and read the dataset.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px

In [None]:
df_as = pd.read_csv('../input/nba-all-star-game-20002016/NBA All Stars 2000-2016 - Sheet1.csv')

In [None]:
df_as.head()

Now let's do some data cleaning. The 'Selection Type' feature of the dataframe contains information about the area and the selection type of the player. We can separate the information into separate features, which might be useful for visualisations later. We can do this by creating columns for each area (Western and Eastern) and each type (Fan Vote, Coaches, and Replacement). Then we iterate through the 'Selection Type' column to see which area and type that each strings contain.

In [None]:
df_as['Western'] = 0
df_as['Eastern'] = 0
df_as['Fan Vote'] = 0
df_as['Coaches'] = 0
df_as['Replacement'] = 0

In [None]:
for idx, sel in enumerate(df_as['Selection Type']):
    if 'Western' in sel:
        df_as.loc[idx, 'Western'] = 1
    else :
        df_as.loc[idx, 'Eastern'] = 1
    if 'Fan Vote' in sel:
        df_as.loc[idx, 'Fan Vote'] = 1
    elif 'Coaches' in sel:
        df_as.loc[idx, 'Coaches'] = 1
    else :
        df_as.loc[idx, 'Replacement'] = 1

We can also separate the information contained in the 'NBA Draft Status' feature. For each strings in 'NBA Draft Status', we can separate it into 'Draft Year' and 'Overall Draft Order'. Each round in the draft contains 30 players, so the overall draft order of players drafted in the second round will simply be the pick order + 30. Players that are undrafted will simply get 'None' in their 'Overall Draft Order'

In [None]:
def get_draft_year(s):
    return int(s[0:4])

df_as['Draft Year'] = df_as['NBA Draft Status'].apply(get_draft_year)

In [None]:
def get_draft_order(s):
    l = s.split()
    if 'Undrafted' in s:
        return None
    elif l[2] == '1':
        return l[4]
    else :
        return int(l[4]) + 30
    
df_as['Overall Draft Order'] = df_as['NBA Draft Status'].apply(get_draft_order)

I also noticed that some players have two nationalities. We can separate the nationalities into two features, 'First Nationality' and 'Second Nationality'.

In [None]:
for idx, nat in enumerate(df_as['Nationality']):
    if '\n' in nat:
        l = nat.split('\n')
        df_as.loc[idx, 'First Nationality'] = l[0]
        df_as.loc[idx, 'Second Nationality'] = l[1]
    else :
        df_as.loc[idx, 'First Nationality'] = nat
        df_as.loc[idx, 'Second Nationality'] = None

Now let's see what the dataframe will look like.

In [None]:
df_as.head()

Now let's do some EDA. First, I want to see the nationalities of the All-Star Players.

In [None]:
nat_counts = df_as[['First Nationality', 'Second Nationality']].apply(pd.value_counts)
labels = nat_counts.index
nat_counts = nat_counts.fillna(0)
values = nat_counts['First Nationality'] + nat_counts['Second Nationality']

fig = px.pie(df_as, values=values, names=labels, title='All-Stars Nationality Distribution')
fig.update_layout(    margin=dict(
        l=50,
        r=50,
        b=100,
        t=200,
        pad=4
    ))
fig.show()

There's no surprise here, as a large amount of All-Stars (81.2 %) are americans. The top 3 most represented countries other than United States are Germany (3.43%), Spain(2.14%), and Canada(1.93%). 

In [None]:
labels = ['Western', 'Eastern']
values = list([sum(df_as['Western']), sum(df_as['Eastern'])])
fig = px.pie(df_as, values=values, names=labels, title='All-Stars West and East Distribution')
fig.show()

The west east distribution of All-Stars is almost equal. The expectation is the number of western and eastern All-Stars are the same, as each region will send the same number of All-Stars. I think the reason why there are one more eastern player is because more eastern All-Stars are injured, so there is one more replacement player.

In [None]:
labels = ['Fan Vote', 'Coaches', 'Replacement']
values = list([sum(df_as['Fan Vote']), sum(df_as['Coaches']), sum(df_as['Replacement'])])
fig = px.pie(df_as, values=values, names=labels, title='All-Stars Selection Distribution')
fig.show()

54.2% of of All-Stars are selected by coaches, while 38.7% are selected by fans. 7.06% of players in the dataset are selected because of injuries.

In [None]:
labels = df_as['Team'].value_counts().index
values = df_as['Team'].value_counts().values
fig = px.pie(df_as, values=values, names=labels, title='All-Stars Team Distribution')
fig.show()

So which team has the most representation in the All-Star Game from 2000-2016? The answer is Miami Heat with 28 representations, followed by the Boston Celtics with 26 representations and the LA Lakers with 25 representations. I also noticed that some teams that are no longer in the NBA also have some representations, such as the Seattle SuperSonics and the Charlotte Bobcats.

In [None]:
labels = df_as['Player'].value_counts().index
values = df_as['Player'].value_counts().values
fig = px.bar(df_as, y=values, x=labels, title='All-Stars Players Distribution')
fig.update_layout(
    xaxis_title = 'Name of Players',
    yaxis_title = 'Number of All-Star Appearances'
)
fig.show()

Now let's see which players have the most All-Star appereances. Kobe Bryant (RIP Kobe) have the most with 16 appereances, followed by Dirk Nowitzki, Tim Duncan, and LeBron James with 13. 

Now let's take a look at the distribution of positions. There are some inconsistencies in the dataset with the writing of the positions, so I fix them first before plotting.

In [None]:
df_as.loc[df_as['Pos'] == 'F-C', 'Pos'] = 'FC'
df_as.loc[df_as['Pos'] == 'G-F', 'Pos'] = 'GF'

In [None]:
labels = df_as['Pos'].value_counts().index
values = df_as['Pos'].value_counts().values
fig = px.pie(df_as, values=values, names=labels, title='All-Stars Position Distribution')
fig.show()

We can see that each positions have similar representations in the All-Star games. There are also players that can play multiple positions, such as the GF(Guard and Forward) and FC(Forward and Center). G (Point Guard and Shooting Guard) has the most representation with 16.4%, while GF(Guard and Forward) has the least with only 5.24%.

Now let's see the height of players per position. First, I will convert the height in to total inches so the height can be sorted easily.

In [None]:
import re
r = re.compile(r"([0-9]+)-([0-9]*[0-9]+)")
def get_inches(el):
    m = r.match(el)
    if m == None:
        return float('NaN')
    else:
        return int(m.group(1))*12 + float(m.group(2))
df_as['HT'] = df_as['HT'].apply(get_inches)

In [None]:
fig = px.box(df_as, x = 'Pos', y = 'HT', title = 'All-Stars Height per Position')
fig.show()

As expected, centers have the largest median height with 85 inches (7-1), and a maximum height of 90 inches (7-6) (Yao Ming). Also as expected, point guards have the smallest median height with 73.5 inches (6-2) and a minimum height of 69 inches (5-9) (Isaiah Thomas). 

In [None]:
fig = px.box(df_as, x = 'Pos', y = 'WT', title = 'All-Stars Weight per Position')
fig.show()

As expected, centers are the heaviest position with a median weight of 265 pounds, and a maximum weight of 325 pounds (Shaq). Point guards and G (point guard and shooting guard hybrid) have the lightest median weight with 190, with the lightest weight being 165 pounds (Allen Iverson).  

Now I want to see which 'Draft Year' contribute the most All-Stars. In order to do that, we must first find the unique values of 'Player', in order to see the number of All-Stars from each draft class without duplicates. 

In [None]:
players = df_as[['Player', 'Draft Year']]
players = players.drop_duplicates()
labels = players['Draft Year'].value_counts().index
values = players['Draft Year'].value_counts().values
fig = px.bar(players, x = labels, y = values, title='All-Stars Draft Year Distribution')
fig.update_layout(
    xaxis_title = 'Year',
    yaxis_title = 'Number of All-Stars'
)
fig.show()

The 1996 Draft Class produce the most All-Stars with 11, followed by the 1999 and 2003 Draft Class with 9.

In [None]:
players = df_as[['Player', 'Draft Year', 'Overall Draft Order']]
players = players.drop_duplicates()
labels = players['Overall Draft Order'].value_counts().index
values = players['Overall Draft Order'].value_counts().values
fig = px.bar(players, x = labels, y = values, title='All-Stars Overall Draft Order Distribution')
fig.update_layout(
    xaxis_title = 'Draft Order',
    yaxis_title = 'Number of All-Stars Selected'
)
fig.show()

From the barplot above, we can see that players picked early in the draft will have a higher chance of being All-Stars. This makes sense, as usually more skilled players will be selected first. As the draft order increases, there is a lower chance of being All-Stars. Some pretty interesting stats is there is no All-Star in the period of 2000-2016 selected 8th and 12th in the draft. There is also an interesting amount of All-Stars drafted 9th.

Now I want to explore more about their season stats, so I used another dataset containing the season stats of NBA players.

In [None]:
df_stats = pd.read_csv('../input/nba-players-stats/Seasons_Stats.csv')
df_stats.head()

In [None]:
df_stats.columns

There are a lot of columns, but I am only going to use some of them. The 'PTS', 'TRB', 'AST', 'STL', 'BLK' columns contain the total amount for that season, but I think the average stats per game is more useful, so I divide those with the amount of games the player played.

In [None]:
df_stats['TRB'] = df_stats['TRB'] / df_stats['G']
df_stats['AST'] = df_stats['AST'] / df_stats['G']
df_stats['STL'] = df_stats['STL'] / df_stats['G']
df_stats['BLK'] = df_stats['BLK'] / df_stats['G']
df_stats['PTS'] = df_stats['PTS'] / df_stats['G']

chosen_features = ['Player', 'Year', 'Age', 'PER', 'TS%', 'VORP', 'FG%', '3P%', 'TRB', 'AST', 'STL', 'BLK', 'PTS']
df_stats = df_stats[chosen_features]

In [None]:
df_stats.head()

Then I am going to merge it with the df_as on the 'Player' and 'Year' columns.

In [None]:
df_combined = pd.merge(df_as, df_stats, on = ['Player', 'Year'])

In [None]:
df_combined.head()

In [None]:
labels = df_combined['Age'].value_counts().index
values = df_combined['Age'].value_counts().values
fig = px.bar(df_combined, x = labels, y = values, title='All-Stars Age Distribution')
fig.update_layout(
    xaxis_title = 'Age',
    yaxis_title = 'Number of All-Stars Selected'
)
fig.show()

Seems like most All-Stars are in their mid 20's, peaking at age 25.

In [None]:
fig = px.scatter(df_combined, x = 'Player', y = 'PER', color = 'Pos', title='All-Stars PER Distribution')
fig.show()

PER(Player Efficiency Rating) takes into account accomplishments, such as field goals, free throws, 3-pointers, assists, rebounds, blocks and steals, and negative results, such as missed shots, turnovers and personal fouls. Seems like the PER are distributed around the 20 mark, with Lebron James recording the highest PER with 31.7 followed by Stephen Curry with 31.5.

In [None]:
fig = px.scatter(df_combined, x = 'Player', y = 'VORP', color = 'Pos', title='All-Stars VORP Distribution')
fig.show()

Value over Replacement Player (VORP) converts the BPM rate into an estimate of each player's overall contribution to the team, measured vs. what a theoretical “replacement player” would provide, where the “replacement player” is defined as a player on minimum salary or not a normal member of a team's rotation. Seems like most player's VORP are around 2-4 area. LeBron records the highest VORP with 11.6, showing why he is considered one of the best players of all time.

In [None]:
fig = px.scatter(df_combined, x = 'Player', y = 'TS%', color = 'Pos', title='All-Stars Shooting Percentages Distribution')
fig.show()

All positions seems to have similar distribution of true shooting %, with most being around the 55% mark. Some outliers are Tyson Chandler with 70.8% TS% and Amar'e Stoudemire with 42% TS%.

In [None]:
fig = px.scatter(df_combined, x = 'Player', y = 'FG%', color = 'Pos', title='All-Stars Shooting Percentages Distribution')
fig.show()

Centers seems to have higher FG% overall. DeAndre Jordan has the highest FG% with 70.3%, while some All-Stars have really low FG% around the 35%, which may signify that they were chosen because of their reputations and previous achievements rather than their performance that season.

In [None]:
fig = px.scatter(df_combined, x = 'Player', y = '3P%', color = 'Pos', title='All-Stars Shooting Percentages Distribution')
fig.show()

Some Centers and Power Forwards have 0% in 3P%, either because they didn't make any threes or they didn't try any. But, some of them have really high 3P%, such as Al Horford with 100% and Marc Gasol with 66.7%. This is probably because they tried limited 3-point attempts, so their true 3P% haven't even out yet.

In [None]:
fig = px.scatter(df_combined, x = 'Player', y = 'STL', color = 'Pos', title='All-Stars Steals Distribution')
fig.show()

Chris Paul stood out as the leader in steals, at one point averaging around 2.8 steals per game. Seems like Point Guards have more steals overall, while Centers have less.

In [None]:
fig = px.scatter(df_combined, x = 'Player', y = 'BLK', color = 'Pos', title='All-Stars Blocks Distribution')
fig.show()

As expected, Centers and hybrid of Centers and Forwards have the most blocks, while Guards have the least blocks, which make total sense considering their sizes. Ben Wallace, a player famous for his defensive skills, averaged 3.475 blocks per game at a particular season. 

In [None]:
fig = px.scatter(df_combined, x = 'Player', y = 'TRB', color = 'Pos', title='All-Stars Rebounds Distribution')
fig.show()

Centers and Forwards also have more rebounds, which totally makes sense as well.

In [None]:
fig = px.scatter(df_combined, x = 'Player', y = 'AST', color = 'Pos', title='All-Stars Assists Distribution')
fig.show()

In contrary, guards (especially point guards) have the most assists compared to other positions. Rajon Rondo, Derron Williams, and Steve Nash are leaders in the assist category. 

In [None]:
fig = px.scatter(df_combined, x = 'Player', y = 'PTS', color = 'Pos', title='All-Stars Points Distribution')
fig.show()

Most All-Stars seems to average around 20 points per game. Guards seems to average more points compared to Forwards and Centers, with Kobe Bryant leading the way with 35.4 points per game. Ben Wallace is a really interesting case, because he was chosen as an All-Star eventhough he never averaged 10 points per game or more in this dataset. This really shows how good he is on defense. 

So I will end this kernel here, eventhough there are still a lot of things that can be done with the dataset. I might continue improving this kernel if I have time later on. Please don't forget to upvote if you like this kernel, and I greatly appreciate any suggestions and corrections! Thank you :)