Spring 2025<br>
Lecture 05

# Comparing Categories (Part 2)

Using indicator variables and facets with horizontal and vertical bar plots.

- Dataset: 2023-2024 NBA Player Statistics
- Metadata: https://www.kaggle.com/datasets/vivovinco/2023-2024-nba-player-stats
- One row per player-team in the 2023-2024 season of the NBA (National Basketball Association)

In [1]:
# Imports
import pandas as pd
import plotly.express as px

# Load data
df = pd.read_csv(
    'data/2023-2024 NBA Player Stats - Regular.csv',
    sep=';',
    encoding = "ISO-8859-1",
)

print(df.shape[0])
df.info()


735
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 735 entries, 0 to 734
Data columns (total 30 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Rk      735 non-null    int64  
 1   Player  735 non-null    object 
 2   Pos     735 non-null    object 
 3   Age     735 non-null    int64  
 4   Tm      735 non-null    object 
 5   G       735 non-null    int64  
 6   GS      735 non-null    int64  
 7   MP      735 non-null    float64
 8   FG      735 non-null    float64
 9   FGA     735 non-null    float64
 10  FG%     735 non-null    float64
 11  3P      735 non-null    float64
 12  3PA     735 non-null    float64
 13  3P%     735 non-null    float64
 14  2P      735 non-null    float64
 15  2PA     735 non-null    float64
 16  2P%     735 non-null    float64
 17  eFG%    735 non-null    float64
 18  FT      735 non-null    float64
 19  FTA     735 non-null    float64
 20  FT%     735 non-null    float64
 21  ORB     735 non-null    float64
 22

## Examples

1. Which players played on the most NBA teams during the 2023-2024 season?
    - Exclude rows where `Tm == 'TOT'`
    - Create a column called `total_points` equal to `PTS` * `G`
    - Find the number of unique teams and the total_number of points per player
    - Only keep players who played for at least 3 teams

In [2]:
df.head()

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1,Precious Achiuwa,PF-C,24,TOT,74,18,21.9,3.2,6.3,...,0.616,2.6,4.0,6.6,1.3,0.6,0.9,1.1,1.9,7.6
1,1,Precious Achiuwa,C,24,TOR,25,0,17.5,3.1,6.8,...,0.571,2.0,3.4,5.4,1.8,0.6,0.5,1.2,1.6,7.7
2,1,Precious Achiuwa,PF,24,NYK,49,18,24.2,3.2,6.1,...,0.643,2.9,4.3,7.2,1.1,0.6,1.1,1.1,2.1,7.6
3,2,Bam Adebayo,C,26,MIA,71,71,34.0,7.5,14.3,...,0.755,2.2,8.1,10.4,3.9,1.1,0.9,2.3,2.2,19.3
4,3,Ochai Agbaji,SG,23,TOT,78,28,21.0,2.3,5.6,...,0.661,0.9,1.8,2.8,1.1,0.6,0.6,0.8,1.5,5.8


### Example 1

In [3]:
# Example 1

## ALTERNATE
# (
#     df
#     # Create total_points
#     .assign(total_points = df['G']*df['PTS'])
# )

df_ex1 = (
    df
    # Filter out the "TOT" team
    [df['Tm']!="TOT"]
    # Create a "total_points" column
    .assign(
        total_points = lambda x: (x['PTS']*x['G']).round(0)
    )
    # Group by Player
    .groupby('Player')
    # Aggregation
    .agg(
        {
            'Tm': 'nunique',
            'total_points': 'sum'
        }
    )
    # Round entire dataframe
    .round(0)
    .reset_index()
    # Sort
    # .sort_values(by = 'Tm', ascending=False)
    # Filter to Tm >= 3
    [lambda x: x['Tm']>=3]
    # Final Sort
    .sort_values(by = 'total_points', ascending=False)
    # Rename
    .rename(
        columns = {
            'Tm': 'Number of Teams',
            'total_points': 'Total Points'
        }
    )
)

display(df_ex1)

Unnamed: 0,Player,Number of Teams,Total Points
378,Malachi Flynn,3,381.0
113,Danilo Gallinari,3,282.0
502,Shake Milton,3,217.0
409,Mike Muscala,3,160.0
332,Kenneth Lofton Jr.,3,96.0
348,Kira Lewis Jr.,3,90.0
163,Dylan Windler,3,43.0


### Example 2

2. Create a vertical and horizontal bar plot of the data from Example 1. (Customize the bar plots as well)

In [4]:
# Vertical

fig1 = px.bar(
    df_ex1,
    x = 'Player',
    y = 'Total Points',
    template = 'simple_white',
    title = '<b>Flynn scored 381 points despite 2 trades in regular season</b>',
    subtitle = 'Among all players traded thrice during the 2023-2024 season, Malachi Flynn scored the most total points during the regular season.<br>The next highest player was Danilo Gallinary with approx. 100 points less.',
    height = 500,
    width = 1000,
    color_discrete_sequence=['lightgray']
)
fig1.update_layout(
    # Remove the x-tick labels
    xaxis_ticks = "",
    # Change the title font
    title_font_family = 'Baskerville',
    # Font for the rest of plot
    font_family = 'Helvetica',
    # Change title font size
    title_font_size = 22,
    # Modify (enlarge) the top margin
    margin = {'t':130}
)

# Render the figure
fig1.show()

In [5]:
# Horizontal Bar Chart

fig2 = px.bar(
    df_ex1.sort_values(by = 'Total Points', ascending=True), # this should be the OPPOSITE of the orignal sort (horizontal bar chart ONLY)
    x = 'Total Points',
    y = 'Player',
    template = 'simple_white',
    title = '<b>Flynn scored 381 points despite 2 trades in regular season</b>',
    subtitle = 'Among all players traded thrice during the 2023-2024 season, Malachi Flynn scored the most total points during the regular season.<br>The next highest player was Danilo Gallinary with approx. 100 points less.',
    height = 500,
    width = 1000,
    color_discrete_sequence=['lightgray']
)
fig2.update_layout(
    # Remove the x-tick labels
    yaxis_ticks = "",
    # Change the title font
    title_font_family = 'Baskerville',
    # Font for the rest of plot
    font_family = 'Helvetica',
    # Change title font size
    title_font_size = 22,
    # Modify (enlarge) the top margin
    margin = {'t':130}
)

# Render the figure
fig2.show()

### Example 3

3. Create an indicator variable on `df_ex1` that is True when `Player == "Malachi Flynn"`
- Use this variable in a bar chart to add a highlight for Flynn to the plot.
- Add a label to each bar of the player's points.
- Hide the axis that becomes redundant.

In [6]:
# Create an indicator variable (True/False) ---> flag

## True
df_ex1.loc[
    df_ex1['Player'] == 'Malachi Flynn',
    'flag'
] = True

## False
# df_ex1.loc[
#     df_ex1['flag'] != True,
#     'flag'
# ] = False

df_ex1.loc[
    df_ex1['Player'] != 'Malachi Flynn',
    'flag'
] = False

df_ex1

Unnamed: 0,Player,Number of Teams,Total Points,flag
378,Malachi Flynn,3,381.0,True
113,Danilo Gallinari,3,282.0,False
502,Shake Milton,3,217.0,False
409,Mike Muscala,3,160.0,False
332,Kenneth Lofton Jr.,3,96.0,False
348,Kira Lewis Jr.,3,90.0,False
163,Dylan Windler,3,43.0,False


In [7]:
fig3 = px.bar(
    df_ex1.sort_values(by = 'Total Points', ascending=True), # this should be the OPPOSITE of the orignal sort (horizontal bar chart ONLY)
    x = 'Total Points',
    y = 'Player',
    template = 'simple_white',
    title = '<b>Flynn scored 381 points despite 2 trades in regular season</b>',
    subtitle = 'Among all players traded thrice during the 2023-2024 season, Malachi Flynn scored the most total points during the regular season.<br>The next highest player was Danilo Gallinary with approx. 100 points less.',
    height = 500,
    width = 1000,
    color_discrete_sequence=['lightgray', 'lightcoral'],
    # Add a color variable
    color = 'flag',
    # Add TEXT labels
    text = 'Total Points'
)
# Customize Figure
fig3.update_layout(
    # Remove the x-tick labels
    yaxis_ticks = "",
    # Change the title font
    title_font_family = 'Baskerville',
    # Font for the rest of plot
    font_family = 'Helvetica',
    # Change title font size
    title_font_size = 22,
    # Modify (enlarge) the top margin
    margin = {'t':130},
    # Hide the legend
    showlegend = False
)
# Move bar labels
fig3.update_traces(textposition = 'outside')

# Hide the x-axis
fig3.update_xaxes(visible = False)

# Render the figure
fig3.show()

### Example 4

4. Find the total points for each Tm and Position combination.

- Create a grouped bar plot for the total points by team and position.
- Create a stacked bar plot 
- Create a faceted bar plot
- Use an indicator variable to highlight a SINGLE bar. (Apply some formatting to make it aesthetic).

In [8]:
df_ex4 = (
    df
    # create a "total_points" variable
    .assign(total_points = (df['G']*df['PTS']).round())
    # filter out TOT
    [df['Tm']!="TOT"]
    # group by
    .groupby(['Tm', 'Pos'])
    # aggregation
    .agg(
        {
            'total_points': 'sum'
        }
    )
    .reset_index()
    .rename(
        columns = {
            'total_points': 'Total Points'
        }
    )
)

display(df_ex4)

Unnamed: 0,Tm,Pos,Total Points
0,ATL,C,1685.0
1,ATL,PF,24.0
2,ATL,PG,1573.0
3,ATL,SF,2738.0
4,ATL,SG,3691.0
...,...,...,...
145,WAS,C,990.0
146,WAS,PF,2047.0
147,WAS,PG,968.0
148,WAS,SF,3164.0


In [9]:
# Stacked Bar Chart


px.bar(
    df_ex4,
    x = 'Tm',
    y = 'Total Points',
    color = 'Pos'
)

In [10]:
# Grouped Bar Chart <> Paired Bar Chart
px.bar(
    df_ex4,
    x = 'Tm',
    y = 'Total Points',
    color = 'Pos',
    width = 1200,
    # Specify GROUPED
    barmode = 'group'
)

In [11]:
df_ex4.loc[
    (df_ex4['Tm']=='LAL') & (df_ex4['Pos']=='PF'),
    'flag'
] = True

df_ex4.loc[
    df_ex4['flag'] != True,
    'flag'
] = False

In [17]:
# Bar Plot with a Facet!

fig4 = px.bar(
    df_ex4,
    x = 'Tm',
    y = 'Total Points',
    color = 'flag',
    # Add a facet variable
    facet_col = 'Pos',
    # Change the number of columns
    facet_col_wrap = 1,
    height = 900,
    color_discrete_sequence=['lightgray', '#FDB927'],
    # Different TEMPLATE (with gridlines)
    template = 'none',
    title = "<b>LA lakers' power forwards (PF) had extra 200 points than all other team-positions.</b>",
    subtitle = "At 3,940 points during the 2023-2024 season, the Los Angeles Lakers' power forwards scored 200 points than all other player team combinations.<br>This team lacked scoring at the Small Forward position with only 285 points. The next highest scoring group were the Toronto Raptor's shooting<br>guards (SG).",
    width = 1200
)
fig4.update_layout(
    showlegend = False,
    margin = {'t':200},
    font_family = 'Helvetica',
    title_font_family = 'Baskerville',
    title_font_size = 22,
    # Left align title
    title_x = 0.068 # Start title 6.8% into the figure
)
fig4.for_each_xaxis(lambda xaxis: xaxis.update(showticklabels=True))

In [13]:
df_ex4.sort_values(by = ['Total Points'], ascending=False)

Unnamed: 0,Tm,Pos,Total Points,flag
66,LAL,PF,3940.0,True
139,TOR,SG,3705.0,False
81,MIL,PF,3692.0,False
4,ATL,SG,3691.0,False
93,NOP,SF,3631.0,False
...,...,...,...,...
59,IND,SG,673.0,False
94,NOP,SG,537.0,False
16,CHI,PF,512.0,False
68,LAL,SF,285.0,False
