# What stats are most transferrable to the NBA from College? #
### Motivation: To find patterns if one particular type of player is more favourable to be drafted. Many top college athletes are not drafted, or choose not to be drafted, the question is why?  ##

For the purpose of this analysis, data was taken from the 2021 NBA draft class (specifically only athletes who went to American colleges)

### Collecting the data
Step 1: Take a csv of the 2021 draft names  

Step 2: Scrape data from https://www.sports-reference.com  

Step 3: Format the data  


In [21]:
# Imports

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import os
import time

pd.options.mode.chained_assignment = None

In [3]:
# Downloaded 2021 draft class names

df = pd.read_csv('coll21.csv')
urls = []
player = df['player'].values.tolist()
x = len(df['player'])

# Formatted names to be easily referenced - some names were tricker to input (i.e. symbols or multiple athletes with the same name, those cases required manual work)
for i in range(x):
    url = 'https://www.sports-reference.com/cbb/players/' + player[i] + '.html' 
    urls.append(url)

In [4]:
# Scraping data from the website
data = []
for i in range(x):
    if i % 10 == 0:
        # Didn't want to time out IP address session
        time.sleep(30) 
    page = requests.get(urls[i])
    soup = BeautifulSoup(page.content, 'html.parser')
    
    # Find the player's position and other details within 3 sections
    details = soup.find("div", {"id": "info"})
    a= []

    span_tags = details.find_all('span', limit=3)
    
    for span in span_tags:
        a.append(span.text)
              

    details = soup.find("div", {"class": "p1"})

    span_tags = details.find_all('p')
    for span in span_tags:
        a.append(span.text)
        
    details = soup.find("div", {"class": "p2"})

    span_tags = details.find_all('p')
    for span in span_tags:
        a.append(span.text)

    details = soup.find("div", {"class": "p3"})

    span_tags = details.find_all('p')
    for span in span_tags:
        a.append(span.text)
    data.append(a)
    

# Creating a dataframe and cleaning up the values
df = pd.DataFrame(data, columns=['Player', 'Height', 'Weight', 'G', 'G1', 'PTS', 'PTS1', 'TRB', 'TRB1', 'AST', 'AST1', 'FG%', 'FG1%', 'FG3%','FG3%1', 'FT%','FT%1', 'eFG%','eFG%1', 'PER', 'PER1', 'WS', 'WS1'])    
df = df.drop(columns=['G1', 'PTS1', 'TRB1', 'AST1', 'FG1%', 'FG3%1', 'FT%1', 'eFG%1', 'PER1', 'WS1'])
df = pd.DataFrame(data)
df = df.loc[:, (df != '').any(axis=0)]

df = df.rename(columns={0: 'Player', 1: 'Height', 2: 'Weight', 4:'G', 6: 'PTS', 8: 'TRB', 10:'AST', 12: 'FG%', 14: 'FG3%', 16: 'FT%', 18: 'eFG%', 20: 'PER', 22: 'WS'})
df['Weight'] = df['Weight'].str.replace('lb', '')

In [11]:
collegeData = df

### Comparing NBA Rookie Data (how athletes performed in their college career vs. rookie season)
Step 1: Download rookie season data  

Step 2: Merge the dataframe with college data  

Step 3: Create visualizations

In [22]:
# Downloaded rookie year data from Basketball Reference
nbadf = pd.read_csv('21rookies.csv')
players = collegeData['Player'].values.tolist()
mask = nbadf['Player'].isin(players)  

# Filter the dataframe using the mask
filtered_df = nbadf[mask]

# Labelling college and nba stats according - R for Rookie, C for college
college = filtered_df[['Player', 'Age', 'G', 'FG%', '3P%', 'FT%', 'PG_MP', 'PG_PTS', 'PG_TRB', 'PG_AST']]
college.rename(columns={'G': 'R_G', 'FG%': 'R_FG%', '3P%': 'R_3P%', 'FT%': 'R_FT%', 'PG_MP': 'R_MP', 'PG_PTS': 'R_PTS', 'PG_TRB': 'R_TRB', 'PG_AST': 'R_AST'}, inplace=True)

df_true = pd.merge(df, college, on='Player', how='outer', suffixes=('_C', '_R'))
df_true = df_true.drop([ 'eFG%',    'PER',    21,     'WS'], axis=1)
df_true.to_csv("21R+C.csv")

### Creating Visualizations ###

In [7]:
from bokeh.io import output_file
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.io import output_notebook
from bokeh.transform import factor_cmap

In [8]:
import bokeh.io
bokeh.io.reset_output()
bokeh.io.output_notebook()

In [20]:
import pandas as pd
from bokeh.plotting import figure, show
from bokeh.models import HoverTool, ColumnDataSource

# Comparing points
df_pts = pd.DataFrame({
    'Player': df_true['Player'],
    'PTS_College': df_true['PTS'].astype(float),
    'PTS_Rookie': df_true['R_PTS'].astype(float)
})

# Comparing assists
df_ast = pd.DataFrame({
    'Player': df_true['Player'],
    'AST_College': df_true['AST'].astype(float),
    'AST_Rookie': df_true['R_AST'].astype(float)
})

# Comparing total rebounds
df_trb = pd.DataFrame({
    'Player': df_true['Player'],
    'TRB_College': df_true['TRB'].astype(float),
    'TRB_Rookie': df_true['R_TRB'].astype(float)
})

# Creating plots
p_pts = figure(plot_width=400, plot_height=400, title='PTS Comparison: College vs Rookie Season')
p_pts.xaxis.axis_label = 'PTS - College'
p_pts.yaxis.axis_label = 'PTS - Rookie Season'
source_pts = ColumnDataSource(df_pts)
p_pts.circle(x='PTS_College', y='PTS_Rookie', size=10, color='blue', alpha=0.5, source=source_pts)
hover_pts = HoverTool(tooltips=[
    ('Player', '@Player'),
    ('PTS - College', '@PTS_College{0.2f}'),
    ('PTS - Rookie Season', '@PTS_Rookie{0.2f}')
])
p_pts.add_tools(hover_pts)


p_ast = figure(plot_width=400, plot_height=400, title='AST Comparison: College vs Rookie Season')
p_ast.xaxis.axis_label = 'AST - College'
p_ast.yaxis.axis_label = 'AST - Rookie Season'
source_ast = ColumnDataSource(df_ast)
p_ast.circle(x='AST_College', y='AST_Rookie', size=10, color='blue', alpha=0.5, source=source_ast)
hover_ast = HoverTool(tooltips=[
    ('Player', '@Player'),
    ('AST - College', '@AST_College{0.2f}'),
    ('AST - Rookie Season', '@AST_Rookie{0.2f}')
])
p_ast.add_tools(hover_ast)


p_trb = figure(plot_width=400, plot_height=400, title='TRB Comparison: College vs Rookie Season')
p_trb.xaxis.axis_label = 'TRB - College'
p_trb.yaxis.axis_label = 'TRB - Rookie Season'
source_trb = ColumnDataSource(df_trb)
p_trb.circle(x='TRB_College', y='TRB_Rookie', size=10, color='blue', alpha=0.5, source=source_trb)
hover_trb = HoverTool(tooltips=[
    ('Player', '@Player'),
    ('TRB - College', '@TRB_College{0.2f}'),
    ('TRB - Rookie Season', '@TRB_Rookie{0.2f}')
])
p_trb.add_tools(hover_trb)

show(p_pts)
show(p_ast)
show(p_trb)


### Conclusion ###

From the plots, we can see there is a weak positive correlation between assists in college versus rookie seasons, which shows us that athletes who were facilitating in college are able to translate that skill to the NBA as well. Total rebounds showed a weak positive correlation which some outliers like, Scottie Barnes who averaged more rebounds in his rookie year than in his college career. Lastly, points did not show any correlation. This makes sense, as many rookies get limited minutes compared to their college careers, which leads to less opportunity to score. Some next steps would be to sample more years and see if the correlations follow a similar pattern. 