## NBA web scraping player information
This notebook will web scrape information from Baksetball Reference, which is accessed through this url: http://www.basketball-reference.com. This website hosts professional and college sasketball statistics and history, and is one of the most comprehensive websites for sports data collection.

In [1]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup, SoupStrainer
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(style="whitegrid", color_codes=True)
import sys
import string
import requests
from datetime import datetime, timedelta
import time

### Basic player information

By adding path /players/'letter' to the domain name, we can view basic info for all active/retired NBA & ABA players with last names starting with that letter. For example, https://www.basketball-reference.com/players/a/ would give the following information for each player with last name starting with 'A':

We will go through all letters and scrape every player information in each page.

In [2]:
def player_info():
    
    players = []
    player_base_url = 'http://www.basketball-reference.com/players/'

    for letter in string.ascii_lowercase: # get player tables from alphabetical list pages
        page_request = requests.get(player_base_url + letter)
        soup = BeautifulSoup(page_request.text,"lxml")
        table = soup.find('table') # find table in soup

        if table:
            table_body = table.find('tbody')
            for row in table_body.findAll('tr'):  # loop over list of players in the table
                player_url = row.find('a') 
                player_pages = player_url['href']  # player page url
                player_names = player_url.text  #p player name

                # get additional player info from table
                cells = row.findAll('td')
                active_from = int(cells[0].text) # 'From' column
                active_to = int(cells[1].text) # 'To' column
                position = cells[2].text # 'Pos' column
                height = cells[3].text # 'Ht' column (feet-inches)
                weight = cells[4].text # 'Wt' column (lbs)
                birth_date = cells[5].text # 'Birth Date' column
                college = cells[6].text # 'Colleges' column (blank is either no college or intl)

                # create entry
                player_entry = {'url': player_pages,
                                'name': player_names,
                                'active_from': active_from,
                                'active_to': active_to,
                                'pos': position,
                                'college': college,
                                'height': height,
                                'weight': weight,
                                'birth_date': birth_date}

                # append player dictionary
                players.append(player_entry)
                
    return pd.DataFrame(players)

In [3]:
players_general_df = player_info() # call function that scrapes general info

In [4]:
# convert height to inches
height_inches = players_general_df['height'].str.split('-',expand=True)
players_general_df['height_inches'] = 12.0*pd.to_numeric(height_inches[0], errors='coerce')+pd.to_numeric(height_inches[1], errors='coerce')

# convert birth_date to datetime
players_general_df['birth_date'] = pd.to_datetime(players_general_df['birth_date'])

# calculate career length
players_general_df['career_length'] = players_general_df['active_to'] - players_general_df['active_from']
#if just played one season, career_length is 1 instead of 0
players_general_df.career_length = players_general_df.career_length.replace({0: 1})

players_general_df.head(10) # preview

Unnamed: 0,active_from,active_to,birth_date,college,height,name,pos,url,weight,height_inches,career_length
0,1991,1995,1968-06-24,Duke University,6-10,Alaa Abdelnaby,F-C,/players/a/abdelal01.html,240,82.0,4
1,1969,1978,1946-04-07,Iowa State University,6-9,Zaid Abdul-Aziz,C-F,/players/a/abdulza01.html,235,81.0,9
2,1970,1989,1947-04-16,"University of California, Los Angeles",7-2,Kareem Abdul-Jabbar,C,/players/a/abdulka01.html,225,86.0,19
3,1991,2001,1969-03-09,Louisiana State University,6-1,Mahmoud Abdul-Rauf,G,/players/a/abdulma02.html,162,73.0,10
4,1998,2003,1974-11-03,"University of Michigan, San Jose State University",6-6,Tariq Abdul-Wahad,F,/players/a/abdulta01.html,223,78.0,5
5,1997,2008,1976-12-11,University of California,6-9,Shareef Abdur-Rahim,F,/players/a/abdursh01.html,225,81.0,11
6,1977,1981,1954-05-06,Indiana University,6-7,Tom Abernethy,F,/players/a/abernto01.html,220,79.0,4
7,1957,1957,1932-07-27,Western Kentucky University,6-3,Forest Able,G,/players/a/ablefo01.html,180,75.0,1
8,1947,1948,1919-02-09,Salem International University,6-3,John Abramovic,F,/players/a/abramjo01.html,195,75.0,1
9,2017,2019,1993-08-01,,6-6,Alex Abrines,G-F,/players/a/abrinal01.html,200,78.0,2


We get the above table output with the following features: 

- From: int variable, career start year
- To: int variable, career end year
- Pos: string variable, basketball position abbreviation
- Ht: string variable, height in feet-inches
- Wt: string variable, weight in pounds
- Brith Date: string variable
- Colleges: string variable, blank if international or did not play in college

In [5]:
# save in csv format
players_general_df.to_csv('Tables/players_general_df.csv', index=False)

### Detailed player information

The url of each player follows the below format:

*/players/(first letter of the last name)/(first 5 letters of last name)(first 2 letters of first name)(01 unless there's another player that fits the prior name setup, else it's 02, 03, etc).html*

Example, the end of the url for John Wall is https://www.basketball-reference.com/players/w/walljo01.html

Each player's url includes much more information (season, awards, salary, etc.), but we will focus on a couple of statistics, such as career averages.

In [6]:
players_general_df = pd.read_csv('Tables/players_general_df.csv')

In [7]:
def player_detail_info(url):
    '''
    scrape player's personal page. Input is players url (without  www.basketball-reference.com)
    '''
    # we do not need to parse the whole page since the information we are interested in is only a small part
    personal = SoupStrainer('p')
    page_request = requests.get('http://www.basketball-reference.com' + url)
    soup = BeautifulSoup(page_request.text, "lxml", parse_only=personal) # parse only part we are interested in
    p = soup.findAll('p')
                      
    
    # initialize some values in case they are unavailable
    position = None
    shooting_hand = None
    high_school = None
    draft = None
    ppg = None
    trb = None
    ast = None
    hof = 0
    fgp = None
    per = None
    ws = None

    # loop over personal info to get certain information
    for prow in p:
        if 'Shoots:' in prow.text:
            s = prow.text.replace('\n','').split(u'\u25aa') # clean text
            if len(s)>1:
                shoots = s[1].split(':')[1].lstrip().rstrip()
                position = s[0].split(':')[1].lstrip().rstrip() # if multiple positions, read first listed position
                if 'and' in position:
                    position = position.split('and')[0].lstrip().rstrip()
        elif 'High School:' in prow.text:
            s = prow.text.replace('\n','').split(':') 
            if len(s)>1:
                high_school = s[1].lstrip()
        elif 'Draft:' in prow.text:
            s = prow.text.replace('\n','').split(':')
            if len(s)>1:
                draft = s[1].lstrip()
        elif 'Hall of Fame:' in prow.text:
            s = prow.text.replace('\n','').split(':')
            if len(s)>1 and 'Inducted as Player' in s[1]:
                hof = 1
            else:
                hof = 0
        elif 'Career' in prow.text:
            #print(prow)
            s = prow.find_next_siblings('p')[3]
            ppg = (str(s).strip('"<p>""</p>"'))
            s = prow.find_next_siblings('p')[5]
            trb = (str(s).strip('"<p>""</p>"'))            
            s = prow.find_next_siblings('p')[7]
            ast = (str(s).strip('"<p>""</p>"'))
            s = prow.find_next_siblings('p')[9]
            fgp = (str(s).strip('"<p>""</p>"'))
        elif 'Game Logs' in prow.text:
            t = prow.find_previous_siblings('p')[0]
            ws = str(t).strip('"<p>""</p>"').split('class')[0]
            t = prow.find_previous_siblings('p')[2]
            per = str(t).strip('"<p>""</p>"').split('class')[0]

            break
        
            #print(prow.find_previous_siblings('p')[:25])
            #t = prow.find_next_siblings('p')[17]
            #per = float(str(t).strip('"<p>""</p>"'))
            
            #s = prow.find_next_siblings('p')[19]
            #ws = float(str(s).strip('"<p>""</p>"'))
            #print(prow.find_next_siblings('p')[8:])
    
    # create dictionary with all of the info            
    player_entry = {'position': position,
                    'shooting_hand': shoots,
                    'high_school': high_school,
                    'draft': draft,
                    'ppg' : ppg,
                    'trb' : trb,
                    'ast' : ast,
                    'url': url,
                    'hof': hof,
                    'fgp': fgp,
                    'per': per,
                    'ws': ws}

    return player_entry

In [8]:
players_details_info_list = []
for i,url in enumerate(players_general_df.url):
    try:
        players_details_info_list.append(player_detail_info(url))
    except:
        print(players_general_df['name'])
players_detail_df = pd.DataFrame(players_details_info_list) # convert to dateframe 
players_detail_df = players_detail_df[['ppg', 'trb', 'ast', 'fgp', 'per', 'ws', 
                                       'position', 'shooting_hand', 'draft', 
                                       'high_school', 'hof', 'url']]
players_detail_df.head() # preview

0            Alaa Abdelnaby
1           Zaid Abdul-Aziz
2       Kareem Abdul-Jabbar
3        Mahmoud Abdul-Rauf
4         Tariq Abdul-Wahad
5       Shareef Abdur-Rahim
6             Tom Abernethy
7               Forest Able
8            John Abramovic
9              Alex Abrines
10               Alex Acker
11             Don Ackerman
12               Mark Acres
13                Bud Acton
14               Quincy Acy
15              Alvan Adams
16                Don Adams
17             George Adams
18             Hassan Adams
19             Jaylen Adams
20             Jordan Adams
21            Michael Adams
22             Steven Adams
23           Rafael Addison
24              Bam Adebayo
25                Deng Adel
26             Rick Adelman
27              Jeff Adrien
28            Arron Afflalo
29             Maurice Ager
               ...         
4655              Joe Young
4656         Korleone Young
4657          Michael Young
4658             Nick Young
4659            Perr

Unnamed: 0,ppg,trb,ast,fgp,per,ws,position,shooting_hand,draft,high_school,hof,url
0,5.7,3.3,0.3,50.2,13.0,4.8,Power Forward,Right,"Portland Trail Blazers, 1st round (25th pick, ...","Bloomfield in Bloomfield, New Jersey",0,/players/a/abdelal01.html
1,9.0,8.0,1.2,42.8,15.1,17.5,Center,Right,"Cincinnati Royals, 1st round (5th pick, 5th ov...","John Jay in Brooklyn, New York",0,/players/a/abdulza01.html
2,24.6,11.2,3.6,55.9,24.6,273.4,Center,Right,"Milwaukee Bucks, 1st round (1st pick, 1st over...","Power Memorial in New York, New York",1,/players/a/abdulka01.html
3,14.6,1.9,3.5,44.2,15.4,25.2,Point Guard,Right,"Denver Nuggets, 1st round (3rd pick, 3rd overa...","Gulfport in Gulfport, Mississippi",0,/players/a/abdulma02.html
4,7.8,3.3,1.1,41.7,11.4,3.5,Shooting Guard,Right,"Sacramento Kings, 1st round (11th pick, 11th o...","Lycee Aristide Briand in Evreux, France",0,/players/a/abdulta01.html


We get the above table with the following features: 
    
- shooting_hand: Right or Left
- high_school: school name, city, state (or country)
- ppg: career points per game
- trb: career total rebounds per game
- ast: career assists per game
- url: player url page

The data needs to be cleaned up a bit here. We will divide the high_school string to separate school name, city, and state for better organization.

In [9]:
# split high school into state
def split_hs(hsString):
    '''
    splits high_school value into high school name and state
    '''
    if hsString:
        s = hsString.split(' in ')[1].split(',')
        if len(s)==2:
            city = s[0].lstrip().rstrip()
            state = s[1].lstrip().rstrip()
            name = hsString.split(' in ')[0]
    else:
        city = None
        state = None
        name = None
    return pd.Series([city, state, name], index=['city','state','name'])

# now apply the function
players_detail_df[['hs_city','hs_state','hs_name']] = players_detail_df['high_school'].apply(split_hs)
players_detail_df.head() # preview

Unnamed: 0,ppg,trb,ast,fgp,per,ws,position,shooting_hand,draft,high_school,hof,url,hs_city,hs_state,hs_name
0,5.7,3.3,0.3,50.2,13.0,4.8,Power Forward,Right,"Portland Trail Blazers, 1st round (25th pick, ...","Bloomfield in Bloomfield, New Jersey",0,/players/a/abdelal01.html,Bloomfield,New Jersey,Bloomfield
1,9.0,8.0,1.2,42.8,15.1,17.5,Center,Right,"Cincinnati Royals, 1st round (5th pick, 5th ov...","John Jay in Brooklyn, New York",0,/players/a/abdulza01.html,Brooklyn,New York,John Jay
2,24.6,11.2,3.6,55.9,24.6,273.4,Center,Right,"Milwaukee Bucks, 1st round (1st pick, 1st over...","Power Memorial in New York, New York",1,/players/a/abdulka01.html,New York,New York,Power Memorial
3,14.6,1.9,3.5,44.2,15.4,25.2,Point Guard,Right,"Denver Nuggets, 1st round (3rd pick, 3rd overa...","Gulfport in Gulfport, Mississippi",0,/players/a/abdulma02.html,Gulfport,Mississippi,Gulfport
4,7.8,3.3,1.1,41.7,11.4,3.5,Shooting Guard,Right,"Sacramento Kings, 1st round (11th pick, 11th o...","Lycee Aristide Briand in Evreux, France",0,/players/a/abdulta01.html,Evreux,France,Lycee Aristide Briand


We will standardize some of the positions that we have pulled. Since players sometimes play more than one position in their careers, I have chosen to pull the first position that Basketball-Reference lists in the player page. Here I will sort all positions to either the traditional three positions (G, F, C) and modern five positions (PG, SG, SF, PF, C).

In [10]:
# covert swingman positions into traditional five basketball positions
players_detail_df['position'] =  players_detail_df['position'].map({'Guard/Forward': 'Point Guard',
                                                                    'Point Guard': 'Point Guard',
                                                                    'Guard': 'Point Guard',
                                                                    'Shooting Guard': 'Shooting Guard',
                                                                    'Small Forward': 'Small Forward',
                                                                    'Forward/Guard': 'Small Forward',
                                                                    'Forward': 'Small Forward',
                                                                    'Power Forward': 'Power Forward',
                                                                    'Forward/Center': 'Power Forward',
                                                                    'Center': 'Center',
                                                                    'Center/Forward': 'Center'
                                                                    })

players_detail_df['trad_position'] = players_detail_df['position'].map({'Point Guard': 'Guard',
                                                                        'Shooting Guard': 'Guard',
                                                                        'Small Forward': 'Forward',
                                                                        'Power Forward': 'Forward',
                                                                        'Center': 'Center'})

In [11]:
# save in csv format
players_detail_df.to_csv('Tables/players_detail_df.csv', index=False)

In [12]:
players_detail_df = pd.read_csv('Tables/players_detail_df.csv')
players_detail_df.head() # preview

Unnamed: 0,ppg,trb,ast,fgp,per,ws,position,shooting_hand,draft,high_school,hof,url,hs_city,hs_state,hs_name,trad_position
0,5.7,3.3,0.3,50.2,13.0,4.8,Power Forward,Right,"Portland Trail Blazers, 1st round (25th pick, ...","Bloomfield in Bloomfield, New Jersey",0,/players/a/abdelal01.html,Bloomfield,New Jersey,Bloomfield,Forward
1,9.0,8.0,1.2,42.8,15.1,17.5,Center,Right,"Cincinnati Royals, 1st round (5th pick, 5th ov...","John Jay in Brooklyn, New York",0,/players/a/abdulza01.html,Brooklyn,New York,John Jay,Center
2,24.6,11.2,3.6,55.9,24.6,273.4,Center,Right,"Milwaukee Bucks, 1st round (1st pick, 1st over...","Power Memorial in New York, New York",1,/players/a/abdulka01.html,New York,New York,Power Memorial,Center
3,14.6,1.9,3.5,44.2,15.4,25.2,Point Guard,Right,"Denver Nuggets, 1st round (3rd pick, 3rd overa...","Gulfport in Gulfport, Mississippi",0,/players/a/abdulma02.html,Gulfport,Mississippi,Gulfport,Guard
4,7.8,3.3,1.1,41.7,11.4,3.5,Shooting Guard,Right,"Sacramento Kings, 1st round (11th pick, 11th o...","Lycee Aristide Briand in Evreux, France",0,/players/a/abdulta01.html,Evreux,France,Lycee Aristide Briand,Guard


### Merge two dataframes 
Now we combine the two dataframes we have both pulled from Website. We will use the player url to merge the dataframes, and sort the column order so that the layout makes sense.

In [13]:
# merge two dataframes: players_general_info and players_detail_df
players = players_general_df.merge(players_detail_df,how='outer',on='url')

# reorganize columns
players = players[['name', 'active_from', 'active_to', 'career_length', 'birth_date',
                   'position', 'trad_position', 'ppg', 'trb', 'ast', 'fgp', 'per', 'ws', 
                   'height_inches', 'weight', 'shooting_hand', 'hof', 'college', 'hs_name', 'hs_city', 'hs_state', 'url']]

players['hof'].fillna(0, inplace=True)

players = players.replace('-', np.nan)

#preview
players.head(10)

Unnamed: 0,name,active_from,active_to,career_length,birth_date,position,trad_position,ppg,trb,ast,...,ws,height_inches,weight,shooting_hand,hof,college,hs_name,hs_city,hs_state,url
0,Alaa Abdelnaby,1991,1995,4,1968-06-24,Power Forward,Forward,5.7,3.3,0.3,...,4.8,82.0,240.0,Right,0.0,Duke University,Bloomfield,Bloomfield,New Jersey,/players/a/abdelal01.html
1,Zaid Abdul-Aziz,1969,1978,9,1946-04-07,Center,Center,9.0,8.0,1.2,...,17.5,81.0,235.0,Right,0.0,Iowa State University,John Jay,Brooklyn,New York,/players/a/abdulza01.html
2,Kareem Abdul-Jabbar,1970,1989,19,1947-04-16,Center,Center,24.6,11.2,3.6,...,273.4,86.0,225.0,Right,1.0,"University of California, Los Angeles",Power Memorial,New York,New York,/players/a/abdulka01.html
3,Mahmoud Abdul-Rauf,1991,2001,10,1969-03-09,Point Guard,Guard,14.6,1.9,3.5,...,25.2,73.0,162.0,Right,0.0,Louisiana State University,Gulfport,Gulfport,Mississippi,/players/a/abdulma02.html
4,Tariq Abdul-Wahad,1998,2003,5,1974-11-03,Shooting Guard,Guard,7.8,3.3,1.1,...,3.5,78.0,223.0,Right,0.0,"University of Michigan, San Jose State University",Lycee Aristide Briand,Evreux,France,/players/a/abdulta01.html
5,Shareef Abdur-Rahim,1997,2008,11,1976-12-11,Small Forward,Forward,18.1,7.5,2.5,...,71.2,81.0,225.0,Right,0.0,University of California,Wheeler,Marietta,Georgia,/players/a/abdursh01.html
6,Tom Abernethy,1977,1981,4,1954-05-06,Power Forward,Forward,5.6,3.2,1.2,...,13.4,79.0,220.0,Right,0.0,Indiana University,Saint Joseph,South Bend,Indiana,/players/a/abernto01.html
7,Forest Able,1957,1957,1,1932-07-27,Point Guard,Guard,0.0,1.0,1.0,...,0.0,75.0,180.0,Right,0.0,Western Kentucky University,Fairdale,Louisville,Kentucky,/players/a/ablefo01.html
8,John Abramovic,1947,1948,1,1919-02-09,Small Forward,Forward,9.5,,0.7,...,-1.9,75.0,195.0,Right,0.0,Salem International University,Etna,Etna,Pennsylvania,/players/a/abramjo01.html
9,Alex Abrines,2017,2019,2,1993-08-01,Shooting Guard,Guard,5.3,1.4,0.5,...,5.0,78.0,200.0,Right,0.0,,,,,/players/a/abrinal01.html


Finally, we will convert arguments to a numeric type. A lot of our player statistics are missing from the website, such as rebounds per game, field goal percentage, etc. This could be because stats were not recorded for a while since the first NBA games were played in 1947, a player might not have played enough minutes to register certain stats, or an error in recordkeeping. These are recorded as '-' on the website, so we will convert these to 0.0 in the dataframe.

In [14]:
players['trb'] = pd.to_numeric(players['trb'], errors='coerce')
players['fgp'] = pd.to_numeric(players['fgp'], errors='coerce')
players['per'] = pd.to_numeric(players['per'], errors='coerce')
players['ws'] = pd.to_numeric(players['ws'], errors='coerce')

Here we see that we have about 4,500 unique players with almost all stats fields population.

In [15]:
players.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4685 entries, 0 to 4684
Data columns (total 22 columns):
name             4685 non-null object
active_from      4685 non-null int64
active_to        4685 non-null int64
career_length    4685 non-null int64
birth_date       4657 non-null object
position         4684 non-null object
trad_position    4684 non-null object
ppg              4684 non-null float64
trb              4392 non-null float64
ast              4684 non-null float64
fgp              4656 non-null float64
per              4334 non-null float64
ws               4680 non-null float64
height_inches    4685 non-null float64
weight           4680 non-null float64
shooting_hand    4684 non-null object
hof              4685 non-null float64
college          4372 non-null object
hs_name          4026 non-null object
hs_city          4024 non-null object
hs_state         4026 non-null object
url              4685 non-null object
dtypes: float64(9), int64(3), object(10)
memory usa

In [16]:
players.to_csv('Tables/players.csv', index=False)
players = pd.read_csv('Tables/players.csv')