<a id="top"></a>

# Web Scraping NBA Player Game and Shooting Stats Using Python and BeautifulSoup

This is something that I hacked together so that I can teach my oldest son SQL.  Instead of teaching him SQL on data he has no interest in, I decided that I need to get data on something he'd enjoy looking at.  Since he loves basketball and LeBron James, I figured that I try to get data on NBA player statistics from ESPN's website using a technique called "web scraping".  Web scraping is using a computer program to obtain/extract/"scrape" data from a website.

Here I am using Python 3 and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to scrape the data from ESPN's website.  Visit ESPN's [website](http://espn.go.com/nba/teams) to get a feel of what's available.  Take a look at a specific team's stats.  For example, here is the [page](http://espn.go.com/nba/team/stats/_/name/cle/cleveland-cavaliers) for the Cleveland Cavaliers.

I know there are better/faster libraries for scraping web data like lxml.  But with lxml, I didn't want to have to get up to speed with XPath and XSLT.  With BeautifulSoup, I can get up and running quickly since I didn't have to learn a few more different APIs and it is just "good enough" for my purpose dealing with relatively small data sets.

I re-wrote these scripts using the [requests](http://docs.python-requests.org) library.  So you'll need to install requests and lxml since I am using lxml as the parsing engine with BeautifulSoup.

Before you venture into the world of web scraping, I would recommend reading this [article](http://robertorocha.info/on-the-ethics-of-web-scraping/) on ethical web scraping.

#### Quick Links

- [source code for scraping player game stats](#game_stats)
- [source code for scraping player shooting stats](#shooting_stats)
- [querying the NBA sqlite database using db.py](#db_py)

### Below are the SQL statements needed to create tables that will be used to store the player stats

**In this example, I am using sqlite3 database.**

In [None]:
CREATE TABLE "player_game_stats" (
    "id" INTEGER PRIMARY KEY NOT NULL,
    "name_pos" TEXT NOT NULL,
    "team_name" TEXT NOT NULL,
    "GP" INTEGER NOT NULL,
    "GS" INTEGER NOT NULL,
    "MIN" REAL NOT NULL,
    "PPG" REAL NOT NULL,
    "OFFR" REAL NOT NULL,
    "DEFR" REAL NOT NULL,
    "RPG" REAL NOT NULL,
    "APG" REAL NOT NULL,
    "SPG" REAL NOT NULL,
    "BPG" REAL NOT NULL,
    "TPG" REAL NOT NULL,
    "FPG" REAL NOT NULL,
    "A2TO" REAL NOT NULL,
    "PER" REAL NOT NULL
);

CREATE TABLE "player_shooting_stats" (
    "id" INTEGER PRIMARY KEY NOT NULL,
    "name_pos" TEXT NOT NULL,
    "team_name" TEXT NOT NULL,
    "FGM" REAL NOT NULL,
    "FGA" REAL NOT NULL,
    "FG_Perc" REAL NOT NULL,
    "3PM" REAL NOT NULL,
    "3PA" REAL NOT NULL,
    "3P_Perc" REAL NOT NULL,
    "FTM" REAL NOT NULL,
    "FTA" REAL NOT NULL,
    "FT_Perc" REAL NOT NULL,
    "2PM" REAL NOT NULL,
    "2PA" REAL NOT NULL,
    "2P_Perc" REAL NOT NULL,
    "PPS" REAL NOT NULL,
    "AFG_Perc" REAL NOT NULL
);

### Let's begin by examining what's available at ESPN's website

In [1]:
import requests
from bs4 import BeautifulSoup
import re

base_url = 'http://espn.go.com'

teams_url = 'http://espn.go.com/nba/teams'
html_teams = requests.get(teams_url)

soup_teams = BeautifulSoup(html_teams.text,'lxml')

### Getting the URLs for all teams in the NBA

In [2]:
urls = soup_teams.find_all(href=re.compile('/nba/teams/stats'))
urls

[<a href="/nba/teams/stats?team=bos">Stats</a>,
 <a href="/nba/teams/stats?team=bkn">Stats</a>,
 <a href="/nba/teams/stats?team=nyk">Stats</a>,
 <a href="/nba/teams/stats?team=phi">Stats</a>,
 <a href="/nba/teams/stats?team=tor">Stats</a>,
 <a href="/nba/teams/stats?team=gsw">Stats</a>,
 <a href="/nba/teams/stats?team=lac">Stats</a>,
 <a href="/nba/teams/stats?team=lal">Stats</a>,
 <a href="/nba/teams/stats?team=pho">Stats</a>,
 <a href="/nba/teams/stats?team=sac">Stats</a>,
 <a href="/nba/teams/stats?team=chi">Stats</a>,
 <a href="/nba/teams/stats?team=cle">Stats</a>,
 <a href="/nba/teams/stats?team=det">Stats</a>,
 <a href="/nba/teams/stats?team=ind">Stats</a>,
 <a href="/nba/teams/stats?team=mil">Stats</a>,
 <a href="/nba/teams/stats?team=dal">Stats</a>,
 <a href="/nba/teams/stats?team=hou">Stats</a>,
 <a href="/nba/teams/stats?team=mem">Stats</a>,
 <a href="/nba/teams/stats?team=nor">Stats</a>,
 <a href="/nba/teams/stats?team=sas">Stats</a>,
 <a href="/nba/teams/stats?team=atl">Sta

### But I want the full URLs and without the markup junk

In [4]:
team_urls = [base_url+url['href'] for url in urls]
team_urls

['http://espn.go.com/nba/teams/stats?team=bos',
 'http://espn.go.com/nba/teams/stats?team=bkn',
 'http://espn.go.com/nba/teams/stats?team=nyk',
 'http://espn.go.com/nba/teams/stats?team=phi',
 'http://espn.go.com/nba/teams/stats?team=tor',
 'http://espn.go.com/nba/teams/stats?team=gsw',
 'http://espn.go.com/nba/teams/stats?team=lac',
 'http://espn.go.com/nba/teams/stats?team=lal',
 'http://espn.go.com/nba/teams/stats?team=pho',
 'http://espn.go.com/nba/teams/stats?team=sac',
 'http://espn.go.com/nba/teams/stats?team=chi',
 'http://espn.go.com/nba/teams/stats?team=cle',
 'http://espn.go.com/nba/teams/stats?team=det',
 'http://espn.go.com/nba/teams/stats?team=ind',
 'http://espn.go.com/nba/teams/stats?team=mil',
 'http://espn.go.com/nba/teams/stats?team=dal',
 'http://espn.go.com/nba/teams/stats?team=hou',
 'http://espn.go.com/nba/teams/stats?team=mem',
 'http://espn.go.com/nba/teams/stats?team=nor',
 'http://espn.go.com/nba/teams/stats?team=sas',
 'http://espn.go.com/nba/teams/stats?tea

### Ahhh much better now

### The web site doesn't have team names readily available that I could easily scrape using beautifulsoup, but the team codes are available in the URLs.  But I rather get the full team names.  I'm sure I could've scraped the team names somehow, but using the dictionary was more straight-forward and easiest for me at the time.

### Python dictionary to be used to create a team name column based on the team code at the end of the URLs above

In [6]:
team_name_dict = {'bos':'Boston Celtics',
                  'bkn':'Brooklyn Nets',
                  'nyk':'New York Knicks',
                  'phi':'Philadelphia 76ers',
                  'tor':'Toronto Raptors',
                  'gsw':'Golden State Warriors',
                  'lac':'Los Angeles Clippers',
                  'lal':'Los Angeles Lakers',
                  'pho':'Phoenix Suns',
                  'sac':'Sacramento Kings',
                  'chi':'Chicago Bulls',
                  'cle':'Cleveland Cavaliers',
                  'det':'Detroit Pistons',
                  'ind':'Indiana Pacers',
                  'mil':'Milwaukee Bucks',
                  'dal':'Dallas Mavericks',
                  'hou':'Houston Rockets',
                  'mem':'Memphis Grizzlies',
                  'nor':'New Orleans Pelicans',
                  'sas':'San Antonio Spurs',
                  'atl':'Atlanta Hawks',
                  'cha':'Charlotte Hornets',
                  'mia':'Miami Heat',
                  'orl':'Orlando Magic',
                  'was':'Washington Wizards',
                  'den':'Denver Nuggets',
                  'min':'Minnesota Timberwolves',
                  'okc':'Oklahoma City Thunder',
                  'por':'Portland Trail Blazers',
                  'uth':'Utah Jazz'
                  }

### Getting game stats and shooting stats for a specific team (Cleveland Cavaliers)

In [7]:
url_team = 'http://espn.go.com/nba/teams/stats?team=cle'
team_code = url_team[-3:]
html_team = requests.get(url_team)

soup_team = BeautifulSoup(html_team.text, 'lxml')

# Grab all HTML tr elements with class containing the word 'player'
roster = soup_team.find_all('tr', class_=re.compile('player'))

### First/top half of the web page contains the game statistics, the 2nd/bottom half contains the shooting statistics

In [9]:
# Grab the top half of the data, which contains the game stats
roster_game_stats = roster[:int(len(roster)/2)]

# Grap the bottom half of the data, which contains the shooting stats
roster_shooting_stats = roster[-int(len(roster)/2):]

### Let's examine the game statistics data:

In [10]:
for player in roster_game_stats:
    print(player.get_text())

LeBron James, SF131340.932.50.87.28.07.02.151.383.92.31.8
Kyrie Irving, PG131334.824.50.42.02.45.61.380.542.51.72.3
Kevin Love, PF131332.117.21.49.010.41.90.770.852.02.51.0
Tristan Thompson, C131333.09.24.25.19.31.00.460.771.12.00.9
Channing Frye, PF11013.07.80.01.71.71.10.270.360.21.06.0
JR Smith, SG131326.26.60.32.32.60.80.770.310.62.11.4
Kyle Korver, SG13017.76.40.21.71.80.80.460.310.40.92.2
Deron Williams, PG13015.55.60.21.01.22.50.620.080.91.62.7
Iman Shumpert, SG12017.44.70.42.73.11.00.580.250.31.83.0
Richard Jefferson, SF9010.62.90.60.91.40.60.110.220.41.41.3
Derrick Williams, PF505.62.80.20.20.40.60.000.200.60.01.0
Dahntay Jones, SG703.31.00.00.40.40.10.000.140.11.01.0
James Jones, SG504.20.40.20.40.60.00.000.000.40.60.0


### Let's examine the shooting statistics data:

In [11]:
for player in roster_shooting_stats:
    print(player.get_text())

LeBron James, SF11.620.5.5662.55.8.4216.89.60.719.114.7.6231.5840.63
Kyrie Irving, PG8.819.0.4662.46.7.3564.54.90.916.412.3.5251.2910.53
Kevin Love, PF5.311.6.4572.96.2.4753.64.20.852.45.4.4371.4770.58
Tristan Thompson, C3.25.4.6000.00.0.0002.84.20.673.25.4.6001.7140.60
Channing Frye, PF2.75.0.5451.83.5.5260.50.60.860.91.5.5881.5640.73
JR Smith, SG2.34.8.4841.73.8.4490.30.60.500.61.0.6151.3870.66
Kyle Korver, SG2.14.7.4431.74.1.4150.50.51.000.40.6.6251.3610.62
Deron Williams, PG2.03.7.5420.81.7.5000.80.80.911.22.0.5771.5210.66
Iman Shumpert, SG1.83.6.4880.71.4.4710.50.60.861.12.2.5001.3020.58
Richard Jefferson, SF0.92.2.4000.41.1.4000.71.10.600.51.1.4001.3000.50
Derrick Williams, PF1.21.8.6670.40.6.6670.00.00.000.81.2.6671.5560.78
Dahntay Jones, SG0.41.0.4290.00.0.0000.10.11.000.41.0.4291.0000.43
James Jones, SG0.20.8.2500.00.4.0000.00.00.000.20.4.500.5000.25


### Now we're ready to create a list of the game stats.  I also noticed that player ID numbers are available also.

In [12]:
players = []
for row in roster_game_stats:
    for data in row:
        players.append(data.get_text())
        
player_ids = [player.a['href'].split('/')[7] for player in roster_game_stats]

### Let's see what the list of player game stats looks like:

In [13]:
for player in players:
    print(player)

LeBron James, SF
13
13
40.9
32.5
0.8
7.2
8.0
7.0
2.15
1.38
3.9
2.3
1.8
Kyrie Irving, PG
13
13
34.8
24.5
0.4
2.0
2.4
5.6
1.38
0.54
2.5
1.7
2.3
Kevin Love, PF
13
13
32.1
17.2
1.4
9.0
10.4
1.9
0.77
0.85
2.0
2.5
1.0
Tristan Thompson, C
13
13
33.0
9.2
4.2
5.1
9.3
1.0
0.46
0.77
1.1
2.0
0.9
Channing Frye, PF
11
0
13.0
7.8
0.0
1.7
1.7
1.1
0.27
0.36
0.2
1.0
6.0
JR Smith, SG
13
13
26.2
6.6
0.3
2.3
2.6
0.8
0.77
0.31
0.6
2.1
1.4
Kyle Korver, SG
13
0
17.7
6.4
0.2
1.7
1.8
0.8
0.46
0.31
0.4
0.9
2.2
Deron Williams, PG
13
0
15.5
5.6
0.2
1.0
1.2
2.5
0.62
0.08
0.9
1.6
2.7
Iman Shumpert, SG
12
0
17.4
4.7
0.4
2.7
3.1
1.0
0.58
0.25
0.3
1.8
3.0
Richard Jefferson, SF
9
0
10.6
2.9
0.6
0.9
1.4
0.6
0.11
0.22
0.4
1.4
1.3
Derrick Williams, PF
5
0
5.6
2.8
0.2
0.2
0.4
0.6
0.00
0.20
0.6
0.0
1.0
Dahntay Jones, SG
7
0
3.3
1.0
0.0
0.4
0.4
0.1
0.00
0.14
0.1
1.0
1.0
James Jones, SG
5
0
4.2
0.4
0.2
0.4
0.6
0.0
0.00
0.00
0.4
0.6
0.0


### But I would like to add player ID to the list, so I will insert the ID for every 15th element:

In [11]:
index = 0 # insert the player ID before the player's name
increment = 0
for id in player_ids:
    players.insert(index + increment, id)
    index = index + 15
    increment = increment + 1

### Let's see if the player IDs got added before each player's name

In [12]:
for player in players:
    print(player)

1966
LeBron James, SF
29
29
37.5
25.2
0.7
4.6
5.3
7.6
1.34
0.79
3.8
1.7
2.0
25.0
6442
Kyrie Irving, PG
30
30
38.2
20.8
0.5
2.6
3.1
5.3
1.43
0.30
2.1
2.2
2.5
19.8
3449
Kevin Love, PF
31
31
35.8
16.7
1.9
8.2
10.1
2.4
0.77
0.45
1.8
2.3
1.3
17.9
6628
Dion Waiters, SG
31
3
23.5
10.3
0.3
1.4
1.6
2.2
1.29
0.32
1.5
1.8
1.4
12.3
2419
Anderson Varejao, C
26
26
24.5
9.8
2.2
4.3
6.5
1.3
0.73
0.62
1.3
2.2
1.0
17.6
6474
Tristan Thompson, PF
32
5
27.6
9.3
3.7
4.1
7.8
0.6
0.41
0.84
1.0
2.5
0.6
16.4
510
Shawn Marion, SF
30
22
23.1
5.6
1.0
2.5
3.5
1.1
0.63
0.73
0.7
1.1
1.5
10.6
2489716
Matthew Dellavedova, SG
17
5
20.9
4.2
0.6
1.2
1.9
2.8
0.35
0.12
1.1
2.5
2.5
6.1
2009
James Jones, SF
15
0
10.5
4.1
0.1
1.3
1.4
0.7
0.27
0.20
0.1
1.1
11.0
14.7
558
Mike Miller, SF
21
8
17.0
3.0
0.2
2.1
2.3
1.0
0.29
0.10
0.4
2.0
2.4
6.1
2528794
Joe Harris, SG
27
0
10.7
2.8
0.2
0.8
1.0
0.6
0.11
0.04
0.4
1.3
1.3
6.4
1000
Brendan Haywood, C
8
1
6.4
2.4
0.5
1.4
1.9
0.0
0.13
0.75
0.6
1.1
0.0
12.7
2489897
Will Cherry, PG
8
0
8.6


### As you can see from the output above, the player IDs were added, but I would also like to add team name since I took the trouble making that dictionary earlier:

In [13]:
index = 2  # insert the team name after the player's name
increment = 0
for id in player_ids:
    players.insert(index + increment, team_name_dict[team_code])
    index = index + 16  # since we added player ID, there is now a total of 16 columns. instead of 15
    increment = increment + 1

### Let's double-check to see team name was added

In [14]:
for player in players:
    print(player)

1966
LeBron James, SF
Cleveland Cavaliers
29
29
37.5
25.2
0.7
4.6
5.3
7.6
1.34
0.79
3.8
1.7
2.0
25.0
6442
Kyrie Irving, PG
Cleveland Cavaliers
30
30
38.2
20.8
0.5
2.6
3.1
5.3
1.43
0.30
2.1
2.2
2.5
19.8
3449
Kevin Love, PF
Cleveland Cavaliers
31
31
35.8
16.7
1.9
8.2
10.1
2.4
0.77
0.45
1.8
2.3
1.3
17.9
6628
Dion Waiters, SG
Cleveland Cavaliers
31
3
23.5
10.3
0.3
1.4
1.6
2.2
1.29
0.32
1.5
1.8
1.4
12.3
2419
Anderson Varejao, C
Cleveland Cavaliers
26
26
24.5
9.8
2.2
4.3
6.5
1.3
0.73
0.62
1.3
2.2
1.0
17.6
6474
Tristan Thompson, PF
Cleveland Cavaliers
32
5
27.6
9.3
3.7
4.1
7.8
0.6
0.41
0.84
1.0
2.5
0.6
16.4
510
Shawn Marion, SF
Cleveland Cavaliers
30
22
23.1
5.6
1.0
2.5
3.5
1.1
0.63
0.73
0.7
1.1
1.5
10.6
2489716
Matthew Dellavedova, SG
Cleveland Cavaliers
17
5
20.9
4.2
0.6
1.2
1.9
2.8
0.35
0.12
1.1
2.5
2.5
6.1
2009
James Jones, SF
Cleveland Cavaliers
15
0
10.5
4.1
0.1
1.3
1.4
0.7
0.27
0.20
0.1
1.1
11.0
14.7
558
Mike Miller, SF
Cleveland Cavaliers
21
8
17.0
3.0
0.2
2.1
2.3
1.0
0.29
0.10
0.4
2.

### Great, looks like team name was added to the list.  But there is a problem.  I want to insert this data into a database.  I got this one huge list, but I need it to be structured so that the data for each player is grouped in its own list.

### Searched stackoverflow for a [solution](http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python) and found one using a generator:

In [15]:
# http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python
def chunks(l, n):
    """ Yield successive n-sized chunks from l.
    """
    for i in range(0, len(l), n):
        yield l[i:i+n]

### Let's double-check if that worked:

In [16]:
from pprint import pprint

for row in chunks(players,17):
    pprint(row)

['1966',
 'LeBron James, SF',
 'Cleveland Cavaliers',
 '29',
 '29',
 '37.5',
 '25.2',
 '0.7',
 '4.6',
 '5.3',
 '7.6',
 '1.34',
 '0.79',
 '3.8',
 '1.7',
 '2.0',
 '25.0']
['6442',
 'Kyrie Irving, PG',
 'Cleveland Cavaliers',
 '30',
 '30',
 '38.2',
 '20.8',
 '0.5',
 '2.6',
 '3.1',
 '5.3',
 '1.43',
 '0.30',
 '2.1',
 '2.2',
 '2.5',
 '19.8']
['3449',
 'Kevin Love, PF',
 'Cleveland Cavaliers',
 '31',
 '31',
 '35.8',
 '16.7',
 '1.9',
 '8.2',
 '10.1',
 '2.4',
 '0.77',
 '0.45',
 '1.8',
 '2.3',
 '1.3',
 '17.9']
['6628',
 'Dion Waiters, SG',
 'Cleveland Cavaliers',
 '31',
 '3',
 '23.5',
 '10.3',
 '0.3',
 '1.4',
 '1.6',
 '2.2',
 '1.29',
 '0.32',
 '1.5',
 '1.8',
 '1.4',
 '12.3']
['2419',
 'Anderson Varejao, C',
 'Cleveland Cavaliers',
 '26',
 '26',
 '24.5',
 '9.8',
 '2.2',
 '4.3',
 '6.5',
 '1.3',
 '0.73',
 '0.62',
 '1.3',
 '2.2',
 '1.0',
 '17.6']
['6474',
 'Tristan Thompson, PF',
 'Cleveland Cavaliers',
 '32',
 '5',
 '27.6',
 '9.3',
 '3.7',
 '4.1',
 '7.8',
 '0.6',
 '0.41',
 '0.84',
 '1.0',
 '2.5',
 

### Sweet!  As you can see from above, each player's data is in its own list.  Now we can just loop through the main list and insert the data from the nested lists into the sqlite database:

In [21]:
import sqlite3

conn = sqlite3.connect('/home/pybokeh/databases/nba')
c = conn.cursor()

for row in chunks(players,17):
    try:
        c.execute('INSERT INTO player_game_stats VALUES(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)', row)
    except:
        pass
    conn.commit()
conn.close()

<a id="game_stats"></a>

# Below is all the code in one cell for getting player game stats:

[[back to top](#top)]

In [None]:
import requests
from bs4 import BeautifulSoup
import sqlite3
import re

base_url = 'http://espn.go.com'

teams_url = 'http://espn.go.com/nba/teams'
html_teams = requests.get(teams_url)

soup_teams = BeautifulSoup(html_teams.text, 'lxml')

urls = soup_teams.find_all(href=re.compile('/nba/teams/stats'))

team_urls = [base_url+url['href'] for url in urls]

team_name_dict = {'bos':'Boston Celtics',
                  'bkn':'Brooklyn Nets',
                  'nyk':'New York Knicks',
                  'phi':'Philadelphia 76ers',
                  'tor':'Toronto Raptors',
                  'gsw':'Golden State Warriors',
                  'lac':'Los Angeles Clippers',
                  'lal':'Los Angeles Lakers',
                  'pho':'Phoenix Suns',
                  'sac':'Sacramento Kings',
                  'chi':'Chicago Bulls',
                  'cle':'Cleveland Cavaliers',
                  'det':'Detroit Pistons',
                  'ind':'Indiana Pacers',
                  'mil':'Milwaukee Bucks',
                  'dal':'Dallas Mavericks',
                  'hou':'Houston Rockets',
                  'mem':'Memphis Grizzlies',
                  'nor':'New Orleans Pelicans',
                  'sas':'San Antonio Spurs',
                  'atl':'Atlanta Hawks',
                  'cha':'Charlotte Hornets',
                  'mia':'Miami Heat',
                  'orl':'Orlando Magic',
                  'was':'Washington Wizards',
                  'den':'Denver Nuggets',
                  'min':'Minnesota Timberwolves',
                  'okc':'Oklahoma City Thunder',
                  'por':'Portland Trail Blazers',
                  'uth':'Utah Jazz'
                  }

# http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python
def chunks(l, n):
    """ Yield successive n-sized chunks from l.
    """
    for i in range(0, len(l), n):
        yield l[i:i+n]

for team in team_urls:
    team_code = team[-3:]
    html_team = requests.get(team)

    soup_team = BeautifulSoup(html_team.text, 'lxml')

    roster = soup_team.find_all('tr', class_=re.compile('player'))
    roster_game_stats = roster[:int(len(roster)/2)]
    #roster_shooting_stats = roster[-int(len(roster)/2):]
    
    players = []
    for row in roster_game_stats:
        for data in row:
            players.append(data.get_text())
        
    player_ids = [player.a['href'].split('/')[7] for player in roster_game_stats]
    
    index = 0
    increment = 0
    for id in player_ids:
        players.insert(index + increment, id)
        index = index + 15
        increment = increment + 1
        
    index = 2
    increment = 0
    for id in player_ids:
        players.insert(index + increment, team_name_dict[team_code])
        index = index + 16
        increment = increment + 1

    conn = sqlite3.connect('/home/pybokeh/databases/nba')
    c = conn.cursor()

    for row in chunks(players,17):
        try:
            c.execute('INSERT INTO player_game_stats VALUES(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)', row)
        except:
            pass
        conn.commit()
    conn.close()

<a id="shooting_stats"></a>

# Below is all the code in one cell for getting player shooting stats:

[[back to top](#top)]

In [None]:
import requests
from bs4 import BeautifulSoup
import sqlite3
import re

base_url = 'http://espn.go.com'

teams_url = 'http://espn.go.com/nba/teams'
html_teams = requests.get(teams_url)

soup_teams = BeautifulSoup(html_teams.text, 'lxml')

urls = soup_teams.find_all(href=re.compile('/nba/teams/stats'))

team_urls = [base_url+url['href'] for url in urls]

team_name_dict = {'bos':'Boston Celtics',
                  'bkn':'Brooklyn Nets',
                  'nyk':'New York Knicks',
                  'phi':'Philadelphia 76ers',
                  'tor':'Toronto Raptors',
                  'gsw':'Golden State Warriors',
                  'lac':'Los Angeles Clippers',
                  'lal':'Los Angeles Lakers',
                  'pho':'Phoenix Suns',
                  'sac':'Sacramento Kings',
                  'chi':'Chicago Bulls',
                  'cle':'Cleveland Cavaliers',
                  'det':'Detroit Pistons',
                  'ind':'Indiana Pacers',
                  'mil':'Milwaukee Bucks',
                  'dal':'Dallas Mavericks',
                  'hou':'Houston Rockets',
                  'mem':'Memphis Grizzlies',
                  'nor':'New Orleans Pelicans',
                  'sas':'San Antonio Spurs',
                  'atl':'Atlanta Hawks',
                  'cha':'Charlotte Hornets',
                  'mia':'Miami Heat',
                  'orl':'Orlando Magic',
                  'was':'Washington Wizards',
                  'den':'Denver Nuggets',
                  'min':'Minnesota Timberwolves',
                  'okc':'Oklahoma City Thunder',
                  'por':'Portland Trail Blazers',
                  'uth':'Utah Jazz'
                  }

# http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python
def chunks(l, n):
    """ Yield successive n-sized chunks from l.
    """
    for i in range(0, len(l), n):
        yield l[i:i+n]

for team in team_urls:
    team_code = team[-3:]
    html_team = requests.get(team)

    soup_team = BeautifulSoup(html_team.text, 'lxml')

    roster = soup_team.find_all('tr', class_=re.compile('player'))
    #roster_game_stats = roster[:int(len(roster)/2)]
    roster_shooting_stats = roster[-int(len(roster)/2):]
    
    players = []
    for row in roster_shooting_stats:
        for data in row:
            players.append(data.get_text())
        
    player_ids = [player.a['href'].split('/')[7] for player in roster_shooting_stats]
    
    index = 0
    increment = 0
    for id in player_ids:
        players.insert(index + increment, id)
        index = index + 15
        increment = increment + 1
        
    index = 2
    increment = 0
    for id in player_ids:
        players.insert(index + increment, team_name_dict[team_code])
        index = index + 16
        increment = increment + 1

    conn = sqlite3.connect('/home/pybokeh/databases/nba')
    c = conn.cursor()

    for row in chunks(players,17):
        try:
            c.execute('INSERT INTO player_shooting_stats VALUES(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)', row)
        except:
            pass
        conn.commit()
    conn.close()

## Below are the database table definitions used to create the player game stats table and player shooting stats table in sqlite:

CREATE TABLE "player_game_stats" (
    "id" INTEGER PRIMARY KEY NOT NULL,
    "name_pos" TEXT NOT NULL,
    "team_name" TEXT NOT NULL,
    "GP" INTEGER NOT NULL,
    "GS" INTEGER NOT NULL,
    "MIN" REAL NOT NULL,
    "PPG" REAL NOT NULL,
    "OFFR" REAL NOT NULL,
    "DEFR" REAL NOT NULL,
    "RPG" REAL NOT NULL,
    "APG" REAL NOT NULL,
    "SPG" REAL NOT NULL,
    "BPG" REAL NOT NULL,
    "TPG" REAL NOT NULL,
    "FPG" REAL NOT NULL,
    "A2TO" REAL NOT NULL,
    "PER" REAL NOT NULL
)

CREATE TABLE "player_shooting_stats" (
    "id" INTEGER PRIMARY KEY NOT NULL,
    "name_pos" TEXT NOT NULL,
    "team_name" TEXT NOT NULL,
    "FGM" REAL NOT NULL,
    "FGA" REAL NOT NULL,
    "FG_Perc" REAL NOT NULL,
    "3PM" REAL NOT NULL,
    "3PA" REAL NOT NULL,
    "3P_Perc" REAL NOT NULL,
    "FTM" REAL NOT NULL,
    "FTA" REAL NOT NULL,
    "FT_Perc" REAL NOT NULL,
    "2PM" REAL NOT NULL,
    "2PA" REAL NOT NULL,
    "2P_Perc" REAL NOT NULL,
    "PPS" REAL NOT NULL,
    "AFG_Perc" REAL NOT NULL
)

## OK, now that I have the data in sqlite database.  Let's go dig in!

<a id="db_py"></a>

<center><h1>Using Yhat's [db.py](http://blog.yhathq.com/posts/introducing-db-py.html) to query the NBA sqlite database</h1></center>

[[back to top](#top)]

### If you are a fan of the [pandas](http://pandas.pydata.org/) library, I would definitely check out db.py since it is tightly integrated with pandas.  Not only that, db.py's API is so nice and easy to use.

In [21]:
from db import DB
import pandas as pd

db = DB(filename="/home/pybokeh/databases/nba", dbtype="sqlite")

Indexing schema. This will take a second...finished!
Refreshing schema. Please wait...done!


### What if you have no idea what's in this sqlite database?  Like what tables does it contain?

First, let's see what functions are available in db?  Let's use Python's built-in dir() method:

In [3]:
dir(db)

['__class__',
 '__delattr__',
 '__delete__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_assign_limit',
 '_create_sqlite_metatable',
 '_query_templates',
 '_try_command',
 'con',
 'cur',
 'dbname',
 'dbtype',
 'filename',
 'find_column',
 'find_table',
 'hostname',
 'keys_per_column',
 'limit',
 'load_credentials',
 'password',
 'port',
 'query',
 'query_from_file',
 'refresh_schema',
 'save_credentials',
 'schemas',
 'tables',
 'to_redshift',
 'username']

### Hmmmm, "tables" look like something I may be interested in.  Let's check it out.

In [4]:
help(db.tables)

Help on TableSet in module db.db object:

class TableSet(builtins.object)
 |  Set of Tables. Used for displaying search results in terminal/ipython notebook.
 |  
 |  Methods defined here:
 |  
 |  __getitem__(self, i)
 |  
 |  __init__(self, tables)
 |  
 |  __repr__(self)
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)



In [2]:
db.tables

Table,Columns
player_game_stats,"id, name_pos, team_name, GP, GS, MIN, PPG, OFFR, DEFR, RPG, APG, SPG, BPG, TPG, FPG, A2TO, PER"
player_shooting_stats,"id, name_pos, team_name, FGM, FGA, FG_Perc, 3PM, 3PA, 3P_Perc, FTM, FTA, FT_Perc , 2PM, 2PA, 2P_Perc, PPS, AFG_Perc"


### Nice!  db.tables tells me what tables are inside the sqlite database.

### Let's now query the database.  But first, let's configure IPython notebook display options so that we can make sure we can view our results.

In [None]:
pd.set_option("display.max_columns",50)
pd.set_option("display.max_rows",999)

### Now, let's return rows from our game stats table for just Cleveland Cavaliers players and order the results by PPG (points per game) in descending order.  Check the ESPN [site](http://espn.go.com/nba/team/stats/_/name/cle/cleveland-cavaliers) and see if results are similar.  If they played games since I last ran the script, obviously things will be different.

In [5]:
db.query("select * from player_game_stats where team_name like '%Cav%' order by PPG desc;")

Unnamed: 0,id,name_pos,team_name,GP,GS,MIN,PPG,OFFR,DEFR,RPG,APG,SPG,BPG,TPG,FPG,A2TO,PER
0,1966,"LeBron James, SF",Cleveland Cavaliers,29,29,37.5,25.2,0.7,4.6,5.3,7.6,1.34,0.79,3.8,1.7,2.0,25.0
1,6442,"Kyrie Irving, PG",Cleveland Cavaliers,30,30,38.2,20.8,0.5,2.6,3.1,5.3,1.43,0.3,2.1,2.2,2.5,19.8
2,3449,"Kevin Love, PF",Cleveland Cavaliers,31,31,35.8,16.7,1.9,8.2,10.1,2.4,0.77,0.45,1.8,2.3,1.3,17.9
3,6628,"Dion Waiters, SG",Cleveland Cavaliers,31,3,23.5,10.3,0.3,1.4,1.6,2.2,1.29,0.32,1.5,1.8,1.4,12.3
4,2419,"Anderson Varejao, C",Cleveland Cavaliers,26,26,24.5,9.8,2.2,4.3,6.5,1.3,0.73,0.62,1.3,2.2,1.0,17.6
5,6474,"Tristan Thompson, PF",Cleveland Cavaliers,32,5,27.6,9.3,3.7,4.1,7.8,0.6,0.41,0.84,1.0,2.5,0.6,16.4
6,510,"Shawn Marion, SF",Cleveland Cavaliers,30,22,23.1,5.6,1.0,2.5,3.5,1.1,0.63,0.73,0.7,1.1,1.5,10.6
7,2489716,"Matthew Dellavedova, SG",Cleveland Cavaliers,17,5,20.9,4.2,0.6,1.2,1.9,2.8,0.35,0.12,1.1,2.5,2.5,6.1
8,2009,"James Jones, SF",Cleveland Cavaliers,15,0,10.5,4.1,0.1,1.3,1.4,0.7,0.27,0.2,0.1,1.1,11.0,14.7
9,558,"Mike Miller, SF",Cleveland Cavaliers,21,8,17.0,3.0,0.2,2.1,2.3,1.0,0.29,0.1,0.4,2.0,2.4,6.1


### Now, let's do the same for the player shooting stats table and order the results by FG_Perc (field goal percentage) in descending order.

In [6]:
db.query("select * from player_shooting_stats where team_name like '%Cav%' order by FG_Perc desc;")

Unnamed: 0,id,name_pos,team_name,FGM,FGA,FG_Perc,3PM,3PA,3P_Perc,FTM,FTA,FT_Perc,2PM,2PA,2P_Perc,PPS,AFG_Perc
0,2419,"Anderson Varejao, C",Cleveland Cavaliers,4.3,7.7,0.555,0.0,0.1,0.0,1.3,1.7,0.73,4.3,7.6,0.561,1.275,0.56
1,6474,"Tristan Thompson, PF",Cleveland Cavaliers,3.7,6.8,0.542,0.0,0.0,0.0,2.0,3.2,0.64,3.7,6.8,0.542,1.384,0.54
2,1000,"Brendan Haywood, C",Cleveland Cavaliers,1.1,2.1,0.529,0.0,0.0,0.0,0.1,0.4,0.33,1.1,2.1,0.529,1.118,0.53
3,1966,"LeBron James, SF",Cleveland Cavaliers,8.8,18.1,0.488,1.7,4.5,0.369,5.9,7.9,0.74,7.1,13.6,0.527,1.392,0.53
4,6442,"Kyrie Irving, PG",Cleveland Cavaliers,7.5,16.0,0.467,1.7,4.7,0.357,4.2,5.0,0.84,5.8,11.3,0.512,1.302,0.52
5,510,"Shawn Marion, SF",Cleveland Cavaliers,2.4,5.3,0.45,0.4,1.2,0.314,0.5,0.6,0.78,2.0,4.1,0.488,1.056,0.48
6,3449,"Kevin Love, PF",Cleveland Cavaliers,5.4,12.7,0.427,1.6,4.7,0.342,4.3,5.1,0.84,3.8,8.0,0.478,1.321,0.49
7,6628,"Dion Waiters, SG",Cleveland Cavaliers,4.1,9.9,0.41,0.7,2.5,0.282,1.5,1.9,0.79,3.4,7.4,0.454,1.042,0.45
8,2528794,"Joe Harris, SG",Cleveland Cavaliers,1.0,2.4,0.406,0.7,1.7,0.383,0.2,0.3,0.63,0.3,0.7,0.471,1.172,0.55
9,2009,"James Jones, SF",Cleveland Cavaliers,1.3,3.3,0.38,1.3,3.1,0.413,0.3,0.4,0.83,0.0,0.2,0.0,1.24,0.57


In [7]:
type(db.query("select * from player_game_stats limit 5;"))

pandas.core.frame.DataFrame

## The resultset of the query is a Pandas dataframe!  Very nice!

In [10]:
df = db.query("select * from player_game_stats where team_name like '%Cav%' order by PPG desc;")

In [11]:
df

Unnamed: 0,id,name_pos,team_name,GP,GS,MIN,PPG,OFFR,DEFR,RPG,APG,SPG,BPG,TPG,FPG,A2TO,PER
0,1966,"LeBron James, SF",Cleveland Cavaliers,29,29,37.5,25.2,0.7,4.6,5.3,7.6,1.34,0.79,3.8,1.7,2.0,25.0
1,6442,"Kyrie Irving, PG",Cleveland Cavaliers,30,30,38.2,20.8,0.5,2.6,3.1,5.3,1.43,0.3,2.1,2.2,2.5,19.8
2,3449,"Kevin Love, PF",Cleveland Cavaliers,31,31,35.8,16.7,1.9,8.2,10.1,2.4,0.77,0.45,1.8,2.3,1.3,17.9
3,6628,"Dion Waiters, SG",Cleveland Cavaliers,31,3,23.5,10.3,0.3,1.4,1.6,2.2,1.29,0.32,1.5,1.8,1.4,12.3
4,2419,"Anderson Varejao, C",Cleveland Cavaliers,26,26,24.5,9.8,2.2,4.3,6.5,1.3,0.73,0.62,1.3,2.2,1.0,17.6
5,6474,"Tristan Thompson, PF",Cleveland Cavaliers,32,5,27.6,9.3,3.7,4.1,7.8,0.6,0.41,0.84,1.0,2.5,0.6,16.4
6,510,"Shawn Marion, SF",Cleveland Cavaliers,30,22,23.1,5.6,1.0,2.5,3.5,1.1,0.63,0.73,0.7,1.1,1.5,10.6
7,2489716,"Matthew Dellavedova, SG",Cleveland Cavaliers,17,5,20.9,4.2,0.6,1.2,1.9,2.8,0.35,0.12,1.1,2.5,2.5,6.1
8,2009,"James Jones, SF",Cleveland Cavaliers,15,0,10.5,4.1,0.1,1.3,1.4,0.7,0.27,0.2,0.1,1.1,11.0,14.7
9,558,"Mike Miller, SF",Cleveland Cavaliers,21,8,17.0,3.0,0.2,2.1,2.3,1.0,0.29,0.1,0.4,2.0,2.4,6.1


In [18]:
df.head() # grab just the top 5 rows

Unnamed: 0,id,name_pos,team_name,GP,GS,MIN,PPG,OFFR,DEFR,RPG,APG,SPG,BPG,TPG,FPG,A2TO,PER
0,1966,"LeBron James, SF",Cleveland Cavaliers,29,29,37.5,25.2,0.7,4.6,5.3,7.6,1.34,0.79,3.8,1.7,2.0,25.0
1,6442,"Kyrie Irving, PG",Cleveland Cavaliers,30,30,38.2,20.8,0.5,2.6,3.1,5.3,1.43,0.3,2.1,2.2,2.5,19.8
2,3449,"Kevin Love, PF",Cleveland Cavaliers,31,31,35.8,16.7,1.9,8.2,10.1,2.4,0.77,0.45,1.8,2.3,1.3,17.9
3,6628,"Dion Waiters, SG",Cleveland Cavaliers,31,3,23.5,10.3,0.3,1.4,1.6,2.2,1.29,0.32,1.5,1.8,1.4,12.3
4,2419,"Anderson Varejao, C",Cleveland Cavaliers,26,26,24.5,9.8,2.2,4.3,6.5,1.3,0.73,0.62,1.3,2.2,1.0,17.6


### On average, how many minutes do the Cav players play per game?

In [20]:
df.MIN.mean()

18.731249999999999

I know that is not very interesting, but just wanted to show that now we can leverage the pandas library.  There are just so many things you can do with pandas to analyze data with, I recommend just checking out the [documentation](http://pandas.pydata.org/pandas-docs/stable/cookbook.html).

# UPDATE: I've added a ["Part 2"](http://nbviewer.ipython.org/github/pybokeh/ipython_notebooks/blob/master/web_scraping/NBA_Regular_Season_Stats.ipynb) where I show how to scrape the individualized regular season stats for every NBA player.

## If you don't want the step-by-step breakdown, here's the "all-in-one" notebook [version](http://nbviewer.ipython.org/github/pybokeh/ipython_notebooks/blob/master/web_scraping/NBA_Web_Scraping_All_In_One.ipynb).

[[back to top]](#top)