# Web Scraping with BeautifulSoup & Requests

In this tutorial, we will go through a basic example of scraping a webpage using Requests and BeautifulSoup. 

First, a quick background on HTML. HTML stands for hypertext markdown language, and it is the language used to encode and display all the content of a website. 

### HTML

Let's look at two examples of HTML code:

![](extras/html.png)

or...

![](extras/html2.png)

This is the basic syntax of a HTML webpage. Every `<tag>` serves a block inside the webpage: 
1. `<!DOCTYPE html>` : HTML documents must start with a type declaration. 
2. The HTML document is contained between `<html>` and `</html>`. 
3. The meta and script declaration of the HTML document is between `<head>` and `</head>`. 
4. The visible part of the HTML document is between `<body>` and `</body>`. 
5. Title headings are defined with the `<h1>` to `<h6>` tags. 
6. Paragraphs are defined with the `<p>` tag.

There are other useful tags like `<a>` for hyperlinks, `<table>` for tables, with `<tr>` for rows and `<td>` for columns.

Also, HTML tags sometimes come with id and class as attributes. The id attribute specifies a unique id for an HTML tag and the value must be unique within the HTML document. The class attribute is used to define equal styles for HTML tags with same class. We can make use of these ids and classes to help us locate the data we want.

### Requests

The Python requests library allows us to scrape web pages. Using the `get` method, we can download the HTML contents of a site.  

As an example, let's take a look at a website containing the box score of a Warriors game:

In [2]:
import requests

page = requests.get('http://www.basketball-reference.com/boxscores/201611230GSW.html')

After running our request, we get a Response object. This object has a status_code property, which indicates if the page was downloaded successfully:

In [5]:
print page
print page.status_code

<Response [200]>
200


A status_code of 200 means that the page downloaded successfully. We won’t fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.  

We can also get the contents of the page using the `content` method. We will not execute it here for the sake of saving space.

### BeautifulSoup

Using BeautifulSoup, we can extract content from the page using different tags found within the page.

In [10]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

When we inpect the page's source code, we can see that the CSV version of the box score is located at under the id='csv_box_lal_basic,' and therefore we can use that ID to obtain the actual box score.

![](extras/box_score.png)

As it turns out, this doesn't work (see below). The question is, why??

In [291]:
dd = soup.find_all(id="csv_box_lal_basic")
dd

[]

But there is an alternate method with the HTML. We can find individual box scores using a different `id` tag:

In [182]:
box = soup.find_all(id="div_box_lal_basic")

# note: need to change 'box' object into a string, since it's currently a BeautifulSoup' ResultSet object, which is useless to us
box = str(box)

If we open up the box object as a text file, we can start to glean information about how the data is encoded. For example, we can get all the headers by doing a regex search for text starting with `col">`. 

In [281]:
import re

headers = re.findall('col">(.+?)<',box)

headers

['Starters',
 'MP',
 'FG',
 'FGA',
 'FG%',
 '3P',
 '3PA',
 '3P%',
 'FT',
 'FTA',
 'FT%',
 'ORB',
 'DRB',
 'TRB',
 'AST',
 'STL',
 'BLK',
 'TOV',
 'PF',
 'PTS',
 '+/-']

Looking closer at the box object, we can see that the column headers _and_ the stats we seek are located inside a <data stat> tag:

![](extras/data_stat.png)

Therefore, we can collect all of our data using regular expressions.

In [274]:
stats = re.findall('data-stat=.+?>(.+?)<',box)

stats[:23]

['Basic Box Score Stats',
 'Starters',
 'MP',
 'FG',
 'FGA',
 'FG%',
 '3P',
 '3PA',
 '3P%',
 'FT',
 'FTA',
 'FT%',
 'ORB',
 'DRB',
 'TRB',
 'AST',
 'STL',
 'BLK',
 'TOV',
 'PF',
 'PTS',
 '+/-',
 '<a href="/players/i/ingrabr01.html">Brandon Ingram']

We need to clean up the data. First, let's delete the first item in the list, since that's just the table name:

In [275]:
del stats[0]

Currently, our player names contain HTML encoding, so we will want to clean that up with with regular expressions.

In [276]:
for i, m in enumerate(stats):
    if len(re.findall('html">(.+)',str(m))) > 0:
        stats[i] = re.findall('html">(.+)',str(m))[0]

Next, let's delete the for any player who did not dress for the game, since they are in the table without data, and their presence will throw off alignment in our table.

In [277]:
a = []
a.append(stats.index('Did Not Dress'))

b = []
for i in a:
    b.append(i)
    b.append(i-1)

for index in sorted(b, reverse=True):
    del stats[index]

Now we want to iterate through the list and group (i.e. subset into separate lists) the stats by their proper rows. 

In [278]:
stats_clean = [stats[i:i+21] for i in range(0, len(stats), 21)]

Now our stats are clean and ready to go. Our last step for this exercise will be to put the stats into Pandas DataFrame in order to resemble a true, clean box score. Although tehre is still cleanup that needs to be done (removing the reserves header column, as well as cleaning up percentages for non-shooters), we have done the majority of the work with a relatively few steps.

In [279]:
pd.DataFrame(stats_clean[1:], columns=stats_clean[0])

Unnamed: 0,Starters,MP,FG,FGA,FG%,3P,3PA,3P%,FT,FTA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,+/-
0,Brandon Ingram,34:24,6,12,.500,2,4,.500,2,2,...,0,3,3,1,0,0,3,0,16,-30
1,Luol Deng,26:02,3,9,.333,0,1,.000,2,2,...,2,5,7,2,0,0,0,1,8,-11
2,Jose Calderon,19:59,3,7,.429,1,3,.333,0,0,...,2,2,4,6,0,0,2,1,7,-23
3,Nick Young,18:55,3,8,.375,2,3,.667,0,0,...,0,1,1,1,0,0,0,1,8,-27
4,Timofey Mozgov,17:24,5,7,.714,0,0,</td>,2,2,...,1,3,4,1,0,0,1,5,12,-4
5,Reserves,MP,FG,FGA,FG%,3P,3PA,3P%,FT,FTA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,+/-
6,Jordan Clarkson,29:24,7,16,.438,0,2,.000,2,3,...,1,2,3,2,1,0,1,2,16,-20
7,Lou Williams,21:43,6,11,.545,1,2,.500,3,5,...,0,0,0,2,1,0,2,1,16,-10
8,Larry Nance Jr.,18:26,1,2,.500,0,0,</td>,2,2,...,0,7,7,1,0,1,1,4,4,-14
9,Tarik Black,15:28,2,9,.222,0,1,.000,5,6,...,1,4,5,0,1,0,0,4,9,-27


#### Slightly different approach

Instead of finding the indidual team box score using the `id`, we can use the `tr` tag, which stands for 'table row,' in order to just look up different table elements. Here, we are looking at only the first five table headers from the first two rows of the table.

In [287]:
soup.findAll('tr', limit=2)[1].findAll('th')[:5]

[<th aria-label="Starters" class=" poptip sort_default_asc left" data-stat="player" scope="col">Starters</th>,
 <th aria-label="Minutes Played" class=" poptip right" data-over-header="Basic Box Score Stats" data-stat="mp" data-tip="Minutes Played" scope="col">MP</th>,
 <th aria-label="Field Goals" class=" poptip right" data-over-header="Basic Box Score Stats" data-stat="fg" data-tip="Field Goals" scope="col">FG</th>,
 <th aria-label="Field Goal Attempts" class=" poptip right" data-over-header="Basic Box Score Stats" data-stat="fga" data-tip="Field Goal Attempts" scope="col">FGA</th>,
 <th aria-label="Field Goal Percentage" class=" poptip right" data-over-header="Basic Box Score Stats" data-stat="fg_pct" data-tip="Field Goal Percentage" scope="col">FG%</th>]

We can now use this knowledge to get the table headers of the box score, just as we did before.

In [46]:
column_headers = [th.getText() for th in 
                  soup.findAll('tr', limit=2)[1].findAll('th')]
column_headers

[u'Starters',
 u'MP',
 u'FG',
 u'FGA',
 u'FG%',
 u'3P',
 u'3PA',
 u'3P%',
 u'FT',
 u'FTA',
 u'FT%',
 u'ORB',
 u'DRB',
 u'TRB',
 u'AST',
 u'STL',
 u'BLK',
 u'TOV',
 u'PF',
 u'PTS',
 u'+/-']

Using regular expressions, we can again get the player names.

In [96]:
players = []
for i in range(len(data_rows)):
    player = str(data_rows[i].findAll('th')[0])
    players.append(re.findall('html">(.+?)<',player))

Now we do the same with data. Note that in this method, player names and player data are separated automatically because of the `td.getText()` method.

In [51]:
player_data = [[td.getText() for td in data_rows[i].findAll('td')]
            for i in range(len(data_rows))]

It's very easy to combine player names with their data.

In [288]:
data = []
for i in range(len(players)):
    data.append(players[i]+player_data[i])

Finally, just as we did before, we import the data into a Pandas DataFrame. Unfortunately, in this method we imported from a couple other box scores (advanced box scores) from the web page, so we needed to use the `dropna()` method to get rid of those. 

In [125]:
import pandas as pd

df = pd.DataFrame(data, columns=column_headers)

df.dropna()

Unnamed: 0,Starters,MP,FG,FGA,FG%,3P,3PA,3P%,FT,FTA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,+/-
0,Brandon Ingram,34:24,6,12,0.5,2,4,0.5,2,2,...,0,3,3,1,0,0,3,0,16,-30
1,Luol Deng,26:02,3,9,0.333,0,1,0.0,2,2,...,2,5,7,2,0,0,0,1,8,-11
2,Jose Calderon,19:59,3,7,0.429,1,3,0.333,0,0,...,2,2,4,6,0,0,2,1,7,-23
3,Nick Young,18:55,3,8,0.375,2,3,0.667,0,0,...,0,1,1,1,0,0,0,1,8,-27
4,Timofey Mozgov,17:24,5,7,0.714,0,0,,2,2,...,1,3,4,1,0,0,1,5,12,-4
6,Jordan Clarkson,29:24,7,16,0.438,0,2,0.0,2,3,...,1,2,3,2,1,0,1,2,16,-20
7,Lou Williams,21:43,6,11,0.545,1,2,0.5,3,5,...,0,0,0,2,1,0,2,1,16,-10
8,Larry Nance Jr.,18:26,1,2,0.5,0,0,,2,2,...,0,7,7,1,0,1,1,4,4,-14
9,Tarik Black,15:28,2,9,0.222,0,1,0.0,5,6,...,1,4,5,0,1,0,0,4,9,-27
10,Metta World Peace,14:17,2,4,0.5,2,4,0.5,0,0,...,0,1,1,2,0,0,0,3,6,-15


## References

- https://www.dataquest.io/blog/web-scraping-tutorial-python/
- http://savvastjortjoglou.com/nba-draft-part01-scraping.html
- http://altitudelabs.com/blog/web-scraping-with-python-and-beautiful-soup/