<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Web-Scraping-with-BeautifulSoup" data-toc-modified-id="Web-Scraping-with-BeautifulSoup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Web Scraping with BeautifulSoup</a></span><ul class="toc-item"><li><span><a href="#Objectives:" data-toc-modified-id="Objectives:-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Objectives:</a></span></li><li><span><a href="#Context" data-toc-modified-id="Context-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Context</a></span></li><li><span><a href="#The-components-of-a-web-page" data-toc-modified-id="The-components-of-a-web-page-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>The components of a web page</a></span><ul class="toc-item"><li><span><a href="#HTML" data-toc-modified-id="HTML-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>HTML</a></span></li><li><span><a href="#Let's-take-a-look-at-a-sample-web-page-from-our-local-computer" data-toc-modified-id="Let's-take-a-look-at-a-sample-web-page-from-our-local-computer-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Let's take a look at a sample web page from our local computer</a></span></li></ul></li><li><span><a href="#Make-a-sample-webpage" data-toc-modified-id="Make-a-sample-webpage-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Make a sample webpage</a></span></li><li><span><a href="#Webscraping-with-Python" data-toc-modified-id="Webscraping-with-Python-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Webscraping with Python</a></span><ul class="toc-item"><li><span><a href="#The-requests-library" data-toc-modified-id="The-requests-library-1.5.1"><span class="toc-item-num">1.5.1&nbsp;&nbsp;</span>The requests library</a></span></li><li><span><a href="#Parsing-a-page-with-BeautifulSoup" data-toc-modified-id="Parsing-a-page-with-BeautifulSoup-1.5.2"><span class="toc-item-num">1.5.2&nbsp;&nbsp;</span>Parsing a page with BeautifulSoup</a></span></li><li><span><a href="#Finding-all-instances-of-a-tag-at-once" data-toc-modified-id="Finding-all-instances-of-a-tag-at-once-1.5.3"><span class="toc-item-num">1.5.3&nbsp;&nbsp;</span>Finding all instances of a tag at once</a></span></li><li><span><a href="#Searching-for-tags-by-class-and-id" data-toc-modified-id="Searching-for-tags-by-class-and-id-1.5.4"><span class="toc-item-num">1.5.4&nbsp;&nbsp;</span>Searching for tags by class and id</a></span></li><li><span><a href="#More-sophisticated-webpages" data-toc-modified-id="More-sophisticated-webpages-1.5.5"><span class="toc-item-num">1.5.5&nbsp;&nbsp;</span>More sophisticated webpages</a></span></li><li><span><a href="#Pulling-in-a-Table" data-toc-modified-id="Pulling-in-a-Table-1.5.6"><span class="toc-item-num">1.5.6&nbsp;&nbsp;</span>Pulling in a Table</a></span></li></ul></li><li><span><a href="#Combining-our-data-into-a-Pandas-DataFrame" data-toc-modified-id="Combining-our-data-into-a-Pandas-DataFrame-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Combining our data into a Pandas DataFrame</a></span></li></ul></li></ul></div>

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

<img width=50% src="https://images.unsplash.com/photo-1542831371-29b0f74f9713?ixid=MXwxMjA3fDB8MHxzZWFyY2h8MXx8aHRtbHxlbnwwfHwwfA%3D%3D&ixlib=rb-1.2.1&auto=format&fit=crop&w=800&q=60" />

# Web Scraping with BeautifulSoup

## Learning Goals:  

- Parse HTML and CSS elements in webpages
- Use `requests` and `BeautifulSoup` to get and process webpage contents
- Use ethics when scraping websites

## Context

We have already developed many ways of interacting with data.  We are able to:

* import .csv files (and other .csv-like objects)
* gather data from API's and interact with JSON objects
* query SQL databases

There is publicly available data all over the internet ripe for scraping, whether that be artist information data from wikipedia, song lyrics from songlyrics.com, or texts of famous books from Project Gutenberg. Below, we will learn how to navigate HTML and CSS to bring data into our local computers.

## The components of a web page

When we visit a web page, our web browser makes a GET request to a web server. The server then sends back files that tell our browser how to render the page for us. The files fall into a few main types:

- HTML — contain the main content of the page.
- CSS — add styling to make the page look nicer.
- JS — Javascript files add interactivity to web pages.
- Images — image formats, such as JPG and PNG allow web pages to show pictures.

After our browser receives all the files, it renders the page and displays it to us. There’s a lot that happens behind the scenes to render a page nicely, but we don’t need to worry about most of it when we’re web scraping.

### HTML

HyperText Markup Language (HTML) is a language that web pages are created in. HTML isn’t a programming language, like Python — instead, it’s a markup language that tells a browser how to layout content. 

Let’s take a quick tour through HTML so we know enough to scrape effectively. HTML consists of elements called tags. The most basic tag is the `<html>` tag. This tag tells the web browser that everything inside of it is HTML. We can make a simple HTML document just using this tag:

~~~html
<html>
</html>
~~~

Right inside an html tag, we put two other tags, the head tag, and the body tag. The main content of the web page goes into the body tag. The head tag contains data about the title of the page, and other information that generally isn’t useful in web scraping:

~~~html
<html>
    <head>
    </head>
    <body>
    </body>
</html>
~~~

We’ll now add our first content to the page, in the form of the p tag. The p tag defines a paragraph, and any text inside the tag is shown as a separate paragraph:
~~~html
<html>
    <head>
    </head>
    <body>
          <p>
            Here's a paragraph of text!
        </p>
        <p>
            Here's a second paragraph of text!
        </p>
    </body>
</html>
~~~

<html>
    <head>
    </head>
    <body>
          <p>
            Here's a paragraph of text!
        </p>
        <p>
            Here's a second paragraph of text!
        </p>
    </body>
</html>

Tags have commonly used names that depend on their position in relation to other tags:

- **child** — a child is a tag inside another tag. So the two p tags above are both children of the body tag.
- **parent** — a parent is the tag another tag is inside. Above, the html tag is the parent of the body tag.
- **sibiling** — a sibiling is a tag that is nested inside the same parent as another tag. For example, head and body are siblings, since they’re both inside html. Both p tags are siblings, since they’re both inside body.

We can also add properties to HTML tags that change their behavior:

~~~html
<html>
  <head></head>
  <body>
    <p>
      Here's a paragraph of text!
      <a href="https://www.dataquest.io">Learn Data Science Online</a>
    </p>
    <p>
      Here's a second paragraph of text!
      <a href="https://www.python.org">Python</a>        
    </p>
  </body>
</html>
~~~

<html>
    <head>
    </head>
    <body>
        <p>
            Here's a paragraph of text!
            <a href="https://www.dataquest.io">Learn Data Science Online</a>
        </p>
        <p>
            Here's a second paragraph of text!
            <a href="https://www.python.org">Python</a>        </p>
    </body></html>

In the above example, we added two a tags. a tags are links, and tell the browser to render a link to another web page. The href property of the tag determines where the link goes.

a and p are extremely common html tags. Here are a few others:

- *div*: indicates a division, or area, of the page.
- *b*: bolds any text inside.
- *i*: italicizes any text inside.
- *u*: underlines any text inside.
- *table*: creates a table.
- *form*: creates an input form.


For a full list of tags, look [here](https://developer.mozilla.org/en-US/docs/Web/HTML/Element).

<html>
    <b>bold</b> <br/>
    <i>italics</i> <br/>
    <u>underlining</u>
</html>

There are two special properties that give HTML elements names, and make them easier to interact with when we’re scraping: **class** and **id**. 

- One element can have multiple classes, and a class can be shared between elements. 
- Each element can only have one id, and an id can only be used once on a page. 
- Classes and ids are optional, and not all elements will have them.

We can add classes and ids to our example:

~~~html
<html>
    <head>
    </head>
    <body>
        <p class="bold-paragraph">
            Here's a paragraph of text!
            <a href="https://www.dataquest.io" id="learn-link">Learn Data Science Online</a>
        </p>
        <p class="bold-paragraph extra-large">
            Here's a second paragraph of text!
            <a href="https://www.python.org" class="extra-large">Python</a>
        </p>
    </body>
</html>
~~~

<html>
    <head>
    </head>
    <body>
        <p class="bold-paragraph">
            Here's a paragraph of text!
            <a href="https://www.dataquest.io" id="learn-link">Learn Data Science Online</a>
        </p>
        <p class="bold-paragraph extra-large">
            Here's a second paragraph of text!
            <a href="https://www.python.org" class="extra-large">Python</a>
        </p>
    </body>
</html>

### Let's take a look at a sample web page from our local computer

- Open up example.com in your browser.
- Open up the inspector
    - Mac: cmd+option+c
    - Windows: ctrl+shift+c
- Click on the elements tab, and click on an element

## Make a sample webpage

Use [this site](https://htmledit.squarefree.com/) to see your page as you edit the underlying html.

## Webscraping with Python

### The requests library

The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python requests library (similar to interacting with APIs!).

In [2]:
req = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")

After running our request, we get a Response object. This object has a status_code property, which indicates if the page was downloaded successfully:

In [3]:
req.status_code

200

We can print out the HTML content of the page using the content property:

In [4]:
req.content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

### Parsing a page with BeautifulSoup

We can use the BeautifulSoup library to parse this document, and extract the text from the `<p>` tag. 

In [6]:
soup = BeautifulSoup(req.content)
list(soup.children)

['html',
 <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

In [9]:
list(soup.children)

['html',
 <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

In [10]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


As all the tags are nested, we can move through the structure one level at a time. We can first select all the elements at the top level of the page using the `children` property of `soup`.

In [11]:
list(soup.children)

['html',
 <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

In [12]:
[type(item) for item in list(soup.children)]

[bs4.element.Doctype, bs4.element.Tag]

The `Tag` object allows us to navigate through an HTML document, and extract other tags and text. You can learn more about the various `BeautifulSoup` objects [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#kinds-of-objects).

We can now select the html tag and its children by taking the second item in the list:



In [13]:
html = list(soup.children)[1]
html

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

Each item in the list returned by the `children` property is also a `BeautifulSoup` object, so we can also call the `children` method on `html`.

Now we can find the `children` inside the `html` tag:

In [14]:
list(html.children)

['\n',
 <head>
 <title>A simple example page</title>
 </head>,
 '\n',
 <body>
 <p>Here is some simple content for this page.</p>
 </body>,
 '\n']

As you can see above, there are two tags here, `head`, and `body`. We want to extract the text inside the `p` tag, so we’ll dive into the body:

In [15]:
body = list(html.children)[3]

In [16]:
list(body.children)

['\n', <p>Here is some simple content for this page.</p>, '\n']

We can now isolate the `p` tag:



In [17]:
p = list(body.children)[1]

In [19]:
type(p)

bs4.element.Tag

Once we’ve isolated the tag, we can use the `get_text` method to extract all of the text inside the tag:

In [20]:
p.get_text

<bound method Tag.get_text of <p>Here is some simple content for this page.</p>>

In [21]:
p.get_text()

'Here is some simple content for this page.'

### Finding all instances of a tag at once

If we want to extract a single tag, we can instead use the find_all method, which will find all the instances of a tag on a page.

In [22]:
soup = BeautifulSoup(req.content)
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

Note that `find_all` returns a list, so we’ll have to loop through, or use list indexing, it to extract text:

In [23]:
soup.find_all('p')[0].get_text()

'Here is some simple content for this page.'

If you instead only want to find the first instance of a tag, you can use the `find` method, which will return a single `BeautifulSoup` object:

In [24]:
soup.find('p')

<p>Here is some simple content for this page.</p>

### Searching for tags by class and id

We introduced classes and ids earlier, but it probably wasn’t clear why they were useful. Classes and ids are used by CSS to determine which HTML elements to apply certain styles to. We can also use them when scraping to specify specific elements we want to scrape. Let's look at the following page:

~~~html
<html>
    <head>
        <title>A simple example page</title>
    </head>
    <body>
        <div>
            <p class="inner-text first-item" id="first">
                First paragraph.
            </p>
            <p class="inner-text">
                Second paragraph.
            </p>
        </div>
        <p class="outer-text first-item" id="second">
            <b>
                First outer paragraph.
            </b>
        </p>
        <p class="outer-text">
            <b>
                Second outer paragraph.
            </b>
        </p>
    </body>
</html>
~~~

In [25]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content)
soup

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

Now, we can use the `find_all` method to search for items by class or by id. In the below example, we’ll search for any `p` tag that has the class `outer-text`:

In [26]:
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In the below example, we’ll look for any tag that has the class `outer-text`:



In [29]:
soup.find_all(class_="outer-text")[0]

<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>

We can also search for elements by `id`:


In [30]:
soup.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

### More sophisticated webpages

In [31]:
url = 'https://forecast.weather.gov/MapClick.php?lat=41.8843&lon=-87.6324#.XdPlJUVKg6g'
request = requests.get(url)
soup = BeautifulSoup(request.content)

In [33]:
times = soup.find_all(class_='period-name')
times

[<p class="period-name">This<br/>Afternoon</p>,
 <p class="period-name">Tonight<br/><br/></p>,
 <p class="period-name">Wednesday<br/><br/></p>,
 <p class="period-name">Wednesday<br/>Night</p>,
 <p class="period-name">Thursday<br/><br/></p>,
 <p class="period-name">Thursday<br/>Night</p>,
 <p class="period-name">Friday<br/><br/></p>,
 <p class="period-name">Friday<br/>Night</p>,
 <p class="period-name">Saturday<br/><br/></p>]

In [35]:
len(times) == len(descs)

True

In [34]:
descs = soup.find_all(class_='short-desc')
descs

[<p class="short-desc">Isolated<br/>T-storms and<br/>Breezy</p>,
 <p class="short-desc">T-storms<br/>Likely</p>,
 <p class="short-desc">Slight Chance<br/>T-storms then<br/>Partly Sunny</p>,
 <p class="short-desc">Chance<br/>Showers</p>,
 <p class="short-desc">Chance<br/>Showers</p>,
 <p class="short-desc">Mostly Clear</p>,
 <p class="short-desc">Sunny</p>,
 <p class="short-desc">Mostly Clear</p>,
 <p class="short-desc">Mostly Sunny</p>]

In [38]:
descs[0].text

'IsolatedT-storms andBreezy'

In [37]:
type(zip(times, descs))

zip

In [39]:
together = [(entry[0].text, entry[1].text) for entry in zip(times, descs)]
together

[('ThisAfternoon', 'IsolatedT-storms andBreezy'),
 ('Tonight', 'T-stormsLikely'),
 ('Wednesday', 'Slight ChanceT-storms thenPartly Sunny'),
 ('WednesdayNight', 'ChanceShowers'),
 ('Thursday', 'ChanceShowers'),
 ('ThursdayNight', 'Mostly Clear'),
 ('Friday', 'Sunny'),
 ('FridayNight', 'Mostly Clear'),
 ('Saturday', 'Mostly Sunny')]

### Pulling in a Table

*In general you'll need to examine the html code so that you can tell the BeautifulSoup parser what to look for!*

In [40]:
url = 'https://www.pro-football-reference.com/'
res = requests.get(url)
soup = BeautifulSoup(res.content)

In [41]:
table = soup.find('table', {'id': 'AFC'})

In [43]:
table.find('tbody')

<tbody><tr class="thead onecell"><td class="right left" colspan="5" data-stat="onecell"> AFC East</td></tr>
<tr><th class="left" csk="1" data-stat="team" scope="row"><a href="/teams/buf/2021.htm">BUF</a>*</th><td class="right" data-stat="wins">11</td><td class="right" data-stat="losses">6</td><td class="right iz" data-stat="ties">0</td><td class="right" data-stat="win_loss_perc">.647</td></tr>
<tr><th class="left" csk="2" data-stat="team" scope="row"><a href="/teams/nwe/2021.htm">NWE</a>+</th><td class="right" data-stat="wins">10</td><td class="right" data-stat="losses">7</td><td class="right iz" data-stat="ties">0</td><td class="right" data-stat="win_loss_perc">.588</td></tr>
<tr><th class="left" csk="3" data-stat="team" scope="row"><a href="/teams/mia/2021.htm">MIA</a></th><td class="right" data-stat="wins">9</td><td class="right" data-stat="losses">8</td><td class="right iz" data-stat="ties">0</td><td class="right" data-stat="win_loss_perc">.529</td></tr>
<tr><th class="left" csk="4

In [45]:
table.find('tbody').find_all('tr')[0]

<tr class="thead onecell"><td class="right left" colspan="5" data-stat="onecell"> AFC East</td></tr>

In [46]:
teams = []
table = soup.find('table', {'id': 'AFC'})

for row in table.find('tbody').find_all('tr'):
    try:
        team = {'name': row.find('th', {'data-stat': 'team'}).text,
               'wins': row.find('td', {'data-stat': 'wins'}).text,
               'losses': row.find('td', {'data-stat': 'losses'}).text,
               'ties': row.find('td', {'data-stat': 'ties'}).text}
        teams.append(team)
    except:
        pass

In [47]:
teams

[{'name': 'BUF*', 'wins': '11', 'losses': '6', 'ties': '0'},
 {'name': 'NWE+', 'wins': '10', 'losses': '7', 'ties': '0'},
 {'name': 'MIA', 'wins': '9', 'losses': '8', 'ties': '0'},
 {'name': 'NYJ', 'wins': '4', 'losses': '13', 'ties': '0'},
 {'name': 'CIN*', 'wins': '10', 'losses': '7', 'ties': '0'},
 {'name': 'PIT+', 'wins': '9', 'losses': '7', 'ties': '1'},
 {'name': 'CLE', 'wins': '8', 'losses': '9', 'ties': '0'},
 {'name': 'BAL', 'wins': '8', 'losses': '9', 'ties': '0'},
 {'name': 'TEN*', 'wins': '12', 'losses': '5', 'ties': '0'},
 {'name': 'IND', 'wins': '9', 'losses': '8', 'ties': '0'},
 {'name': 'HOU', 'wins': '4', 'losses': '13', 'ties': '0'},
 {'name': 'JAX', 'wins': '3', 'losses': '14', 'ties': '0'},
 {'name': 'KAN*', 'wins': '12', 'losses': '5', 'ties': '0'},
 {'name': 'LVR+', 'wins': '10', 'losses': '7', 'ties': '0'},
 {'name': 'LAC', 'wins': '9', 'losses': '8', 'ties': '0'},
 {'name': 'DEN', 'wins': '7', 'losses': '10', 'ties': '0'}]

## Combining our data into a Pandas DataFrame

We can now combine the data into a Pandas DataFrame and analyze it.

In order to do this, we’ll call the DataFrame class, and pass in each list of items that we have. We pass them in as part of a dictionary. Each dictionary key will become a column in the DataFrame, and each list will become the values in the column:

In [48]:
# Football data from the table in dictionary form (very easy!)

football = pd.DataFrame(teams)
football

Unnamed: 0,name,wins,losses,ties
0,BUF*,11,6,0
1,NWE+,10,7,0
2,MIA,9,8,0
3,NYJ,4,13,0
4,CIN*,10,7,0
5,PIT+,9,7,1
6,CLE,8,9,0
7,BAL,8,9,0
8,TEN*,12,5,0
9,IND,9,8,0


In [49]:
together

[('ThisAfternoon', 'IsolatedT-storms andBreezy'),
 ('Tonight', 'T-stormsLikely'),
 ('Wednesday', 'Slight ChanceT-storms thenPartly Sunny'),
 ('WednesdayNight', 'ChanceShowers'),
 ('Thursday', 'ChanceShowers'),
 ('ThursdayNight', 'Mostly Clear'),
 ('Friday', 'Sunny'),
 ('FridayNight', 'Mostly Clear'),
 ('Saturday', 'Mostly Sunny')]

In [50]:
# Weather data from the list of doubles

weather = pd.DataFrame(together,
                      columns=['day_and_part_of_day', 'weather_forecast'])
weather

Unnamed: 0,day_and_part_of_day,weather_forecast
0,ThisAfternoon,IsolatedT-storms andBreezy
1,Tonight,T-stormsLikely
2,Wednesday,Slight ChanceT-storms thenPartly Sunny
3,WednesdayNight,ChanceShowers
4,Thursday,ChanceShowers
5,ThursdayNight,Mostly Clear
6,Friday,Sunny
7,FridayNight,Mostly Clear
8,Saturday,Mostly Sunny
