# Dataquest Tutorial: Web Scraping with Python Using Beautiful Soup (BS4)

https://www.dataquest.io/blog/web-scraping-tutorial-python/

### How Does Web Scraping Work? (see Tutorial)

### Why Python for Web Scraping? (see Tutorial)

### Is Web Scraping Legal? (see Tutorial)

### Web Scraping Best Practices (see Tutorial)

### The Components of a Web Page (see Tutorial)

## 0. HTML (see Tutorial)

For a more comprehensive tutorial on HTML: https://www.w3schools.com/html/default.asp

# Prelude: A Quick Start

The following code (Roman numeral items) is from a "Quick Start" tutorial on the Beautiful Soup 4 web site:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

The Dataquest Tutorial resumes below...

## I. A fragment of HTML to search...

In [7]:
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>
"""


In [8]:
print (html_doc) # just print the string

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>



In [9]:
# write out above as external .html file then open in browser...

import webbrowser
import os
print ()
f = open('html_doc.html','w')
  
f.write(html_doc)

f.close()

#Change path to reflect file location
filename = 'file:///'+os.getcwd()+'/' + 'html_doc.html'
webbrowser.open_new_tab(filename) # returns True if successful

# in the browser page that opens, right-click and 'view page source'




True

## II. Creating a BeautifulSoup 'data structure'...

In [10]:
import bs4
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print (soup.prettify())


<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



In [11]:
type(soup)

bs4.BeautifulSoup

## III. Examples of BS4 Navigation in this data structure...

In [14]:
soup.title.text
# <title>The Dormouse's story</title>

"The Dormouse's story"

In [13]:
print (type(soup.find('title')))
soup.find('title') # equivalent to above: search for first <title> tag

<class 'bs4.element.Tag'>


<title>The Dormouse's story</title>

In [15]:
soup.find('p',class_='story') # search for first <p class='story'> tags
# note that class is a Python reserved word, hence the use of class_

<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

In [16]:
print (type(soup.p))
print(soup.p['class']) # another way of search for <p class='story': "p as dictionary" lookup
type(soup.p['class'])


<class 'bs4.element.Tag'>
['title']


list

In [17]:
soup.title.name # tag's name
# u'title'

'title'

In [18]:
soup.title.string # or .text or .get_text(): text within tag
# u'The Dormouse's story'

"The Dormouse's story"

In [19]:
soup.title.parent.name # name of enclosing tag of <title>
# u'head'

'head'

In [20]:
soup.p # find first <p> in document
# <p class="title"><b>The Dormouse's story</b></p>

<p class="title"><b>The Dormouse's story</b></p>

In [21]:
soup.p['class'] # different result than shown in Tutorial
# u'title'

['title']

In [22]:
soup.a # first <a> within document
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [23]:
soup.find_all('a') # get ALL <a> tags as list
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [24]:
soup.find('a') # same as soup.a: find only first <a>

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [25]:
soup.find(id="link3") # find tag with given id (which should be unique within page)
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

One common task is extracting all the URLs found within a page’s <a> tags:

In [26]:
for link in soup.find_all('a'):
    print(link.get('href'))
    # print(link['href']) # equivalent
    
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
    

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


Another common task is extracting all the text from a page:

In [27]:
print(soup.get_text()) # all text as single string

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...





In [28]:
print(soup.text) # a shortcut for previous

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...





In [29]:
soup.a # only finds *first* matching tag

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [30]:
soup.find('a') # equivalent to previous

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

We can extract all the URL's within a link of class 'sister' => find all `'a'` tags as list using `find_all`, then index each with attribute `'href'`.

Note how tags act like a dictionary for enclosed attributes...

In [31]:

for each_tag in soup.find_all('a',class_='sister'): # class_ unnecessary: only sister's have <a>
    print (each_tag['href']) # ['href'])


http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


# (Back to Dataquest Tutorial) 
# 1. The requests library

In [32]:
import requests
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
page

<Response [200]>

In [33]:
page.status_code

200

In [34]:
page.content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

#     2. Parsing a page with BeautifulSoup

In [35]:
# create soup == BS4 object from page content
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser') # lxml, html5lib
soup

<!DOCTYPE html>

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

In [36]:
print(soup.prettify()) # pretty-print the HTML

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


In [37]:
list(soup.children) # show the children in the document in list format

['html',
 '\n',
 <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

In [38]:
[type(item) for item in list(soup.children)] # show type of each element

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

In [39]:
html = list(soup.children)[2] # select html tag (third item in list)
html

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

In [40]:
print(len(list(html.children)))
list(html.children) # list the children inside the html tag

5


['\n',
 <head>
 <title>A simple example page</title>
 </head>,
 '\n',
 <body>
 <p>Here is some simple content for this page.</p>
 </body>,
 '\n']

In [41]:
body = list(html.children)[3] # get the 3rd element of children (try changing to 0, 1, 2)
body

<body>
<p>Here is some simple content for this page.</p>
</body>

In [42]:
list(body.children) # look inside this element ==  body tag

['\n', <p>Here is some simple content for this page.</p>, '\n']

In [43]:
p = list(body.children)[1] # extract p tag (2nd element)
p

<p>Here is some simple content for this page.</p>

In [44]:
p.get_text() # extract all text inside the tag


'Here is some simple content for this page.'

In [45]:
html.get_text() # extra example: text from inside html tag

'\n\nA simple example page\n\n\nHere is some simple content for this page.\n\n'

In [46]:
html.body.get_text() # extra example: text inside body tag

'\nHere is some simple content for this page.\n'

# 3. Finding all instances of a tag at once

In [47]:
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p') # find all the <p> tags

[<p>Here is some simple content for this page.</p>]

In [48]:
soup.find_all('p')[0].get_text() # find_all returns a list

'Here is some simple content for this page.'

In [49]:
soup.find('p') # find returns the first instance of a tag

<p>Here is some simple content for this page.</p>

# 4. Searching for tags by class and id

Get document at: http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html

In [50]:
# download the page and create a BS4 object

page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
print (soup) # show this page

print ('\nPrettier Version:\n') 
print(soup.prettify()) # extra example: pretty-print the HTML

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

Prettier Version:

<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <div>
   <p class="inner-text first-item" id="first">
    First paragraph.
   </p>
   <p class="inner-text">
    Second paragraph.
   </p>
  </div>
  <p class="outer-text first-item" id="second">
   <b>
    First outer paragraph.
   </b>
  </p>
  <p class="outer-text">
   <b>
    Second outer paragraph.
   </b>
  </p>
 </body>
</html>


In [51]:
soup.find_all('p', class_='outer-text') # find ALL (list of) <p> elements with given class

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [52]:
soup.find_all(class_="outer-text") # same result (all 'outer-text' inside <p> tags)

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [53]:
soup.find_all(id="first") # extra example: find all tags with given id

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

In [54]:
soup.find(id='first') # same as before since only one 'first' anywhere in doc

<p class="inner-text first-item" id="first">
                First paragraph.
            </p>

# 5. Using CSS Selectors

CSS documents allow *element styling* => coloring selected elements, etc.

CSS Selectors are a matching mechanism.

BeautifulSoup allows searching using CSS Selectors!

A few CSS Selectors (note no CSS is specified in page here)

`p a` — finds all `a` tags inside of a `p` tag. 

`body p a` — finds all `a` tags inside of a `p` tag inside of a `body` tag. 

`html body` — finds all `body` tags inside of an `html` tag. 

`p.outer-text` — finds all `p` tags with a class of `outer-text`. 

`p#first` — finds all `p` tags with an `id` of `first`. 

`body p.outer-text` — finds any `p` tags with a class of `outer-text` inside of a `body` tag. 


In [55]:
soup.select("div p") # CSS selector select returns a list of all tags p inside div

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>]

# 6. Downloading weather data

We'll use this page for Mpls weather: 

https://forecast.weather.gov/MapClick.php?lat=44.9444&lon=-93.0933

# 7. Exploring page structure with Chrome DevTools

Firefox:  Tools -> Web Developer -> Toggle Tools

Chrome: View -> Developer -> Developer Tools

=> Right-click near "Extended Forecast" and select "Inspect"

=> Scroll up in Elements (Inspector for Firefox) to find enclosing element containing text of extended forcasts. (div tag with seven-day-forecast id)

=> Note: tombstone-container is inside forecast-tombstone...

In [58]:
# download the page and start parsing...

# get the page for Mpls weather
page = requests.get("https://forecast.weather.gov/MapClick.php?lat=44.9444&lon=-93.0933")

# create BeautifulSoup class to parse it
soup = BeautifulSoup(page.content, 'html.parser')

# find all elements with given id & assign to seven_day
seven_day = soup.find(id="seven-day-forecast")

# inside seven_day, find each individual forecast item
forecast_items = seven_day.find_all(class_="tombstone-container")

# extract and print the first item
current_time = forecast_items[0] # try other indices here

print(current_time.prettify())


<div class="tombstone-container">
 <p class="period-name">
  Tonight
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Tonight: Scattered showers, mainly before 10pm.  Cloudy, with a low around 36. Northwest wind 10 to 15 mph.  Chance of precipitation is 30%." class="forecast-icon" src="DualImage.php?i=nshra&amp;j=novc&amp;ip=30" title="Tonight: Scattered showers, mainly before 10pm.  Cloudy, with a low around 36. Northwest wind 10 to 15 mph.  Chance of precipitation is 30%."/>
 </p>
 <p class="short-desc">
  Scattered
  <br/>
  Showers then
  <br/>
  Cloudy
 </p>
 <p class="temp temp-low">
  Low: 36 °F
 </p>
</div>


In [59]:
period = current_time.find(class_="period-name").get_text()
short_desc = current_time.find(class_="short-desc").get_text()
temp = current_time.find(class_="temp").get_text()
print(period)
print(short_desc)
print(temp)

Tonight
ScatteredShowers thenCloudy
Low: 36 °F


In [60]:
img = current_time.find("img")
desc = img['title']
print(desc)


Tonight: Scattered showers, mainly before 10pm.  Cloudy, with a low around 36. Northwest wind 10 to 15 mph.  Chance of precipitation is 30%.


# 8. Extracting all the information from the page

Use CSS selectors to find all items with class==period-name inside item having class==tombstone-container

In [61]:
# Use CSS selectors to find all items with class==period-name inside 
#   item having class==tombstone-container
period_tags = seven_day.select(".tombstone-container .period-name")

# list comprehension to get all text within each item
periods = [pt.get_text() for pt in period_tags]

periods

['Tonight',
 'Thursday',
 'ThursdayNight',
 'Friday',
 'FridayNight',
 'Saturday',
 'SaturdayNight',
 'Sunday',
 'SundayNight']

In [62]:
# now get the three other fields: 

# short_descs => short descriptions
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]

# temps => 
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]

# descs => full descriptions
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]

print("Short descriptions:\n",short_descs)
print("\nTemperatures:\n",temps)
print("\nLong descriptions:\n",descs)

Short descriptions:
 ['ScatteredShowers thenCloudy', 'Mostly Cloudy', 'Mostly Cloudy', 'Mostly Sunny', 'Mostly Cloudy', 'Mostly Cloudy', 'Partly Cloudy', 'Mostly Sunny', 'Mostly Cloudythen ChanceShowers']

Temperatures:
 ['Low: 36 °F', 'High: 47 °F', 'Low: 37 °F', 'High: 54 °F', 'Low: 37 °F', 'High: 54 °F', 'Low: 35 °F', 'High: 57 °F', 'Low: 37 °F']

Long descriptions:
 ['Tonight: Scattered showers, mainly before 10pm.  Cloudy, with a low around 36. Northwest wind 10 to 15 mph.  Chance of precipitation is 30%.', 'Thursday: Mostly cloudy, with a high near 47. North northwest wind around 10 mph. ', 'Thursday Night: Mostly cloudy, with a low around 37. North northwest wind 5 to 10 mph. ', 'Friday: Mostly sunny, with a high near 54. North northwest wind 5 to 10 mph. ', 'Friday Night: Mostly cloudy, with a low around 37. North wind 5 to 10 mph. ', 'Saturday: Mostly cloudy, with a high near 54. North wind 5 to 10 mph. ', 'Saturday Night: Partly cloudy, with a low around 35. Northwest wind 5 

In [63]:
len(descs)



9

# 9. Combining our data into a Pandas Dataframe

In [64]:
# weather is a Pandas DataFrame with given columns: note dictionary notation for initializing!

import pandas as pd
weather = pd.DataFrame({
    "period": periods, 
    "short_desc": short_descs,
    "temp": temps, # add a blank temp: Pandas requires all arrays of equal length
    "desc":descs
})
weather


Unnamed: 0,period,short_desc,temp,desc
0,Tonight,ScatteredShowers thenCloudy,Low: 36 °F,"Tonight: Scattered showers, mainly before 10pm..."
1,Thursday,Mostly Cloudy,High: 47 °F,"Thursday: Mostly cloudy, with a high near 47. ..."
2,ThursdayNight,Mostly Cloudy,Low: 37 °F,"Thursday Night: Mostly cloudy, with a low arou..."
3,Friday,Mostly Sunny,High: 54 °F,"Friday: Mostly sunny, with a high near 54. Nor..."
4,FridayNight,Mostly Cloudy,Low: 37 °F,"Friday Night: Mostly cloudy, with a low around..."
5,Saturday,Mostly Cloudy,High: 54 °F,"Saturday: Mostly cloudy, with a high near 54. ..."
6,SaturdayNight,Partly Cloudy,Low: 35 °F,"Saturday Night: Partly cloudy, with a low arou..."
7,Sunday,Mostly Sunny,High: 57 °F,"Sunday: Mostly sunny, with a high near 57. Nor..."
8,SundayNight,Mostly Cloudythen ChanceShowers,Low: 37 °F,Sunday Night: A 30 percent chance of showers a...


In [65]:
# the tutorial's code DOESN'T WORK due to blank temp at beginning... :
# note the alternate RE...

# temp_nums = weather["temp"].str.extract("(?P<temp_num>d+)", expand=False)
temp_nums=weather["temp"].str.extract("([0-9]+)", expand=False)
weather["temp_num"] = temp_nums.astype('int') # extract ints and add as new column
temp_nums


0    36
1    47
2    37
3    54
4    37
5    54
6    35
7    57
8    37
Name: temp, dtype: object

In [66]:
weather # look at the new column!

Unnamed: 0,period,short_desc,temp,desc,temp_num
0,Tonight,ScatteredShowers thenCloudy,Low: 36 °F,"Tonight: Scattered showers, mainly before 10pm...",36
1,Thursday,Mostly Cloudy,High: 47 °F,"Thursday: Mostly cloudy, with a high near 47. ...",47
2,ThursdayNight,Mostly Cloudy,Low: 37 °F,"Thursday Night: Mostly cloudy, with a low arou...",37
3,Friday,Mostly Sunny,High: 54 °F,"Friday: Mostly sunny, with a high near 54. Nor...",54
4,FridayNight,Mostly Cloudy,Low: 37 °F,"Friday Night: Mostly cloudy, with a low around...",37
5,Saturday,Mostly Cloudy,High: 54 °F,"Saturday: Mostly cloudy, with a high near 54. ...",54
6,SaturdayNight,Partly Cloudy,Low: 35 °F,"Saturday Night: Partly cloudy, with a low arou...",35
7,Sunday,Mostly Sunny,High: 57 °F,"Sunday: Mostly sunny, with a high near 57. Nor...",57
8,SundayNight,Mostly Cloudythen ChanceShowers,Low: 37 °F,Sunday Night: A 30 percent chance of showers a...,37


In [67]:
# extra example: shows temps matching 'Low:' (NaN => no match)
weather["temp"].str.extract("(Low: [0-9]+)",expand=False)

0    Low: 36
1        NaN
2    Low: 37
3        NaN
4    Low: 37
5        NaN
6    Low: 35
7        NaN
8    Low: 37
Name: temp, dtype: object

In [68]:
weather["temp_num"].mean() # what's the mean of ALL temps (high and low)?

43.77777777777778

In [69]:
is_night = weather["temp"].str.contains("Low") # these are rows for night
is_night

0     True
1    False
2     True
3    False
4     True
5    False
6     True
7    False
8     True
Name: temp, dtype: bool

In [70]:
weather["is_night"] = is_night # add a new boolean column: True if night temps
weather

Unnamed: 0,period,short_desc,temp,desc,temp_num,is_night
0,Tonight,ScatteredShowers thenCloudy,Low: 36 °F,"Tonight: Scattered showers, mainly before 10pm...",36,True
1,Thursday,Mostly Cloudy,High: 47 °F,"Thursday: Mostly cloudy, with a high near 47. ...",47,False
2,ThursdayNight,Mostly Cloudy,Low: 37 °F,"Thursday Night: Mostly cloudy, with a low arou...",37,True
3,Friday,Mostly Sunny,High: 54 °F,"Friday: Mostly sunny, with a high near 54. Nor...",54,False
4,FridayNight,Mostly Cloudy,Low: 37 °F,"Friday Night: Mostly cloudy, with a low around...",37,True
5,Saturday,Mostly Cloudy,High: 54 °F,"Saturday: Mostly cloudy, with a high near 54. ...",54,False
6,SaturdayNight,Partly Cloudy,Low: 35 °F,"Saturday Night: Partly cloudy, with a low arou...",35,True
7,Sunday,Mostly Sunny,High: 57 °F,"Sunday: Mostly sunny, with a high near 57. Nor...",57,False
8,SundayNight,Mostly Cloudythen ChanceShowers,Low: 37 °F,Sunday Night: A 30 percent chance of showers a...,37,True


In [71]:
weather[is_night]["temp_num"].mean() 
# use Pandas boolean indexing:
#   select only temp_num elements of rows with is_night==True, 
#   then compute the mean

36.4