# Web Scraping
In those rare, terrifying moments when I’m without Wi-Fi, I realize just how much of what I do on the computer is really what I do on the internet. Out of sheer habit I’ll find myself trying to check email, read friends’ Twitter feeds, or answer the question, “Did Kurtwood Smith have any major roles before he was in the original 1987 RoboCop?”1

Since so much work on a computer involves going on the internet, it’d be great if your programs could get online. Web scraping is the term for using a program to download and process content from the web. For example, Google runs many web scraping programs to index web pages for its search engine. In this chapter, you will learn about several modules that make it easy to scrape web pages in Python.

webbrowser Comes with Python and opens a browser to a specific page.

requests Downloads files and web pages from the internet.

bs4 Parses HTML, the format that web pages are written in.

selenium Launches and controls a web browser. The selenium module is able to fill in forms and simulate mouse clicks in this browser.

## Project: mapIt.py with the webbrowser Module

The webbrowser module’s open() function can launch a new browser to a specified URL. Enter the following into the interactive shell:

In [1]:
import webbrowser

webbrowser.open('https://inventwithpython.com/')

True

A web browser tab will open to the URL https://inventwithpython.com/. This is about the only thing the webbrowser module can do. Even so, the open() function does make some interesting things possible. For example, it’s tedious to copy a street address to the clipboard and bring up a map of it on Google Maps. You could take a few steps out of this task by writing a simple script to automatically launch the map in your browser using the contents of your clipboard. This way, you only have to copy the address to a clipboard and run the script, and the map will be loaded for you.

This is what your program does:

    Gets a street address from the command line arguments or clipboard
    Opens the web browser to the Google Maps page for the address

This means your code will need to do the following:

    Read the command line arguments from sys.argv.
    Read the clipboard contents.
    Call the webbrowser.open() function to open the web browser.

Open a new file editor tab and save it as mapIt.py.

### Step 1: Figure Out the URL

Based on the instructions in Appendix B, set up mapIt.py so that when you run it from the command line, like so . . .

C:\> mapit 870 Valencia St, San Francisco, CA 94110

. . . the script will use the command line arguments instead of the clipboard. If there are no command line arguments, then the program will know to use the contents of the clipboard.

First you need to figure out what URL to use for a given street address. When you load https://maps.google.com/ in the browser and search for an address, the URL in the address bar looks something like this: https://www.google.com/maps/place/870+Valencia+St/@37.7590311,-122.4215096,17z/data=!3m1!4b1!4m2!3m1!1s0x808f7e3dadc07a37:0xc86b0b2bb93b73d8.

The address is in the URL, but there’s a lot of additional text there as well. Websites often add extra data to URLs to help track visitors or customize sites. But if you try just going to https://www.google.com/maps/place/870+Valencia+St+San+Francisco+CA/, you’ll find that it still brings up the correct page. So your program can be set to open a web browser to 'https://www.google.com/maps/place/your_address_string' (where your_address_string is the address you want to map).

### Step 2: Handle the Command Line Arguments

Make your code look like this:

In [None]:
#! python3
# mapIt.py - Launches a map in the browser using an address from the
# command line or clipboard.

import webbrowser, sys
if len(sys.argv) > 1:
    # Get address from command line.
    address = ' '.join(sys.argv[1:])

# TODO: Get address from clipboard.

After the program’s #! shebang line, you need to import the webbrowser module for launching the browser and import the sys module for reading the potential command line arguments. The sys.argv variable stores a list of the program’s filename and command line arguments. If this list has more than just the filename in it, then len(sys.argv) evaluates to an integer greater than 1, meaning that command line arguments have indeed been provided.

Command line arguments are usually separated by spaces, but in this case, you want to interpret all of the arguments as a single string. Since sys.argv is a list of strings, you can pass it to the join() method, which returns a single string value. You don’t want the program name in this string, so instead of sys.argv, you should pass sys.argv[1:] to chop off the first element of the array. The final string that this expression evaluates to is stored in the address variable.

If you run the program by entering this into the command line . . .

mapit 870 Valencia St, San Francisco, CA 94110

. . . the sys.argv variable will contain this list value:

['mapIt.py', '870', 'Valencia', 'St, ', 'San', 'Francisco, ', 'CA', '94110']

The address variable will contain the string '870 Valencia St, San Francisco, CA 94110'.

### Step 3: Handle the Clipboard Content and Launch the Browser

Make your code look like the following:

In [6]:
#! python3
# mapIt.py - Launches a map in the browser using an address from the
# command line or clipboard.
import webbrowser, sys, pyperclip
if len(sys.argv) > 1:
    # Get address from command line.
    address = ' '.join(sys.argv[1:])
else:
    # Get address from clipboard.
    address = pyperclip.paste()

webbrowser.open('https://www.google.com/maps/place/' + address)


https://www.google.com/maps/place/870 Valencia St, San Francisco, CA 94110


### Ideas for Similar Programs

As long as you have a URL, the webbrowser module lets users cut out the step of opening the browser and directing themselves to a website. Other programs could use this functionality to do the following:

    Open all links on a page in separate browser tabs. (loop through the sub pages in URL)
    Open the browser to the URL for your local weather.
    Open several social network sites that you regularly check.


## Downloading Files from the Web with the requests Module

The requests module lets you easily download files from the web without having to worry about complicated issues such as network errors, connection problems, and data compression. The requests module doesn’t come with Python, so you’ll have to install it first. From the command line, run pip install --user requests. (Appendix A has additional details on how to install third-party modules.)

The requests module was written because Python’s urllib2 module is too complicated to use. In fact, take a permanent marker and black out this entire paragraph. Forget I ever mentioned urllib2. If you need to download things from the web, just use the requests module.

Next, do a simple test to make sure the requests module installed itself correctly. Enter the following into the interactive shell:

In [1]:
import requests

### Downloading a Web Page with the requests.get() Function

The requests.get() function takes a string of a URL to download. By calling type() on requests.get()’s return value, you can see that it returns a Response object, which contains the response that the web server gave for your request. I’ll explain the Response object in more detail later, but for now, enter the following into the interactive shell while your computer is connected to the internet:

In [3]:
import requests

res = requests.get('https://automatetheboringstuff.com/files/rj.txt') # see the response that the web server gave
type(res) # this is a response object

requests.models.Response

In [4]:
res.status_code == requests.codes.ok # check to see if the request for the web page succeded (should return true)

True

In [11]:
len(res.text) # this will return the length of the text stored in the response object's text variable

178978

In [5]:
print(res.text[:250]) # display the first 250 characters of the text

The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Projec


The URL goes to a text web page for the entire play of Romeo and Juliet, provided on this book’s site ➊. You can tell that the request for this web page succeeded by checking the status_code attribute of the Response object. If it is equal to the value of requests.codes.ok, then everything went fine ➋. (Incidentally, the status code for “OK” in the HTTP protocol is 200. You may already be familiar with the 404 status code for “Not Found.”) You can find a complete list of HTTP status codes and their meanings at https://en.wikipedia.org/wiki/List_of_HTTP_status_codes.

If the request succeeded, the downloaded web page is stored as a string in the Response object’s text variable. This variable holds a large string of the entire play; the call to len(res.text) shows you that it is more than 178,000 characters long. Finally, calling print(res.text[:250]) displays only the first 250 characters.

If the request failed and displayed an error message, like “Failed to establish a new connection” or “Max retries exceeded,” then check your internet connection. Connecting to servers can be quite complicated, and I can’t give a full list of possible problems here. You can find common causes of your error by doing a web search of the error message in quotes.

### Checking for Errors

As you’ve seen, the Response object has a status_code attribute that can be checked against requests.codes.ok (a variable that has the integer value 200) to see whether the download succeeded. A simpler way to check for success is to call the raise_for_status() method on the Response object. This will raise an exception if there was an error downloading the file and will do nothing if the download succeeded. Enter the following into the interactive shell:

In [6]:
res = requests.get('https://inventwithpython.com/page_that_does_not_exist')
res.raise_for_status() # should raise an exception (specifically a 404 error) since this page does not exist 

HTTPError: 404 Client Error: Not Found for url: https://inventwithpython.com/page_that_does_not_exist

The raise_for_status() method is a good way to ensure that a program halts if a bad download occurs. This is a good thing: You want your program to stop as soon as some unexpected error happens. If a failed download isn’t a deal breaker for your program, you can wrap the raise_for_status() line with try and except statements to handle this error case without crashing.

In [7]:
# if a failed download is not a deal breaker for a program, you can use try and except statements
import requests
res = requests.get('https://inventwithpython.com/page_that_does_not_exist')
try:
    res.raise_for_status()
except Exception as exc:
    print('There was a problem: %s' % (exc))

There was a problem: 404 Client Error: Not Found for url: https://inventwithpython.com/page_that_does_not_exist


Always call raise_for_status() after calling requests.get(). You want to be sure that the download has actually worked before your program continues.

## Saving Downloaded Files to the Hard Drive

From here, you can save the web page to a file on your hard drive with the standard open() function and write() method. There are some slight differences, though. First, you must open the file in write binary mode by passing the string 'wb' as the second argument to open(). Even if the page is in plaintext (such as the Romeo and Juliet text you downloaded earlier), you need to write binary data instead of text data in order to maintain the Unicode encoding of the text.

To write the web page to a file, you can use a for loop with the Response object’s iter_content() method.

In [8]:
# first let set our working directory to a downloads folder
import os
path = r"C:\Users\Zac\OneDrive\Python\Practice\ATBS\Files\Downloads"
os.chdir(path)
print(os.getcwd())

C:\Users\Zac\OneDrive\Python\Practice\ATBS\Files\Downloads


In [12]:
import requests
res = requests.get('https://automatetheboringstuff.com/files/rj.txt')
res.raise_for_status() # check the success of the download
playFile = open('RomeoAndJuliet.txt', 'wb') # open a new file in write binary mode
for chunk in res.iter_content(100000): # iter content method breaks up the content into chunks (here of 100000 bytes)
    playFile.write(chunk)

playFile.close()

The iter_content() method returns “chunks” of the content on each iteration through the loop. Each chunk is of the bytes data type, and you get to specify how many bytes each chunk will contain. One hundred thousand bytes is generally a good size, so pass 100000 as the argument to iter_content().

The file RomeoAndJuliet.txt will now exist in the current working directory. Note that while the filename on the website was rj.txt, the file on your hard drive has a different filename. The requests module simply handles downloading the contents of web pages. Once the page is downloaded, it is simply data in your program. Even if you were to lose your internet connection after downloading the web page, all the page data would still be on your computer.

The write() method returns the number of bytes written to the file. In the previous example, there were 100,000 bytes in the first chunk, and the remaining part of the file needed only 78,981 bytes.

To review, here’s the complete process for downloading and saving a file:

    Call requests.get() to download the file.
    Call open() with 'wb' to create a new file in write binary mode.
    Loop over the Response object’s iter_content() method.
    Call write() on each iteration to write the content to the file.
    Call close() to close the file.

That’s all there is to the requests module! The for loop and iter_content() stuff may seem complicated compared to the open()/write()/close() workflow you’ve been using to write text files, but it’s to ensure that the requests module doesn’t eat up too much memory even if you download massive files. You can learn about the requests module’s other features from https://requests.readthedocs.org/.

## HTML
Before you pick apart web pages, you’ll learn some HTML basics. You’ll also see how to access your web browser’s powerful developer tools, which will make scraping information from the web much easier.
Resources for Learning HTML

Hypertext Markup Language (HTML) is the format that web pages are written in. This chapter assumes you have some basic experience with HTML, but if you need a beginner tutorial, I suggest one of the following sites:

    https://developer.mozilla.org/en-US/learn/html/
    https://htmldog.com/guides/html/beginner/
    https://www.codecademy.com/learn/learn-html

A Quick Refresher

In case it’s been a while since you’ve looked at any HTML, here’s a quick overview of the basics. An HTML file is a plaintext file with the .html file extension. The text in these files is surrounded by tags, which are words enclosed in angle brackets. The tags tell the browser how to format the web page. A starting tag and closing tag can enclose some text to form an element. The text (or inner HTML) is the content between the starting and closing tags. For example, the following HTML will display Hello, world! in the browser, with Hello in bold:

<strong>Hello</strong>, world!


The opening <strong> tag says that the enclosed text will appear in bold. The closing </strong> tags tells the browser where the end of the bold text is.

There are many different tags in HTML. Some of these tags have extra properties in the form of attributes within the angle brackets. For example, the < a > tag encloses text that should be a link. The URL that the text links to is determined by the href attribute. Here’s an example:

Al's free <a href="https://inventwithpython.com">Python books</a>.

Some elements have an id attribute that is used to uniquely identify the element in the page. You will often instruct your programs to seek out an element by its id attribute, so figuring out an element’s id attribute using the browser’s developer tools is a common task in writing web scraping programs.

### Viewing the Source HTML of a Web Page

You’ll need to look at the HTML source of the web pages that your programs will work with. To do this, right-click (or CTRL-click on macOS) any web page in your web browser, and select View Source or View page source to see the HTML text of the page (see Figure 12-3). This is the text your browser actually receives. The browser knows how to display, or render, the web page from this HTML.

I highly recommend viewing the source HTML of some of your favorite sites. It’s fine if you don’t fully understand what you are seeing when you look at the source. You won’t need HTML mastery to write simple web scraping programs—after all, you won’t be writing your own websites. You just need enough knowledge to pick out data from an existing site.

### Opening Your Browser’s Developer Tools

In addition to viewing a web page’s source, you can look through a page’s HTML using your browser’s developer tools. In Chrome and Internet Explorer for Windows, the developer tools are already installed, and you can press F12 to make them appear (see Figure 12-4). Pressing F12 again will make the developer tools disappear. In Chrome, you can also bring up the developer tools by selecting View ▸ Developer ▸ Developer Tools. In macOS, pressing image-OPTION-I will open Chrome’s Developer Tools.

In Firefox, you can bring up the Web Developer Tools Inspector by pressing CTRL-SHIFT-C on Windows and Linux or by pressing image-OPTION-C on macOS. The layout is almost identical to Chrome’s developer tools.

In Safari, open the Preferences window, and on the Advanced pane check the Show Develop menu in the menu bar option. After it has been enabled, you can bring up the developer tools by pressing image-OPTION-I.

After enabling or installing the developer tools in your browser, you can right-click any part of the web page and select Inspect Element from the context menu to bring up the HTML responsible for that part of the page. This will be helpful when you begin to parse HTML for your web scraping programs.

### DON’T USE REGULAR EXPRESSIONS TO PARSE HTML

Locating a specific piece of HTML in a string seems like a perfect case for regular expressions. However, I advise you against it. There are many different ways that HTML can be formatted and still be considered valid HTML, but trying to capture all these possible variations in a regular expression can be tedious and error prone. A module developed specifically for parsing HTML, such as bs4, will be less likely to result in bugs.

### Using the Developer Tools to Find HTML Elements

Once your program has downloaded a web page using the requests module, you will have the page’s HTML content as a single string value. Now you need to figure out which part of the HTML corresponds to the information on the web page you’re interested in.

This is where the browser’s developer tools can help. Say you want to write a program to pull weather forecast data from https://weather.gov/. Before writing any code, do a little research. If you visit the site and search for the 94105 ZIP code, the site will take you to a page showing the forecast for that area.

What if you’re interested in scraping the weather information for that ZIP code? Right-click where it is on the page (or CONTROL-click on macOS) and select Inspect Element from the context menu that appears. This will bring up the Developer Tools window, which shows you the HTML that produces this particular part of the web page. Figure 12-5 shows the developer tools open to the HTML of the nearest forecast. Note that if the https://weather.gov/ site changes the design of its web pages, you’ll need to repeat this process to inspect the new elements.

From the developer tools, you can see that the HTML responsible for the forecast part of the web page is <div class="col-sm-10 forecast-text">Sunny, with a high near 64. West wind 11 to 16 mph, with gusts as high as 21 mph.</div>. This is exactly what you were looking for! It seems that the forecast information is contained inside a <div> element with the forecast-text CSS class. Right-click on this element in the browser’s developer console, and from the context menu that appears, select Copy ▸ CSS Selector. This will copy a string such as 'div.row-odd:nth-child(1) > div:nth-child(2)' to the clipboard. You can use this string for Beautiful Soup’s select() or Selenium’s find_element_by_css_selector() methods, as explained later in this chapter. Now that you know what you’re looking for, the Beautiful Soup module will help you find it in the string.

## Parsing HTML with the bs4 Module

Beautiful Soup is a module for extracting information from an HTML page (and is much better for this purpose than regular expressions). The Beautiful Soup module’s name is bs4 (for Beautiful Soup, version 4). To install it, you will need to run pip install --user beautifulsoup4 from the command line. (Check out Appendix A for instructions on installing third-party modules.) While beautifulsoup4 is the name used for installation, to import Beautiful Soup you run import bs4.

For this chapter, the Beautiful Soup examples will parse (that is, analyze and identify the parts of) an HTML file on the hard drive. Open a new file editor tab in Mu, enter the following, and save it as example.html. Alternatively, download it from https://nostarch.com/automatestuff2/.

As you can see, even a simple HTML file involves many different tags and attributes, and matters quickly get confusing with complex websites. Thankfully, Beautiful Soup makes working with HTML much easier.

### Creating a BeautifulSoup Object from HTML

The bs4.BeautifulSoup() function needs to be called with a string containing the HTML it will parse. The bs4.BeautifulSoup() function returns a BeautifulSoup object. Enter the following into the interactive shell while your computer is connected to the internet:

In [13]:
# download an website page to load into beautiful soup
import requests, bs4
res = requests.get('https://nostarch.com') # download the main page from no starch press website
res.raise_for_status() # check that download was successful
#  pass the text value of the response object to beautiful soup and then store the beautiful soup object in a variable named noStarchSoup
noStarchSoup = bs4.BeautifulSoup(res.text, 'html.parser') 
type(noStarchSoup) # get the type of the object (should return a beautifulsoup object)

bs4.BeautifulSoup

This code uses requests.get() to download the main page from the No Starch Press website and then passes the text attribute of the response to bs4.BeautifulSoup(). The BeautifulSoup object that it returns is stored in a variable named noStarchSoup.

You can also load an HTML file from your hard drive by passing a File object to bs4.BeautifulSoup() along with a second argument that tells Beautiful Soup which parser to use to analyze the HTML.

Enter the following into the interactive shell (after making sure the example.html file is in the working directory):

In [14]:
# load an html file from hard drive into beautiful soup
exampleFile = open(r"C:\Users\Zac\OneDrive\Python\Practice\ATBS\Files\example.html")
exampleSoup = bs4.BeautifulSoup(exampleFile, 'html.parser')
type(exampleSoup)

bs4.BeautifulSoup

The 'html.parser' parser used here comes with Python. However, you can use the faster 'lxml' parser if you install the third-party lxml module. Follow the instructions in Appendix A to install this module by running pip install --user lxml. Forgetting to include this second argument will result in a UserWarning: No parser was explicitly specified warning.

Once you have a BeautifulSoup object, you can use its methods to locate specific parts of an HTML document.

### Finding an Element with the select() Method

You can retrieve a web page element from a BeautifulSoup object by calling the select()method and passing a string of a CSS selector for the element you are looking for. Selectors are like regular expressions: they specify a pattern to look for—in this case, in HTML pages instead of general text strings.

A full discussion of CSS selector syntax is beyond the scope of this book (there’s a good selector tutorial in the resources at https://nostarch.com/automatestuff2/), but here’s a short introduction to selectors. Table 12-2 shows examples of the most common CSS selector patterns.

Table 12-2: Examples of CSS Selectors

Selector passed to the select() method
	

Will match . . .

soup.select('div')
	

All elements named <div>

soup.select('#author')
	

The element with an id attribute of author

soup.select('.notice')
	

All elements that use a CSS class attribute named notice

soup.select('div span')
	

All elements named <span> that are within an element named <div>

soup.select('div > span')
	

All elements named <span> that are directly within an element named <div>, with no other element in between

soup.select('input[name]')
	

All elements named <input> that have a name attribute with any value

soup.select('input[type="button"]')
	

All elements named <input> that have an attribute named type with value button

The various selector patterns can be combined to make sophisticated matches. For example, soup.select('p #author') will match any element that has an id attribute of author, as long as it is also inside a <p> element. Instead of writing the selector yourself, you can also right-click on the element in your browser and select Inspect Element. When the browser’s developer console opens, right-click on the element’s HTML and select Copy ▸ CSS Selector to copy the selector string to the clipboard and paste it into your source code.

The select() method will return a list of Tag objects, which is how Beautiful Soup represents an HTML element. The list will contain one Tag object for every match in the BeautifulSoup object’s HTML. Tag values can be passed to the str() function to show the HTML tags they represent. Tag values also have an attrs attribute that shows all the HTML attributes of the tag as a dictionary. Using the example.html file from earlier, enter the following into the interactive shell:

In [15]:
# load an html file from hard drive into beautiful soup
import bs4
exampleFile = open(r"C:\Users\Zac\OneDrive\Python\Practice\ATBS\Files\example.html")
exampleSoup = bs4.BeautifulSoup(exampleFile, 'html.parser') # beautiful soup object with html parser
elems = exampleSoup.select('#author') # return a list of all elements containing id="author"
type(elems) # elems is a list of tag objects

bs4.element.ResultSet

In [16]:
len(elems)

1

In [17]:
type(elems[0])

bs4.element.Tag

In [18]:
str(elems[0]) #returns a string with the starting and closing tags and the element’s text.

'<span id="author">Al Sweigart</span>'

In [19]:
elems[0].getText() # returns the element’s text, or inner HTML

'Al Sweigart'

In [20]:
elems[0].attrs # gives us a dictionary with the element’s attribute, 'id', and the value of the id attribute, 'author'.

{'id': 'author'}

This code will pull the element with id="author" out of our example HTML. We use select('#author') to return a list of all the elements with id="author". We store this list of Tag objects in the variable elems, and len(elems) tells us there is one Tag object in the list; there was one match. Calling getText() on the element returns the element’s text, or inner HTML. The text of an element is the content between the opening and closing tags: in this case, 'Al Sweigart'.

Passing the element to str() returns a string with the starting and closing tags and the element’s text. Finally, attrs gives us a dictionary with the element’s attribute, 'id', and the value of the id attribute, 'author'.

You can also pull all the <p> elements from the BeautifulSoup object. Enter this into the interactive shell:

In [21]:
pElems = exampleSoup.select('p') # return a list of all p elements in the beautiful soup object
str(pElems[0])

'<p>Download my <strong>Python</strong> book from <a href="https://\ninventwithpython.com">my website</a>.</p>'

In [22]:
pElems[0].getText()

'Download my Python book from my website.'

In [23]:
str(pElems[1])

'<p class="slogan">Learn Python the easy way!</p>'

In [24]:
pElems[1].getText()

'Learn Python the easy way!'

In [25]:
str(pElems[2])

'<p>By <span id="author">Al Sweigart</span></p>'

In [26]:
pElems[2].getText()

'By Al Sweigart'

In [31]:
# iterate through all of the elements in the object
for elem in pElems:
    print(elem.getText())

Download my Python book from my website.
Learn Python the easy way!
By Al Sweigart


This time, select() gives us a list of three matches, which we store in pElems. Using str() on pElems[0], pElems[1], and pElems[2] shows you each element as a string, and using getText() on each element shows you its text.

### Getting Data from an Element’s Attributes

The get() method for Tag objects makes it simple to access attribute values from an element. The method is passed a string of an attribute name and returns that attribute’s value. Using example.html, enter the following into the interactive shell:

In [34]:
import bs4
exampleFile = open(r"C:\Users\Zac\OneDrive\Python\Practice\ATBS\Files\example.html")
soup = bs4.BeautifulSoup(exampleFile, 'html.parser')
spanElem = soup.select('span')[0] # find any span elements and store the first matched element into spanElem
str(spanElem)

'<span id="author">Al Sweigart</span>'

In [35]:
spanElem.get('id') # return the id attribute's value

'author'

In [36]:
spanElem.get('some_nonexistent_addr') == None

True

In [37]:
spanElem.attrs

{'id': 'author'}

Here we use select() to find any <span> elements and then store the first matched element in spanElem. Passing the attribute name 'id' to get() returns the attribute’s value, 'author'.