### Downloading a Web Page with the requests.get() Function

In [2]:
import requests

In [3]:
res = requests.get('https://automatetheboringstuff.com/files/rj.txt')
type(res)

SSLError: HTTPSConnectionPool(host='automatetheboringstuff.com', port=443): Max retries exceeded with url: /files/rj.txt (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1006)')))

You can tell that the request for this web page succeeded by checking the status_code attribute of the Response object. If it is equal to the value of requests.codes.ok, then everything went fine. (Incidentally, the status code for “OK” in the HTTP protocol is 200. You may already be familiar with the 404 status code for “Not Found.”) You can find a complete list of HTTP status codes and their meanings at https://en.wikipedia.org/wiki/List_of_HTTP_status_codes.

In [4]:
res.status_code == requests.codes.ok

NameError: name 'res' is not defined

If the request succeeded, the downloaded web page is stored as a string in the Response object’s text variable. This variable holds a large string of the entire play; the call to len(res.text) shows you that it is more than 178,000 characters long. Finally, calling print(res.text[:250]) displays only the first 250 characters.

In [5]:
len(res.text)
print(res.text[:250])

NameError: name 'res' is not defined

#### Saving Downloaded Files to the Hard Drive

To write the web page to a file, you can use a for loop with the Response object’s iter_content() method.

To review, here’s the complete process for downloading and saving a file:

1. Call requests.get() to download the file.
2. Call open() with 'wb' to create a new file in write binary mode.
3. Loop over the Response object’s iter_content() method.
4. Call write() on each iteration to write the content to the file.
5. Call close() to close the file.    

In [1]:
import requests
res = requests.get('https://automatetheboringstuff.com/files/rj.txt')
res.raise_for_status()
playFile = open('RomeoAndJuliet.txt', 'wb')
for chunk in res.iter_content(100000):
    playFile.write(chunk)

playFile.close()

SSLError: HTTPSConnectionPool(host='automatetheboringstuff.com', port=443): Max retries exceeded with url: /files/rj.txt (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1006)')))

The iter_content() method returns “chunks” of the content on each iteration through the loop. Each chunk is of the bytes data type, and you get to specify how many bytes each chunk will contain. One hundred thousand bytes is generally a good size, so pass 100000 as the argument to iter_content().

### HMTL

Use the Developer Tools to Find HTML Elements

#### Parsing HTML with the bs4 Module

The bs4.BeautifulSoup() function needs to be called with a string containing the HTML it will parse. The bs4.BeautifulSoup() function returns a BeautifulSoup object.

In [2]:
import requests, bs4
res = requests.get('https://nostarch.com')
res.raise_for_status()
noStarchSoup = bs4.BeautifulSoup(res.text, 'html.parser')
type(noStarchSoup)

SSLError: HTTPSConnectionPool(host='nostarch.com', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1006)')))

#### Finding an Element with the select() Method

there’s a good selector tutorial in the resources at https://nostarch.com/automatestuff2/), but here’s a short introduction to selectors:


Selector passed to the select() method:


    soup.select('div')              All elements named <div>

    soup.select('#author')          The element with an id attribute of author

    soup.select('.notice')          All elements that use a CSS class attribute named notice

    soup.select('div span')         All elements named < span > that are within an element named < div >

    soup.select('div > span')       All elements named < span > that are directly within an element named < div >, with no other element in between

    soup.select('input[name]')      All elements named < input > that have a name attribute with any value

    soup.select('input[type="button"]') All elements named < input > that have an attribute named type with value button