# **Python Libraries - `requests`, `bs4`**

**Author: Eni Mustafaraj**

In this notebook I'll be showing you:

1. how to use the `requests` package to download an HTML file
2. how to install new packages directly from Jupyter
3. how to use `bs4` (BeautifulSoup) to parse the content of a simple HTML file

## Part 1: Using `requests`

We humans use a web browser to read HTML pages on the Web. Everytime we visit a webpage, it is actually being transfered to our computer (and stored on our local drive). When we don't use a browser to visit websites, we can use libraries from a programming language to perform the same action as the browser.

In Python, we will use `requests` to download files from your web folder.

In [1]:
import requests

Use the `get` method to send a request to the server for the desired file. Check that the response was received.

In [2]:
response = requests.get("http://cs.wellesley.edu/~cs315/readings/index.html")
print(response.status_code)

200


The variable response is an object, referring to an instance of a class defined in the library `requests`, we can verify this through the Python function `type`: 

In [3]:
type(response)

requests.models.Response

We use `dir` to lookup the list of all attributed and methods for an object or class or library:

In [4]:
print(dir(response))

['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_content', '_content_consumed', '_next', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'next', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']


Look up the text of the response:

In [5]:
response.url

'https://cs.wellesley.edu/~cs315/readings/index.html'

Check if the desired phrase is in the response's content:

In [6]:
response.text.find("for protection") != -1 

# Why are we checking that the value of the expression on the left is different from -1?
# Answer: Because the string method find return -1 when it doesn't find a substring in text

True

### Use case: checking your CS server accounts

We will need first to get your accounts:

In [7]:
with open('sec02.txt') as inputF:
    lines = inputF.readlines()
    
print(lines)

['cf104@wellesley.edu\n', 'sg110@wellesley.edu\n', 'mg107@wellesley.edu\n', 'sh110@wellesley.edu\n', 'cj104@wellesley.edu\n', 'jk103@wellesley.edu\n', 'sk112@wellesley.edu\n', 'qk100@wellesley.edu\n', 'hl105@wellesley.edu\n', 'al118@wellesley.edu\n', 'll111@wellesley.edu\n', 'yl106@wellesley.edu\n', 'ml112@wellesley.edu\n', 'mm121@wellesley.edu\n', 'jr109@wellesley.edu\n', 'tr100@wellesley.edu\n', 'ay106@wellesley.edu\n', 'fy100@wellesley.edu']


Clean up the list of accounts, using list comprehension:

In [8]:
accounts = [line.split('@')[0] for line in lines]
print(accounts)

['cf104', 'sg110', 'mg107', 'sh110', 'cj104', 'jk103', 'sk112', 'qk100', 'hl105', 'al118', 'll111', 'yl106', 'ml112', 'mm121', 'jr109', 'tr100', 'ay106', 'fy100']


Now that we have all your accounts, let's generate the URLs for them:

In [9]:
# generate URLs for all accounts via list comprehension and the format string syntax

urls = [f'http://cs.wellesley.edu/~{acc}/index.html' for acc in accounts]

for el in urls:
    print(el) 

http://cs.wellesley.edu/~cf104/index.html
http://cs.wellesley.edu/~sg110/index.html
http://cs.wellesley.edu/~mg107/index.html
http://cs.wellesley.edu/~sh110/index.html
http://cs.wellesley.edu/~cj104/index.html
http://cs.wellesley.edu/~jk103/index.html
http://cs.wellesley.edu/~sk112/index.html
http://cs.wellesley.edu/~qk100/index.html
http://cs.wellesley.edu/~hl105/index.html
http://cs.wellesley.edu/~al118/index.html
http://cs.wellesley.edu/~ll111/index.html
http://cs.wellesley.edu/~yl106/index.html
http://cs.wellesley.edu/~ml112/index.html
http://cs.wellesley.edu/~mm121/index.html
http://cs.wellesley.edu/~jr109/index.html
http://cs.wellesley.edu/~tr100/index.html
http://cs.wellesley.edu/~ay106/index.html
http://cs.wellesley.edu/~fy100/index.html


We now will look up the text of each index.html file to check if it contains our desired phrase.

In [10]:
for acc in accounts:
    url = f"http://cs.wellesley.edu/~{acc}/index.html"
    response = requests.get(url)
    if response.status_code == 200:
        if response.text.find("for protection") != -1:
            print(acc, "SUCCESS")
        else:
            print(acc, "Didn't find phrase.")
    else:
        print(acc, "ERROR", response.reason)

cf104 Didn't find phrase.
sg110 ERROR Not Found
mg107 SUCCESS
sh110 Didn't find phrase.
cj104 SUCCESS
jk103 Didn't find phrase.
sk112 ERROR Not Found
qk100 SUCCESS
hl105 Didn't find phrase.
al118 SUCCESS
ll111 ERROR Not Found
yl106 SUCCESS
ml112 SUCCESS
mm121 SUCCESS
jr109 ERROR Not Found
tr100 Didn't find phrase.
ay106 Didn't find phrase.
fy100 SUCCESS


## Part 2: Install a new package

When our Python installation doesn't contain a package/module, we will get an error when importing it:

In [11]:
import textblob

It's easy and possible to install packages from the notebook itself, just use the command `pip install` followed by the library name.

In [12]:
pip install textblob

Note: you may need to restart the kernel to use updated packages.


## Part 3: BeautifulSoup

This is a library that helps parse HTML documents. 

In [13]:
import bs4

The class we will use is called BeatifulSoup, but since the name is long, we will rename it as BS.

In [14]:
from bs4 import BeautifulSoup as BS

Firs, I'm creating a simple function to get the content of HTML pages based on their URLs:

In [15]:
def getHTMLPage(url):
    """Given a url, get the HTML page content"""
    
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        print("Failure resaon:", response.reason)
        return

Now I will get the content of the HTML file using the function I created:

In [8]:
url = "http://cs.wellesley.edu/~cs315/readings/index.html"
htmlPage = getHTMLPage(url)
print(htmlPage)

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <title>Purpose</title>
</head>
<body>
    <p>This page is here for protection.</p>
</body>
</html>



Let's check what value type is stored in `htmlPage`:

In [9]:
type(htmlPage)

str

The BeautifulSoup constructor will create the DOM (document object model) object:

In [10]:
domTree = BS(htmlPage, 'html.parser')
type(domTree)

bs4.BeautifulSoup

**Note:** Notice the difference between `htmlPage`, which is simply a string and `domTree` which is an object (an instance of the class BeautifuSoup).

In [11]:
print(dir(domTree))



Let us use the method `find` to find elements with a given tag:

In [12]:
domTree.find('p') # get a paragraph element

<p>This page is here for protection.</p>

In [13]:
domTree.find('body') # get the body element

<body>
<p>This page is here for protection.</p>
</body>

In [14]:
domTree.find('title') # get the title element

<title>Purpose</title>

In [15]:
domTree.find('p').text # get the text of the p element

'This page is here for protection.'

This was just to give you a taste of BeautifulSoup, we will continue doing more work with it on future tutorials.