# Recap Last Week

We covered:
* First class Functions
* Functions within Functions
* Decorators
* Context Managers

# This Week

Interacting with the web:

* A Primer on HTTP
* The requests module in Python
* Getting data from the web
* Basics of web scraping with Beautiful Soup



## Basics of HTTP

HTTP is foundational to any exchange of data on the web. HTTP is a **client-server protocol**, which means that clients (often a web browser) must initiate the interaction. Once the server has received a message from the client it can choose whether or not to respond, and what information to return. The messages a client sends are typically called **requests**, and there are several differnt types of requests that a client can make. Whatever the server sends back is know as the **response**.

## HTTP Request Methods

Again, When deailing with web interactions the client **IS ALWAYS** the side that makes the **request**. At a minimum each request will contain a **method**, informing the server what action thE client wants to perform, and a **url**, which specifies the path to the resource. HTTP requests typically also contain **headers**, providing the server with additional information about the request, and some HTTP methods will contain a **body** because they are sending data to the server

**A list of common HTTP methods**

* **GET** - Ask for some resource from the server. Some HTML document, CSS stylesheet, JavaScript file, JSON Data, etc.

* **POST** - Used to send data to the server. Typically Form data on an HTML page. Creates new resources on the server

* **PUT** - Also used to send data to the server, but is often used to update existing resources instead of creating new ones

* **DELETE** - As the name suggests, this method is used to delete resources from the server



In [1]:
import requests

In order to make an http get request, we can use the get method from the requests module

In [12]:
url = "https://www.allrecipes.com/recipes/88/bbq-grilling"

resp = requests.get(url)
# The response object has a reference to the request that we just sent
req = resp.request

In [19]:
# Take a look at the request object, and what we can call on it
print(dir(req))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_body_position', '_cookies', '_encode_files', '_encode_params', '_get_idna_encoded_host', 'body', 'copy', 'deregister_hook', 'headers', 'hooks', 'method', 'path_url', 'prepare', 'prepare_auth', 'prepare_body', 'prepare_content_length', 'prepare_cookies', 'prepare_headers', 'prepare_hooks', 'prepare_method', 'prepare_url', 'register_hook', 'url']


## Inspecting the request object

In [20]:
print(f'The request method was {req.method}', end='\n\n')
print(f'The request url was {req.url}', end='\n\n')
print(f'The request headers were {req.headers}', end='\n\n')
print(f'The request body was {req.body}', end='\n\n')

The request method was GET

The request url was https://www.allrecipes.com/recipes/88/bbq-grilling/

The request headers were {'User-Agent': 'python-requests/2.21.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Cookie': 'FirstImpression=False; ARSiteUser=1-b7535850-7b50-42b9-99da-094f5638c2ff; ARCompressedSession=LwcAAB+LCAAAAAAABAB9VdmSokoQ/RdfezpYlK3fBEFRAWVTfGMppNgKilUn5t8vbe9zb9yICiLqZFbmycxTxe+ZBa+l33YYzF5mt/0gX+KqgNDny4aEg9+VB5FPbYsySETtV6Ivk5wgKe69Ns5Fwu3ZJy/lb5SgRmNzFhZZfaFrqcuV2o6DiHPDi2yr+kVjIambwEsMxD+BRe9unVGkg2pd02V3XO/EbUHMaQJyhZdhcWE1Tl7vMFemfQtMlTk1JM62TB0RSgJoFfietutXlSNxdy0+3UNq4eW27k1MQ79QXC41stMZBH4+H+9QPaojxWb93Xc8eKvDLBzscLyt5gbVWYWTLFvzHqsj6p4Gh9WNdj2cLzfLoipOdNaD5kgHwKtRHijNgU0F8f5kle7ck6mEK3ahu4szWbnnPVWhSO0U9bi3dgW3XVfzGrlunFFgYXflyJ97KmVDVLhbvA9dc53RCklfxHRhrPZpS8WpoTZ9KA116ZWDMNrMij0QwXBlYOYwB5SKDLskyKkPN04q5k+uoSUtQxtdIWt9UvfnxQK0Cb5E4QbV2YlnJOCLG7Xk0f6qCO2g9KfDoJdLkkmiseKK+yDRF2Nz5mJDrPPFxei8Xsnkc4r2iW5KnZtn1Xp3Ph+NM1f

## HTTP Responses

An HTTP response is what the server sends back after it receives a request. There are a few things that make up a response, The **status code**, **status message**, and **headers**. Optionally, some responses will contain a **body**. Status codes in the **200+** range usually indicate that everything went fine. **300+** requests usually mean there's a redirect, **400+** request mean there's an error with the clients request, and **500+** errors mean that there is an issue with the server.

**Common Status Codes**

* **200** - The request was succesfull
* **301** - What you're request no longer exists at the given url
* **400** - Bad Request. The server couldn't figure out what you wanted
* **401** - The client needs to authenticate before accessing that resource
* **404** - Whatever the client tried to access doesn't exist
* **500** - Some internal server error occured

**Common Headers**

* **Content-Type** - Indicates the media type of the response. Ex) text\html
* **Content-Length** - Indicates how long the body is

## Inspecting the response object

Before we made a get request to https://www.allrecipes.com/recipes/88/bbq-grilling, lets inspect the response more thouroughtly

In [25]:
print(dir(resp))

['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_content', '_content_consumed', '_next', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'next', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']


In [33]:
print(f'The Status Code of the reponse is {resp.status_code}', end='\n\n')
print(f'The response headers are {list(resp.headers.keys())}', end='\n\n')
print(f"The Content-Type is {resp.headers['Content-Type']}", end='\n\n')

The Status Code of the reponse is 200

The response headers are ['Cache-Control', 'Transfer-Encoding', 'Content-Type', 'Content-Encoding', 'Vary', 'Set-Cookie', 'CorrelationId', 'Arr-Disable-Session-Affinity', 'Access-Control-Allow-Origin', 'Date', 'X-F5-Node']

The Content-Type is text/html; charset=utf-8



In [37]:
# The text object contains the html from the response, There's a lot so we're only going to print the first 100 characters
print(resp.text[:500], end='\n\n\n')




<!DOCTYPE html>
<html lang="en-us">
<head>
    <title>BBQ & Grilling Recipes - Allrecipes.com</title>

<script src='https://secureimages.allrecipes.com/assets/deployables/v-1.167.0.5111/karma.bundled.js' async=true></script>


    <!--Make our website baseUrl available to the client-side code-->
    <script type="text/javascript">
        var AR = AR || {};

        AR.segmentWriteKey = "RnmsxUrjIjM7W62olfjKgJrcsVlxe68V";
        AR.baseWebsiteUrl = 'https://www.allrecipes.com




As you can see we managed to get the actual HTML from the we page.

## Beautiful Soup Basics

Beautiful Soup is a 3rd party package in Python, which means that you'll need to install it before you can use it. From a terminal or command prompt type:

    pip install beautifulsoup4

Beautifyl soup is a library built to easily parse data from html, and it's what we'll use extract data from the request we made earlier

In [57]:
from bs4 import BeautifulSoup

# pass the html document to Beautiful soup so you can easily parse it
soup = BeautifulSoup(resp.text)
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [58]:
# There is a lot we can do with the soup object
print(dir(soup))



If you know a bit about html, we can quickly get back the first occurence of each of these HTML elements

In [59]:
print(soup.title)
print(soup.a)
print(soup.input)
print(soup.option)

<title>BBQ &amp; Grilling Recipes - Allrecipes.com</title>
<a id="top"></a>
<input id="searchText" name="searchText" ng-keypress="isEnterKey($event) &amp;&amp; performSearch()" ng-model="search.keywords" placeholder="Find a recipe" type="text"/>
<option value="">Select location</option>


In [62]:
# As we can see we can get Tag elements from Beautiful Soup
print(type(soup.div))

<class 'bs4.element.Tag'>


In [63]:
# What can we do with these tag elements?
print(dir(soup.div))

['HTML_FORMATTERS', 'XML_FORMATTERS', '__bool__', '__call__', '__class__', '__contains__', '__copy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_all_strings', '_find_all', '_find_one', '_formatter_for_name', '_is_xml', '_lastRecursiveChild', '_last_descendant', '_should_pretty_print', 'append', 'attrs', 'can_be_empty_element', 'childGenerator', 'children', 'clear', 'contents', 'decode', 'decode_contents', 'decompose', 'descendants', 'encode', 'encode_contents', 'extend', 'extract', 'fetchNextSiblings', 'fetchParents', 'fetchPrevious', 'fetchPreviousSiblings', 'find', 'findAll', 'findAllNext', 'findAllPre

We'll come back to this list of methods later, but as you can see there are a lot of methods we can use to get information out of a Tag Element Object

## Searching for elements using find() and find_all()

As you might expect, ``find_all()`` Will find every occurence of an element on the page, where ``find()`` will return the first occurence

In [68]:
all_links = soup.find_all('a')
print(f'There are {len(all_links)} on the page', end='\n\n')

# The First 10 links on the page
print(all_links[:10])

There are 328 on the page

[<a id="top"></a>, <a class="skip-to-content" href="#main-content">Skip to main content</a>, <a class="newThisMonth" href="/new-this-month/" rel="nofollow">New&lt;&gt; this month</a>, <a aria-label="Pinterest" class="pinterest" data-header-link-tracking='{"label": "Social &gt; Pinterest"}' href="http://pinterest.com/allrecipes/" target="_blank" title="Pinterest"><span class="svg-icon--social--pinterest svg-icon--social--pinterest-dims"></span></a>, <a aria-label="Facebook" class="facebook" data-header-link-tracking='{"label": "Social &gt; Facebook"}' href="https://www.facebook.com/allrecipes" target="_blank" title="Facebook"><span class="svg-icon--social--facebook svg-icon--social--facebook-dims"></span></a>, <a aria-label="Instagram" class="instagram" data-header-link-tracking='{"label": "Social &gt; Instagram"}' href="http://instagram.com/allrecipes" target="_blank" title="Instagram"><span class="svg-icon--social--instagram svg-icon--social--instagram-dims"

In [69]:
soup.find('p')

<p data-ellipsis="" data-ng-cloak="">Spice up your grilling rotation with some fiery Mexican-style BBQ. These recipes get their smoky, char-flavored kicks from the flames.</p>

## Filtering find_all with a function

Based on the Beautiful Soup [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-function), it's possible to filter searches using a funtion that takes a single argument (an HTML Element). The function can be as complicated as you want, but should return True or False, whether to return that element or not

In [79]:
def find_all_hidden(element):
    return element.hidden

p = soup.find('p')

p.hidden
# [item for item in dir(p) if 'p' in item]
p.name

'p'

In [73]:
# Looks Like there aren't any hidden elements on the page
soup.find_all(find_all_hidden)

[]

In [87]:
# another example. if the parent element is a div
from bs4.element import Tag

def is_parent_a_div(element):
    return isinstance(element.parent, Tag) and element.parent.name == 'div'

elements_with_div_parents = soup.find_all(is_parent_a_div)
len(elements_with_div_parents)

372

## Find elements by CSS Class / and Regular Expressions

In [88]:
import re

title_class = re.compile('title')

elements_with_title_css = soup.find_all(class_=title_class)
len(elements_with_title_css)

print(elements_with_title_css[:5])

[<span class="toggle-similar__title" itemprop="name">
                        Home
                    </span>, <span class="toggle-similar__title" itemprop="name">
                            Recipes
                        </span>, <span class="toggle-similar__title" itemprop="name">
                        BBQ &amp; Grilling
                    </span>, <div class="title-section">
<h2 class="special-font"></h2>
<h1>
<span class="title-section__text title">BBQ &amp; Grilling Recipes</span>
</h1>
<span class="title-section__text subtitle">The best BBQ chicken, pork and BBQ sauces. Hundreds of barbecue and grilling recipes, with tips and tricks from home grillers.</span>
<div class="title-section__follow" data-ng-cloak="" data-ng-controller="ar_controllers_hub_stream" data-ng-init="init('bq', 88, 'BBQ')">
<span class="hub-follow-blurb">Follow to get the latest bbq &amp; grilling recipes, articles and more!</span>
<a class="hub-follow" ng-class="{'highlighted': isFollowing}" ng-click="f

## Parsing Recipie Data

We've seen how easy using the requests library makes it to send HTTP requests in Python. In conjuctino with Beautiful soup, we can quickly parse values out of HTML that might be returned from the request. In the next section we'll build a model for capturing data about each recepie.

Go to your browser, type in the url that we used the request library to get for us earlier, and take a look at the page. There seems to be some structure to the way the recepeis are layed out.

``Right click`` the page, and cick ``inspect``

Taking a look at the HTML for the page will help us figure out what we need to search for when trying to pare the HTML

## Closing Notes on Web Scraping

Web Scraping in theory is pretty simple, you have some website that's written in HTML, once you have that html you utilise the built in structure of HTML to extract only the information that you need. In practice it's much more complicated. You might be able to build a script that can scrape one webiste, but what happens when the structure of the HTML changes? Suddenly, your script is obsolete and you might have to start from scratch. What about two differenct sites? Chances are the first script won't work for the second site. On top of that, a lot of websites / web applications are built with a minimal amout of HTML these days. Modern Web Frameworks utilise JavaScript to fully rendered pages; data is loaded asyncronously, which means a lot of information is usually not stored in the HTML sent to the browser; or more data is loaded when a user scrolls to a certain point on the page. If you're using the methods we showed today, it might not alwasy work. After all this you Also need to consider that some websites will actively try to prevent people from scraping their data, and it's not hard to figure out who's a bot and who's not.

If you thought what we covered in todays class was interesting than I highly encourage you to keep at it. I would also add that it's important to be mindful, and not overload any one site with too many requests to their server at once. If you are scraping multiple pages of the same site, just take it slow.

There are other cool tools for browser automation that we didn't have time to cover today (and won't have time to cover in this course), but if you're interested I'd highly recommend checking out a library called [selenium](https://selenium-python.readthedocs.io/)

## Additional Resources

* [an overveiw of HTTP](https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview) Overall a ton of great info if you want to dig a bit deeper on how the web works
* [list of HTTP Status Codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status)
* [requests library](https://realpython.com/python-requests/)
* [Beautiful Soup tutorial](https://www.youtube.com/watch?v=87Gx3U0BDlo)
* [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [Web Scraping Introduction, Best Practices, and Caveats](https://medium.com/velotio-perspectives/web-scraping-introduction-best-practices-caveats-9cbf4acc8d0f)