# Recap Last Week

We covered:
* First class Functions
* Functions within Functions
* Decorators
* Context Managers

# This Week

Interacting with the web:

* A Primer on HTTP
* The requests module in Python
* Getting data from the web
* Basics of web scraping with Beautiful Soup



## Basics of HTTP

HTTP is foundational to any exchange of data on the web. HTTP is a **client-server protocol**, which means that clients (often a web browser) must initiate the interaction. Once the server has received a message from the client it can choose whether or not to respond, and what information to return. The messages a client sends are typically called **requests**, and there are several differnt types of requests that a client can make. Whatever the server sends back is know as the **response**.

## HTTP Request Methods

Again, When deailing with web interactions the client **IS ALWAYS** the side that makes the **request**. At a minimum each request will contain a **method**, informing the server what action thE client wants to perform, and a **url**, which specifies the path to the resource. HTTP requests typically also contain **headers**, providing the server with additional information about the request, and some HTTP methods will contain a **body** because they are sending data to the server

**A list of common HTTP methods**

* **GET** - Ask for some resource from the server. Some HTML document, CSS stylesheet, JavaScript file, JSON Data, etc.

* **POST** - Used to send data to the server. Typically Form data on an HTML page. Creates new resources on the server

* **PUT** - Also used to send data to the server, but is often used to update existing resources instead of creating new ones

* **DELETE** - As the name suggests, this method is used to delete resources from the server



### Requests in Python

There are several libraries that you can use to make HTTP requests in python, but one of the easiest to use is ``reqeusts``.

It's a third party package, so you'll need to install it in order to use it.

    pip install requests

In [None]:
import requests

In order to make an http get request, we can use the get method from the requests module

In [None]:
url = "https://www.allrecipes.com/recipes/88/bbq-grilling"

resp = requests.get(url)
# The response object has a reference to the request that we just sent
req = resp.request

In [None]:
# Take a look at the request object, and what we can call on it
print(dir(req))

## Inspecting the request object

In [None]:
print(f'The request method was {req.method}', end='\n\n')
print(f'The request url was {req.url}', end='\n\n')
print(f'The request headers were {req.headers}', end='\n\n')
print(f'The request body was {req.body}', end='\n\n')

## HTTP Responses

An HTTP response is what the server sends back after it receives a request. There are a few things that make up a response, The **status code**, **status message**, and **headers**. Optionally, some responses will contain a **body**. Status codes in the **200+** range usually indicate that everything went fine. **300+** requests usually mean there's a redirect, **400+** request mean there's an error with the clients request, and **500+** errors mean that there is an issue with the server.

**Common Status Codes**

* **200** - The request was succesfull
* **301** - What you're request no longer exists at the given url
* **400** - Bad Request. The server couldn't figure out what you wanted
* **401** - The client needs to authenticate before accessing that resource
* **404** - Whatever the client tried to access doesn't exist
* **500** - Some internal server error occured

**Common Headers**

* **Content-Type** - Indicates the media type of the response. Ex) text\html
* **Content-Length** - Indicates how long the body is

## Inspecting the response object

Before we made a get request to https://www.allrecipes.com/recipes/88/bbq-grilling, lets inspect the response more thouroughtly

In [None]:
print(dir(resp))

In [None]:
print(f'The Status Code of the reponse is {resp.status_code}', end='\n\n')
print(f'The response headers are {list(resp.headers.keys())}', end='\n\n')
print(f"The Content-Type is {resp.headers['Content-Type']}", end='\n\n')

In [None]:
# The text object contains the html from the response, There's a lot so we're only going to print the first 100 characters
print(resp.text[:500], end='\n\n\n')

As you can see we managed to get the actual HTML from the we page.

## Beautiful Soup Basics

Beautiful Soup is a 3rd party package in Python, which means that you'll need to install it before you can use it. From a terminal or command prompt type:

    pip install beautifulsoup4

Beautifyl soup is a library built to easily parse data from html, and it's what we'll use extract data from the request we made earlier

In [None]:
from bs4 import BeautifulSoup

# pass the html document to Beautiful soup so you can easily parse it
soup = BeautifulSoup(resp.text)
print(type(soup))

In [None]:
# There is a lot we can do with the soup object
print(dir(soup))

If you know a bit about html, we can quickly get back the first occurence of each of these HTML elements

In [None]:
print(soup.title)
print(soup.a)
print(soup.input)
print(soup.option)

In [None]:
# As we can see we can get Tag elements from Beautiful Soup
print(type(soup.div))

In [None]:
# What can we do with these tag elements?
print(dir(soup.div))

We'll come back to this list of methods later, but as you can see there are a lot of methods we can use to get information out of a Tag Element Object

## Searching for elements using find() and find_all()

As you might expect, ``find_all()`` Will find every occurence of an element on the page, where ``find()`` will return the first occurence

In [None]:
all_links = soup.find_all('a')
print(f'There are {len(all_links)} on the page', end='\n\n')

# The First 10 links on the page
print(all_links[:10])

In [None]:
soup.find('p')

## Filtering find_all with a function

Based on the Beautiful Soup [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-function), it's possible to filter searches using a funtion that takes a single argument (an HTML Element). The function can be as complicated as you want, but should return True or False, whether to return that element or not

In [None]:
def find_all_hidden(element):
    return element.hidden

p = soup.find('p')

p.hidden
# [item for item in dir(p) if 'p' in item]
p.name

In [None]:
# Looks Like there aren't any hidden elements on the page
soup.find_all(find_all_hidden)

In [None]:
# another example. if the parent element is a div
from bs4.element import Tag

def is_parent_a_div(element):
    return isinstance(element.parent, Tag) and element.parent.name == 'div'

elements_with_div_parents = soup.find_all(is_parent_a_div)
len(elements_with_div_parents)

## Find elements by CSS Class / and Regular Expressions

In [None]:
import re

title_class = re.compile('title')

elements_with_title_css = soup.find_all(class_=title_class)
len(elements_with_title_css)

print(elements_with_title_css[:5])

## Parsing Recipie Data

We've seen how easy using the requests library makes it to send HTTP requests in Python. In conjuctino with Beautiful soup, we can quickly parse values out of HTML that might be returned from the request. In the next section we'll build a model for capturing data about each recepie.

Go to your browser, type in the url that we used the request library to get for us earlier, and take a look at the page. There seems to be some structure to the way the recepeis are layed out.

``Right click`` the page, and cick ``inspect``

Taking a look at the HTML for the page will help us figure out what we need to search for when trying to pare the HTML.

**Here's what the HTML for one recepie looks like**

    <article class="fixed-recipe-card">
        <a ng-class="{ highlighted : saved }" title="Save this recipe" data-ng-show="showHeart" class="favorite ng-isolate-scope" data-id="244552" data-type="'Recipe'" data-name="&quot;Red, White, and Blueberry Grilled Chicken&quot;" data-segmentpageproperties="segmentContentInfo" data-imageurl="'https://images.media-allrecipes.com/userphotos/300x300/2410310.jpg'"><span class="ng-binding"></span></a>
        <div class="grid-card-image-container">
            <a href="https://www.allrecipes.com/recipe/244552/red-white-and-blueberry-grilled-chicken/?internalSource=rotd&amp;referringId=88&amp;referringContentType=Recipe Hub" data-content-provider-id="" data-internal-referrer-link="rotd" class="ng-isolate-scope" target="_self">

                <img class="fixed-recipe-card__img ng-isolate-scope" data-lazy-load="" data-original-src="https://images.media-allrecipes.com/userphotos/300x300/2410310.jpg" alt="Red, White, and Blueberry Grilled Chicken Recipe and Video - Chef John's spicy chili-rubbed chicken breasts are grilled and topped with a tangy blueberry sauce. It's a perfect dish for a summer barbeque." title="Red, White, and Blueberry Grilled Chicken Recipe and Video" src="https://images.media-allrecipes.com/userphotos/300x300/2410310.jpg" style="display: inline;">
            </a>
                <a href="https://www.allrecipes.com/video/5182/red-white-and-blueberry-grilled-chicken/?internalSource=rotd&amp;referringId=88&amp;referringContentType=Recipe Hub" data-content-provider-id="" data-internal-referrer-link="rotd" class="ng-isolate-scope" target="_self">
                    <span class="watchButton">
                        <span class="watchButton__text">WATCH</span>
                    </span>
                </a>
        </div>
        <div class="fixed-recipe-card__info">
                <h4 class="fixed-recipe-card__rotd">
                    Recipe of the Day
                </h4>
            <h3 class="fixed-recipe-card__h3">
                <a href="https://www.allrecipes.com/recipe/244552/red-white-and-blueberry-grilled-chicken/?internalSource=rotd&amp;referringId=88&amp;referringContentType=Recipe Hub" data-content-provider-id="" data-internal-referrer-link="rotd" class="fixed-recipe-card__title-link ng-isolate-scope" target="_self">
                    <span class="fixed-recipe-card__title-link">Red, White, and Blueberry Grilled Chicken</span>
                </a>
            </h3>
            <a href="https://www.allrecipes.com/recipe/244552/red-white-and-blueberry-grilled-chicken/?internalSource=rotd&amp;referringId=88&amp;referringContentType=Recipe Hub" data-content-provider-id="" data-internal-referrer-link="rotd" class="ng-isolate-scope" target="_self">
                <div class="fixed-recipe-card__ratings">

    <span class="stars stars-4-5" onclick="AnchorScroll('reviews')" data-ratingstars="4.69000005722046" aria-label="Rated 4.69 out of 5 stars"></span>

                    <span class="fixed-recipe-card__reviews">43</span>
                        <div data-merch-type="Ads_LogoScroller_122x34">
                            <span id="ad-rotd"></span>
                        </div>
                </div>
                <div data-ellipsis="" class="fixed-recipe-card__description ng-isolate-scope">Chef John's spicy chili-rubbed chicken breasts are grilled and topped with a tangy blueberry…</div>
            </a>
            <div class="fixed-recipe-card__profile">
                        <a href="https://www.allrecipes.com/cook/foodwisheswithchefjohn/?internalSource=rotd&amp;referringId=88&amp;referringContentType=Recipe Hub" data-content-provider-id="" data-internal-referrer-link="rotd" class="ng-isolate-scope" target="_self">
                            <ul class="cook-submitter-info">
                                <li>
                                    <img class="cook-img" alt="profile image" src="https://images.media-allrecipes.com/userphotos/50x50/2267470.jpg">
                                </li>
                                <li>
                                    <h4><span>By</span> Chef John</h4>
                                </li>
                            </ul>
                        </a>
            </div>
        </div>
    </article>


From taking a look at the HTML we can easily identify where some useful information is:

* Each recepie is contained in an ``<article>`` tag with a class of ``fixed-recipe-card``
* Titles can be found in a ``<span>`` tag with a class of ``fixed-recipe-card__title-link``

In [None]:
recepies = soup.find_all('article', class_='fixed-recipe-card')

# check out how many recepies we were able to get
len(recepies)

recepie_1 = recepies[0]

In [None]:
def recepie_title(recepie_html):
    """returns the title of a recepie"""
    return recepie_html.find('span', class_='fixed-recipe-card__title-link').text

recepie_title(recepie_1)

In [None]:
def recepie_description(recepie_html):
    """
    description of the recepie can be found as an ``alt`` attribute of the image or 
    in a div tag
    """
    # grab the text from the description div     
    description_1 = recepie_html.find('div', class_='fixed-recipe-card__description').text
    # grab the alt text from the recepie image
    description_2 = recepie_html.find('img', class_='fixed-recipe-card__img').attrs['alt']
    # we want the longest description we can get     
    return max(description_1, description_2, key=lambda description: len(description))

recepie_description(recepie_1)

In [None]:
stars = re.compile('stars')
def recepie_rating(recepie_html):
    """
    return the rating for a recepie
    """
    element = recepie_html.find('span', class_=stars)
    rating =  float(element.attrs['data-ratingstars'])
    return round(rating, 2)

recepie_rating(recepie_1)

In [None]:
def recepie_reviews(recepie_html):
    """
    return the number of reviews for a recepie
    """
    review_element = recepie_html.find('span', class_="fixed-recipe-card__reviews")
    reviews = next(review_element.children).attrs['number']
    return int(reviews)

recepie_reviews(recepie_1)

In [None]:
# link to original recepie
def recepie_link(recepie_html):
    """
    The link to the recepie can be found by the title
    """
    title_element = recepie_html.find('h3', class_="fixed-recipe-card__h3")
    link = title_element.find('a').attrs['href']
    return link

recepie_link(recepie_1)

In [None]:
# chef's name

def recepie_chef(recepie_html):
    ul_element = recepie_html.find('ul', class_="cook-submitter-info")
    return ul_element.find('h4').text

recepie_chef(recepie_1)

In [None]:
def error_handler(f):
    """
    decorator that will catch all errors that might be raised when calling our functions,
    and return an empty string by default
    """
    def wrapper(*args, **kwargs):
        try:
            return f(*args, **kwargs)
        except:
            return ''
    return wrapper

### Putting it all together

In [None]:
recepie_list = []

for recepie in recepies:
    recepie_data = {}
    # remember that weird syntax of decorating on the fly     
    recepie_data['title'] = error_handler(recepie_title)(recepie)
    recepie_data['description'] = error_handler(recepie_description)(recepie)
    recepie_data['rating'] = error_handler(recepie_rating)(recepie)
    recepie_data['reviews'] = error_handler(recepie_reviews)(recepie)
    recepie_data['link'] = error_handler(recepie_link)(recepie)
    recepie_data['chef'] = error_handler(recepie_chef)(recepie)
    
    recepie_list.append(recepie_data)

for recepie in (recepie_list):
    # We have all the data we're just choosing to only print the title     
    print(f"{recepie['rating']} -- {recepie['title']}")

Now that we have the data we can do pretty much anything with it. We could store it in a csv file for example, and process it later. Assuming that all of recepie.com's pages are set up in a similar fashion we could reproduce our scraping for other links

## Closing Notes on Web Scraping

Web Scraping in theory is pretty simple, you have some website that's written in HTML, once you have that html you utilise the built in structure of HTML to extract only the information that you need. In practice it's much more complicated. You might be able to build a script that can scrape one webiste, but what happens when the structure of the HTML changes? Suddenly, your script is obsolete and you might have to start from scratch. What about two differenct sites? Chances are the first script won't work for the second site. On top of that, a lot of websites / web applications are built with a minimal amout of HTML these days. Modern Web Frameworks utilise JavaScript to fully rendered pages; data is loaded asyncronously, which means a lot of information is usually not stored in the HTML sent to the browser; or more data is loaded when a user scrolls to a certain point on the page. If you're using the methods we showed today, it might not alwasy work. After all this you Also need to consider that some websites will actively try to prevent people from scraping their data, and it's not hard to figure out who's a bot and who's not.

If you thought what we covered in todays class was interesting than I highly encourage you to keep at it. I would also add that it's important to be mindful, and not overload any one site with too many requests to their server at once. If you are scraping multiple pages of the same site, just take it slow.

There are other cool tools for browser automation that we didn't have time to cover today (and won't have time to cover in this course), but if you're interested I'd highly recommend checking out a library called [selenium](https://selenium-python.readthedocs.io/)

## Additional Resources

* [an overveiw of HTTP](https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview) Overall a ton of great info if you want to dig a bit deeper on how the web works
* [list of HTTP Status Codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status)
* [requests library](https://realpython.com/python-requests/)
* [Beautiful Soup tutorial](https://www.youtube.com/watch?v=87Gx3U0BDlo)
* [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [Web Scraping Introduction, Best Practices, and Caveats](https://medium.com/velotio-perspectives/web-scraping-introduction-best-practices-caveats-9cbf4acc8d0f)