# BeautifulSoup Webscraping Tutorial

#### Table of Contents

1. <a href='##Introduction'>The Basics of Webscraping With BeautifulSoup</a> <br>
> 1.a <a href='##urllib'>The urllib package</a> <br>
> 1.b <a href='##beautiful_soup'>The Beautiful Soup package</a> <br>
2. <a href='##example'>Webscrapig Example</a> <br>
> 2.a <a href='##packages'>Necessary Packages</a> <br>
> 2.b <a href='##urllib_example'>urllib to grab the website</a> <br>
> 2.c <a href='##bs4_object'>Create a bs4 object</a> <br>
3. <a href='##bs4_parse'>Parsing a Beautiful Soup Object</a> <br>
> 3.a <a href='##bs4_html'>Creating the right HTML objects to parse</a> <br>
> 3.b <a href='##bs4_find_html'>Creating nested HTML objects using **find( ) and **find_all( ) methodology</a> <br>
> 3.c <a href='##pulling_tags'>Parsing the individual movies and pulling the data</a> <br>
> 3.d <a href='##storing_data'>Storing the data and exporting to CSV</a> <br>

<a id='#Introduction'></a>

### The Basics of Webscraping with BeautifulSoup

Beautiful Soup is a lightweight, flexible Python library designed for parsing HTML documents. It provides many great features, such as HTML tree traversal and automatic encoding (incoming documents to Unicode and outgoing documents to UTF-8) (1) which enable the user to focus solely on parsing documents and extracting information. Beautiful Soup itself does not collect webpages - this is left to the urllib package in Python which we'll discuss next. Instead Beauful Soup parses the HTML code, meaning that Beautiful Soup and urllib work hand-in hand. 

We'll be working in Python 3 for this tutorial, although Beautiful Soup runs for Python 2 with some simple adjustments to your code. Let's take a look at an example and see if we can get you up and running with parsing your own web pages!

(1) https://www.crummy.com/software/BeautifulSoup/

<a id='#urllib'></a>

#### The urllib package

The urllib package enables you to access websites via your Python console. With just a few lines of code you can write a program for grabbing objects and saving data from webpages. What's also great about urllib is that it is part of the standard Python libary - so it rarely ever needs to be installed (you already have it upon installing any version of Python). Below are some links to urllib documentation and simple example code:

**Tutorial**:    https://pythonprogramming.net/urllib-tutorial-python-3/ <br>
**Documentation**: https://docs.python.org/3/library/urllib.html

<a id='#beautiful_soup'></a>

#### The Beautiful Soup Package

Once we've used urllib to grab an HTML document, then we can deploy Beautiful Soup. Beautiful Soup has built in methods to parse the HTML tree, grab tags and encode incoming/outgoing documents. There have been several iterations of the Beautiful Soup package, the latest of which is **bs4** for Python 3. Below are some relevant links for installing and reading about Beautiful Soup:

**Documentation**: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ <br>
**Installation**: http://www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python

<a id='#example'></a>

#### Let's start a webscraping example

The first thing we need to do is import the relevant packages for our project. These packages are the following: <br>
* **re** - the regular expressions package for evaluating strings <br>
* **pandas** - this is a fundamental package for formatting and evaluating data <br>
* **urllib.request** - HTML reqeusts are handled by the '.request' method in the urllib package <br>
* **bs4** - we will create an object for parsing using the Beautiful Soup method of the bs4 package <br>

<a id='#packages'></a>

In [119]:
import re
import json
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup

<a id='#urllib_example'></a>

#### Use urllib to grab the website object

In [3]:
#Declare the website as a string. For this example, I've decided to grab the data from rottentomatoes website, parse
#the html and save it in a pandas dataframe which we can export to a csv file.
webpage = 'https://www.rottentomatoes.com/'

#Use the 'with-as' method to open the webpage and create an object of the html source code
with urllib.request.urlopen(webpage) as response:
   html = response.read()

<a id='#bs4_object'></a>

#### Create a Beautiful Soup object

We do this by passing the **html** object to the Beautiful Soup method

In [4]:
soup = BeautifulSoup(html)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


**soup.prettify** displays all the HTML syntax just as it appears on the webpage we took it from. This helps us to determine how to properly search for tags as we parse our tree.

In [5]:
soup.prettify

<bound method Tag.prettify of <!DOCTYPE html>
<html dir="ltr" lang="en" prefix="fb: http://www.facebook.com/2008/fbml og: http://opengraphprotocol.org/schema/" xmlns="http://www.w3.org/1999/xhtml">
<head prefix="og: http://ogp.me/ns# flixstertomatoes: http://ogp.me/ns/apps/flixstertomatoes#">
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>Rotten Tomatoes: Movies | TV Shows | Movie Trailers | Reviews - Rotten Tomatoes</title>
<meta content="Rotten Tomatoes, home of the Tomatometer, is the most trusted measurement of quality for Movies &amp; TV. The definitive site for Reviews, Trailers, Showtimes, and Tickets" name="description"/>
<link href="https://www.rottentomatoes.com/" rel="canonical"/>
<link href="https://www.rottentomatoes.com/assets/pizza-pie/images/favicon.ico" rel="shortcut icon" sizes="76x76" type="image/x-icon"/>
<meta con

<a id='#bs4_parse'></a>

#### Beautiful Soup Parsing Methods

The beauty of Beautiful Soup (pun intended) is in the robust, flexible parsing methods. Some of the ones that I have found the most useful are the following (which we will use to parse this webpage):

* **find_all( )** : the 'find_all()' method is one of the most popular and userful methods of Beautiful Soup. We can pass tags to this method and it will search **DOWN** the tree to return all HTML tags of this type that it finds on the page. <br>
* **find( )** : the 'find()' method is like the 'find_all()' method, except that it returns only the first occurence of the tag that is passed in as a parameter by searching **DOWN** the tree

(Both **find( )** and **find_all( )** search the descendants of the tree, and most other methods of Beautiful Soup are derivations of these two. We'll touch on a couple others)

* **find_parents( )** : the 'find_parents()' method searches **UP** the tree to find all matching HTML tags, as opposed to **find_all( )** which searches down the tree. <br>
* **find_parent( )** : the 'find_parent( )' method is like the 'find( )' method, except that it returns only the first occurence of the tag that is found looking **UP** the HTML tree

* **find_next_siblings( )** : the 'find_next_sibling( )' method searches **DOWN** the tree to find all matching HTML tags that are siblings in the tree <br>
* **find_next_sibling( )** : the 'find_next_sibling( )' method searches **DOWN** the tree to find the next matching HTML tags that are siblings in the tree <br>

To see a full list of the methods available for parsing HTML trees in Beatiful Soup, visit the documentation page at

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find

In [8]:
#As we examine our HTML doc, we see that all the information we need are nested in 'div' tags that have 'ordered-layout__list ordered-layout__list--carousel' 
#as an element of the 'class' tag. 

#To capture these tags, we pass a 'div' tag and also a 'class' tag as parameters to our 'find_all()' method. If we 
#have a value that we are looking to capture in our tags, we can pass it in as a 'dict' parameter. 
#In this instance we use Regular Expressions to capture all the 'id' tags that have the word 'homepage' in the string.

categories = soup.find_all('div', {'class': re.compile('ordered-layout__list ordered-layout__list--carousel')})

In [9]:
#Let's now examine the 'categories' object to see what has been stored there
print("Let's look at the length of the 'categories object")
print('Length: ', len(categories), '\n')
print("Now let's examine the first object to see what we need to pull")
print(categories[1].prettify)

Let's look at the length of the 'categories object
Length:  8 

Now let's examine the first object to see what we need to pull
<bound method Tag.prettify of <div class="ordered-layout__list ordered-layout__list--carousel">
<section class="dynamic-poster-list" id="dynamic-poster-list">
<div class="dynamic-poster-list__header-container">
<div>
<h2>Popular in Theaters</h2>
<a class="a--short" href="/browse/in-theaters/">View all</a>
</div>
<h3 class="p">Availability may vary, check your <a href="https://www.rottentomatoes.com/showtimes">local showtimes</a> for details.</h3>
</div>
<tiles-carousel hidden="">
<div class="posters-container" slot="posters-container">
<tile-poster-video videoid="F68956D2-51DB-456F-8CCD-B0EC9C3D4FFB">
<button class="js-show-modal-trailer" data-content-type="movie" data-media-url="/m/tenet" data-mpx-fwsite="rotten_tomatoes_video_vod" data-no-ads="false" data-title="Tenet" data-video-id="F68956D2-51DB-456F-8CCD-B0EC9C3D4FFB" data-video-list="rt-hp-list-posters-po

<a id='#bs4_html'></a>

#### Creating the right HTML objects to parse: Let's grab the headers

We see that the objects we need to pull have an **'h2'** tag which storest the category name on the rottentomatoes website. The first 6 category objects have this **'h2'** tag, while the last 2 do not (we can discard these last two objects). 

We can use the **soup_html_object.get_text()** method to extract the text associated with our tags

In [88]:
#For each object in the 'categories' object, display it and print out the the movie category title.

for category in categories:
    print(category.find('h2').get_text(),'\n')
    # print(category)
    # print("-----------------Next Category Starts----------------- \n\n")

New & Upcoming Movies 

Popular in Theaters 

New TV This Week 

Hidden Gem Movies on Hulu 

Best Series on Netflix 

Essential Comedies 

Hidden Gem Movies on Prime 

Certified Fresh Picks 



We see from these nested loops that we have __three__ levels that need to be parsed in this HTML object in order to get the necessary data. The structure looks like this:

- Categories (TOP BOX OFFICE, OPENING THIS WEEK, etc)   

    ```<div class="ordered-layout__list ordered-layout__list--carousel">```
    - Category (NEW & UPCOMING MOVIES)
    
          <h2>New & Upcoming Movies</h2>
    
        - Movies (The Devil All the Time)
        
          ```<score-icon-critic alignment="left" percentage="64" size="tiny" slot="critic-score" state="fresh"></score-icon-critic>```
          
          ```<span slot="title" class="p--small">The Devil All the Time</span>```
        
We can take a look at each of these elements and see how we have to parse the nested HTML objects

First we'll look at a category:

In [75]:
# Let's look at the category again
categories[1].prettify

<bound method Tag.prettify of <div class="ordered-layout__list ordered-layout__list--carousel">
<section class="dynamic-poster-list" id="dynamic-poster-list">
<div class="dynamic-poster-list__header-container">
<div>
<h2>Popular in Theaters</h2>
<a class="a--short" href="/browse/in-theaters/">View all</a>
</div>
<h3 class="p">Availability may vary, check your <a href="https://www.rottentomatoes.com/showtimes">local showtimes</a> for details.</h3>
</div>
<tiles-carousel hidden="">
<div class="posters-container" slot="posters-container">
<tile-poster-video videoid="F68956D2-51DB-456F-8CCD-B0EC9C3D4FFB">
<button class="js-show-modal-trailer" data-content-type="movie" data-media-url="/m/tenet" data-mpx-fwsite="rotten_tomatoes_video_vod" data-no-ads="false" data-title="Tenet" data-video-id="F68956D2-51DB-456F-8CCD-B0EC9C3D4FFB" data-video-list="rt-hp-list-posters-popular-in-theaters" slot="play">
<tile-poster-image isvideovisible="true" slot="image">
<img class="js-lazyLoad" data-src="https

Now let's see what one movie looks like when we grab the relevant tags

In [86]:
print('The name of the movie is: ', categories[0].find_all('span', {'class': 'p--small'})[0].get_text())
print('The critics score is: ', categories[0].find_all('score-icon-critic')[0].get('percentage'))
print('The average audience score is: ', categories[0].find_all('score-icon-audience')[0].get('percentage'))

The name of the movie is:  The Devil All the Time
The critics score is:  64
The average audience score is:  83


#### Examine Categories

As we examine each object in each of the **categories** object, we see that the nested objects do not give us individual movies with the associated data. In order to grab this in a clear hierarchal fashion, we'll need to create the objects using the **find( )** and **find_all( )** methodology. 

In [128]:
for category in categories:
    print('THIS IS CATEGORY: ', category.find('h2').get_text(), '\n\n')
    for movie in category.find_all('tile-poster-video'):
        print(movie)

THIS IS CATEGORY:  New & Upcoming Movies 


<tile-poster-video videoid="D414D8FC-57F4-4BEE-BE06-07E0BAEEC597">
<button class="js-show-modal-trailer" data-content-type="movie" data-media-url="/m/the_devil_all_the_time" data-mpx-fwsite="rotten_tomatoes_video_vod" data-no-ads="false" data-title="The Devil All the Time" data-video-id="D414D8FC-57F4-4BEE-BE06-07E0BAEEC597" data-video-list="rt-hp-list-posters-coming-soon" slot="play">
<tile-poster-image isvideovisible="true" slot="image">
<img class="js-lazyLoad" data-src="https://resizing.flixster.com/cjybVUh-ZFidwdU5FxS6xe4FwpY=/180x257/v2/https://resizing.flixster.com/aAnDGC9xqq_cNk-LDPJf1VwZdEk=/ems.ZW1zLXByZC1hc3NldHMvbW92aWVzLzEzN2ZhN2M3LWEwM2UtNDdmZi1hZjNkLWE2ODIxNjlkODk1OC5qcGc=" onerror="this.onerror=null; this.src='/images/poster_default.gif';" slot="poster" src="/assets/pizza-pie/images/poster_default.c8c896e70c3.gif"/>
<rt-icon-cta-video slot="icon-play"></rt-icon-cta-video>
</tile-poster-image>
</button>
<a class="unset" href="/

<a id='#bs4_find_html'></a>

#### Now we need to extract the data in these categories

We see that by setting **movies** to **category.find_all('title-poster-video')**, each nested object is a movie with all the necessary data for us to evaluate (title, rating, additional data).

In [129]:
_dict={}
for category in categories:
    #print(category.find_all('a', {'class': 'unset'}),'\n')
    for movie in category.find_all('tile-poster-video'):
        print('Category: ', category.find('h2').get_text())
        print('Movie: ', movie.find('span', {'class': 'p--small'}).get_text())
        print('Critics Rating: ', movie.find('score-icon-critic').get('percentage'))
        print('****Next Movie****\n')

Category:  New & Upcoming Movies
Movie:  The Devil All the Time
Critics Rating:  64
****Next Movie****

Category:  New & Upcoming Movies
Movie:  Antebellum
Critics Rating:  27
****Next Movie****

Category:  New & Upcoming Movies
Movie:  Enola Holmes
Critics Rating:  88
****Next Movie****

Category:  New & Upcoming Movies
Movie:  Misbehaviour
Critics Rating:  89
****Next Movie****

Category:  New & Upcoming Movies
Movie:  Kajillionaire
Critics Rating:  93
****Next Movie****

Category:  New & Upcoming Movies
Movie:  The Trial of the Chicago 7
Critics Rating:  
****Next Movie****

Category:  New & Upcoming Movies
Movie:  The Swerve
Critics Rating:  100
****Next Movie****

Category:  New & Upcoming Movies
Movie:  The Glorias
Critics Rating:  73
****Next Movie****

Category:  New & Upcoming Movies
Movie:  The Boys in the Band
Critics Rating:  
****Next Movie****

Category:  New & Upcoming Movies
Movie:  Charm City Kings
Critics Rating:  88
****Next Movie****

Category:  Popular in Theaters


<a id='#pulling_tags'></a>

#### Pulling the Relevant Tags

Now that we see how to create the right HTML objects, we can go in and use the **find( )** method to extract the relevant data.

We can find the data by parsing the appropriate tags:

**Movie Title:** 

`<span slot="title" class="p--small">The New Mutants</span>` <br>

**Audience Score:** 

`<score-icon-audience alignment="left" percentage="55" size="tiny" slot="audience-score" state="spilled" style="display: none;"></score-icon-audience>` <br>

**Critics Score:** 

`<score-icon-critic alignment="left" percentage="34" size="tiny" slot="critic-score" state="rotten"></score-icon-critic>`

#### All of these are pieces of this **movie** HTML object:

**Entire Movie Object**

`<tile-poster-video videoid="E4036FAA-9495-4BA9-85A1-FAFA8F78987E">`
`<button class="js-show-modal-trailer" data-content-type="movie" data-media-url="/m/the_new_mutants" data-mpx-fwsite="rotten_tomatoes_video_vod" data-no-ads="false" data-title="The New Mutants" data-video-id="E4036FAA-9495-4BA9-85A1-FAFA8F78987E" data-video-list="rt-hp-list-posters-popular-in-theaters" slot="play">`
`<tile-poster-image isvideovisible="true" slot="image">`
                                    
`<img slot="poster" data-src="https://resizing.flixster.com/FyE89pK7hUk-8vR0QjkjNNfo9uw=/180x257/v2/https://resizing.flixster.com/vW3ug5-igOxENDGIVc5CFrAzLHA=/ems.ZW1zLXByZC1hc3NldHMvbW92aWVzL2RjZTcxN2UzLThmOTYtNDBiYS1hOTNjLTlmYmZlZDM2ODU5Yi5qcGc=" src="https://resizing.flixster.com/FyE89pK7hUk-8vR0QjkjNNfo9uw=/180x257/v2/https://resizing.flixster.com/vW3ug5-igOxENDGIVc5CFrAzLHA=/ems.ZW1zLXByZC1hc3NldHMvbW92aWVzL2RjZTcxN2UzLThmOTYtNDBiYS1hOTNjLTlmYmZlZDM2ODU5Yi5qcGc=" class="js-lazyLoad" onerror="this.onerror=null; this.src='/images/poster_default.gif';" data-revealed="true" style=" -webkit-animation: overlay-fade 1s 1; -o-animation: overlay-fade 1s 1; animation: overlay-fade 1s 1;">`
                                    
                                    
`<rt-icon-cta-video slot="icon-play"></rt-icon-cta-video>`
`</tile-poster-image>`
`</button>`
`<a href="/m/the_new_mutants" class="unset" slot="link">`
`<tile-poster-meta>`
`<score-icon-critic alignment="left" percentage="34" size="tiny" slot="critic-score" state="rotten"></score-icon-critic>`
`<score-icon-audience alignment="left" percentage="55" size="tiny" slot="audience-score" state="spilled"` `style="display: none;"></score-icon-audience>`
`<span slot="title" class="p--small">The New Mutants</span>`
`</tile-poster-meta>`
`</a>`
`</tile-poster-video>`

#### Pull the relevant data into a dictionary object and export to CSV

In [131]:
movie_data={}
for category in categories:
    #print(category.find_all('a', {'class': 'unset'}),'\n')
    for movie in category.find_all('tile-poster-video'):
        movie_data[movie.find('span', {'class': 'p--small'}).get_text()]={'category': category.find('h2').get_text(),
                                                                    'critics_rating': movie.find('score-icon-critic').get('percentage'),
                                                                    'audience_rating': movie.find('score-icon-audience').get('percentage')}
print(json.dumps(movie_data, sort_keys=True, indent=4))

{
    "21 Jump Street": {
        "audience_rating": "82",
        "category": "Essential Comedies",
        "critics_rating": "84"
    },
    "Airplane!": {
        "audience_rating": "89",
        "category": "Essential Comedies",
        "critics_rating": "97"
    },
    "Alone": {
        "audience_rating": "",
        "category": "Popular in Theaters",
        "critics_rating": "94"
    },
    "Anna and the Apocalypse": {
        "audience_rating": "62",
        "category": "Hidden Gem Movies on Prime",
        "critics_rating": "77"
    },
    "Antebellum": {
        "audience_rating": "64",
        "category": "New & Upcoming Movies",
        "critics_rating": "27"
    },
    "BPM (Beats Per Minute)": {
        "audience_rating": "83",
        "category": "Hidden Gem Movies on Hulu",
        "critics_rating": "98"
    },
    "Bill & Ted Face the Music": {
        "audience_rating": "74",
        "category": "Popular in Theaters",
        "critics_rating": "81"
    },
    "Bone T

<a id='#storing_data'></a>

#### Storing Data and Exporting to CSV

Now that we know we have the relevant data, we can store our data in a **dict** object and export that data to a **CSV** file via our **Pandas** dataframe.

In [132]:
#Store dict object in pandas dataframe
df = pd.DataFrame.from_dict(movie_data, orient = 'index')
#Export pandas dataframe to CSV
df.to_csv('rottentomatoes.csv')

#### Homework Question: 

There are two sections on the webpage, **POPULAR STREAMING MOVIES** and **MOST POPULAR TV ON RT**. How would we go about parsing this data and exporting to CSV?

#### Thank You!

Thank you for working through this tutorial. Please feel free to respond with questions/comments, and let me know how this tutorial can be improved. 