# BeautifulSoup Webscraping Tutorial

#### Table of Contents

1. <a href='##Introduction'>The Basics of Webscraping With BeautifulSoup</a> <br>
> 1.a <a href='##urllib'>The urllib package</a> <br>
> 1.b <a href='##beautiful_soup'>The Beautiful Soup package</a> <br>
2. <a href='##example'>Webscrapig Example</a> <br>
> 2.a <a href='##packages'>Necessary Packages</a> <br>
> 2.b <a href='##urllib_example'>urllib to grab the website</a> <br>
> 2.c <a href='##bs4_object'>Create a bs4 object</a> <br>
3. <a href='##bs4_parse'>Parsing a Beautiful Soup Object</a> <br>
> 3.a <a href='##bs4_html'>Creating the right HTML objects to parse</a> <br>
> 3.b <a href='##bs4_find_html'>Creating nested HTML objects using **find( ) and **find_all( ) methodology</a> <br>
> 3.c <a href='##pulling_tags'>Parsing the individual movies and pulling the data</a> <br>
> 3.d <a href='##storing_data'>Storing the data and exporting to CSV</a> <br>

<a id='#Introduction'></a>

### The Basics of Webscraping with BeautifulSoup

Beautiful Soup is a lightweight, flexible Python library designed for parsing HTML documents. It provides many great features, such as HTML tree traversal and automatic encoding (incoming documents to Unicode and outgoing documents to UTF-8) (1) which enable the user to focus solely on parsing documents and extracting information. Beautiful Soup itself does not collect webpages - this is left to the urllib package in Python which we'll discuss next. Instead Beauful Soup parses the HTML code, meaning that Beautiful Soup and urllib work hand-in hand. 

We'll be working in Python 3 for this tutorial, although Beautiful Soup runs for Python 2 with some simple adjustments to your code. Let's take a look at an example and see if we can get you up and running with parsing your own web pages!

(1) https://www.crummy.com/software/BeautifulSoup/

<a id='#urllib'></a>

#### The urllib package

The urllib package enables you to access websites via your Python console. With just a few lines of code you can write a program for grabbing objects and saving data from webpages. What's also great about urllib is that it is part of the standard Python libary - so it rarely ever needs to be installed (you already have it upon installing any version of Python). Below are some links to urllib documentation and simple example code:

**Tutorial**:    https://pythonprogramming.net/urllib-tutorial-python-3/ <br>
**Documentation**: https://docs.python.org/3/library/urllib.html

<a id='#beautiful_soup'></a>

#### The Beautiful Soup Package

Once we've used urllib to grab an HTML document, then we can deploy Beautiful Soup. Beautiful Soup has built in methods to parse the HTML tree, grab tags and encode incoming/outgoing documents. There have been several iterations of the Beautiful Soup package, the latest of which is **bs4** for Python 3. Below are some relevant links for installing and reading about Beautiful Soup:

**Documentation**: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ <br>
**Installation**: http://www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python

<a id='#example'></a>

#### Let's start a webscraping example

The first thing we need to do is import the relevant packages for our project. These packages are the following: <br>
* **re** - the regular expressions package for evaluating strings <br>
* **pandas** - this is a fundamental package for formatting and evaluating data <br>
* **urllib.request** - HTML reqeusts are handled by the '.request' method in the urllib package <br>
* **bs4** - we will create an object for parsing using the Beautiful Soup method of the bs4 package <br>

<a id='#packages'></a>

In [7]:
import re
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup

<a id='#urllib_example'></a>

#### Use urllib to grab the website object

In [13]:
#Declare the website as a string. For this example, I've decided to grab the data from rottentomatoes website, parse
#the html and save it in a pandas dataframe which we can export to a csv file.
webpage = 'https://www.rottentomatoes.com/'

#Use the 'with-as' method to open the webpage and create an object of the html source code
with urllib.request.urlopen(webpage) as response:
   html = response.read()

<a id='#bs4_object'></a>

#### Create a Beautiful Soup object

We do this by passing the **html** object to the Beautiful Soup method

In [14]:
soup = BeautifulSoup(html)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


**soup.prettify** displays all the HTML syntax just as it appears on the webpage we took it from. This helps us to determine how to properly search for tags as we parse our tree.

In [15]:
soup.prettify

<bound method Tag.prettify of <!DOCTYPE html>
<html lang="en" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/">
<head prefix="og: http://ogp.me/ns# flixstertomatoes: http://ogp.me/ns/apps/flixstertomatoes#">
<script src="//cdn.optimizely.com/js/594670329.js"></script>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<meta content="VPPXtECgUUeuATBacnqnCm4ydGO99reF-xgNklSbNbc" name="google-site-verification"/>
<meta content="034F16304017CA7DCF45D43850915323" name="msvalidate.01"/>
<link href="https://staticv2-4.rottentomatoes.com/static/images/iphone/apple-touch-icon.png" rel="apple-touch-icon"/>
<link href="https://staticv2-4.rottentomatoes.com/static/images/icons/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<link href="https://staticv2-4.rottentomatoes.com/static/styles/css/rt_main.css" rel="stylesheet"/>
<script id="jsonLdSchema" type="applicat

<a id='#bs4_parse'></a>

#### Beautiful Soup Parsing Methods

The beauty of Beautiful Soup (pun intended) is in the robust, flexible parsing methods. Some of the ones that I have found the most useful are the following (which we will use to parse this webpage):

* **find_all( )** : the 'find_all()' method is one of the most popular and userful methods of Beautiful Soup. We can pass tags to this method and it will search **DOWN** the tree to return all HTML tags of this type that it finds on the page. <br>
* **find( )** : the 'find()' method is like the 'find_all()' method, except that it returns only the first occurence of the tag that is passed in as a parameter by searching **DOWN** the tree

(Both **find( )** and **find_all( )** search the descendants of the tree, and most other methods of Beautiful Soup are derivations of these two. We'll touch on a couple others)

* **find_parents( )** : the 'find_parents()' method searches **UP** the tree to find all matching HTML tags, as opposed to **find_all( )** which searches down the tree. <br>
* **find_parent( )** : the 'find_parent( )' method is like the 'find( )' method, except that it returns only the first occurence of the tag that is found looking **UP** the HTML tree

* **find_next_siblings( )** : the 'find_next_sibling( )' method searches **DOWN** the tree to find all matching HTML tags that are siblings in the tree <br>
* **find_next_sibling( )** : the 'find_next_sibling( )' method searches **DOWN** the tree to find the next matching HTML tags that are siblings in the tree <br>

To see a full list of the methods available for parsing HTML trees in Beatiful Soup, visit the documentation page at

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find

In [20]:
#As we examine our HTML doc, we see that all the information we need are nested in 'div' tags that have 'homepage' 
#as an element of the 'id' tag. 

#To capture these tags, we pass a 'div' tag and also an 'id' tag as parameters to our 'find_all()' method. If we 
#have a value that we are looking to capture in our tags, we can pass it in as a 'dict' parameter. 
#In this instance we use Regular Expressions to capture all the 'id' tags that have the word 'homepage' in the string.

categories = soup.find_all('div', {'id': re.compile('homepage')})

In [26]:
#Let's now examine the 'categories' object to see what has been stored there
print("Let's look at the length of the 'categories object")
print('Length: ', len(categories), '\n')
print("Now let's examine the first object to see what we need to pull")
print(categories[0])

Let's look at the length of the 'categories object
Length:  8 

Now let's examine the first object to see what we need to pull
<div class="listings" id="homepage-opening-this-week">
<a class="pull-right showtimesLink" href="/showtimes/">Get Tickets</a>
<h2>Movies Opening This Week</h2>
<table class="movie_list" id="Opening">
<tr class="sidebarInTheaterOpening">
<td class="left_col">
<a href="/m/a_quiet_place_2018">
<span class="icon tiny certified_fresh"></span>
<span class="tMeterScore">97%</span>
</a>
</td>
<td class="middle_col">
<a href="/m/a_quiet_place_2018">A Quiet Place</a>
</td>
<td class="right_col right">
<a href="/m/a_quiet_place_2018">
                    Apr 6</a>
</td>
</tr><tr class="sidebarInTheaterOpening">
<td class="left_col">
<a href="/m/blockers">
<span class="icon tiny certified_fresh"></span>
<span class="tMeterScore">83%</span>
</a>
</td>
<td class="middle_col">
<a href="/m/blockers">Blockers</a>
</td>
<td class="right_col right">
<a href="/m/blockers">
       

<a id='#bs4_html'></a>

#### Creating the right HTML objects to parse

We see that the objects we need to pull have an **'h2'** tag which storest the category name on the rottentomatoes website. The first 6 category objects have this **'h2'** tag, while the last 2 do not (we can discard these last two objects). 

We can use the **soup_html_object.text** method to extract the text associated with our tags

In [34]:
#For each object in the 'categories' object, display it and print out the the movie category title.

for category in categories:
    print(category.find('h2').text,'\n')
    print(category)
    print("-----------------Next Category Starts----------------- \n\n")

Movies Opening This Week 

<div class="listings" id="homepage-opening-this-week">
<a class="pull-right showtimesLink" href="/showtimes/">Get Tickets</a>
<h2>Movies Opening This Week</h2>
<table class="movie_list" id="Opening">
<tr class="sidebarInTheaterOpening">
<td class="left_col">
<a href="/m/a_quiet_place_2018">
<span class="icon tiny certified_fresh"></span>
<span class="tMeterScore">97%</span>
</a>
</td>
<td class="middle_col">
<a href="/m/a_quiet_place_2018">A Quiet Place</a>
</td>
<td class="right_col right">
<a href="/m/a_quiet_place_2018">
                    Apr 6</a>
</td>
</tr><tr class="sidebarInTheaterOpening">
<td class="left_col">
<a href="/m/blockers">
<span class="icon tiny certified_fresh"></span>
<span class="tMeterScore">83%</span>
</a>
</td>
<td class="middle_col">
<a href="/m/blockers">Blockers</a>
</td>
<td class="right_col right">
<a href="/m/blockers">
                    Apr 6</a>
</td>
</tr><tr class="sidebarInTheaterOpening">
<td class="left_col">
<a href

AttributeError: 'NoneType' object has no attribute 'text'

#### Examine Categories

As we examine each object in each of the **categories** object, we see that the nested objects do not give us individual movies with the associated data. In order to grab this in a clear hierarchal fashion, we'll need to create the objects usign the **find( )** and **find_all( )** methodology. 

In [38]:
for category in categories:
    print(category.find('h2').text,'\n')
    for movie in category:
        print(movie, '\n')
        print('----------NEXT MOVIE----------')

Movies Opening This Week 


 

----------NEXT MOVIE----------
<a class="pull-right showtimesLink" href="/showtimes/">Get Tickets</a> 

----------NEXT MOVIE----------

 

----------NEXT MOVIE----------
<h2>Movies Opening This Week</h2> 

----------NEXT MOVIE----------

 

----------NEXT MOVIE----------
<table class="movie_list" id="Opening">
<tr class="sidebarInTheaterOpening">
<td class="left_col">
<a href="/m/a_quiet_place_2018">
<span class="icon tiny certified_fresh"></span>
<span class="tMeterScore">97%</span>
</a>
</td>
<td class="middle_col">
<a href="/m/a_quiet_place_2018">A Quiet Place</a>
</td>
<td class="right_col right">
<a href="/m/a_quiet_place_2018">
                    Apr 6</a>
</td>
</tr><tr class="sidebarInTheaterOpening">
<td class="left_col">
<a href="/m/blockers">
<span class="icon tiny certified_fresh"></span>
<span class="tMeterScore">83%</span>
</a>
</td>
<td class="middle_col">
<a href="/m/blockers">Blockers</a>
</td>
<td class="right_col right">
<a href="/m/bl

AttributeError: 'NoneType' object has no attribute 'text'

<a id='#bs4_find_html'></a>

#### Creating nested HTML objects using **find( ) and **find_all( ) methodology

We see that by setting **movies** to **category.find_all('td')**, each nested object is not a movie with all the necessary data for us to evaluate (title, rating, additional data).

In [39]:
#category represensts what it sounds - each category on the rottentomatoes website
for category in categories[:-2]:
    print(category.find('h2').text)
    movies = category.find_all('tr')
    for movie in movies:
        print(movie, '\n')
        print("----------NEXT MOVIE----------")

Movies Opening This Week
<tr class="sidebarInTheaterOpening">
<td class="left_col">
<a href="/m/a_quiet_place_2018">
<span class="icon tiny certified_fresh"></span>
<span class="tMeterScore">97%</span>
</a>
</td>
<td class="middle_col">
<a href="/m/a_quiet_place_2018">A Quiet Place</a>
</td>
<td class="right_col right">
<a href="/m/a_quiet_place_2018">
                    Apr 6</a>
</td>
</tr> 

----------NEXT MOVIE----------
<tr class="sidebarInTheaterOpening">
<td class="left_col">
<a href="/m/blockers">
<span class="icon tiny certified_fresh"></span>
<span class="tMeterScore">83%</span>
</a>
</td>
<td class="middle_col">
<a href="/m/blockers">Blockers</a>
</td>
<td class="right_col right">
<a href="/m/blockers">
                    Apr 6</a>
</td>
</tr> 

----------NEXT MOVIE----------
<tr class="sidebarInTheaterOpening">
<td class="left_col">
<a href="/m/chappaquiddick">
<span class="icon tiny fresh"></span>
<span class="tMeterScore">82%</span>
</a>
</td>
<td class="middle_col">
<a

<a id='#pulling_tags'></a>

#### Pulling the Relevant Tags

Now that we have created the right HTML objects, we can go in and use the **find( )** method to extract the relevant data. We'll leverage the **try/except** functionality in Python so that we can collect the necessary data while not throwing an exception that will break our script should the object not exist. 

We can find the data by parsing the appropriate **td** tags tags:

**Movie Title:** 

`<td class="middle_col">` <br>
`<a href="https://www.rottentomatoes.com/tv/the_americans/s06">The Americans</a>` <br>
`</td>` <br>

**Movie Score:** 

`<td class="left_col">` <br>
`<a href="https://www.rottentomatoes.com/tv/the_americans/s06">` <br>
`<span class="icon tiny certified_fresh"></span>` <br>
`<span class="tMeterScore">98%</span>` <br>
`</a>` <br>
`</td>` <br>

**Additional Data:** 

`<td class="right_col right">` <br>
`<a href="/m/tyler_perrys_acrimony">` <br>
`                    $17.2M</a>` <br>
`</td>` <br>

#### All of these are pieces of this **movie** HTML object:

**Entire Movie Object**

Top Box Office <br>
`<tr class="sidebarInTheaterOpening">` <br>
`<td class="left_col">` <br>
`<a href="/m/ready_player_one">` <br>
`<span class="icon tiny certified_fresh"></span>` <br>
`<span class="tMeterScore">74%</span>` <br>
`</a>` <br>
`</td>` <br>
`<td class="middle_col">` <br>
`<a href="/m/ready_player_one">Ready Player One</a>` <br>
`</td>` <br>
`<td class="right_col right">` <br>
`<a href="/m/ready_player_one">` <br>
`                    $41.9M</a>` <br>
`</td>` <br>
`</tr>` <br>

In [40]:
#category represensts what it sounds - each category on the rottentomatoes website
for category in categories[:-2]:
    print(category.find('h2').text)
    movies = category.find_all('tr')
    for movie in movies:
        #We can specify the 'td' tag with 'class' value = 'middle_col' and pull the '.text' attribute
        print(movie.find('td', {'class': 'middle_col'}).text)
        try:
            #To pull the movie score, we need to treat the returned object as an atrribute and then call the 
            #'find_all()' method on that onject once again, while extracting the text of the second element.
            print(movie.find('td', {'class': 'left_col'}).find_all('span')[1].text)
        except:
            print("No Score Yet")
        try:
            #Here we specify the 'td' tag with 'class' value = 'right_col right' and pull the '.text' attribute
            print(movie.find('td', {'class': 'right_col right'}).text)
        except:
            pass

Movies Opening This Week

A Quiet Place

97%


                    Apr 6


Blockers

83%


                    Apr 6


Chappaquiddick

82%


                    Apr 6


The Miracle Season

32%


                    Apr 6


You Were Never Really Here

87%


                    Apr 6

Top Box Office

Ready Player One

74%


                    $41.9M


Tyler Perry's Acrimony

25%


                    $17.2M


Black Panther

97%


                    $11.6M


I Can Only Imagine

70%


                    $10.6M


Pacific Rim Uprising

44%


                    $9.4M


Sherlock Gnomes

18%


                    $7M


Tomb Raider

49%


                    $5M


A Wrinkle in Time

39%


                    $4.8M


Love, Simon

92%


                    $4.8M


Paul, Apostle of Christ

43%


                    $3.5M

Coming Soon to Theaters

Isle of Dogs

92%


                    Apr 13


Rampage

No Score Yet


                    Apr 13


Beirut

90%


                    Apr 11


Sgt. 

<a id='#storing_data'></a>

#### Storing Data and Exporting to CSV

Now that we know we have the relevant data, we can store our data in a **dict** object and export that data to a **CSV** file via our **Pandas** dataframe.

In [42]:
movie_data = {}
#category represensts what it sounds - each category on the rottentomatoes website
for category in categories[:-2]:
    cat = category.find('h2').text
    movies = category.find_all('tr')
    for movie in movies:
        title = movie.find('td', {'class': 'middle_col'}).text
        try:
            score = movie.find('td', {'class': 'left_col'}).find_all('span')[1].text
        except:
            score = "No Score Yet"
        try:
            addtl_data = movie.find('td', {'class': 'right_col right'}).text
        except:
            pass
        movie_data[title.strip()] = {'Category': cat, 'Score': score, 'Addititional_Data': addtl_data.strip()}
movie_data

{'A Quiet Place': {'Addititional_Data': 'Apr 6',
  'Category': 'Movies Opening This Week',
  'Score': '97%'},
 'A Wrinkle in Time': {'Addititional_Data': '$4.8M',
  'Category': 'Top Box Office',
  'Score': '39%'},
 'Beirut': {'Addititional_Data': 'Apr 11',
  'Category': 'Coming Soon to Theaters',
  'Score': '90%'},
 'Black Panther': {'Addititional_Data': '$11.6M',
  'Category': 'Top Box Office',
  'Score': '97%'},
 'Blockers': {'Addititional_Data': 'Apr 6',
  'Category': 'Movies Opening This Week',
  'Score': '83%'},
 'Blue Bloods': {'Addititional_Data': 'Apr 13',
  'Category': 'New TV Tonight',
  'Score': 'No Score Yet'},
 'Borg Vs. McEnroe': {'Addititional_Data': 'Apr 13',
  'Category': 'Coming Soon to Theaters',
  'Score': '79%'},
 'Chappaquiddick': {'Addititional_Data': 'Apr 6',
  'Category': 'Movies Opening This Week',
  'Score': '82%'},
 'Counterpart': {'Addititional_Data': 'Apr 13',
  'Category': 'Most Popular TV on RT',
  'Score': '100%'},
 'Dynasty': {'Addititional_Data': 'Apr

In [45]:
#Store dict object in pandas dataframe
df = pd.DataFrame.from_dict(movie_data, orient = 'index')
#Export pandas dataframe to CSV
df.to_csv('rottentomatoes.csv')

#### Thank You!

Thank you for working through this tutorial. Please feel free to respond with questions/comments, and let me know how this tutorial can be improved. 