<h1><center>BeautifulSoup Spotlight</center></h1>
<center>by Aaron Lee</center>

Hello all! Today we will be taking a look at a very interesting and useful Python library called Beautiful Soup. Beautiful Soup is fundamentally a tool designed to assist users in extracting data from HTML and XML sources through commonly known and used Pythonic idioms. With the help of the Requests Python library, it can also be used to crawl the web and scrape data from online HTML and XML sources. It leverages the nested structure of both HTML and XML files to provide an easily searchable, modifiable, and well parsed document tree. For more information regarding official documentation, please visit https://www.crummy.com/software/BeautifulSoup/bs4/doc/. 

This spotlight assumes that the user has a working understanding of both XML and HTML document structures as well as their applications. If unfamiliar with XML, please consider visiting https://www.howtogeek.com/357092/what-is-an-xml-file-and-how-do-i-open-one/ to bolster or refresh your understanding. Also consider visiting https://study.com/academy/lesson/what-is-an-html-document-structure-types-examples.html for information about HTML.

This spotlight will demonstrate many of the most important functions and features of the Beautiful Soup library alongside some interesting applications of it. 

First, we must confirm that we have the correct packages installed in our respective environment. This can be achieved by running the following lines of code. 

Requests is described as an elegant and simple HTTP library for Python. 
It simplifies the process of sending HTTP/1.1 requests over the web.
Please see https://requests.readthedocs.io/en/master/ for further documentation. 

pip install requests

pip install beautifulsoup4

lxml is a third-party Python parser that is necessary for the xml functions
of the Beautiful Soup library to work.

pip install lxml

Though the html5lib package is not completely necessary, it enables you to use a 
different Beautiful Soup HTML parser than the standard one that handles 
certain types of web pages more efficiently as a result of formatting differences. 

pip install html5lib

Though many environments come pre-installed with these packages, I have included code to install all packages that are necessary for the proper functioning of the Beautiful Soup library above for the sake of completeness. Note that we will be working with beautifulsoup4 version 4.8.2, requests version 2.22.0, lxml version 4.5.0, html5lib version 1.0.1, and Python version 3.7.6 during this demonstration. 

Now that we have the correct packages installed in our local environments we can import the required libraries using the following lines of code. 

In [1]:
import re
import requests
from bs4 import BeautifulSoup

We can now begin to explore the Beautiful Soup library and discover its many uses and features. We will begin by exploring a locally stored xml file. Since I am unable to include any other files with this submission, I have provided the sample file (called xml_document) below. 

In [2]:
xml_document = """
<breakfast_menu>
<food gf="true" low_calorie="true">
<dish>Belgian Waffles</dish>
<price>$5.95</price>
<description>
Two of our famous Belgian Waffles with plenty of real maple syrup
</description>
<calories>650</calories>
</food>
<food gf="true" low_calorie="false">
<dish>Strawberry Belgian Waffles</dish>
<price>$7.95</price>
<description>
Light Belgian waffles covered with strawberries and whipped cream
</description>
<calories>900</calories>
</food>
<food gf="true" low_calorie="false">
<dish>Berry-Berry Belgian Waffles</dish>
<price>$8.95</price>
<description>
Light Belgian waffles covered with an assortment of fresh berries and whipped cream
</description>
<calories>900</calories>
</food>
<food gf="false" low_calorie="true">
<dish>French Toast</dish>
<price>$4.50</price>
<description>
Thick slices made from our homemade sourdough bread
</description>
<calories>600</calories>
</food>
<food gf="false" low_calorie="false">
<dish>Homestyle Breakfast</dish>
<price>$6.95</price>
<description>
Two eggs, bacon or sausage, toast, and our ever-popular hash browns
</description>
<calories>950</calories>
</food>
</breakfast_menu>
"""

In [3]:
# The variable beautifulsoup corresponds to a Beautiful Soup object that
# represents the above document as a nested data structure. We have also
# specified to use the "xml" parser which is the only currently 
# supported xml parser.
beautifulsoup = BeautifulSoup(xml_document, "xml")

# The prettify method takes the above Beautiful Soup parse tree and 
# translates it into a formatted Unicode string for ease of 
# reading
print(beautifulsoup.prettify())

<?xml version="1.0" encoding="utf-8"?>
<breakfast_menu>
 <food gf="true" low_calorie="true">
  <dish>
   Belgian Waffles
  </dish>
  <price>
   $5.95
  </price>
  <description>
   Two of our famous Belgian Waffles with plenty of real maple syrup
  </description>
  <calories>
   650
  </calories>
 </food>
 <food gf="true" low_calorie="false">
  <dish>
   Strawberry Belgian Waffles
  </dish>
  <price>
   $7.95
  </price>
  <description>
   Light Belgian waffles covered with strawberries and whipped cream
  </description>
  <calories>
   900
  </calories>
 </food>
 <food gf="true" low_calorie="false">
  <dish>
   Berry-Berry Belgian Waffles
  </dish>
  <price>
   $8.95
  </price>
  <description>
   Light Belgian waffles covered with an assortment of fresh berries and whipped cream
  </description>
  <calories>
   900
  </calories>
 </food>
 <food gf="false" low_calorie="true">
  <dish>
   French Toast
  </dish>
  <price>
   $4.50
  </price>
  <description>
   Thick slices made from our ho

As you can see, the above xml document represents a breakfast menu containing multiple food items with nested categories of name, price, description, and calories within each food item. To make this point more clear, we can use the following code.

In [4]:
# Extracts only the text of the document and returns it as a
# single Unicode string. 
print(beautifulsoup.get_text())



Belgian Waffles
$5.95

Two of our famous Belgian Waffles with plenty of real maple syrup

650


Strawberry Belgian Waffles
$7.95

Light Belgian waffles covered with strawberries and whipped cream

900


Berry-Berry Belgian Waffles
$8.95

Light Belgian waffles covered with an assortment of fresh berries and whipped cream

900


French Toast
$4.50

Thick slices made from our homemade sourdough bread

600


Homestyle Breakfast
$6.95

Two eggs, bacon or sausage, toast, and our ever-popular hash browns

950




Now that we have seen what is contained within the xml document, lets take a look at how Beautiful Soup categorizes and separates these elements. We will start by looking at the Tag object. Tags correspond to the HTML or XML tags within the original document. For example, if you wanted to take a look at a single food item on the menu, you can do so with the following code. 

In [5]:
# Creates a Tag object called food_tag that holds the information
# for a single food item (the first one) from the document
food_tag = beautifulsoup.food

print("Name of printed tag: " + str(food_tag.name))
print(food_tag)

Name of printed tag: food
<food gf="true" low_calorie="true">
<dish>Belgian Waffles</dish>
<price>$5.95</price>
<description>
Two of our famous Belgian Waffles with plenty of real maple syrup
</description>
<calories>650</calories>
</food>


Note that this only brings up the first food item on the menu. We will go over later how to navigate, search, and modify information within a document. For now, lets take a look at how to extract specific information from a given Tag. Say you need to identify all the details of a food item separately. 

In [6]:
# Notice that we are now starting to combine some of the methods
# we've used up to this point. 

# The "strip=True" modifier within the get_text() method simply
# strips white space from the beginning and end of each string
# for ease of output.
print("Dish: " + str(food_tag.dish.get_text(strip=True)))
print("Price: " + str(food_tag.price.get_text(strip=True)))
print("Description: " + str(food_tag.description.get_text(strip=True)))
print("Calories: " + str(food_tag.calories.get_text(strip=True)))

Dish: Belgian Waffles
Price: $5.95
Description: Two of our famous Belgian Waffles with plenty of real maple syrup
Calories: 650


Easy as pie (or perhaps waffles in this case)! Now that we've covered the basics of using Tags lets take a look at Attributes within a Tag object. A Tag may have any number of attributes assocaited with it. These can be accessed, removed, and modified by treating the Attributes like a dictionary. We will demonstrate this fact below. Note that "gf" indicates whether a dish is "gluten-free" or not.

In [7]:
# Prints out the full list of Attributes
print("Original Attributes: " + str(food_tag.attrs))

# Prints out specifically the Attribute for 'gf'
print("gf Attribute: " + str(food_tag['gf']))

# Prints out specifically the Attribute for 'low_calorie'
print("low_calorie Attribute: " + str(food_tag['low_calorie']))

# Changes the Attribute for 'low_calorie'
food_tag['low_calorie'] = 'false'
print("low_calorie Attribute after change: " + str(food_tag['low_calorie']))

# Deletes the 'low_calorie' Attribute completely
del food_tag['low_calorie']
print("Attributes after deletion: " + str(food_tag.attrs))

# Adds the 'low_calorie' Attribute
food_tag['low_calorie'] = 'true'
print("Attributes after addition: " + str(food_tag.attrs))


Original Attributes: {'gf': 'true', 'low_calorie': 'true'}
gf Attribute: true
low_calorie Attribute: true
low_calorie Attribute after change: false
Attributes after deletion: {'gf': 'true'}
Attributes after addition: {'gf': 'true', 'low_calorie': 'true'}


Tags and Attributes are two of the most commonly dealt with objects within the Beautiful Soup library. Though there are other objects such as BeautifulSoup objects and NavigableString objects, they can be handled with many of the same methods described above (BeautifulSoup objects can be treated much like Tags and NavigableString objects are similar Attributes). It is also worth noting here that though Beautiful Soup has the capability to use both XML and HTML parsers, the methods used to work with them are the exact same (since they have similarly nested sub-structures). 

With this knowledge in hand, lets incorporate the use of the Requests package to leverage our new found understanding to scrape real data straight from the web! We will be using an example that everyone should be familiar with by this point in the semester; Dr. Caverlee's course webpage. 

In [8]:
# The response_object defined below holds all the information collected from 
# the course web page url that we provided. It is then converted into another
# form that holds all the data in string form. We can think of this newly created
# object much like the document we worked with above.
response_object = requests.get("http://courses.cse.tamu.edu/caverlee/csce670/")
response_object_text = response_object.text

soup = BeautifulSoup(response_object_text, "html5lib")
print(soup.prettify())

<html>
 <head>
  <title>
   CSCE 670 :: Information Storage and Retrieval :: Spring 2020
  </title>
 </head>
 <body alink="blue" bgcolor="white" link="blue" text="black" vlink="blue">
  <h1>
   CSCE 670 :: Information Storage and Retrieval :: Spring 2020
  </h1>
  <dt>
   MWF 11:30am-12:20pm
   <strike>
    in ZACH 310
   </strike>
   online
  </dt>
  <br/>
  <dt>
   Instructor:
   <a href="http://faculty.cse.tamu.edu/caverlee/">
    James Caverlee
   </a>
   ,
HRBB 403
  </dt>
  <dt>
   Office Hours: 3-4pm on Monday and Tuesday, or by appointment
  </dt>
  <dt>
   Department of
   <a href="http://www.cse.tamu.edu">
    Computer Science and
Engineering
   </a>
  </dt>
  <dt>
   <a href="http://www.tamu.edu">
    Texas A&amp;M
University
   </a>
  </dt>
  <dt>
   <br/>
  </dt>
  <dt>
   TA:
   <a href="http://people.tamu.edu/~yunhe/">
    Yun He
   </a>
   , HRBB 408D
  </dt>
  <dt>
   Office Hours: 4-5pm on Thursday and Friday, or by appointment
  </dt>
  <dt>
   <p>
    <a href="sched

Now that we have retrieved the course webpage, we can begin exploring it using Beautiful Soup's many features. Though there are several ways to search a document, we will focus on the most popular feature; the find_all method. It is designed to filter through the entire document (or whatever you provide to the Beautiful Soup constructor) and return information that contains the given argument. It is able to accept strings, regular expressions, lists, and functions as input which makes it an especially powerful tool. Its use is demonstrated below. 

In [9]:
# The find_all method will print out every Tag object
# in the document labelled strike
for item in soup.find_all("strike"):
    print(item)
    print()

<strike>in ZACH 310</strike>

<strike>
</strike>

<strike>
<b>Participation (5%)</b>. Attendance in class and participation in the discussion are both important to your success in the course. We expect you to participate in online discussions on Piazza. Over the course of the semester, you should <b>post at least three</b> posts or replies to the discussion forum on Piazza. These posts can start a new thread or respond to an existing one. Since we encounter search and recommendation every day, there are ample opportunities to connect what we talk about in class to new research results, new features on existing platforms, challenges facing industry, ethical considerations, etc. Towards your participation grade, the final day to post to the discussion group is April 22. (Of course you are welcome to continue to post afterwards, but these posts will not count toward your participation grade.) Also note that your project-related posts do not count towards this participation score (e.g., po

Searching for the "strike" keyword brings up a number of sections of the document that have been striken out. As can clearly be seen by the amount of information that has been striken out in this document, there have been some significant changes done to the structure of the document. Lets continue to explore this fact by navigating around the document using Beautiful Soup commands. 

In [10]:
# We will start by looking at the first strike section in the document.
print("Statement: " + str(soup.strike))
print()

# We can print out the contents of this section (the actual text that
# was striken out) using the following command
print("Contents: " + str(soup.strike.contents))
print()

# We can look at the parent of the current statement (that is the
# statement within which it is contained) using the following command
print("Parent: " + str(soup.strike.parent))
print()

# We can continue to go up the ranks of the document tree by adding
# more modifier statements as below.
print("Grandparent: " + str(soup.strike.parent.parent))
print()

# We can also navigate downwards in the document tree and look at the
# children of the examined section (that is to say its contents)
for child in soup.strike.children:
    print("Child: " + str(child))


Statement: <strike>in ZACH 310</strike>

Contents: ['in ZACH 310']

Parent: <dt>MWF 11:30am-12:20pm <strike>in ZACH 310</strike> online</dt>

Grandparent: <body alink="blue" bgcolor="white" link="blue" text="black" vlink="blue">


<h1> CSCE 670 :: Information Storage and Retrieval :: Spring 2020 </h1> 

<dt>MWF 11:30am-12:20pm <strike>in ZACH 310</strike> online</dt>

<br/>
<dt> Instructor: <a href="http://faculty.cse.tamu.edu/caverlee/">James Caverlee</a>,
HRBB 403 </dt><dt> Office Hours: 3-4pm on Monday and Tuesday, or by appointment 
</dt><dt>
Department of <a href="http://www.cse.tamu.edu">Computer Science and
Engineering</a> </dt><dt> <a href="http://www.tamu.edu">Texas A&amp;M
University</a>
</dt><dt><br/>
</dt><dt> TA: <a href="http://people.tamu.edu/~yunhe/">Yun He</a>, HRBB 408D
</dt><dt> Office Hours: 4-5pm on Thursday and Friday, or by appointment

</dt><dt>

<p><a href="schedule.html">Course Schedule</a> :: <a href="spotlight.html">Spotlight</a> :: <a href="project.html">Pr

As we can see by using some basic navigation, the statement we've been examining is relatively close to the beginning of the page and contains a short phrase. This type of navigation can be leveraged to explore complicated and lengthy documents as well as extract necessary information out of them with relative ease!

Next we will next go over how to modify parts of the document parse tree. There are many ways to do so but I will present some of the more popular methods here. We will continue using the strike example. 

In [11]:
print("Statement: " + str(soup.strike))

# We can modify the string contained within the statement as follows
soup.strike.string = "in ZACH 100"
print("Statement after modification: " + str(soup.strike))

# We can add to the end of the statement using the following code
soup.strike.append(" or anywhere else for that matter!")
print("Statement after addition: " + str(soup.strike))

# We can also delete the text contained within the statement this way
soup.strike.clear()
print("Statement after clearing: " + str(soup.strike))

soup.strike.string = "in ZACH 300"

Statement: <strike>in ZACH 310</strike>
Statement after modification: <strike>in ZACH 100</strike>
Statement after addition: <strike>in ZACH 100 or anywhere else for that matter!</strike>
Statement after clearing: <strike></strike>


With the addition of other aspects of the Requests library (which is outside the scope of this spotlight), it is easily possible to write a script capable of automatically updating a given website or document. The possibilities with Beautiful Soup are truly endless! 

Now that we have a grasp on the basic workings of Beautiful Soup, lets put it all together to demonstrate how you could create a real web crawler from scratch using Dr. Caverlee's course webpage as a starting point. 

In [12]:
link_list = []

# Returns every Tag object on the webpage with the keyword a in it
# (which are items with links associated with them)
for link in soup.find_all('a'):
    
    # Uses regular expressions to check that the text associated with
    # the above searched tag starts with "http" in order to separate
    # absolute web addresses (those that link to separate web pages) from
    # relative web addresses (those that link to other links on the page)
    if re.findall("^http", str(link.get('href'))):
        
        # Adds only absolute addresses to a list for future processing
        link_list.append(str(link.get('href')))
        
print(link_list)

['http://faculty.cse.tamu.edu/caverlee/', 'http://www.cse.tamu.edu', 'http://www.tamu.edu', 'http://people.tamu.edu/~yunhe/', 'http://nlp.stanford.edu/IR-book/information-retrieval-book.html', 'https://en.wikipedia.org/wiki/Gerard_Salton', 'https://staff.fnwi.uva.nl/m.derijke/wp-content/papercite-data/pdf/markov-2018-what.pdf', 'https://piazza.com/tamu/spring2020/csce670/home', 'http://nlp.stanford.edu/IR-book/information-retrieval-book.html', 'http://www.mmds.org', 'http://ciir.cs.umass.edu/irbook/', 'http://lintool.github.io/MapReduceAlgorithms/ed1n.html', 'http://www.cs.cornell.edu/home/kleinber/networks%2Dbook/', 'http://theory.stanford.edu/~aiken/moss/', 'http://theory.stanford.edu/~aiken/moss/', 'http://disability.tamu.edu', 'http://aggiehonor.tamu.edu']


We have now created a list of absolute links from the information scraped from Dr. Caverlee's webpage. Now we can take each of these links and collect information from them.

In [13]:
link_dictionary = {}

# Iterates over all scraped links, makes the absolute address
# the key value in a dictionary, and makes the information scraped
# from that absolute address the value for that key in the dictionary
for link in link_list:
    r_o = requests.get(link)
    r_o_t = r_o.text
    b_s = BeautifulSoup(r_o_t, "html5lib")
    link_dictionary[link] = b_s.prettify()
    
# Prints only the first link as a confirmation of functionality
# I tried printing the entire dictionary in several different 
# formats but Jupyter Notebooks was unable to handle the output
# without crashing.
print(link_dictionary[link_list[0]])

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
  <meta content="" name="description"/>
  <meta content="" name="author"/>
  <title>
   James Caverlee :: Texas A&amp;M :: Computer Science and Engineering
  </title>
  <!-- Bootstrap core CSS -->
  <link href="css/bootstrap.min.css" rel="stylesheet"/>
  <style>
   body {
        padding-top: 50px; /* 60px to make the container go all the way to the bottom of the topbar */
      }
      h3 {
padding-top:60px;
margin-top:-60px;
}
  </style>
  <link href="css/bootstrap-responsive.css" rel="stylesheet"/>
 </head>
 <body>
  <nav class="navbar navbar-inverse navbar-fixed-top">
   <div class="container">
    <div class="navbar-header">
     <button aria-controls="navbar" aria-expanded=

We have now taken each scraped link and populated a dictionary where the key is the absolute link's address and the value is all of the scraped data from that page. With this new inforation we can continue to explore each respective web page and continue to scrape and gather information that would be helpful to any given endeavor! 

We've covered several different applications for the beautiful soup library (web crawling and web scraping as well as scripting updates to online pages) but there are many others. For example, many modern big datasets (at least in the NLP realm) are stored in some variant of XML format and Beautiful Soup is an excellent and easy way to extract necessary and important information from them. The simple to use structure and powerful methods provided by the Beautiful Soup library make it an ideal tool for many different applications and I hope that this spotlight has provided some insight and inspiration to you regarding its use. 