## SCRAPING HTML
HTML ("Hypertext Mark-up Language") is the language of the Internet, more specifically the World Wide Web. It is a very simple language that uses containers (like <HTML></HTML>) to tell a browser what up to display and how to display it.

A note on browsers: from now on you should be using Chrome. If you do not have chrome installed on your computer, do that now before you go any further. Chrome's developers tools are by far the best and most reliable.

This is how the Internet works (in a simplified way): you go to a page by typing a URL into the browser. The URL is and HTTP request for a file on a server. The file that arrives at your browser is an HTML file--the browser reads the HTML and displays what is supposed to be displayed, and also runs some scripts in the background. Most often, the page you see on your browser is an HTML page. (There are many exceptions, like direct PDF files, as well as data accessed via APIs.)

HTML is the raw text source code of what you see in a browser. In Chrome you can view the raw HTML by either going to the menu bar and choosing--View: Developer: View Source -- or right-clicking (control-clicking) the mouse on a page and choosing View Source. Like so:

<img src="http://floatingmedia.com/columbia/viewsource.png">

Don't panic! While HTML can be very disorienting at first look, there are more targeted and helpful ways to investigate it. The best one is through Chrome's "inspect" function. Right-click (or control-click) on the part the page that interests you, and select "Inspect" or "Inspect Element"--and you get the much more friendly developers tools way of navigating through the DOM tree:

<img src="http://floatingmedia.com/columbia/inspect.png">

Did I say DOM tree? Yes, the DOM [document object model](https://www.w3schools.com/js/js_htmldom.asp) is a term for the hierarchical structure of HTML elements on a page. It is a tree, because each of the elements on a page is nested within groups of HTML tags. 

<img src="http://floatingmedia.com/columbia/treeStructure.png">

Here are the most common tags, and often the most helpful tags to use when navigating through an HTML page.

`<h1>`, `<h2>`, `<h3>` headers
`<p>` paragraph
`<b>`, `<i>`, `<strong>` styles, like bold, italics...
`<table><tr><td>` table elements including rows and cells
`<a href="url">` links
`<div>`, `<span>` larger Element containers, these often have an id="name" and/or class="name" attached to them.
`<ol>`,`<ul>`,`<li>` ordered and unordered lists

For example: `<p>This would be a paragraph</p>`
`<p>This would be a <b>paragraph</b></p>` Same thing but the word paragraph is bold

Sometimes important information is hidden inside these tags:


`<span class="year">`2010`</span>`

or 

`<a href="http://www.boxofficemojo.com/movies/?id=avengers11.htm">`more info`</a>`

In this case the "class" tag is likely adding styling information (see CSS), whereas the "href" tag holds a hyperlinked url. [For a more complete list of HTML tags click here](https://www.w3schools.com/tags/ref_byfunc.asp)


**Why does this matter to us?**

Note that each tag begins with `<tagname>` and ends with `</tagname>`. So these HTML tags are structuring the text. The reason for the structure just to tell the browser how everything should look, but we can also use the structure of HTML to programmatically traverse the data on a webpage and scrape out the information we need. This is what scraping is. HTML is not a reliable data structure, but it often it is consistent from page to page on a particular website. If you can learn how to navigate the dom tree, you can turn information on a messy webpage into reliable and searchable data.


## This brings us to Beautiful Soup
Beautiful Soup is a Python library that parses HTML, allowing us to navigate through the elements of a webpage using the HTML tags embedded in it. [Here is the link to the documentation,](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) there are examples and Extensions Beyond what is demonstrated below.

Now it's time to install Beautiful soup. Go to your terminal/shell/bash and type:

`pip3 install bs4`

We will begin by navigating a very simple HTML page I have posted on my website. [Please follow this link](http://floatingmedia.com/columbia/topfivelists.html) and try inspecting the HTML using Chrome. (p.s. The information on this page comes from [http://www.boxofficemojo.com/genres/chart/?id=comicbookadaptation.htm]

# Cooking the soup

Import the urllib and Beautifulsoup library and make http request

In [1]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
raw_html = urlopen("http://floatingmedia.com/columbia/topfivelists.html").read()

Critical step: make the "soup" (=structured html/dom-style tree) out of raw text

In [2]:
soup_doc = BeautifulSoup(raw_html, "html.parser")
print(type(soup_doc)) #the function returns a "BeautifulSoup" object

<class 'bs4.BeautifulSoup'>


In [3]:
print(soup_doc.prettify()) #to display the soup in tree-style notation, use prettify()
#print(soup_doc)

<!DOCTYPE html>
<html>
 <head>
  <title>
   best and worst comic book box office
  </title>
  <style>
   .year {color: #DD0000}
#favorite {background-color: #FFFFDD;
  </style>
 </head>
 <body>
  <p>
   comic book box office
  </p>
  <div>
   <h1>
    Top Five Movies
   </h1>
   <p>
    <b>
     The Avengers
    </b>
    <span class="year">
     2010
    </span>
    <a href="http://www.boxofficemojo.com/movies/?id=avengers11.htm">
     more info
    </a>
   </p>
   <p>
    <b>
     The Dark Knight
    </b>
    <span class="year">
     2008
    </span>
   </p>
   <p>
    <b>
     Avengers: Age Of Ultron
    </b>
    <span class="year">
     2015
    </span>
   </p>
   <p>
    <b>
     The Dark Knight Rises
    </b>
    <span class="year">
     2012
    </span>
   </p>
   <p>
    <b>
     Iron Man 3
    </b>
    <span class="year">
     2013
    </span>
   </p>
  </div>
  <div>
   <h1>
    Middle Five Movies 47-51
   </h1>
   <p>
    <b>
     The Incredible Hulk
    </b>
    <span class=

# Navigate through the soup

In [4]:
soup_doc.title #use chain notation to reference children in the DOM

<title> best and worst comic book box office</title>

In [5]:
type(soup_doc.title) #this is an object of the type "tag"

bs4.element.Tag

In [6]:
soup_doc.title.string # the .string attribute returns everything inside a tag

' best and worst comic book box office'

In [7]:
soup_doc.title.name #the .name attribute returns the name of the tag

'title'

In [8]:
soup_doc.div #also, just giving a general tag returns just the first instance

<div>
<h1> Top Five Movies</h1>
<p><b> The Avengers</b> <span class="year">2010</span> <a href="http://www.boxofficemojo.com/movies/?id=avengers11.htm">more info</a></p>
<p><b> The Dark Knight</b> <span class="year">2008</span></p>
<p><b> Avengers: Age Of Ultron</b> <span class="year">2015</span></p>
<p><b> The Dark Knight Rises</b> <span class="year">2012</span></p>
<p><b> Iron Man 3</b> <span class="year">2013</span></p>
</div>

In [34]:
my_span = soup_doc.find_all('span')[0]
my_span['class'] #put attribute in square brackets to access its value

['year']

# Parents, Children, Siblings

In [19]:
soup_doc.div.h1.parent #the parent element (here: the div containing h1)

<div>
<h1> Top Five Movies</h1>
<p><b> The Avengers</b> <span class="year">2010</span> <a href="http://www.boxofficemojo.com/movies/?id=avengers11.htm">more info</a></p>
<p><b> The Dark Knight</b> <span class="year">2008</span></p>
<p><b> Avengers: Age Of Ultron</b> <span class="year">2015</span></p>
<p><b> The Dark Knight Rises</b> <span class="year">2012</span></p>
<p><b> Iron Man 3</b> <span class="year">2013</span></p>
</div>

In [20]:
soup_doc.div.children #returns a list with all the child tags

<list_iterator at 0x61e9550>

In [21]:
for child in soup_doc.div.children: #can loop through the list
    print (child)



<h1> Top Five Movies</h1>


<p><b> The Avengers</b> <span class="year">2010</span> <a href="http://www.boxofficemojo.com/movies/?id=avengers11.htm">more info</a></p>


<p><b> The Dark Knight</b> <span class="year">2008</span></p>


<p><b> Avengers: Age Of Ultron</b> <span class="year">2015</span></p>


<p><b> The Dark Knight Rises</b> <span class="year">2012</span></p>


<p><b> Iron Man 3</b> <span class="year">2013</span></p>




In [23]:
soup_doc.div.contents #similar like children.

['\n',
 <h1> Top Five Movies</h1>,
 '\n',
 <p><b> The Avengers</b> <span class="year">2010</span> <a href="http://www.boxofficemojo.com/movies/?id=avengers11.htm">more info</a></p>,
 '\n',
 <p><b> The Dark Knight</b> <span class="year">2008</span></p>,
 '\n',
 <p><b> Avengers: Age Of Ultron</b> <span class="year">2015</span></p>,
 '\n',
 <p><b> The Dark Knight Rises</b> <span class="year">2012</span></p>,
 '\n',
 <p><b> Iron Man 3</b> <span class="year">2013</span></p>,
 '\n']

In [24]:
soup_doc.div.contents[1] #here, we can actually index a particular child

<h1> Top Five Movies</h1>

In [9]:
soup_doc.div.previous_sibling #what's to the left of the tag (here: it's just a new line)

'\n'

In [10]:
soup_doc.div.next_sibling #what's on the right of the current tag

'\n'

In [13]:
for sibling in soup_doc.div.next_siblings: #can loop through the next_siblings array
    print (sibling)



<div>
<h1> Middle Five Movies 47-51</h1>
<p><b> The Incredible Hulk</b> <span class="year">2008</span> <a href="http://www.boxofficemojo.com/movies/?id=incrediblehulk.htm">more info</a></p>
<p><b> Wanted</b> <span class="year">2008</span></p>
<p id="favorite"><b> Superman</b> <span class="year">1978</span></p>
<p><b> The Wolverine</b> <span class="year">2013</span></p>
<p><b> Hulk</b> <span class="year">2003</span></p>
</div>


<div>
<h1> Bottom Five Movies</h1>
<ul>
<li> The Rocketeer <span class="year">1991</span></li>
<li> Timecop <span class="year">1994</span></li>
<li> Teenage Mutant Ninja Turtles III <span class="year">1993</span></li>
<li> Ghost In The Shell <span class="year">2017</span></li>
<li> Catwoman <span class="year">2004</span></li>
</ul>
<h2> Most Terrible 137-141</h2>
<p>Snowpiercer</p>
<p>Tank Girl</p>
<p>Barb Wire</p>
<p>Batman: kilpng joke</p>
<p>Blue is the warmest color</p>
<h3>that's all</h3>
</div>




# Finding stuff

In [None]:
soup_doc.find('p') #the find()-method just returns the first instance 

In [None]:
soup_doc.find_all('p') #find_all() returns a list of all the tag instances

In [None]:
soup_doc.find_all('p')[2] #we can treat the find_all()-results just like a list

In [None]:
soup_doc.find('div') #first instance of "div"

In [None]:
soup_doc.find('div').find_all('p') #call find_all() not on whole doc but only on part of it

In [None]:
soup_doc.find('div').find_all('p')[2] #second instance of result. "Find"-style objects are generally indexable

In [17]:
soup_doc.find('div').find_next_siblings('div')[0] #find next sibling of the type "div", indexable list

<div>
<h1> Middle Five Movies 47-51</h1>
<p><b> The Incredible Hulk</b> <span class="year">2008</span> <a href="http://www.boxofficemojo.com/movies/?id=incrediblehulk.htm">more info</a></p>
<p><b> Wanted</b> <span class="year">2008</span></p>
<p id="favorite"><b> Superman</b> <span class="year">1978</span></p>
<p><b> The Wolverine</b> <span class="year">2013</span></p>
<p><b> Hulk</b> <span class="year">2003</span></p>
</div>

In [26]:
all_years = soup_doc.find_all(class_="year") #find all tags with "class" attribute set to "year". 
print(all_years) # watch out for the _ when searching for "class"!

[<span class="year">2010</span>, <span class="year">2008</span>, <span class="year">2015</span>, <span class="year">2012</span>, <span class="year">2013</span>, <span class="year">2008</span>, <span class="year">2008</span>, <span class="year">1978</span>, <span class="year">2013</span>, <span class="year">2003</span>, <span class="year">1991</span>, <span class="year">1994</span>, <span class="year">1993</span>, <span class="year">2017</span>, <span class="year">2004</span>]


In [27]:
fav = soup_doc.find_all(id='favorite') #returns a list of tags where id="favorite"
print(fav)

[<p id="favorite"><b> Superman</b> <span class="year">1978</span></p>]


In [38]:
fav = soup_doc.find_all('p', id='favorite') #can use multiple arguments in find_all()
#fav = soup_doc.find_all('p', attrs={'id': 'favorite'}) #a more complex way
print(fav)

[<p id="favorite"><b> Superman</b> <span class="year">1978</span></p>]


# Error Checking

In [47]:
if fav[0].i is None:
    print ("No <i>-Tag in this element.")

No <i>-Tag in this element.


In [None]:
if fav[0].i:
    print ("There is an <i>-Tag in this element")

# Example: Avengers

In [25]:
raw_html2 = urlopen("http://www.boxofficemojo.com/movies/?id=avengers11.htm").read()
soup_doc2 = BeautifulSoup(raw_html2, "html.parser")
print(soup_doc2.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="Content-type"/>
  <title>
   Marvel's The Avengers (2012) - Box Office Mojo
  </title>
  <style type="text/css">
   table.chart-wide { width: 100%; }
  </style>
  <meta content="marvel's the avengers, movie, film, box office, result, records, charts, revenue, opening weekend, gross, worldwide, overseas, foreign, news, reviews, articles, stories, story, analysis, revenue, release date, mpaa rating, genre, running time, length, budget, production budget, distributor, studio, buena vista, theatrical summary, theatrical, daily box office results, weekend box office results, weekly box office, weekly box office, international box office summary, worldwide box office summary, similar movies, image gallery, images, pictures, photos, box office mojo" name="keywords"/>
  <meta content="Marvel's The Avengers summary 

In [29]:
my_table = soup_doc2.find("table", attrs={"bgcolor": "#dcdcdc"}) #find a particular paragraph where basic info is stored
print(my_table)

<table bgcolor="#dcdcdc" border="0" cellpadding="4" cellspacing="1" width="95%"><tr bgcolor="#ffffff"><td align="center" colspan="2"><font size="4">Domestic Total Gross: <b>$623,357,910</b></font></td></tr><tr bgcolor="#ffffff"><td valign="top">Distributor: <b><a href="/studio/chart/?studio=buenavista.htm">Buena Vista</a></b></td><td valign="top">Release Date: <b><nobr><a href="/schedule/?view=bydate&amp;release=theatrical&amp;date=2012-05-04&amp;p=.htm">May 4, 2012</a></nobr></b></td></tr><tr bgcolor="#ffffff"><td valign="top">Genre: <b>Action / Adventure</b></td><td valign="top">Runtime: <b>2 hrs. 22 min.</b></td></tr><tr bgcolor="#ffffff"><td valign="top">MPAA Rating: <b>PG-13</b></td><td valign="top">Production Budget: <b>$220 million</b></td></tr></table>


In [30]:
each_entry = my_table.find_all('td')
each_entry

[<td align="center" colspan="2"><font size="4">Domestic Total Gross: <b>$623,357,910</b></font></td>,
 <td valign="top">Distributor: <b><a href="/studio/chart/?studio=buenavista.htm">Buena Vista</a></b></td>,
 <td valign="top">Release Date: <b><nobr><a href="/schedule/?view=bydate&amp;release=theatrical&amp;date=2012-05-04&amp;p=.htm">May 4, 2012</a></nobr></b></td>,
 <td valign="top">Genre: <b>Action / Adventure</b></td>,
 <td valign="top">Runtime: <b>2 hrs. 22 min.</b></td>,
 <td valign="top">MPAA Rating: <b>PG-13</b></td>,
 <td valign="top">Production Budget: <b>$220 million</b></td>]

In [31]:
print(each_entry[0])

<td align="center" colspan="2"><font size="4">Domestic Total Gross: <b>$623,357,910</b></font></td>


In [32]:
for entry in each_entry:
    the_data = entry.find('b')
    the_category = the_data.previous_sibling
    print(the_data.string)
    print(the_category)

$623,357,910
Domestic Total Gross: 
Buena Vista
Distributor: 
May 4, 2012
Release Date: 
Action / Adventure
Genre: 
2 hrs. 22 min.
Runtime: 
PG-13
MPAA Rating: 
$220 million
Production Budget: 


In [33]:
avengers_dict = {}
for entry in each_entry:
    the_data = entry.find('b')
    the_category = the_data.previous_sibling
    data_string = the_data.string
    the_category = the_category[:-2].replace(' ','')
    avengers_dict[the_category] = data_string
avengers_dict

{'Distributor': 'Buena Vista',
 'DomesticTotalGross': '$623,357,910',
 'Genre': 'Action / Adventure',
 'MPAARating': 'PG-13',
 'ProductionBudget': '$220 million',
 'ReleaseDate': 'May 4, 2012',
 'Runtime': '2 hrs. 22 min.'}

In [35]:
avengers_dict['Genre']

'Action / Adventure'