In [2]:
# Python 2 & 3 Compatibility
from __future__ import print_function, division

import ipywidgets

# Chapter 1:  HTML refresher and GET requests

Before we can scrape HTML Pages, we need to learn little bit about the DOM (Document Object Model)

In [3]:
url = "https://www.computerhope.com/jargon/d/dom1.jpg"
iframe = '<iframe src=' + url + ' width=700 height=400></iframe>'
ipywidgets.HTML(iframe)

HTML(value='<iframe src=https://www.computerhope.com/jargon/d/dom1.jpg width=700 height=400></iframe>')

## First, an HTML refresher
HTML is the basic language used to create a web page. 

It tells the web browser what text/media to display, where to display it, and how to display it (style)

HTML is very structured/hirarchical. 

Every page is made up of discrete **"elements."**

Elements are labeled with "tags."

For example:

    <p>You are beginning to learn HTML.</p>

A start tag also often contains "attributes" with info about the element.

Attributes usually have a name and value.

Example:

    <p class="my_red_sentences">You are beginning to learn HTML.</p>

We can make a table in HTML: we use the ```<tr>``` tag for table each table row, and the ```<td>``` for each column

```
<table id="mycats">
    <tbody>
        <tr>
            <th>name</th><th>color</th>
        </tr>
        <tr>
            <td>Button</td><td>white</td>
        </tr>
        <tr>
            <td>Peanut</td><td>Calico</td>
        </tr>
    </tbody>
</table>
```
<table id="mycats" width="50%">
<tbody>
<tr>
<th>name</th><th>color</th>
</tr>
<tr>
<td>Button</td><td>white</td>
</tr>
<tr>
<td>Peanut</td><td>Calico</td>
</tr>
</tbody>
</table>

A full HTML document has a structure more like this:

```
<html> 
  <head> </head>
  <body>
     <p class="red">You are beginning to learn HTML.</p>
     <h1> This is a header </h1>
     <a href="www.google.com"> Some link </a>
  </body>
</html>
```

## Exercise:
What element(s) do we need to get if we want to gather all the top 200 movie title and page links?

## Fetch a page with the GET request

When you open your browser, type an URL into the address bar, and hit Enter, the browser sends a "GET" request to the HTTP server. If the server responds "yeah, ok, I see you are requesting this page, let me send it to you", we get the data back (in HTML), and Viola! we see the content the of the page.

Doing this programatically in Python is super easy. There's a library for that: **Requests: HTTP for Humans**  
You can read more about the documentation [here](http://docs.python-requests.org/en/master/)


Let's try now!

In [5]:
# if needed: 
# !conda install requests -y
import requests

url = 'http://www.boxofficemojo.com/'
response = requests.get(url)

In [6]:
response.status_code

200

For information on HTTP status codes, see:  https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

In [7]:
print(response.text)

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<HEAD>
<TITLE>All Time Box Office Adjusted for Ticket Price Inflation</TITLE>
<META NAME="keywords" CONTENT="box, office, all, time, inflation, adjusted, ticket, price, report, movie, film">
<META NAME="description" CONTENT="All time box office adjusted for ticket price inflation.">
<link rel="stylesheet" href="/css/mojo.css?1" type="text/css" media="screen" title="no title" charset="utf-8">
<link rel="stylesheet" href="/css/mojo.css?1" type="text/css" media="print" title="no title" charset="utf-8"></head>
<body>
<iframe id="sis_pixel_sitewide" width="1" height="1" frameborder="0" marginwidth="0" marginheight="0" style="display: none;"></iframe>
<script>
    setTimeout(function(){
        try{
            //sis3.0 pixel
            var cacheBust = Math.random() * 10000000000000000,
                url_sis3 = 'http://s.amazon-adsystem.com/iu3?',
                params_

## Great! We have fetched our first HTML page. 
This page is full of movie links, we will learn to parse the HTML and extract all the links in the next step.
Give yourself a big pat on the back!

Let's explore some live HTML!

Go to [```http://www.boxofficemojo.com/alltime/adjusted.htm```](http://www.boxofficemojo.com/alltime/adjusted.htm) in your browser,
right click and select Inspect Element. Point your cursor to the different elements on the page, what happens?
Also try right clicking and select view page source.