## M2:  Gathering Data - Data formats (Storage/Transportation), Web Scraping

### Data Formats: Storage/Transportation 

As I stated in module 1, one of the main objectives of this course is for you to get a better understanding about working with and analyzing data. We already looked at one representation/model of data with Markov Models but lets know understand how we actually gather data since that's the first step to data analysis. 
 
For example, what if we were doing analysis on stock information and we needed to gather information about Google's previous stock value from 2019. Lets examine some different representations of this data. 

#### Plain Text
> The closing price for GOOG on Jan 25, 2019 was $1090.99 and 1,114,785 shares were traded. 

#### Binary encoding	

                0xC8B10436001102A144885fae
                
Encoding (made up):

  - Day:                 5 bits
  - Month:               4 bits
  - Year:                7 bits               (X-1970)
  - Symbol:             16 bits       
       (1078th in alphabetic order)
  - Closing price: 32 bits             (32 bit IEEE floating point)
  - Shares traded: 32 bits             (unsigned int)

#### CSV 
Day,Month,Year,Symbol,Closing Price,Shares <-- header

25,Jan,2019,GOOG,1090.99,1114785  <-- comma delimited line fo Google data 

#### XML 

```xml 
  <stock_info>
      <symbol>GOOG</symbol>
      <date>
	    <day>25</day>
	    <month>Jan</month>
	    <year>2019</year>
      </date>
      <closing_price>1090.99</closing_price>
      <shares_traded>1114785</shares_traded>
    </stock_info>
```

#### JSON 

```json 
 {"symbol":"GOOG",
  "date": {"day":"25", 
                 "month":"Jan",  
                 "year":"2019"},
   "closing price": “1090.99",
   "shares traded": “1114785"}
```

#### HTML 

```html 
    <tr>
    <td class="yfnc_tabledata1" nowrap align="right">Jan 25, 2019</td>
    <td class="yfnc_tabledata1" align="right">1,090.99</td>
    <td class="yfnc_tabledata1" align="right">GOOG</td>
    <td class="yfnc_tabledata1" align="right">1,114,785</td>
    </tr>
```

Again, some key ideas from seeing all these formats: 

- You need to understand what the data means before you can work with it.

- Data comes in different formats and you may not have any control
  over what you get.

- JSON is a nice mix of readable and easy to work with.

### Gathering data from the web

As stated above, a data scientist won't always be given the data they need handed to them in a easily to process format (e.g., CSV, JSON, etc.). Futhermore, there may not be an easy way to even access the data. A lot the data avaliable to you is on the internet and you may need to go out and get that data yourself. The data can be avaliable in many different ways: 

1. Downloadable from a website. The data is easily accessible and may come in different formats to make it easy to include your analaysis. 

2. API (Application Programming Interface) - is a set functions and communication protocols that provide access to the data for an application, operating system, etc. For example, Twitter provides an API to allow developers to retrieve information about the users of Twitter. 
  > "The Twitter API enables programmatic access to Twitter in unique and advanced ways. Use it to analyze, learn from, and interact with Tweets, Direct Messages, users, and other key Twitter resources." Cite: Twitter
  
  Companies that provide an API might offer access to its data for free, or may charge for access. They might also limit the number of requests that a single user can make or the detail of the data they can access. 
  
  
3. Scraping the data -  process of extracting data from websites using an automated process is known as *web scraping*. Most developers use scraping when the data is not avaliable directly from a website, or access to the API is limited, or too expensive or just non-existent. However, some websites explicitly forbid users from scraping their data because   Making many repeated requests to a website’s server may use up bandwidth, slowing down the website for other users and potentially overloading the server such that the website stops responding entirely or there data is valuable (e.g., Twitter, Amazon Product Information) and they want to protect it from competiters or others. 

Since scraping is a very useful tool to know when working with data, we will learn how to scrape this week. However, it will require that you know some basic html. 

### Basic URL 

Before you can grab the data on a website, you need to first understand the process of getting to that page to access it's data. 

``  protocol://site address/path/filename ``

Example: ``https://classes.cs.uchicago.edu/current/30122-1/pa/pa1/index.html``

    - ``protocol``: http
    - ``hostname``: www
    - ``domain``: classes.cs.uchicago
    - ``top-level domain``: edu
    - ``path``: archive/2017/winter/12200-1/pa/pa1
    - ``filename``: index.html

Relative URL:
   -  ``pa/index.html``

Relative URL with a fragment:  
   - ``pa/pa1/index.html#submission``

### Basic HTML

HTML (HyperText Markup Language) is the standard markup language for documents designed to be displayed in a web browser. It's similar to XML where you you embed the data within **tags** that describe the content of a document, and may also include formatting/style information. 

Standard syntax: ``<tagname> contents </tagname>`` 

#### Basic structure of HTML document

```html 
  <!DOCTYPE HTML>
   <html>
     <head>
        <title>...</title>
     </head>
     <body>
       <p> ... </p>
       <p> ... </p>
       <p> ... </p>
     </body>
   </html>
```
You can also think of an HTML document's tags as making up a tree data structure since you embed tags within tags that act as children. 

![alt text](basic_html.png "Basic HTML structure")


#### Some example tags 

**headers**:
 -   ``<h1>...</h1> through <h6>...</h6>``

**links**: 
 - ``<a href="http://college.uchicago.edu">The College</a>``

 -  Attributes: ``href`` is an attribute and "http://..." is its value. An attribute provides additional information about HTML tags. 

**images**:
 -  ``<img style="height: 120px;" alt="" src="images/freelunch.png">``

**paragraph**:
  -  ``<p> … some text … </p>``
  -  ``<p class="courseblocktitle”>course title…</p>``

**comment**:
  - ``<!-- ... -->``

**division or section**:
  - ``<div> … </div>``
  - ``<div class="courseblock main">…stuff...</div>``


**table**:
```html
     <table>
       <tr>
          <th>...</th>
             ...
          <th>...</th>
       </tr>
       <tr>
          <td>...</td>
             ...
          <td>...</td>
       </tr>
     </table>
```


### Grabbing the HTML Code

We know have a good understanding about the structure of an HTML document. Lets now talk about tools that Python provides to us that allow use to extract out the data from within the tags. 

First, you need to be able to access the HTML from a given website. You can use the ``requests`` package to retrieve the html code. 

In [1]:
import requests

You now need to open a *connection* to your website of choice by using the ``get`` function of the ``requests`` package. 

For example, lets take a look at accessing The College's 2014-2015 catalog, which is located here: 

https://www.classes.cs.uchicago.edu/archive/2015/winter/12200-1/new.collegecatalog.uchicago.edu/thecollege/computerscience/index.html

In [2]:
# requests.get  makes the connection and reads the data from the connection if possible.
r = requests.get("https://www.classes.cs.uchicago.edu/archive/2015/winter/12200-1/new.collegecatalog.uchicago.edu/thecollege/computerscience/index.html")

In [3]:
type(r) # r is just a Response object with information about the connection 

requests.models.Response

In [4]:
# Checking the "status_code" attribute allows us to know if we were able to make a 
# connection with the website. 
#  r.status_code --
#     200 - OK Standard response for successful HTTP requests
#     403 - Forbidden
#     404 - Not Found
r.status_code

200

How do we actually get the HTML? You can use ``r.text.encode(...)`` method to retrive an unicode string containing html code. 

In [5]:
# Now to get the text we just need to use ".text" attribute. You may need to encode
# the text if the html documents contains different unicode characters
r.text.encode('iso-8859-1')

b'<!doctype html>\n<html xml:lang="en" lang="en" dir="ltr">\n\n<head>\n<title>Computer Science &lt; University of Chicago Catalog</title>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n<link rel="search" type="application/opensearchdescription+xml"\n\t\t\thref="../search/opensearch.xml" title="University of Chicago Catalog" />\n<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0" />\n<link href="../favicon.ico" rel="shortcut icon" />\n<link rel="stylesheet" type="text/css" href="../css/reset.css" />\n<link rel="stylesheet" type="text/css" href="../css/screen.css" media="screen" />\n<link rel="stylesheet" type="text/css" href="../css/print.css" media="print" />\n<script type="text/javascript" src="../../js/jquery.js"></script>\n<script type="text/javascript" src="../../js/lfjs.js"></script>\n<script type="text/javascript" src="../../js/lfjs_any.js"></script>\n<link rel="stylesheet" type="text/css" href="../../js/lfjs.css" />\n

### Web Scraping
Once we have the actual HTML document we can then start scraping the document to get out the data. Python provides many different packages that parse through HTML data (e.g. Scrapy - advance) but we will use one of the most common packages known as **Beautiful Soup**. 

First, you will need to install Beautiful Soup as such: 

``sudo pip3 install beautifulsoup4`` 

Now, lets create a Beautiful Soup object from a html document. For example, lets open the ``catalog.html`` page. It has the following tree structure. 

![alt text](catalog.png "CS Catalog Structure")


The idea of Beautiful Soup is to create this tree structure from a given html page and allow you to iterate or grab tags (i.e., nodes) from the structure. 

Since this page is stored locally, we don't need to use ``requests`` to grab the html document but instead we will open it like a normal file and create a beautiful soup object from it. 

In [6]:
import bs4 # import beautiful soup 

In [7]:
s = open("catalog.html").read() # Open the catalog html file

In [8]:
s

'<html>\n\n<head>\n    <title>Computer Science Courses</title>\n</head>\n\n<body>\n    <div class="courseblock main" style="course">\n       Lamont <p class="courseblocktitle">\n            <strong>\n                CMSC&#160;10100. Introduction to Programming for the World Wide Web I. 100 Units.\n            </strong>\n        </p>\n\n        <p class="courseblockdesc">\n            This course teaches the basics of building and maintaining a site on\n            the World Wide Web. We discuss Internet terminology and how the\n            Internet and its associated technologies work. Topics include\n            programming websites, hypertext markup language (HTML), Cascading\n            Style Sheets (CSS), and Common Gateway Interface (CGI) scripts\n            (using PERL). Students also learn how to use JavaScript to add\n            client-side functionality.</p>\n        <p class="courseblockdetail">\n            Instructor(s): W. Sterner&#160;&#160;&#160;&#160;&#160;Terms\n   

In [9]:
# A BeautifulSoup objec takes in two arguments 
# 1. The string to the html document 
# 2. The html parser you want BeautifulSoup to use to create the tree. These
# are predefined strings. You can look at the documentation to see the different
# parsers. 
soup = bs4.BeautifulSoup(s,  "html5lib")

In [10]:
soup

<html><head>
    <title>Computer Science Courses</title>
</head>

<body>
    <div class="courseblock main" style="course">
       Lamont <p class="courseblocktitle">
            <strong>
                CMSC 10100. Introduction to Programming for the World Wide Web I. 100 Units.
            </strong>
        </p>

        <p class="courseblockdesc">
            This course teaches the basics of building and maintaining a site on
            the World Wide Web. We discuss Internet terminology and how the
            Internet and its associated technologies work. Topics include
            programming websites, hypertext markup language (HTML), Cascading
            Style Sheets (CSS), and Common Gateway Interface (CGI) scripts
            (using PERL). Students also learn how to use JavaScript to add
            client-side functionality.</p>
        <p class="courseblockdetail">
            Instructor(s): W. Sterner     Terms
            Offered: Winter<br/> Note(s): This course does n

There are four different *kinds* of Beautiful soup objects:

- ``bs4.BeautifulSoup``-> the top level object that gets created when you provid the html-document and the parser. 

- ``bs4.Tag`` -> corresponds to an XML or HTML tag in the original document (e.g., ``a``, ``p``, ``div``)

- ``bs4.NavigableString`` -> A string corresponds to a bit of text within a tag.

- ``bs4.Comments`` -> Represents the commentts that live within an html document. 

In [11]:
# Returns the tag that corresponds to the ``title`` tag in the document
soup.title  

<title>Computer Science Courses</title>

In [12]:
# .text returns the string inside the tag 
soup.title.text

'Computer Science Courses'

In [13]:
# What happens if specify a html tag that's not inside the original document? 
# it returns the None value. 
print(soup.img)

None


In [14]:
# What about more than one of the same tags? For example "p"?
# returns the first occurence of that tag
soup.p

<p class="courseblocktitle">
            <strong>
                CMSC 10100. Introduction to Programming for the World Wide Web I. 100 Units.
            </strong>
        </p>

Use ``find_all(tag_str)`` method of the soup or tag to return a list of all the tags that match the given ``tag_str`` string. 

In [15]:
divs = soup.find_all("div")

In [16]:
len(divs)

5

In [17]:
divs

[<div class="courseblock main" style="course">
        Lamont <p class="courseblocktitle">
             <strong>
                 CMSC 10100. Introduction to Programming for the World Wide Web I. 100 Units.
             </strong>
         </p>
 
         <p class="courseblockdesc">
             This course teaches the basics of building and maintaining a site on
             the World Wide Web. We discuss Internet terminology and how the
             Internet and its associated technologies work. Topics include
             programming websites, hypertext markup language (HTML), Cascading
             Style Sheets (CSS), and Common Gateway Interface (CGI) scripts
             (using PERL). Students also learn how to use JavaScript to add
             client-side functionality.</p>
         <p class="courseblockdetail">
             Instructor(s): W. Sterner     Terms
             Offered: Winter<br/> Note(s): This course does not meet the general
             education requirement in t

Use the ``class_`` keyword argument of the ``find_all`` method to only return tags with their value equal to the string provided in the ``class_`` argument. Note the need of the ``_`` in the ``class_`` argument. 

In [18]:
course_divs = soup.find_all("div", class_="courseblock main")

In [19]:
len(course_divs)

2

In [20]:
course_divs

[<div class="courseblock main" style="course">
        Lamont <p class="courseblocktitle">
             <strong>
                 CMSC 10100. Introduction to Programming for the World Wide Web I. 100 Units.
             </strong>
         </p>
 
         <p class="courseblockdesc">
             This course teaches the basics of building and maintaining a site on
             the World Wide Web. We discuss Internet terminology and how the
             Internet and its associated technologies work. Topics include
             programming websites, hypertext markup language (HTML), Cascading
             Style Sheets (CSS), and Common Gateway Interface (CGI) scripts
             (using PERL). Students also learn how to use JavaScript to add
             client-side functionality.</p>
         <p class="courseblockdetail">
             Instructor(s): W. Sterner     Terms
             Offered: Winter<br/> Note(s): This course does not meet the general
             education requirement in t

In [21]:
# You can use the ".attrs" attributte to get all the attributes of a tag
soup.div

<div class="courseblock main" style="course">
       Lamont <p class="courseblocktitle">
            <strong>
                CMSC 10100. Introduction to Programming for the World Wide Web I. 100 Units.
            </strong>
        </p>

        <p class="courseblockdesc">
            This course teaches the basics of building and maintaining a site on
            the World Wide Web. We discuss Internet terminology and how the
            Internet and its associated technologies work. Topics include
            programming websites, hypertext markup language (HTML), Cascading
            Style Sheets (CSS), and Common Gateway Interface (CGI) scripts
            (using PERL). Students also learn how to use JavaScript to add
            client-side functionality.</p>
        <p class="courseblockdetail">
            Instructor(s): W. Sterner     Terms
            Offered: Winter<br/> Note(s): This course does not meet the general
            education requirement in the mathematical sci

In [22]:
soup.div.attrs

{'class': ['courseblock', 'main'], 'style': 'course'}

In [23]:
# You can "index" a tag by its attribute too
soup.div["class"]

['courseblock', 'main']

In [24]:
soup.div["style"]

'course'

In [25]:
# You can also use find_all on a specific tag to find tags within a tag 
soup.div

<div class="courseblock main" style="course">
       Lamont <p class="courseblocktitle">
            <strong>
                CMSC 10100. Introduction to Programming for the World Wide Web I. 100 Units.
            </strong>
        </p>

        <p class="courseblockdesc">
            This course teaches the basics of building and maintaining a site on
            the World Wide Web. We discuss Internet terminology and how the
            Internet and its associated technologies work. Topics include
            programming websites, hypertext markup language (HTML), Cascading
            Style Sheets (CSS), and Common Gateway Interface (CGI) scripts
            (using PERL). Students also learn how to use JavaScript to add
            client-side functionality.</p>
        <p class="courseblockdetail">
            Instructor(s): W. Sterner     Terms
            Offered: Winter<br/> Note(s): This course does not meet the general
            education requirement in the mathematical sci

In [26]:
soup.div.find_all("p")

[<p class="courseblocktitle">
             <strong>
                 CMSC 10100. Introduction to Programming for the World Wide Web I. 100 Units.
             </strong>
         </p>, <p class="courseblockdesc">
             This course teaches the basics of building and maintaining a site on
             the World Wide Web. We discuss Internet terminology and how the
             Internet and its associated technologies work. Topics include
             programming websites, hypertext markup language (HTML), Cascading
             Style Sheets (CSS), and Common Gateway Interface (CGI) scripts
             (using PERL). Students also learn how to use JavaScript to add
             client-side functionality.</p>, <p class="courseblockdetail">
             Instructor(s): W. Sterner     Terms
             Offered: Winter<br/> Note(s): This course does not meet the general
             education requirement in the mathematical sciences.<br/>
         </p>]

#### Navigating through the tree

You can navigate through the tree by using the ``contents`` attribute on the soup object/tag. This will give you back a list of tags. However, using ``contents`` is a bit awkward because it will contain newlines and other strings as well as other tags.

In [27]:
soup.div.contents

['\n       Lamont ', <p class="courseblocktitle">
             <strong>
                 CMSC 10100. Introduction to Programming for the World Wide Web I. 100 Units.
             </strong>
         </p>, '\n\n        ', <p class="courseblockdesc">
             This course teaches the basics of building and maintaining a site on
             the World Wide Web. We discuss Internet terminology and how the
             Internet and its associated technologies work. Topics include
             programming websites, hypertext markup language (HTML), Cascading
             Style Sheets (CSS), and Common Gateway Interface (CGI) scripts
             (using PERL). Students also learn how to use JavaScript to add
             client-side functionality.</p>, '\n        ', <p class="courseblockdetail">
             Instructor(s): W. Sterner     Terms
             Offered: Winter<br/> Note(s): This course does not meet the general
             education requirement in the mathematical sciences.<br/

In [28]:
# You can also use .parent, .next_sibling, .prev_sibling, to move through the 
# parent and children of a tag. 
soup.div.next_sibling

'\n    '

In [29]:
soup.div.next_sibling.next_sibling

<div class="courseblock main">
        <p class="courseblocktitle">
            <strong>
                CMSC 12100-12200-12300. Computer Science with Applications I-II-III.
            </strong>
        </p>

        <p class="courseblockdesc">

            This three-quarter sequence teaches computational thinking and
            skills to students who are majoring in the sciences, mathematics,
            and economics. Lectures cover topics in (1) programming, such as
            recursion, abstract data types, and processing data; (2) computer
            science, such as clustering methods, event-driven simulation, and
            theory of computation; and to a lesser extent (3) numerical
            computation, such as approximating functions and their derivatives
            and integrals, solving systems of linear equations, and simple
            Monte Carlo techniques. Applications from a wide variety of fields
            serve both as examples in lectures and as the basi

In [30]:
soup.div.next_sibling.next_sibling.next_sibling

'\n\n    '

In [31]:
# How could we find only the "course subsequences" within a div tag?
def is_subsequence(tag):
    return isinstance(tag, bs4.element.Tag) and 'class' in tag.attrs \
        and tag['class'] == ['courseblock', 'subsequence']

def is_whitespace(tag):
    return isinstance(tag, bs4.element.NavigableString) and (tag.strip() == "")

def find_sequence(tag):
    '''
    If tag is the header for a sequence, then find the tags for the
    courses in the sequence.
    '''

    rv = []
    sib_tag = tag.next_sibling
    while is_subsequence(sib_tag) or is_whitespace(sib_tag):
        if not is_whitespace(sib_tag):
            rv.append(sib_tag)
        sib_tag = sib_tag.next_sibling
    return rv

In [32]:
course_divs = soup.find_all("div", class_="courseblock main")

In [41]:
course_divs[1].next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling

AttributeError: 'NoneType' object has no attribute 'next_sibling'

In [34]:
find_sequence(course_divs[1])

[<div class="courseblock subsequence">
         <p class="courseblocktitle">
             <strong>
                 CMSC 12100. Computer Science with Applications I. 100 Units.</strong>
         </p>
 
         <p class="courseblockdesc">
         </p>
 
         <p class="courseblockdetail">
             Instructor(s): A. Rogers     Terms Offered: Autumn<br/>
             Prerequisite(s): Placement into MATH 15200 or higher, or consent of instructor<br/>
             Note(s): This course meets the general education requirement in the mathematical sciences. <br/>
         </p>
     </div>, <div class="courseblock subsequence">
         <p class="courseblocktitle">
             <strong>
                 CMSC 12200. Computer Science with Applications II. 100 Units.
             </strong>
         </p>
 
         <p class="courseblockdesc">
         </p>
 
         <p class="courseblockdetail">
             Instructor(s): A. Rogers     Terms Offered: Winter<br/>
             Prerequisite(