In [1]:
import ipywidgets

# HTML refresher and GET requests

Before we can scrape HTML Pages, we need to learn little bit about the DOM (Document Object Model)

In [2]:
url = "https://www.computerhope.com/jargon/d/dom1.jpg"
iframe = '<iframe src=' + url + ' width=700 height=400></iframe>'
ipywidgets.HTML(iframe)

HTML(value='<iframe src=https://www.computerhope.com/jargon/d/dom1.jpg width=700 height=400></iframe>')

## First, an HTML refresher
HTML is the basic language used to create a web page.

It tells the web browser what text/media to display, where to display it, and how to display it (style)

HTML is very structured/hirarchical.

Every page is made up of discrete "elements."

Elements are labeled with "tags."

For example:

    <p>You are beginning to learn HTML.</p>

A start tag also often contains "attributes" with info about the element.

Attributes usually have a name and value.

Example:

    <p class="my_red_sentences">You are beginning to learn HTML.</p>

We can make a table in HTML: we use the ```<tr>``` tag for table each table row, and the ```<td>``` for each column

```
<table id="mycats">
    <tbody>
        <tr>
            <th>name</th><th>color</th>
        </tr>
        <tr>
            <td>Button</td><td>white</td>
        </tr>
        <tr>
            <td>Peanut</td><td>Calico</td>
        </tr>
    </tbody>
</table>
```
<table id="mycats" width="50%">
<tbody>
<tr>
<th>name</th><th>color</th>
</tr>
<tr>
<td>Button</td><td>white</td>
</tr>
<tr>
<td>Peanut</td><td>Calico</td>
</tr>
</tbody>
</table>

A full HTML document has a structure more like this:

```
<html>
  <head> </head>
  <body>
     <p class="red">You are beginning to learn HTML.</p>
     <h1> This is a header </h1>
     <a href="www.google.com"> Some link </a>
  </body>
</html>
```

Let's explore some live HTML!

Go to ```http://www.boxofficemojo.com/alltime/adjusted.htm``` in your browser,
right click and select Inspect Element. Point your cursor to the different elements on the page, what happens?
Also try right clicking and select view page source.

## Fetch a page with the GET request

When you open your browser, type an URL into the address bar, and hit Enter, the browser sends a "GET" request to the HTTP server. If the server responds "yeah, ok, I see you are requesting this page, let me send it to you", we get the data back (in HTML), and Viola! we see the content the of the page.

Doing this programatically in Python is super easy. There's a library for that: **Requests: HTTP for Humans**
You can read more about the documentation [here](http://docs.python-requests.org/en/master/)


In [3]:
# if needed:
# !conda install requests -y
import requests

url = 'http://www.google.com/'
response = requests.get(url)
# various
print(response)
print(response.status_code)
print(response.url)
print(response.headers)
print(response.content)
print(response.text)
print(response.json())

<Response [200]>
200
http://www.google.com/
{'Date': 'Thu, 14 May 2020 19:47:23 GMT', 'Expires': '-1', 'Cache-Control': 'private, max-age=0', 'Content-Type': 'text/html; charset=ISO-8859-1', 'P3P': 'CP="This is not a P3P policy! See g.co/p3phelp for more info."', 'Content-Encoding': 'gzip', 'Server': 'gws', 'Content-Length': '5263', 'X-XSS-Protection': '0', 'X-Frame-Options': 'SAMEORIGIN', 'Set-Cookie': '1P_JAR=2020-05-14-19; expires=Sat, 13-Jun-2020 19:47:23 GMT; path=/; domain=.google.com; Secure, NID=204=GQSjM1Fb8gimHqyiQ1XEfJLBrl0THcQUyUjSaqiiTXlI9jDGySO3jWREyMo56mJXYz8X1zQn6-2_lBr8tsuD9Zep46WG19JTzvxHAXL_2baSDhCG105ZLO41wnRfNIZ3pgh-e2xedCHcpLUDM_5Ys0g1QmUbzJdKJbyKzfxNFuA; expires=Fri, 13-Nov-2020 19:47:23 GMT; path=/; domain=.google.com; HttpOnly'}
b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world\'s information, including webpages, images, videos and more. Google has many special features to help you find exac

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [4]:
response.status_code

200

For information on HTTP status codes, see:  https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

In [5]:
print(response.text)

<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for." name="description"><meta content="noodp" name="robots"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="yRjJ7Q4HLfK98MCEJ3CdTA==">(function(){window.google={kEI:'S6C9XtSgBtK9ggf4naOQCg',kEXPI:'0,202123,3,4,1151616,5663,730,224,756,4348,207,3204,10,1051,175,364,926,193,380,576,210,31,383,246,5,830,30,494,196,472,14,13,118,225,656,1217,406,413,3,149,12,1123872,1197793,98,258,78,329040,1294,12383,4855,32692,15247,867,17444,11240,9188,8384,4858,1362,9291,3025,4742,2648,8385,1808,4020,978,4788,1,3142,5297,2054,920,873,1217,2975,6430,1142,6290,3874,3222,235,4284,2777,518,399,2277,8,87,270

# Linkedin Industries
### Request Webpage and extract data from HTML Table
### Save to CSV

In [6]:
import csv
import requests
import lxml.html
from bs4 import BeautifulSoup
from datetime import datetime

industry_url = r'https://developer.linkedin.com/docs/reference/industry-codes'

table_xpath = r'//*[@id="content"]/div[2]/div/section/div/div/div[2]/div[2]/table'

req = requests.get(industry_url)

lxml_html = lxml.html.fromstring(req.content)
root = lxml_html.getroottree()
table = root.xpath(table_xpath)
raw_table_html_bytes = lxml.html.tostring(table[0])
html_table = raw_table_html_bytes.decode('utf-8')

In [7]:
now_time = datetime.utcnow().strftime('%Y%m%dT%H%M%S')

with open(r'data\\linkedin_industries.csv', 'w') as csvFile:
    writer = csv.writer(csvFile)
    soup = BeautifulSoup(html_table, 'lxml')
    for i, row in enumerate(soup.findAll('tr')):
        table_row = [now_time]
        print(row)
        for cell in row.findAll('td'):
            table_row.append(f"{cell.get_text()}")
        writer.writerow(table_row + ['\n'])

<tr><td>Code</td> <td>Groups</td> <td>Description<br/> </td> </tr>
<tr><td>47</td> <td>corp, fin</td> <td>Accounting</td> </tr>
<tr><td>94</td> <td>man, tech, tran</td> <td>Airlines/Aviation</td> </tr>
<tr><td>120</td> <td>leg, org</td> <td>Alternative Dispute Resolution</td> </tr>
<tr><td>125</td> <td>hlth</td> <td>Alternative Medicine</td> </tr>
<tr><td>127</td> <td>art, med</td> <td>Animation</td> </tr>
<tr><td>19</td> <td>good</td> <td>Apparel &amp; Fashion</td> </tr>
<tr><td>50</td> <td>cons</td> <td>Architecture &amp; Planning</td> </tr>
<tr><td>111</td> <td>art, med, rec</td> <td>Arts and Crafts</td> </tr>
<tr><td>53</td> <td>man</td> <td>Automotive</td> </tr>
<tr><td>52</td> <td>gov, man</td> <td>Aviation &amp; Aerospace</td> </tr>
<tr><td>41</td> <td>fin</td> <td>Banking</td> </tr>
<tr><td>12</td> <td>gov, hlth, tech</td> <td>Biotechnology</td> </tr>
<tr><td>36</td> <td>med, rec</td> <td>Broadcast Media</td> </tr>
<tr><td>49</td> <td>cons</td> <td>Building Materials</td> </tr>

# Reading data from an API

### Download data for set of Stock Tickers


In [8]:
# CSV
import os
import time
from datetime import datetime

import requests

tickers = "GOOG MSFT IBM TSLA".split()

today_date = datetime.today().strftime('%Y-%m-%d')

# download prices for each stock in portfolio
# as csv file
for ticker in tickers:
    print(f"Requesting Data for {ticker}")
    req = requests.get(f"https://www.alphavantage.co/query?function=TIME_SERIES_DAILY_ADJUSTED&symbol={ticker}&apikey=537201H9R203WT4C&datatype=csv")
    print(f"Saving {ticker} Data")

    # create filepath with date string
    output_filepath = os.path.join("data", f"{ticker}_PRICES_DAILY_{today_date}.csv")

    with open(output_filepath, 'w') as f:
        f.write(req.text)

    time.sleep(1) # sleep in-between requests

Requesting Data for GOOG
Saving GOOG Data
Requesting Data for MSFT
Saving MSFT Data
Requesting Data for IBM
Saving IBM Data
Requesting Data for TSLA
Saving TSLA Data


In [None]:
# JSON
import csv

# download prices for each stock in portfolio
# but this time json
for ticker in tickers:
    print(f"Requesting Data for {ticker}")
    req = requests.get(f"https://www.alphavantage.co/query?function=TIME_SERIES_DAILY_ADJUSTED&symbol={ticker}&apikey=537201H9R203WT4C&datatype=json")
    print(f"Saving {ticker} Data")

    # create filepath with date string
    output_filepath = os.path.join("data", f"{ticker}_PRICES_DAILY_{today_date}.json")
    with open(output_filepath, 'w') as f:
        f.write(req.text)

    time.sleep(1) # sleep in-between requests






Requesting Data for GOOG
Saving GOOG Data
Requesting Data for MSFT
