# Web Scraping

Useful for getting data from the web when a formal API isn't exposed by a website, or the API doesn't meet your needs.

## Two Approaches

- Basic web client (ex: `requests` library)
    - Used for STATIC web pages rendered on the SERVER SIDE (before content arrives in your browser)
- Headless browser piloted by Python
    - Used when JavaScript is used to assemble a web page DYNAMICALLY on the CLIENT/BROWSER SIDE
    
## Web Technologies

- HyperText Markup Language (HTML): Describes a web page document (headers, paragraphs, links, images, etc)
- Cascading Style Sheets (CSS): Styling code for HTML (colors, font sizes, etc)
    - Uses the concept of "selectors" to specify HTML elements to target
        - For a fun tutorial on CSS selectors: <https://flukeout.github.io/>
- JavaScript: Code that implements behavior for web applications (ex: when I click on this, do that, etc)
- JavaScript Object Notation (JSON): A markup format that describes data. One of the primary formats returned by popular web APIs (Instagram, FaceBook, ICNDB, etc)
- HyperText Transfer Protocol (HTTP): The application-level network protocol used by browsers and other web clients to retrieve data from web servers.
    - Has actions like:
        - `GET`: Retrieve data from a server.
        - `POST`: Upload / create new data on the server.
        - `PUT`: Modify existing data on a server.
        - `DELETE`: Delete data from a server.
    
## Libraries and Tools

### Static Scraping
- `requests`: The most popular and easy to use Python HTTP client library
- `BeautifulSoup`: Useful for parsing HTML / XML data (often obtained using `requests`) and extracting only the data you're interested in
- `feedparser`: Like BeautifulSoup, but focused on RSS / ATOM feed data

### Dynamic Scraping
- `selenium`: A popular, cross-language framework for piloting headless web browsers
- `splinter`: A convenient, higher level interface to Selenium in Python

## Code Review: Static Scraping

In [15]:
# CODE REVIEW: Static Scraping

# Import common third-party libraries used for web scraping.
## requests: A popular HTTP client library.
## bs4: Beautiful Soup 4, Useful for parsing HTML / XML data
##      (often obtained using requests) 
##      and extracting only the data you're interested in
import requests
from bs4 import BeautifulSoup

# Do an HTTP GET request for the Berkshire Hathaway URL,
# and store the result object from that request in the "response"
# variable.
response = requests.get("http://www.berkshirehathaway.com/")

# NOTE: There are lots of interesting attributes and methods
#       available on the response object. For example, e can check the
#       response.status_code to see whether our request succeeded.
#       Reference: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
if 200 <= response.status_code <= 400:
    print(f"Our HTTP status code indicates success! {response.status_code}")
else:
    print(f"Our HTTP status code indicates an error! {response.status_code}")


# Parse the HTML content from the response into a format we can
# more easily query and extract data from.
soup = BeautifulSoup(response.content)

# Select the HTML data we're interested in.
# In the case of the BH website, imagine we wanted:
#   - the TEXT of each link on the main page
#   - the URL for each link on the main page

# We use "CSS selector" notation with the soup.select method
# to extract this data.
links = soup.select('li a')

# To STORE the extracted data for exploration and further use,
# we need to decide on a data structure.
# Because the data we're interested in comes in pairs (text of link, URL),
# a dictionary data structure makes sense here.
# To create our dictionary, we use a DICT COMPREHENSION.
# We populate our dict with pairs (only where there is an href attribute):
# - key: the first string in each link's contents
# - value: the URL corresponding to each link
links_dict = {link.contents[0]: link.attrs['href'] 
              for link in links if link.attrs.get('href')}

Our HTTP status code indicates success! 200


In [13]:
# Here we do SOMETHING with our data: print it.
# We could also do other things --- 
# download all URLS, render a report template, 
# send a notification if a value changes, etc!
for text, url in links_dict.items():
    print(f"{text}: {url}")

A Message From Warren E. Buffett: message.html
News Releases: news/2019news.html
Annual & Interim Reports: reports.html
Warren Buffett's Letters to Berkshire 
        Shareholders: letters/letters.html
Special Letters From Warren & Charlie RE:Past, Present and Future: SpecialLetters/WEBCTMLtr.html
Charlie Munger's Letters to Wesco Shareholders: wesco/WescoHome.html
Link 
        to SEC Filings: http://www.sec.gov/cgi-bin/browse-edgar?company=berkshire+hathaway&match=&CIK=&filenum=&State=&Country=&SIC=&owner=exclude&Find=Find+Companies&action=getcompany
Annual Meeting Information: sharehold.html
Links to Berkshire Subsidiary 
        Companies: subs/sublinks.html
Celebrating 50 Years of a Profitable Partnership: https://www.ebay.com/itm/173723452844
Corporate Governance: govern/govern.html
Comparative Rights and Relative Prices of 
        Class A and B Stock: compab.pdf
Sustainability: sustainability/sustainability.html
Berkshire Activewear 
        : http://www.berkshirewear.com


# EXERCISE: Scraping Static Websites

For ONE of the following websites:

- https://drw.com/
- https://gossetx.com/
- https://en.wikipedia.org/wiki/Main_Page

use `requests` and `BeautifulSoup` to retrieve and parse the site, and return a list of the *img* (image) elements. Example code:

```python
import requests

from bs4 import BeautifulSoup

res = requests.get(YOUR_URL)
soup = BeautifulSoup(res.text)
images = soup.select('img')
print(images)
```

## Code Review: Retrieving data from a JSON API

In [20]:
# CODE REVIEW: Retrieving data from a JSON API

# import the popular HTTP client library.
import requests

# Define a function for retrieving jokes so that we can REUSE and/or test it easily.
def icndb(num_jokes=3):
    """Retrieve random jokes from the ICNDB API.
    
    Reference: http://www.icndb.com/api/
    """
    # Use an HTTP GET request to retrieve some random jokes.
    res = requests.get(f"http://api.icndb.com/jokes/random/{num_jokes}")
    
    # Get the JSON data from the response object.
    data = res.json()
    
    # Use a LIST COMPREHENSION to create a list of ONLY the joke strings.
    # Return that list from our function.
    return [v['joke'] for v in data['value']]

In [21]:
icndb(5)

['Chuck Norris is not Politically Correct. He is just Correct. Always.',
 "The quickest way to a man's heart is with Chuck Norris' fist.",
 'James Cameron wanted Chuck Norris to play the Terminator. However, upon reflection, he realized that would have turned his movie into a documentary, so he went with Arnold Schwarzenegger.',
 "&quot;Brokeback Mountain&quot; is not just a movie. It's also what Chuck Norris calls the pile of dead ninjas in his front yard.",
 "When Chuck Norris throws exceptions, it's across the room."]

## Scraping a Dynamic Website using a Headless Web Browser