Web Scraping for Fun and Profit?
================================

Sometimes you need data and the web is the only place to get it.

This isn't just for a lark - you might have an official data set to which you might like to join some other data which isn't available a a CSV somewhere but is available in some form on the web.

Today we'll cover the basics of web scraping.

What is Scraping
================

Scraping is an ugly name for what is sometimes an ugly process: we will use our programming language to download web pages (maybe) parse them into some kind of Document Object and then we will use queries on that object to extract the data we want.

Languages
=========

You can use almost any language to scrape from the web as long as you can do those two steps:

1. issue a request over HTTP (sometimes with some complicated body or header)
2. parse the response (often HTML sometimes JSON or some other kind of document)
3. interact with the result.

Today we will be using a Python Module called beautiful soup.

HTTP/REST/HTML refresher
========================

The web is a bunch of computers listening for HTTP (or HTTPS) connections. HTTP (hypertext transfer protocol) is a standard for inter-computer communication in which a client (us) and a server (them) exchange messages in the form of documents. These documents adhere to a standard about which we don't need to know too much. What we do need to know is that every message contains a

1. header - contains what you might call meta-data about the request
2. body - contains the document associated with the request

The body can be empty (and often is for simple requests). Most responses contain a body.

Your browser is a request generating engine. All it does is make HTTP requests and then render or otherwise utilize the response. When we scrape from the web we use a program other than a browser to make the requests and instead of rendering the result we parse it and serach inside of it.

Many message responses contain HTML in the body.

HTML
----

HTML stands for "Hypertext Markup Language." You are familiar with one Markup language via this class: (R)Markdown. RMarkdown began as a simple way to represent (eg 'Mark up') a subset of HTML. People used to write HTML by hand but it is verbose:

```
<!DOCTYPE html>
<html>
<head>
	<title>This is Hello World page</title>
</head>
<body>
 	<h1>Hello World</h1>
</body>
</html>
```

The key idea behind HTML is that the text is "marked up" with "tags" that, in conjunction with a stylesheet, tell the browser how to render a page. We are going to be pulling data from the page so we don't need to worry so much about the way that the document is rendered. But the tags in an HTML document often tell us a great deal about the structure of the data on the page. We can use the tags to "monkey bar" around the document.

Here is a list of HTML tags:

```
div, span, h1, h2, h3, h4, ul, ol, li, a
```

There are others but this is the most common set. In order to allow more flexibility to web page designers HTML tags often have attributes. These look like:

```
<div id="important-stuff" class="big important">Some very important text</div>
```

Attributes can have any names whatsoever but "id" and "class" are special and very common. The "id" tag is a unique identifier for an element and is thus (often) useful for finding a specific piece of information on a page. "class" has to do with CSS but also typically picks out a group of similar pieces of content. 

If you are lucky during scraping your data of interest lives in a specific tag with a known id or in a group of elements with a known class.

The first step in any scraping exercise is to look at the source of the page you are interested in scraping from.

An Example
----------



In [1]:
import requests as r
from bs4 import BeautifulSoup
r.get("https://procyonic.org").content

b'<!DOCTYPE html>\n<html>\n<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n\t<title>Raccoons have a brute cleverness.</title>\n\t<link href="css/toast.css" rel="stylesheet" type="text/css" />\n\t<link href="css/index.css" rel="stylesheet" type="text/css" />\n\t<link href="/static/procyonic-favicon.ico" rel="shortcut icon" />\n</head>\n<body>\n<div class="container">\n<div class="grid">\n<div class="unit span-grid">\n<div class="pre-title">(<a href="mailto:vincent.toups@gmail.com">Vincent Toups</a>)</div>\n\n<div class="pre-title">(<a href="cv.html">Curriculum Vitae</a>)</div>\n\n<h1 class="title">-Procyonic-</h1>\n</div>\n\n<div class="unit span-grid"><canvas class="centered" id="skull"></canvas></div>\n\n<div class="unit one-of-three content-column about">\n<h1 class="sub-title">-Being-</h1>\n\n<figure>\n<p class="centered-image-holder"><img class="for-storage" id="hidden-about-image" src="static/vincent-small.png" /></p>\n\n<figcaption>(art by <a href="http

Now that we can make requests lets wrap everything up into a handy dandy wrapper.

In [2]:
def get_and_parse(url):
    response = r.get(url);
    if response.status_code == 200:
        return BeautifulSoup(response.content, "html5lib")
    else:
        return None
    
get_and_parse("https://procyonic.org")

<!DOCTYPE html>
<html><head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
	<title>Raccoons have a brute cleverness.</title>
	<link href="css/toast.css" rel="stylesheet" type="text/css"/>
	<link href="css/index.css" rel="stylesheet" type="text/css"/>
	<link href="/static/procyonic-favicon.ico" rel="shortcut icon"/>
</head>
<body>
<div class="container">
<div class="grid">
<div class="unit span-grid">
<div class="pre-title">(<a href="mailto:vincent.toups@gmail.com">Vincent Toups</a>)</div>

<div class="pre-title">(<a href="cv.html">Curriculum Vitae</a>)</div>

<h1 class="title">-Procyonic-</h1>
</div>

<div class="unit span-grid"><canvas class="centered" id="skull"></canvas></div>

<div class="unit one-of-three content-column about">
<h1 class="sub-title">-Being-</h1>

<figure>
<p class="centered-image-holder"><img class="for-storage" id="hidden-about-image" src="static/vincent-small.png"/></p>

<figcaption>(art by <a href="https://www.facebook.com/girlowar">Katie 

The result above looks like text, but what BeautifulSoup gives us is a Document Object that we can search. But let's pick a more interesting target.

There are varying levels of complexity associated with scraping different web pages. Let's stick with something simple: hackernews.

In [3]:
repr(get_and_parse("https://news.ycombinator.com"))[0:1000]

'<html lang="en" op="news"><head><meta content="origin" name="referrer"/><meta content="width=device-width, initial-scale=1.0" name="viewport"/><link href="news.css?fVvsFovZkUlGSpAVgifJ" rel="stylesheet" type="text/css"/>\n        <link href="favicon.ico" rel="shortcut icon"/>\n          <link href="rss" rel="alternate" title="RSS" type="application/rss+xml"/>\n        <title>Hacker News</title></head><body><center><table bgcolor="#f6f6ef" border="0" cellpadding="0" cellspacing="0" id="hnmain" width="85%">\n        <tbody><tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" style="padding:2px" width="100%"><tbody><tr><td style="width:18px;padding-right:4px"><a href="https://news.ycombinator.com"><img height="18" src="y18.gif" style="border:1px white solid;" width="18"/></a></td>\n                  <td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>\n              <a href="newest">new</a> | <a href=

In [5]:
page = get_and_parse("https://news.ycombinator.com")

A cursory examination of the HTML of the YCombinator Page tells us we are interested in link (anchor) tags with class "storylink". We can use the "find_all" method on the page object to get all of the links like this:

In [35]:
page.find_all("a")[1:10]

[<a href="news">Hacker News</a>,
 <a href="newest">new</a>,
 <a href="front">past</a>,
 <a href="newcomments">comments</a>,
 <a href="ask">ask</a>,
 <a href="show">show</a>,
 <a href="jobs">jobs</a>,
 <a href="submit">submit</a>,
 <a href="login?goto=news">login</a>]

But this isn't really what we want: there are too many links here. We want to restrict down to a single class of links.

In [6]:
page.find_all("a",class_="storylink")[1:10]

[<a class="storylink" href="https://www.reuters.com/article/idUSKBN27W2MB">Twitter names famed hacker 'Mudge' as head of security</a>,
 <a class="storylink" href="https://www.themvpsprint.com/p/how-and-when-to-acquire-saas-users">My side projects always fail. This time is different.</a>,
 <a class="storylink" href="https://stopa.io/post/269">What Gödel Discovered</a>,
 <a class="storylink" href="https://tomcam.github.io/postgres/">PostgreSQL psql command line tutorial and cheat sheet</a>,
 <a class="storylink" href="https://www.tesorio.com/careers#job-openings" rel="nofollow">Tesorio Is Hiring a Senior Product Manager and Senior Engineers</a>,
 <a class="storylink" href="https://mullvad.net/en/blog/2020/11/16/big-no-big-sur-mullvad-disallows-apple-apps-bypass-firewall/">Big no on Big Sur: Mullvad disallows Apple apps to bypass firewall</a>,
 <a class="storylink" href="https://scarybeastsecurity.blogspot.com/2020/11/reverse-engineering-forgotten-1970s.html">Reverse engineering a forgott

So far so good. Perhaps now we want to save the url and the story title to a csv file somewhere.

In [17]:
!mkdir -p derived_data
import pandas as pd

anchors = page.find_all("a",class_="storylink")

df = pd.DataFrame([(el.get("href"),
                    el.getText()) 
                   for el in anchors], 
                  columns=["url","headline"])
df.to_csv("derived_data/hackernews_basic_headlines.csv",index=False)

Congratulations! You now know how to do basic web scraping!

But there is a lot more to this activity. For instance: We might want to scrape more information about these submissions (like the submitter name and how many votes).

We could search for each element of each type separately but we'd risk subtle errors in ordering. The better thing to do is iterate over an element that contains everything we want (if possible).



In [7]:
athings = page.find_all("tr",class_="athing")
athings[0:3]

[<tr class="athing" id="25111726">
       <td align="right" class="title" valign="top"><span class="rank">1.</span></td>      <td class="votelinks" valign="top"><center><a href="vote?id=25111726&amp;how=up&amp;goto=news" id="up_25111726"><div class="votearrow" title="upvote"></div></a></center></td><td class="title"><a class="storylink" href="https://github.blog/2020-11-16-standing-up-for-developers-youtube-dl-is-back/">YouTube-dl's repository has been restored</a><span class="sitebit comhead"> (<a href="from?site=github.blog"><span class="sitestr">github.blog</span></a>)</span></td></tr>,
 <tr class="athing" id="25115754">
       <td align="right" class="title" valign="top"><span class="rank">2.</span></td>      <td class="votelinks" valign="top"><center><a href="vote?id=25115754&amp;how=up&amp;goto=news" id="up_25115754"><div class="votearrow" title="upvote"></div></a></center></td><td class="title"><a class="storylink" href="https://www.reuters.com/article/idUSKBN27W2MB">Twitter nam

In [8]:
el = athings[0]
el.find("a",class_="storylink").get("href")
for f in el.next_sibling.find_all("a"):
    print(f.getText())


fusl
7 hours ago
hide
500 comments


These elements don't quite contain everything we want but let's write some code to extract what we can. Then we'll solve getting the voting information, which is actually in the next element.

In [78]:
def extract_data(element):
    a = element.find("a",class_="storylink");
    url = a.get("href");
    headline = a.getText();
    ## this business with the indexing is to remove () from the sitebit
    site = element.find("span",class_="sitebit").getText().strip()[1:-1];
    return (url, headline, site)

extract_data(el)
    

('https://github.blog/2020-11-16-standing-up-for-developers-youtube-dl-is-back/',
 "YouTube-dl's repository has been restored",
 'github.blog')

This is good but we can do better - they key thing is to realize that the voting data we want is always one element after the one we searched on.

Consider:

In [83]:
def extract_data(element):
    a = element.find("a",class_="storylink");
    url = a.get("href");
    headline = a.getText();
    ## this business with the indexing is to remove () from the sitebit
    site = element.find("span",class_="sitebit").getText().strip()[1:-1];
    ns = element.next_sibling
    score = int(ns.find("span",class_="score").getText().split(" ")[0])
    # we punt on parsing age into a number lest very old posts are in a different unit
    age = ns.find("span",class_="age").getText()
    return (url, headline, site, score, age)

extract_data(el)

('https://github.blog/2020-11-16-standing-up-for-developers-youtube-dl-is-back/',
 "YouTube-dl's repository has been restored",
 'github.blog',
 1449,
 '5 hours ago')

We really want to get the number of comments out of this scrape too but that tag is without a class that picks it out. That means we need to search for it explicitly.

In [13]:
def find_comments(element):
    anchors = element.find_all("a");
    result = None;
    for a in anchors:
        txt = a.getText();
        if 'comments' in txt:
            result = txt;
            break;
    if result:
        return result;
    else:
        return 0;

def get_score(element):
    score = element.find("span",class_="score");
    if score:
        return score.getText().split(" ")[0];
    else:
        return 0;
    

def extract_data(element):
    a = element.find("a",class_="storylink");
    # Bail out if something went wrong
    if not a:
        print("Error on element")
        print(element)
        return None
    url = a.get("href");
    headline = a.getText();
    ## this business with the indexing is to remove () from the sitebit
    site = element.find("span",class_="sitebit").getText().strip()[1:-1];
    ns = element.next_sibling
    score = get_score(ns);
    # we punt on parsing age into a number lest very old posts are in a different unit
    age = ns.find("span",class_="age").getText()
    comments = find_comments(ns);
    if comments:
        comments = int(comments.split()[0].strip())
    return (url, headline, site, score, age, comments)

extract_data(el)

('https://www.tesorio.com/careers#job-openings',
 'Tesorio Is Hiring a Senior Product Manager and Senior Engineers',
 'tesorio.com',
 0,
 '9 minutes ago',
 0)

Now we can put it all together and dump the data:

In [20]:
df = pd.DataFrame([extract_data(el) for el in athings], columns="url headline site score age comments".split())
df
df.to_csv("derived_data/hackernews_headlines.csv");

Recursive Scraping
------------------

There is already a lot we can do with this method but what if you want to scrape more than just the first page?

The answer is to pull urls out of the thing you are parsing and parse those urls too. On hacker news we might want to grab the first N pages.  There are two approaches here: reverse engineer the URL or extract the appropriate href from the right anchor tag.

In [32]:
import time

def extract_hn_data(starting_page, n):
    if n == 0:
        return [];
    else:
        page = get_and_parse(starting_page);
        athings = page.find_all("tr",class_="athing");
        scraped = [extract_data(el) for el in athings];
        more = "https://news.ycombinator.com/" + page.find("a",class_="morelink").get("href");
        time.sleep(1)
        scraped.extend(extract_hn_data(more, n - 1));
        return scraped;

In [34]:
extract_hn_data("https://news.ycombinator.com",3)[0:5]

[('https://github.blog/2020-11-16-standing-up-for-developers-youtube-dl-is-back/',
  "YouTube-dl's repository has been restored",
  'github.blog',
  '1742',
  '7 hours ago',
  506),
 ('https://www.reuters.com/article/idUSKBN27W2MB',
  "Twitter names famed hacker 'Mudge' as head of security",
  'reuters.com',
  '87',
  '2 hours ago',
  8),
 ('https://stopa.io/post/269',
  'What Gödel Discovered',
  'stopa.io',
  '101',
  '2 hours ago',
  22),
 ('https://www.themvpsprint.com/p/how-and-when-to-acquire-saas-users',
  'My side projects always fail. This time is different.',
  'themvpsprint.com',
  '148',
  '4 hours ago',
  71),
 ('https://mullvad.net/en/blog/2020/11/16/big-no-big-sur-mullvad-disallows-apple-apps-bypass-firewall/',
  'Big no on Big Sur: Mullvad disallows Apple apps to bypass firewall',
  'mullvad.net',
  '102',
  '2 hours ago',
  23)]

Webscraping Law and Etiquette
=============================

It is kind of a grey area legally, despite this Towards Data Science post called ["Web Scraping is Now Legal"](https://medium.com/@tjwaterman99/web-scraping-is-now-legal-6bf0e5730a78). It is probably ok to scrape any page you don't need to log in for. If you do need to log in it might be considered malicious activity.

Some things to consider: 

1. Find out of the site has an API that you can use instead. There may even be a library in R or Python to interact with the site directly.
2. Limit the rate at which you make requests (like we did above). A nice rule of thumb is to think about how fast a human would interact with the site and stick to that range.
3. Cache your results. I won't implement that here, but in this example would could save every request to disk the first time we make it. When we request the result again we can check the url and read from disk instead of making another HTTP request.

Other Notes
===========

By its very nature web scraping is brittle. I would hesitate to build a tool which depended on scraping a web site over and over. Web sites change all the time.  It is better to pick a target dataset and collect as much as you can.

You will also want to think carefully about duplicate entries.

We didn't talk about duplicates in class (we should have) but they are one of the biggest ways you can screw up a data science project. Consider: if your data set contains a significant number of duplicates then your train/test split is necessarily invalid: some of the examples you showed the model during training are in your test set.

Duplicates are very common in scraped datasets because pages are updated in real time. News items on the front page of Hackernews might be on page two one second later, which means we record them twice. I've also encountered near-duplicates scraping Yahoo Answers because that service appears to link to previous versions of the same question. You may need a fairly sophisticated deduplication method to detect this kind of thing. It is best to be very aggressive with duplicates when you can be.

Class is Over
=============

This is our last class! 

By now you've all been exposed as broad a survey of data science tools and methods that I could muster up. You've all done a great job with a difficult and still nascent subject and I hope that I've given you a decent set of hand holds for your future work.

Here is my advice:

1. practice - get out there and explore some data
2. learn to use AWS or Azure web services so that you can easily scale a data science project if you need to. I've taught you most of what you need to use those services without learning too much more.
3. Don't forget to use git, even on your own projects.
4. if you want to get programming at a deeper level, learn one or more of these languages: Scheme, J, Forth, Smalltalk

Thank you so much for being my first set of students. I hope my learning curve hasn't made your learning curve too much steeper. 

Feel free to reach out for advice at any time:

toups@email.unc.edu
