## Crawlers

Code for this lab is almost entirely taken and modified from Brent Slatkin's Pycon 2014 talk, since it provides a beautiful illustration of the entire process.

In [16]:
pwd

'/Users/yuhantang/CS207/cs207labs/lab24'

### Synchronous Blocking Crawler

This code, taken from Brent's talk, is provided to you as an example of a synxhronous, single-threaded crawler you will make async

In [1]:
from urllib.parse import urljoin
from urllib.parse import urlparse
from urllib.parse import urlunparse
import re
import requests
URL_EXPR = re.compile(
    '([a-zA-Z]+\s*=\s*["\'])'   # Tag attribute: href="
    '(?P<url>'
        '((http(s?):)?'         # Optional scheme
        '//[^"\'\s\\\\</]+)?'   # Optional domain
        '/[^"\'\s\\\\<]*'       # Required path
    ')')



In [2]:
def canonicalize(url):
    parts = list(urlparse(url))
    if parts[2] == '':
        parts[2] = '/'  # Empty path equals root path
    parts[5] = ''       # Erase fragment
    return urlunparse(parts)

Notice the quick and dirty use of assert's here to throw exceptions if something goes wrong. The calling code should catch generic exceptions.

In [3]:
def fetch(url):
    print("Doing", url)
    response = requests.get(url)
    assert response.status_code == 200
    data = response.content#get as bytes
    assert data
    return data.decode('utf-8')


In [4]:
fetch("http://www.xkcd.com/353")

Doing http://www.xkcd.com/353




For simplicity, we keep to the same site for now. You can pass over this code, it just extracts urls on the same domain from the page using regular expressions.

In [5]:
def same_domain(a, b):
    parsed_a = urlparse(a)
    parsed_b = urlparse(b)
    if parsed_a.netloc == parsed_b.netloc:
        return True
    if (parsed_a.netloc == '') ^ (parsed_b.netloc == ''):  # Relative paths
        return True
    return False

In [6]:
def extract(url):
    data = fetch(url)
    found_urls = set()
    for match in URL_EXPR.finditer(data):
        found = canonicalize(match.group('url'))
        #print(found)
        if same_domain(url, found):
            found_urls.add(urljoin(url, found))
    return url, len(data), sorted(found_urls)

In [7]:
extract("http://www.xkcd.com/353")[2]

Doing http://www.xkcd.com/353


['http://www.xkcd.com/',
 'http://www.xkcd.com/1/',
 'http://www.xkcd.com/150/',
 'http://www.xkcd.com/162/',
 'http://www.xkcd.com/352/',
 'http://www.xkcd.com/354/',
 'http://www.xkcd.com/556/',
 'http://www.xkcd.com/688/',
 'http://www.xkcd.com/730/',
 'http://www.xkcd.com/about',
 'http://www.xkcd.com/archive',
 'http://www.xkcd.com/atom.xml',
 'http://www.xkcd.com/license.html',
 'http://www.xkcd.com/rss.xml',
 'http://www.xkcd.com/s/919f27.ico',
 'http://www.xkcd.com/s/b0dcca.css']

In [8]:
def extract_multi(to_fetch, seen_urls):
    results = []
    for url in to_fetch:
        if url in seen_urls: 
            continue
        seen_urls.add(url)
        try:
            results.append(extract(url))
        except Exception:
            continue
    return results


def crawl(start_url, max_depth=1):
    seen_urls = set()
    to_fetch = [canonicalize(start_url)]
    results = []
    for depth in range(max_depth + 1):
        batch = extract_multi(to_fetch, seen_urls)
        to_fetch = []
        for url, datalen, found_urls in batch:
            results.append((depth, url, datalen))
            to_fetch.extend(found_urls)

    return results

In [9]:
cr = crawl("http://www.xkcd.com/353")
cr

Doing http://www.xkcd.com/353
Doing http://www.xkcd.com/
Doing http://www.xkcd.com/1/
Doing http://www.xkcd.com/150/
Doing http://www.xkcd.com/162/
Doing http://www.xkcd.com/352/
Doing http://www.xkcd.com/354/
Doing http://www.xkcd.com/556/
Doing http://www.xkcd.com/688/
Doing http://www.xkcd.com/730/
Doing http://www.xkcd.com/about
Doing http://www.xkcd.com/archive
Doing http://www.xkcd.com/atom.xml
Doing http://www.xkcd.com/license.html
Doing http://www.xkcd.com/rss.xml
Doing http://www.xkcd.com/s/919f27.ico
Doing http://www.xkcd.com/s/b0dcca.css


[(0, 'http://www.xkcd.com/353', 6924),
 (1, 'http://www.xkcd.com/', 6219),
 (1, 'http://www.xkcd.com/1/', 6399),
 (1, 'http://www.xkcd.com/150/', 6904),
 (1, 'http://www.xkcd.com/162/', 6914),
 (1, 'http://www.xkcd.com/352/', 6540),
 (1, 'http://www.xkcd.com/354/', 6400),
 (1, 'http://www.xkcd.com/556/', 7977),
 (1, 'http://www.xkcd.com/688/', 7319),
 (1, 'http://www.xkcd.com/730/', 11800),
 (1, 'http://www.xkcd.com/about', 7649),
 (1, 'http://www.xkcd.com/archive', 103648),
 (1, 'http://www.xkcd.com/atom.xml', 2092),
 (1, 'http://www.xkcd.com/license.html', 2558),
 (1, 'http://www.xkcd.com/rss.xml', 2022),
 (1, 'http://www.xkcd.com/s/b0dcca.css', 3487)]

### 1. Synchronous crawler, async style

(using yield from)

Just like in the lecture, let us slowly bring in the async technology, still keeping a synchronous crawler going. This means that we'll have one `yield from` after another.

We write the fetcher async now:

In [19]:
import asyncio, aiohttp

@asyncio.coroutine
def fetch_async(url):
    print("Doing", url)
    response = yield from aiohttp.request('GET', url)
    try:
        assert response.status == 200
        data = yield from response.read()
        assert data
        return data.decode('utf-8')
    finally:
        response.close()

Write the extractor

In [20]:
@asyncio.coroutine
def extract_async(url):
    data = yield from fetch_async(url)
    found_urls = set()
    for match in URL_EXPR.finditer(data):
        found = canonicalize(match.group('url'))
        if same_domain(url, found):
            found_urls.add(urljoin(url, found))
    return url, data, sorted(found_urls)



We wrap the top level coroutine in a task. Since a task is a future, we can also get its result in this form.

In [21]:
future = asyncio.Task(extract_async('http://www.xkcd.com/353'))
#future = extract_async('http://www.xkcd.com/353')
#you could do the above but could not access the result as 
#future.result()

loop = asyncio.get_event_loop()
loop.run_until_complete(future)
#loop.close() ONLY DO IF NOT IN REPL OR YOU WILL BE HOSED
future.result()

Doing http://www.xkcd.com/353
Doing http://www.xkcd.com/
Doing http://www.xkcd.com/
Doing http://www.xkcd.com/
Doing http://www.xkcd.com/1/


('http://www.xkcd.com/353',
 ['http://www.xkcd.com/',
  'http://www.xkcd.com/1/',
  'http://www.xkcd.com/150/',
  'http://www.xkcd.com/162/',
  'http://www.xkcd.com/352/',
  'http://www.xkcd.com/354/',
  'http://www.xkcd.com/556/',
  'http://www.xkcd.com/688/',
  'http://www.xkcd.com/730/',
  'http://www.xkcd.com/about',
  'http://www.xkcd.com/archive',
  'http://www.xkcd.com/atom.xml',
  'http://www.xkcd.com/license.html',
  'http://www.xkcd.com/rss.xml',
  'http://www.xkcd.com/s/919f27.ico',
  'http://www.xkcd.com/s/b0dcca.css'])

### 2. Write the multi-extractor and crawler

Note that you are writing the multi-extractor using async syntax but the `yield from`s are serialized.

In [22]:
@asyncio.coroutine
def extract_multi_async(to_fetch, seen_urls):
    results = []
    for url in to_fetch:
        if url in seen_urls: continue
        seen_urls.add(url)
        try:
            results.append((yield from extract_async(url)))
        except Exception:
            continue
    return results

In [23]:
@asyncio.coroutine
def crawl_async(start_url, max_depth):
    seen_urls = set()
    to_fetch = [canonicalize(start_url)]
    results = []
    for depth in range(max_depth + 1):
        batch = yield from extract_multi_async(to_fetch, seen_urls)
        to_fetch = []
        for url, data, found_urls in batch:
            results.append((depth, url, data))
            to_fetch.extend(found_urls)

    return results

We run the entire crawler now:

In [27]:
future = asyncio.Task(crawl_async('http://www.xkcd.com/353', max_depth=1))
loop = asyncio.get_event_loop()
loop.run_until_complete(future)
future.result()

Doing http://www.xkcd.com/353
Doing http://www.xkcd.com/
Doing http://www.xkcd.com/s/919f27.ico
Doing http://www.xkcd.com/
Doing http://www.xkcd.com/1/
Doing http://www.xkcd.com/s/b0dcca.css
Doing http://www.xkcd.com/
Doing http://www.xkcd.com/1/
Doing http://www.xkcd.com/150/
Doing http://www.xkcd.com/150/
Doing http://www.xkcd.com/162/
Doing http://www.xkcd.com/1/
Doing http://www.xkcd.com/150/
Doing http://www.xkcd.com/162/
Doing http://www.xkcd.com/352/
Doing http://www.xkcd.com/162/
Doing http://www.xkcd.com/352/
Doing http://www.xkcd.com/354/
Doing http://www.xkcd.com/352/
Doing http://www.xkcd.com/354/
Doing http://www.xkcd.com/354/
Doing http://www.xkcd.com/556/
Doing http://www.xkcd.com/556/
Doing http://www.xkcd.com/688/
Doing http://www.xkcd.com/556/
Doing http://www.xkcd.com/688/
Doing http://www.xkcd.com/688/
Doing http://www.xkcd.com/730/
Doing http://www.xkcd.com/730/
Doing http://www.xkcd.com/about
Doing http://www.xkcd.com/730/
Doing http://www.xkcd.com/about
Doing htt

[(0,
  'http://www.xkcd.com/353',
 (1,
  'http://www.xkcd.com/',
 (1,
  'http://www.xkcd.com/1/',
 (1,
  'http://www.xkcd.com/150/',
 (1,
  'http://www.xkcd.com/162/',
 (1,
  'http://www.xkcd.com/352/',
 (1,
  'http://www.xkcd.com/354/',
 (1,
  'http://www.xkcd.com/556/',
 (1,
  'http://www.xkcd.com/688/',
 (1,
  'http://www.xkcd.com/730/',
 (1,
  'http://www.xkcd.com/about',
  '<HTML><head>\n<title>xkcd - A webcomic</title><body bgcolor="#96A8C8" link="#2020FF" vlink="#000077" alink="#000077">\n<link rel="alternate" type="application/rss+xml" title="xkcd rss" href="http://www.xkcd.com/rss.xml">\n</head>\n<body>\n<center><table width="600">\n<TR>\n<TD><div style="border: 1px solid black; padding: 10px; margin: 5px;\nmargin-top: 0px; background: #FFFFFF">\n<center><h2><a href="http://xkcd.com/">xkcd.com</a></h2>\nRandall Munroe<br />\nContact:<br />\n<table width="500"><tr><td>\n<a href="mailto:orders@xkcd.com">orders@xkcd.com</a> -- All store-related email.<br />\n<a href="mailto:press

###  3. Asynchronous crawler with `async def` and `await`: Many simultaneous fetches

Rewrite all the code here. You will need to make two changes:

1. `yield from` -> `await`, decorator -> `async def`
2. note that `extract_multi_async` upstairs was seriealized. Use futures from `asyncio.as_completed` to change this.

The first two are just copied over

In [28]:
async def fetch_async(url):
    print("Doing", url)
    response = await aiohttp.request('GET', url)
    try:
        assert response.status == 200
        data = await response.read()
        assert data
        return data.decode('utf-8')
    finally:
        response.close()
    

In [29]:
async def extract_async(url):
    #your code here
    data = await fetch_async(url)
    found_urls = set()
    for match in URL_EXPR.finditer(data):
        found = canonicalize(match.group('url'))
        if same_domain(url, found):
            found_urls.add(urljoin(url, found))
    return url, data, sorted(found_urls)

Surprisingly, one of these next two is unchanged except for the syntax. Which one? 

In [30]:

async def extract_multi_async(to_fetch, seen_urls):
    futures, results = [], []
    for url in to_fetch:
        if url in seen_urls: continue
        seen_urls.add(url)
        futures.append(extract_async(url))        

    for future in asyncio.as_completed(futures):  
        try:
            results.append((await future))
        except Exception:
            continue

    return results


In [31]:

async def crawl_async(start_url, max_depth):
    seen_urls = set()
    to_fetch = [canonicalize(start_url)]
    results = []
    for depth in range(max_depth + 1):
        batch = await extract_multi_async(to_fetch, seen_urls)
        to_fetch = []
        for url, data, found_urls in batch:
            results.append((depth, url, data))
            to_fetch.extend(found_urls)

    return results

In [32]:
future = asyncio.Task(crawl_async('http://www.xkcd.com/353', max_depth=1))
loop = asyncio.get_event_loop()
loop.run_until_complete(future)
1

Doing http://www.xkcd.com/353
Doing http://www.xkcd.com/s/b0dcca.css
Doing http://www.xkcd.com/556/
Doing http://www.xkcd.com/s/919f27.ico
Doing http://www.xkcd.com/354/
Doing http://www.xkcd.com/688/
Doing http://www.xkcd.com/about
Doing http://www.xkcd.com/
Doing http://www.xkcd.com/150/
Doing http://www.xkcd.com/atom.xml
Doing http://www.xkcd.com/license.html
Doing http://www.xkcd.com/162/
Doing http://www.xkcd.com/archive
Doing http://www.xkcd.com/rss.xml
Doing http://www.xkcd.com/730/
Doing http://www.xkcd.com/1/
Doing http://www.xkcd.com/s/b0dcca.css
Doing http://www.xkcd.com/352/


1

### 4. Concurrent Crawls

We can even do concurrent crawls to multiple web sites. Implement this.

In [33]:
urls = ['http://www.xkcd.com/353', 'http://what-if.xkcd.com/148/']

In [36]:
async def crawl_multi_async(urls):
    todos = [crawl_async(url,1) for url in urls]
    results = []
    
    for future in asyncio.as_completed(todos):
        try:
            results.append((await future))
        except Exception:
            print('Exception')
            continue
            
    return results
    #your code here


In [37]:
future = asyncio.Task(crawl_multi_async(urls))
loop = asyncio.get_event_loop()
loop.run_until_complete(future)


Doing http://www.xkcd.com/353
Doing http://what-if.xkcd.com/148/
Doing http://www.xkcd.com/about
Doing http://www.xkcd.com/archive
Doing http://www.xkcd.com/150/
Doing http://www.xkcd.com/rss.xml
Doing http://www.xkcd.com/352/
Doing http://www.xkcd.com/556/
Doing http://www.xkcd.com/162/
Doing http://www.xkcd.com/730/
Doing http://www.xkcd.com/s/b0dcca.css
Doing http://www.xkcd.com/
Doing http://www.xkcd.com/1/
Doing http://www.xkcd.com/354/
Doing http://www.xkcd.com/atom.xml
Doing http://www.xkcd.com/s/919f27.ico
Doing http://www.xkcd.com/license.html
Doing http://www.xkcd.com/688/
Doing http://what-if.xkcd.com/css/style.css
Doing http://what-if.xkcd.com/imgs/apple-touch-icon.png
Doing http://what-if.xkcd.com/imgs/a/148/actualsize.png
Doing http://what-if.xkcd.com/feed.atom
Doing https://what-if.xkcd.com/96/
Doing http://what-if.xkcd.com/147/
Doing http://what-if.xkcd.com/imgs/a/148/franchises.png
Doing http://what-if.xkcd.com/imgs/a/148/snakemeat.png
Doing http://what-if.xkcd.com/arc

[[(0,
   'http://what-if.xkcd.com/148/',
   '<!DOCTYPE html>\n<html>\n  <head>\n    <meta charset="utf-8" />\n    <link rel="stylesheet" type="text/css" href="/css/style.css" />\n    <link rel="shortcut icon" type="image/ico" href="/imgs/favicon.ico" />\n    <link rel="apple-touch-icon" href="/imgs/apple-touch-icon.png" /> \n    <title>Eat the Sun</title>\n    <script type="text/x-mathjax-config">\n      MathJax.Hub.Config({\n      extensions: ["tex2jax.js"],\n      jax: ["input/TeX", "output/HTML-CSS"],\n      tex2jax: {\n       inlineMath: [ [\'$\',\'$\'], ["\\\\(","\\\\)"] ],\n       displayMath: [ ["\\\\[","\\\\]"] ],\n       processEscapes: true\n      },\n      TeX: {\n       extensions: ["AMSmath.js", "AMSsymbols.js"]\n      },\n      "HTML-CSS": { availableFonts: ["TeX"] }\n      });\n    </script>\n    <script type="text/javascript" src="//cdn.mathjax.org/mathjax/latest/MathJax.js"></script>\n    <link rel="alternate" type="application/atom+xml" href="/feed.atom" />\n<script>\