## Crawlers

Code for this lab is almost entirely taken and modified from Brent Slatkin's Pycon 2014 talk, since it provides a beautiful illustration of the entire process.

### Synchronous Blocking Crawler

This code, taken from Brent's talk, is provided to you as an example of a synxhronous, single-threaded crawler you will make async

In [56]:
from urllib.parse import urljoin
from urllib.parse import urlparse
from urllib.parse import urlunparse
import re
import requests
URL_EXPR = re.compile(
    '([a-zA-Z]+\s*=\s*["\'])'   # Tag attribute: href="
    '(?P<url>'
        '((http(s?):)?'         # Optional scheme
        '//[^"\'\s\\\\</]+)?'   # Optional domain
        '/[^"\'\s\\\\<]*'       # Required path
    ')')



In [57]:
def canonicalize(url):
    parts = list(urlparse(url))
    if parts[2] == '':
        parts[2] = '/'  # Empty path equals root path
    parts[5] = ''       # Erase fragment
    return urlunparse(parts)

Notice the quick and dirty use of assert's here to throw exceptions if something goes wrong. The calling code should catch generic exceptions.

In [58]:
def fetch(url):
    print("Doing", url)
    response = requests.get(url)
    assert response.status_code == 200
    data = response.content#get as bytes
    assert data
    return data.decode('utf-8')


In [59]:
fetch("http://www.xkcd.com/353")

For simplicity, we keep to the same site for now. You can pass over this code, it just extracts urls on the same domain from the page using regular expressions.

In [120]:
def same_domain(a, b):
    parsed_a = urlparse(a)
    parsed_b = urlparse(b)
    if parsed_a.netloc == parsed_b.netloc:
        return True
    if (parsed_a.netloc == '') ^ (parsed_b.netloc == ''):  # Relative paths
        return True
    return False

In [121]:
def extract(url):
    data = fetch(url)
    found_urls = set()
    for match in URL_EXPR.finditer(data):
        found = canonicalize(match.group('url'))
        if same_domain(url, found):
            found_urls.add(urljoin(url, found))
    return url, len(data), sorted(found_urls)

In [122]:
extract("http://www.xkcd.com/353")[2]

In [65]:
def extract_multi(to_fetch, seen_urls):
    results = []
    for url in to_fetch:
        if url in seen_urls: 
            continue
        seen_urls.add(url)
        try:
            results.append(extract(url))
        except Exception:
            continue
    return results


def crawl(start_url, max_depth=1):
    seen_urls = set()
    to_fetch = [canonicalize(start_url)]
    results = []
    for depth in range(max_depth + 1):
        batch = extract_multi(to_fetch, seen_urls)
        to_fetch = []
        for url, datalen, found_urls in batch:
            results.append((depth, url, datalen))
            to_fetch.extend(found_urls)

    return results

In [66]:
cr = crawl("http://www.xkcd.com/353")
cr

### 1. Synchronous crawler, async style

(using yield from)

Just like in the lecture, let us slowly bring in the async technology, still keeping a synchronous crawler going. This means that we'll have one `yield from` after another.

We write the fetcher async now:

In [123]:
import asyncio, aiohttp

@asyncio.coroutine
def fetch_async(url):
    print("Doing", url)
    response = yield from aiohttp.request('GET', url)
    try:
        assert response.status == 200
        data = yield from response.read()
        assert data
        return data.decode('utf-8')
    finally:
        response.close()

Write the extractor

In [124]:
@asyncio.coroutine
def extract_async(url):
    #your code here


We wrap the top level coroutine in a task. Since a task is a future, we can also get its result in this form.

In [127]:
future = asyncio.Task(extract_async('http://www.xkcd.com/353'))
#future = extract_async('http://www.xkcd.com/353')
#you could do the above but could not access the result as 
#future.result()

loop = asyncio.get_event_loop()
loop.run_until_complete(future)
#loop.close() ONLY DO IF NOT IN REPL OR YOU WILL BE HOSED
future.result()

### 2. Write the multi-extractor and crawler

Note that you are writing the multi-extractor using async syntax but the `yield from`s are serialized.

In [94]:
@asyncio.coroutine
def extract_multi_async(to_fetch, seen_urls):
    #your code here


In [95]:
@asyncio.coroutine
def crawl_async(start_url, max_depth=1):
    #your code here


We run the entire crawler now:

In [96]:
future = asyncio.Task(crawl_async('http://www.xkcd.com/353', max_depth=1))
loop = asyncio.get_event_loop()
loop.run_until_complete(future)
future.result()

###  3. Asynchronous crawler with `async def` and `await`: Many simultaneous fetches

Rewrite all the code here. You will need to make two changes:

1. `yield from` -> `await`, decorator -> `async def`
2. note that `extract_multi_async` upstairs was seriealized. Use futures from `asyncio.as_completed` to change this.

The first two are just copied over

In [108]:
async def fetch_async(url):
    #your code here


In [109]:
async def extract_async(url):
    #your code here


Surprisingly, one of these next two is unchanged except for the syntax. Which one? 

In [110]:
async def extract_multi_async(to_fetch, seen_urls):
    #your code here


In [111]:
async def crawl_async(start_url, max_depth=1):
    #your code here


In [112]:
future = asyncio.Task(crawl_async('http://www.xkcd.com/353', max_depth=1))
loop = asyncio.get_event_loop()
loop.run_until_complete(future)

### 4. Concurrent Crawls

We can even do concurrent crawls to multiple web sites. Implement this.

In [113]:
urls = ['http://www.xkcd.com/353', 'http://what-if.xkcd.com/148/']

In [114]:
async def crawl_multi_async(urls):
    #your code here


In [115]:
future = asyncio.Task(crawl_multi_async(urls))
loop = asyncio.get_event_loop()
loop.run_until_complete(future)