### FUTURES 

A future, or promise, is something that represents a pending opearion and returns straight away. One can then query their state of completion, or register callbacks to be called on successful completion or error.

Adapted from Fluent Python.

#### Serial sleeping

In [26]:
from time import time

In [7]:
serial_main(range(20))

NameError: name 'serial_main' is not defined

#### concurrent sleeping using threads

In [8]:
from concurrent import futures
def get_many_threaded1(it):
    workers = 10
    with futures.ThreadPoolExecutor(max_workers=workers) as executor:
        res = executor.map(get_thing, it)
    return len(list(res))
def threaded_main1(it):
    t0 = time.time()
    count = get_many_threaded1(it)
    elapsed = time.time() - t0
    msg = '\n{} things got in {:.2f}s' 
    print(msg.format(count, elapsed))

In [9]:
threaded_main1(range(20))

NameError: name 'time' is not defined

One might think that the concurrent IO (or sleeping) case is limited by the GIL, but in both cases, the GIL is yielded. Thus there is no waiting around.

The GIL is harmless if code is being run in the context of python library IO or code running in properly coded C extensions like numpy.  The time.sleep() function also releases the GIL. Python threads are totally usable in I/O-bound applications.

### Threads

threads vs processes

On linux

- processes created by fork()
- have a primary thread
- thread is the unit of execution
- process is a container, can have more threads
- can be scheduled across different cores/cpus

```c
int pid;
int status = 0;
/* fork returns pid of child to parent and 0 to child*/
if (pid = fork()) {
    /* parent code */
    pid = wait(&status);
    /*wait returns child pid and status*/
} else {
    /* child  code*/
    exit(status);
} 
```

- threads in a process share same address space (share it entirely)
- thread abstraction decouples resource allocation from control
- defines a single sequential execution stream with PC, stack, register values
- process handles: address space, global variables, open files, child processes, pending alarms, signals and signal handlers, accounting info
- thread handles program counter, registers, stack, and state
- user vs kernel threads

In [37]:
def fib(n):
    return fib(n - 1) + fib(n - 2) if n > 1 else n

In [38]:
from threading import Thread
from time import sleep
from time import time


def sleepy(): #like io
    i=0
    while i < 10:
        print("{} -- {} Sleepy!".format(i, int(time())), flush=True)
        sleep(3)
        i += 1


def cpuy():
    for i in range(35):
        val = fib(i)
        print("fib({}) is {}".format(i, val))

def cpuy2():
    for i in range(35):
        val = fib(i)
        print("cpuy2 fib({}) is {}".format(i, val))
        
def main():
    # Second thread will print the hello message. Starting as a daemon means
    # the thread will not prevent the process from exiting.
    start = time()
    cpuy()
    cpuy2()
    print("serial elapsed:", time() - start)
    start=time()
    #t = Thread(target=sleepy)
    #t.start()
    t2 = Thread(target=cpuy2)
    t2.start()
    # Main thread will read and process input
    cpuy()
    print("thread elapsed:", time() - start)
if __name__ == '__main__':
    main()

fib(0) is 0
fib(1) is 1
fib(2) is 1
fib(3) is 2
fib(4) is 3
fib(5) is 5
fib(6) is 8
fib(7) is 13
fib(8) is 21
fib(9) is 34
fib(10) is 55
fib(11) is 89
fib(12) is 144
fib(13) is 233
fib(14) is 377
fib(15) is 610
fib(16) is 987
fib(17) is 1597
fib(18) is 2584
fib(19) is 4181
fib(20) is 6765
fib(21) is 10946
fib(22) is 17711
fib(23) is 28657
fib(24) is 46368
fib(25) is 75025
fib(26) is 121393
fib(27) is 196418
fib(28) is 317811
fib(29) is 514229
fib(30) is 832040
fib(31) is 1346269
fib(32) is 2178309
fib(33) is 3524578
fib(34) is 5702887
cpuy2 fib(0) is 0
cpuy2 fib(1) is 1
cpuy2 fib(2) is 1
cpuy2 fib(3) is 2
cpuy2 fib(4) is 3
cpuy2 fib(5) is 5
cpuy2 fib(6) is 8
cpuy2 fib(7) is 13
cpuy2 fib(8) is 21
cpuy2 fib(9) is 34
cpuy2 fib(10) is 55
cpuy2 fib(11) is 89
cpuy2 fib(12) is 144
cpuy2 fib(13) is 233
cpuy2 fib(14) is 377
cpuy2 fib(15) is 610
cpuy2 fib(16) is 987
cpuy2 fib(17) is 1597
cpuy2 fib(18) is 2584
cpuy2 fib(19) is 4181
cpuy2 fib(20) is 6765
cpuy2 fib(21) is 10946
cpuy2 fib(22) is 177

### Processes with concurrent futures.

CPU based processing wont release the gil, and is thus best done in a separate process. For illustration, we show what this looks like.

In [41]:
import time

In [42]:
def get_many_process(it, workers=None):
    if workers:
        with futures.ProcessPoolExecutor(max_workers=workers) as executor:
            res = executor.map(get_thing, it)
    else:
        with futures.ProcessPoolExecutor() as executor:
            res = executor.map(get_thing, it)
    return len(list(res))

def process_main(it, workers=None):
    t0 = time.time()
    count = get_many_process(it, workers)
    elapsed = time.time() - t0
    msg = '\n{} things got in {:.2f}s' 
    print(msg.format(count, elapsed))

In [43]:
process_main(range(20))

NameError: name 'get_thing' is not defined

In [44]:
process_main(range(20), workers=10)

NameError: name 'get_thing' is not defined

In [45]:
print(__name__)

__main__


In [46]:
import multiprocessing
start = time()
p=multiprocessing.Process(target=cpuy2)
p.start()
cpuy()
p.join()
print("mp elapsed:", time() - start)

TypeError: 'module' object is not callable

### sockets

- distinction between "client socket" and "server socket"
- default `socket.socket(family=AF_INET, type=SOCK_STREAM, proto=0, fileno=None)`
- server socket sits and creates client sockets
- non-blocking sockets and the `select` system call

Read: https://docs.python.org/3.5/howto/sockets.html

### Writing a web page fetcher

We'll eventually use the asyncio module to play with web page fetching and crawling, but lets build up to that by writing a simple fetcher. We'll start with blocking, then move to non-blocking, and finally to co-routines, and even more finally to `yield from` based co-routines.

Adapted from http://aosabook.org/en/500L/a-web-crawler-with-asyncio-coroutines.html

#### Blocking fetch

In [47]:
import socket
def fetch(host, url):
    sock = socket.socket()
    sock.connect((host, 80))
    request = 'GET {} HTTP/1.0\r\nHost: {}\r\n\r\n'.format(url, host)
    sock.send(request.encode('ascii'))
    response = b''
    chunk = sock.recv(4096)
    while chunk:
        response += chunk
        chunk = sock.recv(4096)
    return response

In [48]:
from IPython.display import HTML, IFrame
HTML(str(fetch("www.example.com","/")))

#### Basic non-blocking

In [49]:
host="www.example.com"
url="/"
request = 'GET {} HTTP/1.0\r\nHost: {}\r\n\r\n'.format(url, host)
encoded = request.encode('ascii')
sock = socket.socket()
sock.setblocking(False)
try:
    sock.connect(('xkcd.com', 80))
except BlockingIOError:
    pass
while True:
    try:
        sock.send(encoded)
        break  # Done.
    except OSError as e:
        pass

print('sent')

sent


This has only been implemented partially. Notice how the `sock.send` spins in a loop.

This eats cycles. the solution is to use select/kqueue/epoll from a small number of connections to a large number of them. The basic idea behind `select` is to wait for an event to occur on a small set of non-blocking sokets.

We'll use python's `DefaultSelector`, an addition from python 3.4 that automatically chooses the "best" select like implementation on your system.


In [114]:
from selectors import DefaultSelector, EVENT_WRITE

selector = DefaultSelector()
host="www.example.com"
sock = socket.socket()
#sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.setblocking(False)
try:
    sock.connect((host, 80))
except BlockingIOError:
    print('here')
    pass

def connected():
    selector.unregister(sock.fileno())
    print('connected!', flush=True)

selector.register(sock.fileno(), EVENT_WRITE, connected)

here


SelectorKey(fileobj=1104, fd=1104, events=2, data=<function connected at 0x000002652EE9E268>)

`connected` is the **callback** run when the connection happens.

In [115]:

def loop():
    start = time.time()
    while True:
        if time.time() - start > 10:
            break
        events = selector.select()
        for event_key, event_mask in events:
            print('event_key', event_key)
            callback = event_key.data
            callback()

Such a loop is called an "event loop". An async frameworkhas two parts: (a) such an event loop and (b) non-blocking sockets. It all runs on one thread. This is a system, it should be obvious for I/O bound problems.

What have we demonstrated already? We showed how to begin an operation and execute a callback when the operation is ready. An async framework builds on the two features we have shown—non-blocking sockets and the event loop—to run concurrent operations on a single thread.

Guido:
>We have achieved "concurrency" here, but not what is traditionally called "parallelism". What asynchronous I/O is right for, is applications with many slow or sleepy connections with infrequent events.

In [None]:
loop() #loop will destruct after 10 secs

event_key SelectorKey(fileobj=1104, fd=1104, events=2, data=<function connected at 0x000002652EE9E268>)
connected!
event_key SelectorKey(fileobj=1104, fd=1104, events=2, data=<function connected at 0x000002652EE9E268>)
connected!
event_key SelectorKey(fileobj=1104, fd=1104, events=2, data=<function connected at 0x000002652EE9E268>)
connected!
event_key SelectorKey(fileobj=1104, fd=1104, events=2, data=<function connected at 0x000002652EE9E268>)
connected!
event_key SelectorKey(fileobj=1104, fd=1104, events=2, data=<function connected at 0x000002652EE9E268>)
connected!
event_key SelectorKey(fileobj=1104, fd=1104, events=2, data=<function connected at 0x000002652EE9E268>)
connected!
event_key SelectorKey(fileobj=1104, fd=1104, events=2, data=<function connected at 0x000002652EE9E268>)
connected!
event_key SelectorKey(fileobj=1104, fd=1104, events=2, data=<function connected at 0x000002652EE9E268>)
connected!
event_key SelectorKey(fileobj=1104, fd=1104, events=2, data=<function connected 

#### async with response reading

In [63]:
from selectors import DefaultSelector, EVENT_READ, EVENT_WRITE

selector = DefaultSelector()
class Fetcher:
    def __init__(self, host, url):
        self.response = b''  # Empty array of bytes.
        self.host = host
        self.url = url
        self.sock = None
        
    # Method on Fetcher class.
    def fetch(self):
        self.sock = socket.socket()
        self.sock.setblocking(False)
        try:
            self.sock.connect((self.host, 80))
        except BlockingIOError:
            pass

        # Register next callback.
        selector.register(self.sock.fileno(),
                          EVENT_WRITE,
                          self.connected)

    def connected(self, key, mask):
        print('connected!', flush=True)
        selector.unregister(key.fd)
        request = 'GET {} HTTP/1.0\r\nHost: {}\r\n\r\n'.format(self.url, self.host)
        self.sock.send(request.encode('ascii'))

        # Register the next callback.
        selector.register(key.fd,
                          EVENT_READ,
                          self.read_response)
        
    def read_response(self, key, mask):
        global stopped
        
        chunk = self.sock.recv(128)  # USUALLY 4k chunk size, here small
        if chunk:
            print("read chunk", flush=True)
            self.response += chunk
        else:
            print("all read", flush=True)
            selector.unregister(key.fd)  # Done reading.
            stopped=True
            
stopped = False

def loop():
    while not stopped:
        events = selector.select()
        for event_key, event_mask in events:
            callback = event_key.data
            callback(event_key, event_mask)

In [65]:
fetcher = Fetcher('xkcd.com', '/353/')
fetcher.fetch()
loop()

connected!
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
read chunk
all read


You can see how the control-flow is chained together by having the connected callback do the resposing. Beyond a 2-3 ladder, this gets confusing and onerous (see some node.js code). As compared to a blocking program, where the continuation of the program is stored and adressed via the instruction pointer in a sequential fashiom, here the cintinuation is stored by registering the callbacks.'

Since the current frame is popped out of the stack, exceptions have a hard time figuring the origin This is called stack-ripping.

So, even apart from the long debate about the relative efficiencies of multithreading and async, there is this other debate regarding which is more error-prone: threads are susceptible to data races if you make a mistake synchronizing them, but callbacks are stubborn to debug due to stack ripping. And within a bit, we get callback soup.

https://thesynchronousblog.wordpress.com/tag/stack-ripping/

Threads seem to offer a more natural way of programming as the programmer with all state in thread’s single stack.


So why not use them. As we said last time: synchronization and overhead. 

But we can do better with Coroutines!

Guido:
>We entice you with a promise. It is possible to write asynchronous code that combines the efficiency of callbacks with the classic good looks of multithreaded programming. This combination is achieved with a pattern called "coroutines". Using Python 3.4's standard asyncio library, and a package called "aiohttp", fetching a URL in a coroutine is very direct7:

    @asyncio.coroutine
    def fetch(self, url):
        response = yield from self.session.get(url)
        body = yield from response.read()
        
In 3.5 its even more clear:

async def fetch(self, url):
        response = await self.session.get(url)
        body = await response.read()

### Back to the Future with co-routines

In [23]:
from selectors import DefaultSelector, EVENT_READ, EVENT_WRITE
import socket
selector = DefaultSelector()

The future, as you might expect is something with callbacks...

In [42]:
class MyFuture:
    def __init__(self):
        self.result = None
        self._callbacks = []

    def add_done_callback(self, fn):
        self._callbacks.append(fn)

    def set_result(self, result):
        self.result = result
        for fn in self._callbacks:
            fn(self)

We need a "main" to yield to.

In [43]:
class Fetcher:
    
    def __init__(self, url, host):
        self.url = url
        self.host = host
        self.response = b''  # Empty array of bytes.

        
    def fetch(self):
        global stopped
        sock = socket.socket()

        sock.setblocking(False)
        try:
            sock.connect((self.host, 80))
        except BlockingIOError:
            pass

        f = MyFuture()

        #resolves the future by setting a result on it
        def on_connected():
            print('on connected cb ran', flush=True)
            f.set_result(None)
        
        
        
        selector.register(sock.fileno(),
                          EVENT_WRITE,
                          on_connected)
        print("about to yield connection future", flush=True)
        yield f#this makes it look like fetch has returned the "future"
        #bit we have not lost the state (or have to have carried it in obj)
        #a send in will continue us here
        print('we were connected! now back in gen', flush=True)
        selector.unregister(sock.fileno())
        request = 'GET {} HTTP/1.0\r\nHost: {}\r\n\r\n'.format(self.url, self.host)
        sock.send(request.encode('ascii'))
        while True:
            print("in loop")
            #now create a new future for the data-recieving call
            f = MyFuture()
            def on_response():
                chunky = sock.recv(4096)  # 4k chunk size.
                f.set_result(chunky)
            selector.register(sock.fileno(),
                              EVENT_READ,
                              on_response)
            #now to restart the gen, we will from the main
            #throw the data right back in
            chunk = yield f
            selector.unregister(sock.fileno())
            if chunk:
                print("len(chunk)",len(chunk))
                self.response += chunk
            else:
                print("all read")
                stopped= True
                break


        
    

In [44]:
#But when the future resolves, what resumes the generator? We need a coroutine driver. Let us call it "task":
#(this is our main)
class Task:
    def __init__(self, coro):
        self.coro = coro
        f = MyFuture()
        print(">>sending none to initial future",f)
        f.set_result(None)
        print("...stepping")
        self.step(f)
        print(">>>after priming")

    def step(self, future):
        try:
            print("sending", type(future.result))
            next_future = self.coro.send(future.result)
            print('got next future', next_future)

        except StopIteration:
            print("si")
            return None
        next_future.add_done_callback(self.step)

In [45]:
stopped=False
def loop():
    while not stopped:
        events = selector.select()
        for event_key, event_mask in events:
            callback = event_key.data
            callback()

In [46]:
fetcher = Fetcher('/353/', 'xkcd.com')
Task(fetcher.fetch())
stopped=False
loop()

>>sending none to initial future <__main__.MyFuture object at 0x104d72978>
...stepping
sending <class 'NoneType'>
about to yield connection future
got next future <__main__.MyFuture object at 0x104d72c88>
>>>after priming
on connected cb ran
sending <class 'NoneType'>
we were connected! now back in gen
in loop
got next future <__main__.MyFuture object at 0x104d72780>
sending <class 'bytes'>
len(chunk) 4096
in loop
got next future <__main__.MyFuture object at 0x104d72d68>
sending <class 'bytes'>
len(chunk) 3991
in loop
got next future <__main__.MyFuture object at 0x104d72048>
sending <class 'bytes'>
all read
si


#### Refactoring using generators

In [6]:
#But when the future resolves, what resumes the generator? We need a coroutine driver. Let us call it "task":
#(this is our main)
class Task:
    def __init__(self, coro):
        self.coro = coro
        f = MyFuture()
        print(">>sending none to initial future",f)
        f.set_result(None)
        print("...stepping")
        self.step(f)
        print(">>>after priming")

    def step(self, future):
        try:
            print("sending", type(future.result))
            next_future = self.coro.send(future.result)
            print('got next future', next_future)

        except StopIteration:
            print("si")
            return None
        next_future.add_done_callback(self.step)

In [7]:
def read(sock):
    f = MyFuture()

    def on_readable():
        f.set_result(sock.recv(4096))

    selector.register(sock.fileno(), EVENT_READ, on_readable)
    chunk = yield f  # Read one chunk.
    selector.unregister(sock.fileno())
    return chunk

In [8]:
def read_all(sock):
    global stopped
    response = []
    # Read whole response.
    chunk = yield from read(sock)
    while chunk:
        response.append(chunk)
        chunk = yield from read(sock)
    stopped=True
    return b''.join(response)

>If you squint and make the yield from statements disappear it looks like  conventional functions doing blocking I/O. But in fact, read and read_all are coroutines. Yielding from read pauses read_all until the I/O completes. While read_all is paused, asyncio's event loop does other work and awaits other I/O events; read_all is resumed with the result of read on the next loop tick once its event is ready.

In [19]:
class Fetcher:
    
    def __init__(self, url, host):
        self.url = url
        self.host = host
        self.response = b''  # Empty array of bytes.

        
    def fetch(self):
        global stopped
        sock = socket.socket()

        sock.setblocking(False)
        try:
            sock.connect((host, 80))
        except BlockingIOError:
            pass

        f = MyFuture()

        def on_connected():
            print('on connected cb ran')
            f.set_result(None)
        
        
        
        selector.register(sock.fileno(),
                          EVENT_WRITE,
                          on_connected)
        print("about to yield connection future")
        yield f
        print('connected!')
        selector.unregister(sock.fileno())
        request = 'GET {} HTTP/1.0\r\nHost: xkcd.com\r\n\r\n'.format(self.url)
        sock.send(request.encode('ascii'))
        yield from read_all(sock)

In [20]:
fetcher = Fetcher('/353/', 'xkcd.com')
Task(fetcher.fetch())
stopped = False
loop()

NameError: name 'Task' is not defined

![](http://aosabook.org/en/500L/crawler-images/yield-from.png)

There is one yield left amongst the yield froms. For consistency, this can be fixed...it also lets us change implementations under the hood..

In [11]:
def read(sock):
    f = MyFuture()

    def on_readable():
        f.set_result(sock.recv(4096))

    selector.register(sock.fileno(), EVENT_READ, on_readable)
    chunk = yield from f  # Read one chunk.
    selector.unregister(sock.fileno())
    return chunk

In [12]:
class MyFuture:
    def __init__(self):
        self.result = None
        self._callbacks = []

    def add_done_callback(self, fn):
        self._callbacks.append(fn)

    def set_result(self, result):
        self.result = result
        print("cblist", self._callbacks)
        for fn in self._callbacks:
            fn(self)
            
    def __iter__(self):
        yield self
        return self.result

In [13]:
class Fetcher:
    
    def __init__(self, url, host):
        self.url = url
        self.host = host
        self.response = b''  # Empty array of bytes.

        
    def fetch(self):
        global stopped
        sock = socket.socket()

        sock.setblocking(False)
        try:
            sock.connect((self.host, 80))
        except BlockingIOError:
            pass

        f = MyFuture()

        def on_connected():
            print('on connected cb ran')
            f.set_result(None)
        
        
        
        selector.register(sock.fileno(),
                          EVENT_WRITE,
                          on_connected)
        print("about to yield connection future")
        yield from f
        print('connected!')
        selector.unregister(sock.fileno())
        request = 'GET {} HTTP/1.0\r\nHost: xkcd.com\r\n\r\n'.format(self.url)
        sock.send(request.encode('ascii'))
        yield from read_all(sock)

In [14]:
fetcher = Fetcher('/353/','xkcd.com')
Task(fetcher.fetch())
stopped = False
loop()

>>sending none to initial future <__main__.MyFuture object at 0x1039d6860>
cblist []
...stepping
sending <class 'NoneType'>


NameError: name 'socket' is not defined

## Lab

Implement a URL fetcher using Beautiful Soup in the callback version. We will implement a similar one using coroutines on wednesday. 

The implimentation will extend the read_response method by parsing for URL's using `bs4` . Start by creating globals:
```
urls_todo = set(['/'])
seen_urls = set(['/'])
```

then:

```
links = self.parse_links()#write this
```
(using self.response)

Then use the set `difference` method  to add new links to `urls_todo` and recursively set up a `Fetcher` instance.

Now update the `seen_urls` and `urls_todo` thus:
```
seen_urls.update(links)
urls_todo.remove(self.url)
if not urls_todo:
    stopped = True
```

In [173]:
import socket
from bs4 import BeautifulSoup
from selectors import DefaultSelector, EVENT_READ, EVENT_WRITE

selector = DefaultSelector()
class Fetcher:
    def __init__(self, host, url, level = 0):
        self.response = b''  # Empty array of bytes.
        self.host = host
        self.url = url
        self.sock = None
        self.level = level
        
    # Method on Fetcher class.
    def fetch(self):
        self.sock = socket.socket()
        self.sock.setblocking(False)
        try:
            self.sock.connect((self.host, 80))
        except BlockingIOError:
            pass

        # Register next callback.
        selector.register(self.sock.fileno(),
                          EVENT_WRITE,
                          self.connected)

    def connected(self, key, mask):
        print('connected to:', self.url, flush=True)
        selector.unregister(key.fd)
        request = 'GET {} HTTP/1.0\r\nHost: {}\r\n\r\n'.format(self.url, self.host)
        self.sock.send(request.encode('ascii'))

        # Register the next callback.
        selector.register(key.fd,
                          EVENT_READ,
                          self.read_response)
        
    def read_response(self, key, mask):
        global stopped
        
        chunk = self.sock.recv(4096)  # USUALLY 4k chunk size, here small
        if chunk:
            #print("read chunk", flush=True)
            self.response += chunk
        else:
            print("all read", flush=True)
            links = self.parse_links()
            selector.unregister(key.fd)  # Done reading.
            #print (links, seen_urls)         
            
            links.difference_update(seen_urls)
            #print (links)
            if self.level < 1:
                for link in links:
                    urls_todo.add(link)
                    
            if self.level < 1:
                for link in links:
                    urls_todo.add(link)
                    fetcher = Fetcher(self.host, link, self.level+1)
                    fetcher.fetch()

            seen_urls.update(links)
            urls_todo.remove(self.url)
            #print (len(urls_todo), self.url, urls_todo)
            #print (list(selector.get_map()))
            if not urls_todo:
                stopped = True
            
            
    def parse_links(self):
        soup = BeautifulSoup(self.response, "lxml");
        #print (soup)
        result = set([])
        for link in soup.find_all('a'):
            linkurl = link.get('href')
            if linkurl is not None and linkurl.startswith('/'):
                #print (linkurl)
                result.add(linkurl)
        #for x in range(3):
        #    host = 'xkcd.com'
        #    url = '/37'+str(x)+'/'
        #    result.add(url)
        #print (result)
        return result
            
stopped = False

def loop():
    while not stopped:
        events = selector.select()
        for event_key, event_mask in events:
            callback = event_key.data
            callback(event_key, event_mask)

In [174]:
urls_todo = set(['/'])
seen_urls = set(['/'])

fetcher = Fetcher('xkcd.com','/')
fetcher.fetch()
loop()

connected to: /
all read
[5728, 6720, 6644, 6884, 6736, 7044, 6784, 6812]
connected to: /about
connected to: /license.html
connected to: /atom.xml
connected to: //c.xkcd.com/random/comic/
connected to: /1660/
connected to: /1/
connected to: /archive
connected to: /rss.xml
all read
[5728, 6720, 6884, 6736, 7044, 6784, 6812]
all read
[5728, 6720, 6884, 6736, 7044, 6812]
all read
[5728, 6720, 6884, 6736, 7044]
all read
[5728, 6720, 6884, 7044]
all read
[5728, 6720, 6884]
all read
[5728, 6720]
all read
[6720]
all read
[]


In [149]:
import httplib2
from bs4 import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://xkcd.com/')

for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
    print (link)
#for link in soup.find_all('a'):
#    print(link.get('href'))

html
<a href="/archive">Archive</a>
<a href="http://what-if.xkcd.com">What If?</a>
<a href="http://blag.xkcd.com">Blag</a>
<a href="http://store.xkcd.com/">Store</a>
<a href="/about" rel="author">About</a>
<a href="/"><img alt="xkcd.com logo" height="83" src="//imgs.xkcd.com/static/terrible_small_logo.png" width="185"/></a>
<a href="http://amzn.to/1GCXMJ5" title="Thing Explainer Amazon purchase link">Amazon</a>
<a href="http://www.barnesandnoble.com/w/thing-explainer-randall-munroe/1121864432?ean=9780544668256" title="Thing Explainer Barnes and Noble purchase link">Barnes &amp; Noble</a>
<a href="http://www.indiebound.org/book/9780544668256" title="Thing Explainer Indie Bound purchase link">Indie Bound</a>
<a href="http://www.hudsonbooksellers.com/thingexplainer" title="Thing Explainer Hudson purchase link">Hudson</a>
<a href="/1/">|&lt;</a>
<a accesskey="p" href="/1660/" rel="prev">&lt; Prev</a>
<a href="//c.xkcd.com/random/comic/">Random</a>
<a accesskey="n" href="#" rel="next">Next 

  'has been renamed to "%s."' % (old_name, new_name))


 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))
