### PyWren RISECamp, 2017

Welcome to the hands-on tutorial for PyWren.

This tutorial consists of a set of exercises that will have you working directly with PyWren:
- simple matrix multiplication
- data analysis on a wikipedia dataset
- some machine learning algorithms (Eric's) 


## 0. Hello World

First, let's write a simple hello program to test out PyWren.



In [None]:
# some libraries that are useful for this tutorial
import sys
if '/pywren-setup/' not in sys.path:
    sys.path.insert(0, '/pywren-setup/')

from training import *

# first we need to load PyWren and creates an executor instance
import pywren
pwex = pywren.default_executor()

### 0.1. call_async() -- our single invocation API
We can use the `call_async()` API on pywren executor to run the function in the cloud
The workflow is pretty simple and looks like this:

```python
def my_func(param):
    # do something
    return some_result
    
handler = pwex.call_async(my_func, param)
result = handler.result()
```

**Exercise**: modify the following code block to run hello world with pywren


In [None]:
# first we need a basic hello world function
def hello_world(param):
    if param == 42:
        return "hello world!"

future = pwex.call_async()
# on success, this line should print out "hello world"
check_result_1(future.result())

### 0.2. map() -- parallel execution in the cloud
The above example runs a single function in the cloud.
Now PyWren also has a `map()` API that allows users to run a single function with multiple parameters:

```python
handlers = pwex.map(my_func, param_list)
pywren.wait(handlers)

results = [h.result() for h in handlers]
```

**Exercise**: modify the following code block to print "hello world"

In [None]:
# do not modify code here
def hello_world(param):
    if param == 1:
        return "hello"
    if param == 2:
        return "world!"
# do not modify code above

param_list = []
futures = pwex.call_async(hello_world, None)

results = [f.result() for f in futures] 
check_result_2(" ".join(results))

### 0.3. wait() API and multiple jobs

`map` returns a list of `futures`, which represents separate lambda invocations which might not have completed and have results yet. In order to track the progress of our job-set, we can use the `wait` function.


In [None]:
import pywren
import numpy as np

def my_function(b):
    x = np.random.normal(0, b, 1024)
    A = np.random.normal(0, b, (1024, 1024))
    return np.dot(A, x)

pwex = pywren.default_executor()
res = pwex.map(my_function, np.linspace(0.1, 10, 100))
pywren.wait(res, return_when=pywren.ALL_COMPLETED)
for i in res:
    print(i.result())

`wait` polls S3 for the reuslts of any finished jobs, and return two lists: finished and unfinished jobs.

By default it blocks until all jobs have finished, though you can also make it block until at least one job has completed, with `return_when=ANY_COMPLETED`, or return immediately with `ALWAYS`

### 0.4. Visualization and Debugging
From the talk, you have already heard what happens behind every PyWren execution. Let's see it for real!


**Exercise**: inspect PyWren's execution by running the plotting code below

In [None]:
plot_pywren_execution(futures)

Another tool you can use is to print latest CloudWatch logs which could tell you about the latest Lambda execution.  

In [None]:
!pywren print_latest_logs

This concludes our startup section. You can find more documentation on PyWren APIs and usages at http://pywren.io/

## 1. Matrix Multiplication

One nice thing about PyWren is it allows users to integrate existing python libraries easily.
For the following exercise, we are going to use some popular python libraries, e.g., NumPy, to work on some matrix multiplication problems.

In [None]:
import numpy as np

def my_function(b):
    x = np.random.normal(0, b, 1024)
    A = np.random.normal(0, b, (1024, 1024))
    return np.dot(A, x)

pwex = pywren.default_executor()
res = pwex.map(my_function, np.linspace(0.1, 10, 100))


## 2. Data Analytics with Wikipedia Dataset

In this section, we will use PyWren explore the Wikipedia data.


### 2.1. The data
We have a number wikipedia files stored in our RISECamp S3 bucket.
Let's just take a peek at the data.

In [None]:
# we'll first get the list of files
filenames = list_keys_with_prefix(rise_camp_bucket, "wikistats_20090505_restricted-01/")
print(len(filenames))

In [None]:
def take5(filename):
    data = pywren_read_data(rise_camp_bucket, filename)
    result = data.split("\n")[:5]
    return result

future = pwex.call_async(take, filenames[0])
print(future.result())


Unfortunately this is not very readable because result() returns a list. We can make it prettier by printing each record on its own line.

In [None]:
for x in future.result():
    print(x)

### 2.2. Count
Let’s see how many records in total are in this data set (this command will take a while, so read ahead while it is running).

In [None]:
def count(filename):
    data = pywren_read_data(rise_camp_bucket, filename)
    return (len(data.split("\n")) if data else 0)    

futures = pwex.map(count, filenames)
pywren.wait(futures)

result = sum([f.result() for f in futures])
print(result)

This should launch 73 PyWren tasks. After finishing the job, let's plot again to check the execution. Now it should be more interesting than the simple job above.

In [None]:
plot_pywren_execution(futures)

### 2.3. Visits for English Pages
Recall from above when we peek the date, that the second field is the “project code” and contains information about the language of the pages. For example, the project code “en” indicates an English page. Let’s calculate the page counts of english pages, grouped by dates.

In [None]:
from itertools import groupby
from operator import itemgetter
from functools import reduce

def aggregate_count(key_value_list):
    def reduce_f(obj1, obj2):
        return(obj1[0], obj1[1] + obj2[1])
    counts = [reduce(reduce_f, group) for _, group 
          in groupby(sorted(key_value_list), key=itemgetter(0))]
    
    return counts

def english_page_count(filename):
    data = pywren_read_data(rise_camp_bucket, filename)
    # filter out the english pages
    en_pages = [d for d in data.split("\n") 
                if len(d.split(" ")) >= 4 and d.split(" ")[1] == "en"]
    # projection to create (date, pagecount) pairs
    en_kvpair_list = [(p.split(" ")[0][:8], int(p.split(" ")[3])) for p in en_pages]

    return aggregate_count(en_kvpair_list)
    
futures = pwex.map(english_page_count, filenames)
pywren.wait(futures)

results = [f.result() for f in futures]
en_page_counts_by_date = aggregate_count([x for y in results for x in y])
print(en_page_counts_by_date)

## 3. Some Machine Learning