# PyWren RISECamp, 2017

## Data Analytics with PyWren

In this section, we will use PyWren explore the Wikipedia data.


## 1. The data
We have a number wikipedia files stored in our RISECamp S3 bucket.
Let's just take a peek at the data.

In [None]:
# some libraries that are useful for this tutorial
import sys
from training import *

# We need to load PyWren and create an executor instance
import pywren
pwex = pywren.default_executor()

In [None]:
# we'll first get the list of files
filenames = list_keys_with_prefix(wikipedia_bucket, "wikistats_20090505_restricted-01/")
print(len(filenames))

In [None]:
def take5(filename):
    data = read_from_s3(wikipedia_bucket, filename)
    result = data.split("\n")[:5]
    return result

future = pwex.call_async(take, filenames[0])
print(future.result())


Unfortunately this is not very readable because result() returns a list. We can make it prettier by printing each record on its own line.

In [None]:
for x in future.result():
    print(x)

## 2. Count
Let’s see how many records in total are in this data set (this command will take a while, so read ahead while it is running).

In [None]:
def count(filename):
    data = read_from_s3(wikipedia_bucket, filename)
    return (len(data.split("\n")) if data else 0)    

futures = pwex.map(count, filenames)
pywren.wait(futures)

result = sum([f.result() for f in futures])
print(result)

This should launch 73 PyWren tasks. After finishing the job, let's plot again to check the execution. Now it should be more interesting than the simple job above.

In [None]:
plot_pywren_execution(futures)

# 3. Visits for English Pages
Recall from above when we peek the date, that the second field is the “project code” and contains information about the language of the pages. For example, the project code “en” indicates an English page. Let’s calculate the page counts of english pages, grouped by dates.

In [None]:
from itertools import groupby
from operator import itemgetter
from functools import reduce

def aggregate_count(key_value_list):
    def reduce_f(obj1, obj2):
        return(obj1[0], obj1[1] + obj2[1])
    counts = [reduce(reduce_f, group) for _, group 
          in groupby(sorted(key_value_list), key=itemgetter(0))]
    
    return counts

def english_page_count(filename):
    data = read_from_s3(wikipedia_bucket, filename)
    # filter out the english pages
    en_pages = [d for d in data.split("\n") 
                if len(d.split(" ")) >= 4 and d.split(" ")[1] == "en"]
    # projection to create (date, pagecount) pairs
    en_kvpair_list = [(p.split(" ")[0][:8], int(p.split(" ")[3])) for p in en_pages]

    return aggregate_count(en_kvpair_list)
    
futures = pwex.map(english_page_count, filenames)
pywren.wait(futures)

results = [f.result() for f in futures]
en_page_counts_by_date = aggregate_count([x for y in results for x in y])
print(en_page_counts_by_date)