## Two ways to count words
This notebook demonstrates two ways of counting words in a long text. A previous version of this notebook was based on the incorrect assumption that one of these methods is always faster. After correcting a bug, simple count is faster than Sort& Count.

However, Sort and count is still relevant - it is faster when the text is too large to fit into a single machine. We will visit it again in a few classes when we use Map-Reduce in Spark to perform the same task.

This notebook can be run without spark, using just Jupyter.

### The task

We are given a text file, here we are using Moby Dick by Herman Melville which can be downloaded from [here](http://faculty.washington.edu/stepp/courses/2004autumn/tcss143/lectures/files/2004-11-08/mobydick.txt).

Our task is to read the file, separate it into words, make all words lower-case, and count the number of occurances of each word.

In [None]:
from string import lower,strip
import re
%pylab inline

### Reading in the file
We read in the file, split it into words and create a list called `all` which contains all of the words in the document

In [None]:
%%time
import urllib
data_dir='../../Data'
filename='Moby-Dick.txt'

f = urllib.urlretrieve("https://mas-dse-open.s3.amazonaws.com/"+filename, data_dir+'/'+filename)

# Check that the text file is where we expect it to be
!ls -l $data_dir/$filename

In [None]:
#%%time
file=open(data_dir+'/'+filename,'r')

all=[]
for line in file.readlines():
    line=lower(strip(line))
    if len(line)==0:
        continue
    words=[w for w in re.split(r'\W+',line) if len(w)>0]
    #print line, words
    all+=words
print 'the book contains',len(all),'words'

### Simple Count
First, lets try counting words using the most straight-forward approach.
We create a dictionary `D` and then go through the list of words `all`. For each word we increment the corresponding entry in `D` if the word exists as a key in the dictionary, if it is not in the dictionary, we add it to the dictionary

In [None]:
%%time
def simple_count(list):
    D={}
    for w in list:
        if w in D:
            D[w]+=1
        else:
            D[w]=1
    return D
D=simple_count(all)

#### List the 10 most common words

In [None]:
S=sorted(D.items(),key=lambda d:d[1],reverse=True)
S[:10]

### Sorted count
Next we show a different way to count. Sort the words alphabetically. Then, when we iterate through the sorted list, all of the occurances of any word appear consecutively, allowing us to count the number of occurances of one word at a time. This counter is added to the dictionary when this element of the list is different than the previous one.

In [None]:
%%time
from time import time
def sort_count(list):
    t0=time()
    S=sorted(list)
    t1=time()
    D={}
    current=''
    count=0
    for w in S:
        if current==w:
            count+=1
        else:
            if current!='':
                D[current]=count
            count=1
            current=w
    t2=time()
    return D,t1-t0,t2-t1
D,sort_time,count_time=sort_count(all)
print 'sort time= %5.1f ms, count time=%5.1f ms'%(1000*sort_time,1000*count_time)

### Conclusions
We have showed and compared two methods for counting workds: simple count and sorted count. Counting is slightly faster after sorting, however, for this size data, the sort time erases the advantage.

With larger text, especially text that that is too large to fit in the memory of one machine, the advantage of sorting before counting becomes dominant.

## Teacher Stuff

In [None]:
import Tester.SimpleCount_MASTER as SimpleCount_MASTER
import Tester.SimpleCount as SimpleCount
pickleFile="Tester/SimpleCount.pkl"

In [None]:
SimpleCount_MASTER.gen_exercise0_1(pickleFile)
SimpleCount_MASTER.gen_exercise0_2(pickleFile)
SimpleCount_MASTER.gen_exercise0_3(pickleFile)
SimpleCount_MASTER.gen_exercise0_4(pickleFile)

In [None]:
SimpleCount_MASTER.exercise0_1("Tester/SimpleCount.pkl", SimpleCount_MASTER.func_ex0_1)
SimpleCount_MASTER.exercise0_2("Tester/SimpleCount.pkl", SimpleCount_MASTER.func_ex0_2)
SimpleCount_MASTER.exercise0_3("Tester/SimpleCount.pkl", SimpleCount_MASTER.func_ex0_3)
SimpleCount_MASTER.exercise0_4("Tester/SimpleCount.pkl", SimpleCount_MASTER.func_ex0_4)

In [None]:
SimpleCount.exercise0_1(pickleFile, SimpleCount_MASTER.func_ex0_1)
SimpleCount.exercise0_2(pickleFile, SimpleCount_MASTER.func_ex0_2)
SimpleCount.exercise0_3(pickleFile, SimpleCount_MASTER.func_ex0_3)
SimpleCount.exercise0_4(pickleFile, SimpleCount_MASTER.func_ex0_4)

### End of Teacher Stuff

## Exercise 1 

A `k`-mer is a sequence of `k` consecutive words. 

For example, the `3`-mers in the line `you are my sunshine my only sunshine` are

* `you are my`
* `are my sunshine`
* `my sunshine my`
* `sunshine my only`
* `my only sunsine`

For the sake of simplicity we consider only the `k`-mers that appear in a single line. In other words, we ignore `k`-mers that span more than one line.

Write a function **compute_kmers**, to return the list of `k`-mers in a given text for a given `k`.

######  <span style="color:blue">Code:</span>
```python
text = ['you are my sunshine my only sunshine']
compute_kmers(text,3)
```
######  <span style="color:magenta">Output:</span>
['you are my', 'are my sunshine', 'my sunshine my', 'sunshine my only', 'my only sunsine']

In [None]:
def compute_kmers(text,k):
    kmers = []
    # your implementation goes here
    return kmers

In [None]:
import Tester.SimpleCount as SimpleCount
SimpleCount.exercise0_1(pickleFile, compute_kmers)

In [None]:
# %load ../Tester/SimpleCount.py
import pickle

from basic_tester import *

def exercise0_1(pickleFile, func_student):
    checkExerciseFromPickle(pickleFile, func_student,TestList,'ex0_1', multiInputs=True)
    
def exercise0_2(pickleFile, func_student):
    checkExerciseFromPickle(pickleFile, func_student,TestList,'ex0_2')
    
def exercise0_3(pickleFile, func_student):
    checkExerciseFromPickle(pickleFile, func_student,TestList,'ex0_3')
    
def exercise0_4(pickleFile, func_student):
    checkExerciseFromPickle(pickleFile, func_student,TestList,'ex0_4', multiInputs=True)


## Exercise 2

Given a list of k-mers, write a function **count_kmers**, to return the dictionary with key as `k`-mer and value as the number of times it has occurred (the count) in the input list.

######  <span style="color:blue">Code:</span>
```python
kmers = ['you are my', 'are my sunshine', 'my sunshine my', 'sunshine my only', 'my only sunshine']
count_kmers(kmers)
```
######  <span style="color:magenta">Output:</span>

{'you are my' : 1, 'are my sunshine' : 1, 'my sunshine my' : 1, 'sunshine my only' : 1, 'my only sunsine' : 1}

In [None]:
def count_kmers(kmers):
    kmers_count = dict()
    # your implementation goes here 
    return kmers_count

In [None]:
import Tester.SimpleCount as SimpleCount
SimpleCount.exercise0_2(pickleFile, count_kmers)

## Exercise 3 

Given the dictionary of k-mer counts from exercise 2, write a function **sort_counts**, to sort the k-mers in descending order by its count. Return a list of tuples. 

* `Each tuple is of the form (kmer, count).`
* `If two k-mers have same count, then sort them lexicographically.`
    
######  <span style="color:blue">Code:</span>
```python
kmers_counts =  {'you are my' : 1, 'are my sunshine' : 1, 'my sunshine my' : 1, 'sunshine my only' : 1, 'my only sunsine' : 1}
sort_counts(kmers_counts)
```
######  <span style="color:magenta">Output:</span>
[('are my sunshine', 1) , ('my only sunsine' , 1) , ('my sunshine my', 1), ('sunshine my only', 1) , ('you are my', 1)]

In [None]:
def sort_counts(kmers_counts):
    sorted_counts = []
    # your implementation goes here
    return sorted_counts

In [None]:
import Tester.SimpleCount as SimpleCount
SimpleCount.exercise0_3(pickleFile, sort_counts)

## Exercise 4 

Given a list of lines, Write a function, to return the list of tuples containing top `n` k-mers with its count from the given text for a given n, k.


######  <span style="color:blue">Code:</span>
```python
n=2
k=3
text = ['you are my sunshine my only sunsine']
get_top_n_kmers(text,n,k)
```
######  <span style="color:magenta">Output:</span>
    [('are my sunshine', 1) , ('my only sunsine' , 1)]

In [None]:
def get_top_n_kmers(text,n,k):
    kmers = compute_kmers(text,k)
    kmers_count = count_kmers(kmers)
    sorted_counts = sort_counts(kmers_count)
    #SOLUTION BEGINS
    top_n_kmers = []
    #SOLUTION ENDS
    print 'most common %d-mers\n'%k,'\n'.join(['%d:\t%s'%c for c in top_n_kmers])
    return top_n_kmers

In [None]:
import Tester.SimpleCount as SimpleCount
SimpleCount.exercise0_4(pickleFile, get_top_n_kmers)