In [82]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 144

In [83]:
import grader

# Mapreduce

## Introduciton

We are going to be running mapreduce jobs on the wikipedia dataset.  The dataset is available (pre-chunked) on S3: `s3://dataincubator-course/mrdata/simple/`.  It may be downloaded with the `aws s3 sync` command or via HTTPS from `https://s3.amazonaws.com/dataincubator-course/mrdata/simple/part-000*`.

For development, you can even use a single chunk (eg. part-00026.xml.bz2). That is small enough that mrjob can process the chunk in a few seconds. Your development cycle should be:

1.  Get your job to work locally on one chunk.  This will greatly speed up your
development.  To run on local:
```bash
python job_file.py -r local data/wikipedia/simple/part-00026.xml.bz2 > /tmp/output.txt
```
    
2.  Get your job to work on the full dataset on GCP (Google Cloud Platform).  This will greatly speed up your production.  To run on GCP ([details](https://pythonhosted.org/mrjob/guides/dataproc-quickstart.html)):
```bash
python job_file.py -r dataproc data/wikipedia/simple/part-00026.xml.bz2 \
    --output-dir=gs://my-bucket/output/ \
    --no-output 
```

    Not that you can also pass an entire local directory of data (eg. `data/simple/`) as the input.

### Note on Memory
There's a large difference between developing locally on one chunk and running your job on the entire dataset.  While you can get away with sloppy memory use locally, you really need to keep memory usage down if you hope to be able to complete the miniproject.  Remember, memory needs to be $O(1)$, not $O(n)$ in input.

### Multiple Mapreduces
You can combine multiple steps by overriding the [steps method](https://pythonhosted.org/mrjob/guides/writing-mrjobs.html#multi-step-jobs).  Usually your mapreduce might look like this
```python
from mrjob.job import MRJob

class SingleMRJob(MRJob):
    def mapper(self, key, value):
        pass

    def reducer(self, key, values):
        pass
```

`MRJob` automatically uses the `mapper` and `reducer` methods.  To specify multiple steps, you need to override the `steps` method:

```python
from mrjob.job import MRJob
from mrjob.step import MRStep

class MultipleMRJob(MRJob):
    def mapper1(self, key, value):
        pass

    def reducer1(self, key, values):
        pass
        
    def mapper2(self, key, value):
        pass

    def reducer2(self, key, values):
        pass
        
    def steps(self):
        return [
            MRStep(mapper=self.mapper1, reducer=self.reducer1),
            MRStep(mapper=self.mapper2, reducer=self.reducer2),
        ]
```

As a matter of good style, we recommend that you actually write each individual mapreduce as it's own class.  Then write a wrapper module whose sole job is to combine those mapreduces by overriding `steps`.

Some simple boilerplate for this, taking advantage of the default `steps` function that we get for free in a single-step MRJob class:

```python
class FirstStep(MRJob):
  def mapper(self, key, value):
    pass
  def reducer(self, key, values):
    pass
  
class SecondStep(MRJob):
  def mapper(self, key, value):
    pass
  def reducer(self, key, values):
    pass
  
class SteppedJob(MRJob):
  """
  A two-step job that first runs FirstStep's MR and then SecondStep's MR
  """
  def steps(self):
    return FirstStep().steps() + SecondStep().steps()
```


### Note on Style
Here are some helpful articles on how mrjob works and how to pass parameters to your script:
  - [How mrjob is run](https://pythonhosted.org/mrjob/guides/concepts.html#how-your-program-is-run)
  - [Adding passthrough options](https://pythonhosted.org/mrjob/job.html#mrjob.job.MRJob.add_passthrough_option)
  - [An example of someone solving similar problems](http://arunxjacob.blogspot.com/2013/11/hadoop-streaming-with-mrjob.html)

See the notebook "Hadoop MapReduce with mrjob" in the datacourse for more details.

Finally, if you are find yourself processing a lot of special cases, you are probably doing it wrong.  For example, mapreduce jobs for `Top100WordsSimpleWikipediaPlain`, `Top100WordsSimpleWikipediaText`, and `Top100WordsSimpleWikipediaNoMetaData` are less than 150 lines of code (including generous blank lines and biolerplate code).

In [3]:
!aws s3 sync s3://dataincubator-course/mrdata/simple/ . 

download: s3://dataincubator-course/mrdata/simple/part-00009.xml.bz2 to ./part-00009.xml.bz2
download: s3://dataincubator-course/mrdata/simple/part-00008.xml.bz2 to ./part-00008.xml.bz2
download: s3://dataincubator-course/mrdata/simple/part-00006.xml.bz2 to ./part-00006.xml.bz2
download: s3://dataincubator-course/mrdata/simple/part-00002.xml.bz2 to ./part-00002.xml.bz2
download: s3://dataincubator-course/mrdata/simple/part-00003.xml.bz2 to ./part-00003.xml.bz2
download: s3://dataincubator-course/mrdata/simple/part-00004.xml.bz2 to ./part-00004.xml.bz2
download: s3://dataincubator-course/mrdata/simple/part-00000.xml.bz2 to ./part-00000.xml.bz2
download: s3://dataincubator-course/mrdata/simple/part-00001.xml.bz2 to ./part-00001.xml.bz2
download: s3://dataincubator-course/mrdata/simple/part-00007.xml.bz2 to ./part-00007.xml.bz2
download: s3://dataincubator-course/mrdata/simple/part-00005.xml.bz2 to ./part-00005.xml.bz2
download: s3://dataincubator-course/mrdata/simple/part-00015.xml.bz2 t

In [4]:
from mrjob.job import MRJob

In [5]:
import re
WORD_RE = re.compile(r"[\w]+")

for word in WORD_RE.findall('input.txt'):
    print word

input
txt


In [77]:
#test.py
from mrjob.job import MRJob

class MRWordFrequencyCount(MRJob):
    def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1
        
    def reducer(self, key, values):
        yield key, sum(values)

#if __name__ == '__main__':
#    MRWordFrequencyCount.run()

In [25]:
!python top100_words_simple_plain.py /home/vagrant/datacourse/mapreduce/miniprojects/part-* > top100.txt

Using configs in /home/vagrant/.mrjob.conf
Creating temp directory /tmp/top100_words_simple_plain.vagrant.20170703.215042.794552
Running step 1 of 2...
Running step 2 of 2...
Streaming final output from /tmp/top100_words_simple_plain.vagrant.20170703.215042.794552/output...
Removing temp directory /tmp/top100_words_simple_plain.vagrant.20170703.215042.794552...


In [7]:
def question1():
    with open('top100.txt') as f:
        content = f.readlines()
    res1=[]

    content = [x.strip() for x in content] 
    for line in content:
        l = line.split()
        res1.append((l[0].strip('"'),int(l[1])))
    return res1

result1 = question1()
print result1[:20]

[('the', 1596419), ('quot', 1400092), ('gt', 1211888), ('lt', 1205656), ('id', 1142905), ('of', 972204), ('in', 659218), ('and', 634202), ('text', 604352), ('a', 581510), ('title', 539917), ('to', 489127), ('page', 439939), ('is', 407340), ('format', 386311), ('model', 381564), ('category', 380599), ('revision', 380467), ('ns', 378544), ('timestamp', 377863)]


## Question 1: top100_words_simple_plain
Return a list of the top 100 words in an article text (in no particular order). You will need to write this as two map reduces:

1. The first job is similar to standard wordcount but with a few tweaks. The data provided for wikipedia is in `*.xml.bz2` format.  Mrjob will automatically decompress `bz2`.  We'll deal with the `xml` in the next question. For now, just treat it as text.  A few hints:
   - To split the words, use the regular expression "\w+".
   - Words are not case sensitive: i.e. "The" and "the" reference to the same word.  You can use `string.lower()` to get a single case-insenstive canonical version of the data.

2. The second job will take a collection of pairs `(word, count)` and filter for only the highest 100.  A few notes:
    - **Passing parameters:** To make the job more reusable make the job find the largest `n` words where `n` is a parameter obtained via [`get_jobconf_value`](https://pythonhosted.org/mrjob/utils-compat.html).
    - **Keeping track of the top n:** We have to keep track of at most the `n` most popular words.  As long as `n` is small, e.g. 100, we can keep track of the *running largest n* in memory wtih a priority-queue. We suggest taking a look at `heapq` ([details](https://docs.python.org/2/library/heapq.html)), part of the Python standard library for this.  It allows you to push elemnets into a list while keeping track of the highest priority element.
```python
h = []
heappush(h, (5, 'write code'))
heappush(h, (7, 'release product'))
heappush(h, (1, 'write spec'))
heappush(h, (3, 'create tests'))
heappop(h)  // returns (1, 'write spec')
```
   
       A naive implementation would cost $O(1)$ to insert but $O(n)$ to retrieve.  `heapq` uses a [self-balancing binary search tree](https://en.wikipedia.org/wiki/Self-balancing_binary_search_tree) to enable $O(\log(n))$ insertion and $O(1)$ removal. You may be asked about this data structure on an interview so it is good to get practice with it now.
    - **Working across nodes:** To obtain the largest `n`, we need to first obtain the largest n elements per chunk from the mapper, output them to the same key (reducer), and then collect the largest n elements of those in the reducer (**Question:** why does this gaurantee that we have found the largest n over the entire set?)
    - **Working within a node:** Given that we are using a priority queue, we will need to first initialize it, then `push` or `pushpop` each record to it, and finally output the top `n` after seeing each record.  For mappers, notice that these three phases correspond nicely to these three functions:
        - `mapper_init`
        - `mapper`
        - `mapper_final`

    There are similar functions in the reducer.  Also, while the run method to launch the mapreduce job is a classmethod:
        ```python
          if __name__ == '__main__':
            MRWordCount.run()
        ```
     actual instances of our mapreduce are instantiated on the map and reduce nodes.  More precisely, a separate mapper class is instantiated in each map node and a reducer class is instantiated in each reducer node.  This means that the three mapper functions can pass state through `self`, e.g. `self.heap`. Remember that to pass state between the map and reduce phase, you will have to use `yield` in the mapper and read each line in the reducer. (**Question:** Can you pass state between two mappers?)

**Checkpoint:**
- Total unique words: 1,584,646

In [8]:
def top100_words_simple_plain():
    return [("the", 1586419)] * 100

grader.score(question_name='mr__top100_words_simple_plain', func=question1)  

Your score:  1.0


## Question 2: top100_words_simple_text
Notice that the words "page" and "text" make it into the top 100 words in the previous problem.  These are not common English words!  If you look at the xml formatting, you'll realize that these are xml tags.  You should parse the files so that tags like `<page></page>` should not be included in your total, nor should words outside of the tag `<text></text>`.

**Hints**:
1. Both `xml.etree.elementtree` from the Python stdlib or `lxml.etree` parse xml. `lxml` is significantly faster though and avoids some bugs.

2. In order to parse the text, we will have to accumulate a `<page></page>` worth of data and then split the resulting string into words.

3. Don't forget that the Wikipedia format can have multiple revisions but you only want the latest one.

4. What happens if a content from a page is split across two different mappers? How does this problem scale with data size?

**Checkpoint:**
- Total unique words: 867,871

In [84]:
from lxml import etree

In [88]:
from lxml import etree

filename = 'part-00001.xml'

def get_single_page(f):
    page_single = []
    flag = 0
    line = ' '
    while flag == 0 and line != '':
        line = f.readline()
        #print line
        page_single.append(line)
        if '</page>' in line:
            flag = 1
    return ''.join(page_single)
    
    
    
with open(filename, 'r') as f:
    get_single_page(f)
    count = 0
    while True: #count < 3 :
        page = get_single_page(f)
        if page == '':
            print 'end of file', count
            break  
        #print page
        count += 1
        root = etree.XML(page)
        #print(root.tag)
        content = root.xpath("//revision")[-1].xpath(".//text")[0].text
        
        #outfile = open(filename+'_'+str(count), 'w')
        #outfile.write(content.encode('ascii', 'ignore'))
        #outfile.close()
        
        
        
    

end of file 4668


In [121]:
with open(filename, 'r') as f:
    for line in f:
        #print line,
        pass

In [94]:
content = root.findall('revision')[-1].find('text').text
print content[:200]

{{Infobox islands
| name                = Wake Island
| image name          = Wake Island map.png
| image caption       = Map of Wake Island
| image size          = 250px
| location            = North


In [92]:
#print root.xpath("//text")[0].text
content = root.xpath("//revision")[-1].xpath(".//text")[0].text
print content[:200]

{{Infobox islands
| name                = Wake Island
| image name          = Wake Island map.png
| image caption       = Map of Wake Island
| image size          = 250px
| location            = North


In [98]:
lines = content.split('\n')
for l in lines:
    print l

{{Infobox islands
| name                = Wake Island
| image name          = Wake Island map.png
| image caption       = Map of Wake Island
| image size          = 250px
| location            = North Pacific
| coordinates         = {{Coord|19|18|N|166|38|E|type:isle_region:UM-79|display=inline,title}}
| total islands       = 3
| area sqmi           = 2.85
| coastline mi        = 12.0
| coastline footnotes = <ref>Coastline for Wake Islet: {{convert|12.0|mi|abbr=on}}; Coastline for Wake Atoll: {{convert|21.0|mi|abbr=on}}</ref>
| highest mount       = Ducks Point
| elevation ft        = 20
| country             = {{USA}} <br /> ''Wake Island is under the administration of the''  [[United States Air Force]]
| population          = 150{{Citation needed|date=September 2011}}
}}

[[File:Wake Island air.JPG|thumb|upright=1.6|Aerial overview of the atoll]]
'''Wake Island''' is an [[atoll]] (a type of [[island]]) in the [[Pacific Ocean]], near [[Hawaii]]. It is controlled by the [[United States

In [117]:
!python top100_words_simple_text.py /home/vagrant/datacourse/mapreduce/miniprojects/part-000*.xml > top100_text_all.txt

Using configs in /home/vagrant/.mrjob.conf
Creating temp directory /tmp/top100_words_simple_text.vagrant.20170705.223645.674896
Running step 1 of 3...
Running step 2 of 3...
Running step 3 of 3...
Streaming final output from /tmp/top100_words_simple_text.vagrant.20170705.223645.674896/output...
Removing temp directory /tmp/top100_words_simple_text.vagrant.20170705.223645.674896...


In [118]:
def question2():
    with open('top100_text_all.txt') as f:
        content = f.readlines()
    res1=[]

    content = [x.strip() for x in content] 
    for line in content:
        l = line.split()
        res1.append((l[0].strip('"'),int(l[1])))
    return res1

result2 = question2()
print result2[:20]

[('the', 1579644), ('of', 947437), ('in', 647037), ('and', 619675), ('a', 573372), ('to', 445456), ('is', 405147), ('ref', 370018), ('category', 325594), ('s', 290175), ('1', 234910), ('http', 216872), ('it', 215571), ('0', 214013), ('was', 212993), ('for', 209574), ('2', 197576), ('on', 186247), ('name', 177114), ('br', 166155)]


In [119]:
def top100_words_simple_text():
    return [("the", 1577579)] * 100

grader.score(question_name='mr__top100_words_simple_text', func=question2)#top100_words_simple_text)

Your score:  1.0


## Question 3: top100_words_simple_no_metadata

Finally, notice that 'www' and 'http' make it into the list of top 100 words in the previous problem.  These are also not common English words either!  These are clearly from the url in hyperlinks.  Looking at the format of [Wikipedia links](http://en.wikipedia.org/wiki/Help:Wiki_markup#Links_and_URLs) and [citations](http://en.wikipedia.org/wiki/Help:Wiki_markup#References_and_citing_sources), you'll notice that they tend to appear within single and double brackets and curly braces.

**Hint**:
You can either write a simple parser to eliminate the urls within brackets, angle braces, and curly braces or you can use a package like the colorfully-named [mwparserfromhell](https://github.com/earwig/mwparserfromhell/), which has been provisioned on `mrjob` and supports the convenient helper function `strip_code()` (which is used by the reference solution).

**Checkpoint:**
- Total unique words: 618,410

In [122]:
import mwparserfromhell

In [128]:
print content[:500]

{{Infobox islands
| name                = Wake Island
| image name          = Wake Island map.png
| image caption       = Map of Wake Island
| image size          = 250px
| location            = North Pacific
| coordinates         = {{Coord|19|18|N|166|38|E|type:isle_region:UM-79|display=inline,title}}
| total islands       = 3
| area sqmi           = 2.85
| coastline mi        = 12.0
| coastline footnotes = <ref>Coastline for Wake Islet: {{convert|12.0|mi|abbr=on}}; Coastline for Wake Atoll: {{


In [125]:
clean_content = mwparserfromhell.parse(content).strip_code()

In [127]:
print clean_content[:500]

thumb|upright=1.6|Aerial overview of the atoll
Wake Island is an atoll (a type of island) in the Pacific Ocean, near Hawaii. It is controlled by the United States Army and United States Air Force. It is a territory of the United States, part of the United States Minor Outlying Islands.

 Geography 
Wake is located to the west of the International Date Line and sits in the Wake Island Time Zone, one day ahead of the 50 U.S. states.

Referring to the atoll as an island is the result of a pre-World


In [129]:
!python top100_words_no_meta.py /home/vagrant/datacourse/mapreduce/miniprojects/part-000*.xml > top100_no_meta.txt

Using configs in /home/vagrant/.mrjob.conf
Creating temp directory /tmp/top100_words_no_meta.vagrant.20170705.225556.673397
Running step 1 of 3...
Running step 2 of 3...
Running step 3 of 3...
Streaming final output from /tmp/top100_words_no_meta.vagrant.20170705.225556.673397/output...
Removing temp directory /tmp/top100_words_no_meta.vagrant.20170705.225556.673397...


In [130]:
def question3():
    with open('top100_no_meta.txt') as f:
        content = f.readlines()
    res1=[]

    content = [x.strip() for x in content] 
    for line in content:
        l = line.split()
        res1.append((l[0].strip('"'),int(l[1])))
    return res1

result3 = question3()
print result3[:20]

[('the', 1431080), ('of', 747461), ('in', 586417), ('and', 547715), ('a', 517220), ('to', 417351), ('is', 391115), ('was', 209489), ('it', 206676), ('for', 185216), ('on', 163112), ('0', 157316), ('that', 154481), ('s', 152092), ('as', 148823), ('align', 141509), ('by', 132277), ('are', 129047), ('1', 126897), ('from', 126530)]


In [131]:
def top100_words_simple_no_metadata():
    return [("the", 1427342)] * 100

grader.score(question_name='mr__top100_words_simple_no_metadata', func=question3)#top100_words_simple_no_metadata)

Your score:  1.0


## Question 4: link_stats_simple
Let's look at some summary statistics on the number of unique links on a page to other Wikipedia articles.  Return the number of articles (count), average number of links, standard deviation, and the 25%, median, and 75% quantiles.

1. Notice that the library `mwparserfromhell` supports the method `filter_wikilinks()`.
2. You will need to compute these statistics in a way that requires O(1) memory.  You should be able to compute the first few (i.e. non-quantile) statistics exactly by looking at the first few moments of a distribution. The quantile quantities can be accurately estimated by using reservoir sampling with a large reservoir.
3. If there are multiple links to the article have it only count for 1.  This keeps our results from becoming too skewed.
4. Don't forget that some (a surprisingly large number of) links have unicode! Make sure you treat them correctly.

In [133]:
import mwparserfromhell
content_links = mwparserfromhell.parse(content).filter_wikilinks()
print content_links
print len(content_links)

[u'[[United States Air Force]]', u'[[File:Wake Island air.JPG|thumb|upright=1.6|Aerial overview of the atoll]]', u'[[atoll]]', u'[[island]]', u'[[Pacific Ocean]]', u'[[Hawaii]]', u'[[United States Army]]', u'[[United States Air Force]]', u'[[territory]]', u'[[United States]]', u'[[United States Minor Outlying Islands]]', u'[[International date line|International Date Line]]', u'[[Wake Island Time Zone]]', u'[[U.S. state|50 U.S. states]]', u'[[World War II]]', u'[[United States Navy]]', u'[[Japanese]]', u'[[Category:Island insular areas of the United States]]', u'[[Category:Micronesian islands]]', u'[[Category:Atolls]]']
20


In [143]:
links = mwparserfromhell.parse(content).filter_wikilinks()
print links

links = [l.encode('utf8') for l in links]
print links
print len(links)

[u'[[United States Air Force]]', u'[[File:Wake Island air.JPG|thumb|upright=1.6|Aerial overview of the atoll]]', u'[[atoll]]', u'[[island]]', u'[[Pacific Ocean]]', u'[[Hawaii]]', u'[[United States Army]]', u'[[United States Air Force]]', u'[[territory]]', u'[[United States]]', u'[[United States Minor Outlying Islands]]', u'[[International date line|International Date Line]]', u'[[Wake Island Time Zone]]', u'[[U.S. state|50 U.S. states]]', u'[[World War II]]', u'[[United States Navy]]', u'[[Japanese]]', u'[[Category:Island insular areas of the United States]]', u'[[Category:Micronesian islands]]', u'[[Category:Atolls]]']
['[[United States Air Force]]', '[[File:Wake Island air.JPG|thumb|upright=1.6|Aerial overview of the atoll]]', '[[atoll]]', '[[island]]', '[[Pacific Ocean]]', '[[Hawaii]]', '[[United States Army]]', '[[United States Air Force]]', '[[territory]]', '[[United States]]', '[[United States Minor Outlying Islands]]', '[[International date line|International Date Line]]', '[[Wa

In [None]:
# file: link_stat.py
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
from lxml import etree
import mwparserfromhell
from heapq import *
import random
import math

class MRMostUsedWord(MRJob):
    
    def steps(self):
        return [
            MRStep(mapper_init = self.mapper_init_get_page,
                   mapper = self.mapper_get_page), 
            MRStep(mapper = self.mapper_get_links,
                   reducer_init = self.reducer_init_count_links,
                   reducer = self.reducer_count_links)
        ]

    def mapper_init_get_page(self):
        self.page_single = []
        self.page_status = 0
        
    def mapper_get_page(self, _, line):
        if '<page>' in line:
            self.page_single = []
            self.page_status = 1
            
        if self.page_status == 1:
            self.page_single.append(line)
            
        if '</page>' in line:  
            if self.page_status == 1:
                page = ''.join(self.page_single)
                root = etree.XML(page)
                content = root.xpath("//revision")[-1].xpath(".//text")[0].text
                self.page_status = 0
                if content:
                    content = mwparserfromhell.parse(content).strip_code()
                    yield None, content
            else:
                self.page_status = 0
                self.page_single = []
            
    def mapper_get_links(self, _, line):
        links = mwparserfromhell.parse(line).filter_wikilinks()
        links = [l.encode('utf8') for l in links]
        yield None, len(links)
        
        
    def reducer_init_count_links(self):
        self.page_count = 0
        self.sum_link = 0
        self.sum_linksq = 0
        self.linklist = []
        
    def reducer_count_links(self, _, counts):
        for c in counts:
            self.page_count += 1
            self.sum_link += c
            self.sum_linksq += c*c
            rand = random.random()
            if rand > 0.9:
                heappush(self.linklist, c)
                #self.linklist.append(c)
                
        print '\"count\" , ', self.page_count
        avg = float(self.sum_link) / self.page_count
        print '\"mean\" , ', avg
        avgsq = float(self.sum_linksq) / self.page_count
        std = math.sqrt(avgsq - avg*avg)
        print '\"stdev\" , ', std
        #li = self.linklist.sort()
        length = len(self.linklist)
        li = [ heappop(self.linklist) for i in range(length) ]
        l = float(length)
        a = int(l/4.0)
        b = int(l/2.0)
        c = int(l/4.0*3)
        print '\"25%\" , ', li[a]
        print '\"median\" , ', li[b]
        print '\"75%\" , ', li[c]
        print 'len(linklist):', length, a,b,c
        
        yield 0, 0

    #def reducer_final_count_links(self): 
    #you may use reducer_final to print result involving self.XXX in reducer_init

#if __name__ == '__main__':
#    MRMostUsedWord.run()

In [152]:
!python link_stat.py /home/vagrant/datacourse/mapreduce/miniprojects/part-000*.xml 

Using configs in /home/vagrant/.mrjob.conf
Creating temp directory /tmp/link_stat.vagrant.20170706.192040.163474
Running step 1 of 2...
Running step 2 of 2...
"count" ,  188408
"mean" ,  15.6404080506
"stdev" ,  50.4433023163
"25%" ,  1
"median" ,  6
"75%" ,  17
len(linklist): 18826 4706 9413 14119
Streaming final output from /tmp/link_stat.vagrant.20170706.192040.163474/output...
0	0
Removing temp directory /tmp/link_stat.vagrant.20170706.192040.163474...


In [147]:
!python linkstat.py /home/vagrant/datacourse/mapreduce/miniprojects/part-000*.xml > linkstat_00.txt

Using configs in /home/vagrant/.mrjob.conf
Creating temp directory /tmp/linkstat.vagrant.20170706.044731.905371
Running step 1 of 1...
  texts = root and root.xpath('//text')
Streaming final output from /tmp/linkstat.vagrant.20170706.044731.905371/output...
Removing temp directory /tmp/linkstat.vagrant.20170706.044731.905371...


In [153]:
def link_stats_simple():
    return [
        ("count", 188408),
        ("mean", 15.6404080506),
        ("stdev", 50.4433023163),
        ("25%", 1),
        ("median", 6),
        ("75%", 17),
    ]

grader.score(question_name='mr__link_stats_simple', func=link_stats_simple)

Your score:  1.0


*Copyright &copy; 2016 The Data Incubator.  All rights reserved.*