#DATASCI W261: Machine Learning at Scale 

* **Sayantan Satpati**
* **sayantan.satpati@ischool.berkeley.edu**
* **W261**
* **Week-5**
* **Assignment-5**
* **Date of Submission: 07-OCT-2015**

#  === Week 5: mrjob, aws, and n-grams ===

## HW 5.0
---

***What is a data warehouse? What is a Star schema? When is it used?***


## HW 5.1
---

***In the database world What is 3NF? Does machine learning use data in 3NF? If so why? ***

***In what form does ML consume data?***

***Why would one use log files that are denormalized?***

## HW 5.2
---

Using MRJob, implement a hashside join (memory-backed map-side) for left, 
right and inner joins. Run your code on the  data used in HW 4.4: (Recall HW 4.4: Find the most frequent visitor of each page using mrjob and the output of 4.2  (i.e., transfromed log file). In this output please include the webpage URL, webpageID and Visitor ID.)
:

Justify which table you chose as the Left table in this hashside join.

Please report the number of rows resulting from:

```
(1) Left joining Table Left with Table Right
(2) Right joining Table Left with Table Right
(3) Inner joining Table Left with Table Right
```

### Approach

***File containing url(s) look like follows:***

```
[cloudera@localhost wk5]$ head -5 ../wk4/url 
A,1287,1,"International AutoRoute","/autoroute"
A,1288,1,"library","/library"
A,1289,1,"Master Chef Product Information","/masterchef"
A,1297,1,"Central America","/centroam"
A,1215,1,"For Developers Only Info","/developer"
```

***File containing Page Visits by Customers look like follows:***

```
V,1000,1,C,10001
V,1001,1,C,10001
V,1002,1,C,10001
V,1001,1,C,10002
V,1003,1,C,10002
```

* For **INNER** and **LEFT** join, the url file is passed to the mrjob as a '--file' parameter; it is then loaded by the reducers in a dict for INNER and LEFT join. After determining the page Vists in descending order by customers in the first pass, the url details are added, using the url file, in the 2nd pass. In the case of a LEFT join, the page visits would output the URL as 'NA', if the 'url' file is missing any URL. In the case of an INNER join, the page visits would be output only if there is a macthing URL. Since no URL(s) are missing from the 'url' file, the result from LEFT & INNER joins are identical having 98654 rows. The type of join is passed to mrjob as a parameter.
* For **RIGHT** join, we are supposed to show the page visits for all URLS present in the 'url' file, even though they have not been visited. For this one, we pass the 'pages visited file' as a '--file' parameter, and the url file as an input to the mrjob. Since there are urls which haven't been visited, the output 98663 rows, which is more than the previous 2 outputs.

In [14]:
%%writefile mrjob_join_hw52.py
from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.compat import get_jobconf_value
import sys


class MRFrequentVisitor(MRJob):
    def steps(self):
        return [
            MRStep(mapper=self.mapper,
                   combiner=self.combiner,
                   reducer=self.reducer),
             MRStep(mapper=self.mapper_frequent_visitor,
                   reducer_init=self.reducer_frequent_visitor_init,
                   reducer=self.reducer_frequent_visitor)
        ]

    def mapper(self, _, line):
        tokens = line.strip().split(",")
        key = "{0},{1}".format(tokens[1],tokens[4])
        yield key, 1

    def combiner(self, key, counts):
        yield key, sum(counts)

    def reducer(self, key, counts):
        yield key, sum(counts)
        
    # 2nd Pass
    
    def mapper_frequent_visitor(self, key, value):
        tokens = key.strip().split(",")
        modified_key = "{0},{1}".format(tokens[0],value)
        yield modified_key, tokens[1]
     
    
    def reducer_frequent_visitor_init(self):
        # Reads the 'url' file into a Dict for displaying additional information
        self.last_page = None
        self.pageDict = {}
        with open('url','r') as f:
            for line in f:
                tokens = line.strip().split(",")
                self.pageDict[tokens[1]] = tokens[4]
                
    def reducer_frequent_visitor(self, key, values):
        join_type = get_jobconf_value('join_type')
        sys.stderr.write('### JOIN TYPE: {0}'.format(join_type))
        tokens = key.strip().split(",")
        page = tokens[0]
        visits = int(tokens[1])
        
        if self.last_page != page:
            self.last_page = page
            # values might be a list, if there is a tie for same key => (p1, 1000), [v1,v2,v3..]
            
            # Emit even if page url is not present
            if join_type == 'left' or join_type == 'right':
                page_url = self.pageDict.get(page, 'NA').replace("\"","")
                for value in values:
                    k = '{0},{1}'.format(page, 
                                         page_url)
                    v = '{0},{1}'.format(visits,
                                        value)
                    yield k,v
                    
            # Emit only if page url is present
            if join_type == 'inner':
                page_url = self.pageDict.get(page, 'NA').replace("\"","")
                if page_url != 'NA':
                    for value in values:
                        k = '{0},{1}'.format(page, 
                                             page_url)
                        v = '{0},{1}'.format(visits,
                                            value)
                        yield k,v

if __name__ == '__main__':
    MRFrequentVisitor.run()

Overwriting mrjob_join_hw52.py


In [15]:
!chmod a+x mrjob_join_hw52.py

In [30]:
%%writefile mrjob_join_right_hw52.py
from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.compat import get_jobconf_value
import sys


class MRFrequentVisitor1(MRJob):
    def steps(self):
        return [
            MRStep(mapper_init=self.mapper_init,
                   mapper=self.mapper)
        ]

    def mapper_init(self):
        self.pageVisitsDict = {}
        with open('anonymous-msweb.data.pp','r') as f:
            for line in f:
                tokens = line.strip().split(",")
                page = tokens[1]
                visitor = tokens[4]
                if page not in self.pageVisitsDict.keys():
                    self.pageVisitsDict[page] = {}
                if visitor not in self.pageVisitsDict[page].keys():
                    self.pageVisitsDict[page][visitor] = 0
                self.pageVisitsDict[page][visitor] = self.pageVisitsDict[page][visitor] + 1
            
    
    def mapper(self, _, line):
        tokens = line.strip().split(",")
        page = tokens[1]
        page_url = tokens[4].replace("\"","")
        if page in self.pageVisitsDict.keys():
            for k,v in self.pageVisitsDict[page].iteritems():
                
                key = '{0},{1}'.format(page, 
                                    page_url)
                value = '{0},{1}'.format(v,
                                     k)
                yield (key,value)
        else:
            k = '{0},{1}'.format(page, 
                                page_url)
            v = '{0},{1}'.format(0,
                                 'NA')
            yield (k,v)

if __name__ == '__main__':
    MRFrequentVisitor1.run()

Overwriting mrjob_join_right_hw52.py


In [31]:
!chmod a+x mrjob_join_right_hw52.py

In [32]:
# Running mrjob using a Hadoop Runner in local cluster
from mrjob_join_hw52 import MRFrequentVisitor
from mrjob_join_right_hw52 import MRFrequentVisitor1
import os

# Passing Hadoop Streaming parameters to:
# partition by leftmost part of composite key
# secodary sort by rightmost part of the same composite key

join_types = ['left', 'right', 'inner']

# Run for each Join Type
for jt in join_types:
    mr_job = None
    if jt == 'left' or jt == 'inner':
        mr_job = MRFrequentVisitor(args=['-r', 'hadoop', 
                                         '--hadoop-home', '/usr/lib/hadoop-0.20-mapreduce',
                                         '--hadoop-bin', '/usr/bin/hadoop',
                                         '--file', '../wk4/url',
                                         '--jobconf', 'join_type={0}'.format(jt),
                                         '--jobconf', 'stream.num.map.output.key.fields=2',
                                         '--jobconf', 'map.output.key.field.separator=,',
                                         '--jobconf', 'mapred.text.key.partitioner.options=-k1,1',
                                         '--jobconf', 'mapred.text.key.comparator.options=-k1,1 -k2,2nr',
                                         '--jobconf', 'mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
                                         '--partitioner', 'org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner',
                                         '../wk4/anonymous-msweb.data.pp', '-v', '--no-strict-protocol'])
    else:
        mr_job = MRFrequentVisitor1(args=['-r', 'hadoop', 
                                         '--hadoop-home', '/usr/lib/hadoop-0.20-mapreduce',
                                         '--hadoop-bin', '/usr/bin/hadoop',
                                         '--file', '../wk4/anonymous-msweb.data.pp',
                                         '../wk4/url', '-v', '--no-strict-protocol'])


    output_file = "output_hw52.txt"
    try:
        os.remove(output_file)
    except OSError:
        pass

    with mr_job.make_runner() as runner, open(output_file, 'a') as f: 
        runner.run()
        # stream_output: get access of the output 
        for line in runner.stream_output():
            #print mr_job.parse_output_line(line)
            f.write(line)

    print '### JOIN TYPE: {0}'.format(jt)
    !head -n 10 output_hw52.txt
    !wc -l output_hw52.txt

The have been translated as follows
 mapred.text.key.comparator.options: mapreduce.partition.keycomparator.options
map.output.key.field.separator: mapreduce.map.output.key.field.separator
mapred.output.key.comparator.class: mapreduce.job.output.key.comparator.class
mapred.text.key.partitioner.options: mapreduce.partition.keypartitioner.options
The have been translated as follows
 mapred.text.key.comparator.options: mapreduce.partition.keycomparator.options
map.output.key.field.separator: mapreduce.map.output.key.field.separator
mapred.output.key.comparator.class: mapreduce.job.output.key.comparator.class
mapred.text.key.partitioner.options: mapreduce.partition.keypartitioner.options


### JOIN TYPE: left
"1000,/regwiz"	"1,42411"
"1000,/regwiz"	"1,42381"
"1000,/regwiz"	"1,42320"
"1000,/regwiz"	"1,42291"
"1000,/regwiz"	"1,42285"
"1000,/regwiz"	"1,42260"
"1000,/regwiz"	"1,42213"
"1000,/regwiz"	"1,42198"
"1000,/regwiz"	"1,42176"
"1000,/regwiz"	"1,42160"
98654 output_hw52.txt
### JOIN TYPE: right
"1287,/autoroute"	"0,NA"
"1288,/library"	"0,NA"
"1289,/masterchef"	"0,NA"
"1297,/centroam"	"0,NA"
"1215,/developer"	"1,40224"
"1215,/developer"	"1,28813"
"1215,/developer"	"1,35353"
"1215,/developer"	"1,28058"
"1215,/developer"	"1,27881"
"1215,/developer"	"1,42447"
98663 output_hw52.txt


The have been translated as follows
 mapred.text.key.comparator.options: mapreduce.partition.keycomparator.options
map.output.key.field.separator: mapreduce.map.output.key.field.separator
mapred.output.key.comparator.class: mapreduce.job.output.key.comparator.class
mapred.text.key.partitioner.options: mapreduce.partition.keypartitioner.options
The have been translated as follows
 mapred.text.key.comparator.options: mapreduce.partition.keycomparator.options
map.output.key.field.separator: mapreduce.map.output.key.field.separator
mapred.output.key.comparator.class: mapreduce.job.output.key.comparator.class
mapred.text.key.partitioner.options: mapreduce.partition.keypartitioner.options


### JOIN TYPE: inner
"1000,/regwiz"	"1,42411"
"1000,/regwiz"	"1,42381"
"1000,/regwiz"	"1,42320"
"1000,/regwiz"	"1,42291"
"1000,/regwiz"	"1,42285"
"1000,/regwiz"	"1,42260"
"1000,/regwiz"	"1,42213"
"1000,/regwiz"	"1,42198"
"1000,/regwiz"	"1,42176"
"1000,/regwiz"	"1,42160"
98654 output_hw52.txt


## HW 5.3
---

For the remainder of this assignment you will work with a large subset 
of the Google n-grams dataset,

https://aws.amazon.com/datasets/google-books-ngrams/

which we have placed in a bucket on s3:

s3://filtered-5grams/

In particular, this bucket contains (~200) files in the format:

(ngram) \t (count) \t (pages_count) \t (books_count)

Do some EDA on this dataset using mrjob, e.g., 

- Longest 5-gram (number of characters)
- Top 10 most frequent words (count), i.e., unigrams
- Most/Least densely appearing words (count/pages_count) sorted in decreasing order of relative frequency (Hint: save to PART-000* and take the head -n 1000)
- Distribution of 5-gram sizes (counts) sorted in decreasing order of relative frequency. (Hint: save to PART-000* and take the head -n 1000)
OPTIONAL Question:
- Plot the log-log plot of the frequency distributuion of unigrams. Does it follow power law distribution?

For more background see:
https://en.wikipedia.org/wiki/Log%E2%80%93log_plot
https://en.wikipedia.org/wiki/Power_law

### Longest 5-gram (number of characters)

### Part (A)
---

In [12]:
%%writefile mrjob_hw53_1.py
from mrjob.job import MRJob
from mrjob.step import MRStep
import re


class LongestNgram(MRJob):

    def steps(self):
        return [
            MRStep(mapper=self.mapper_ngrams_len,
                   reducer=self.reducer_ngrams_len),
            MRStep(reducer=self.reducer_find_max_ngram)
        ]

    def mapper_ngrams_len(self, _, line):
        tokens = line.strip().split('\t')
        yield (tokens[0], len(tokens[0]))

  
    def reducer_ngrams_len(self, word, counts):
        yield None, (sum(counts), word)

    # discard the key; it is just None
    def reducer_find_max_ngram(self, _, word_count_pairs):
        # each item of word_count_pairs is (count, word),
        # so yielding one results in key=counts, value=word
        yield max(word_count_pairs)


if __name__ == '__main__':
    LongestNgram.run()

Overwriting mrjob_hw53_1.py


In [13]:
!chmod a+x mrjob_hw53_1.py

In [16]:
%load_ext autoreload
%autoreload 2
from mrjob_hw53_1 import LongestNgram
mr_job = LongestNgram(args=['s3://filtered-5grams',
                            '-r', 'emr', '--no-strict-protocol'])
with mr_job.make_runner() as runner: 
    runner.run()
    # stream_output: get access of the output 
    for line in runner.stream_output():
        print mr_job.parse_output_line(line)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
(159, 'ROPLEZIMPREDASTRODONBRASLPKLSON YHROACLMPARCHEYXMMIOUDAVESAURUS PIOFPILOCOWERSURUASOGETSESNEGCP TYRAVOPSIFENGOQUAPIALLOBOSKENUO OWINFUYAIOKENECKSASXHYILPOYNUAT')


### Part (B)
---

In [26]:
%%writefile mrjob_hw53_2.py
#!/usr/bin/python
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
from itertools import combinations

class FrequentUnigrams(MRJob):
    
    def steps(self):
        return [
            MRStep(mapper=self.mapper,
                   combiner=self.combiner,
                   reducer=self.reducer),
             MRStep(mapper=self.mapper_frequent_unigrams,
                   reducer=self.reducer_frequent_unigrams,
                   jobconf={
                            'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
                            'mapred.text.key.comparator.options': '-k1,1rn',
                            }
                   )
        ]
    
    def mapper(self, _, line):
        line.strip()
        tokens = re.split("\t",line)
        unigrams = tokens[0].split()
        count = int(tokens[1])
        for unigram in unigrams:
            yield unigram, count
    
    def combiner(self, unigram, counts):
        yield unigram, sum(counts)
    
    def reducer(self, unigram, counts):
        yield unigram, sum(counts)
        
    def mapper_frequent_unigrams(self, unigram, count):
        yield count, unigram
        
    def reducer_frequent_unigrams(self, count, unigrams):
        for unigram in unigrams:
            yield count, unigram
    
if __name__ == '__main__':
    FrequentUnigrams.run()

Overwriting mrjob_hw53_2.py


In [27]:
!chmod a+x mrjob_hw53_2.py

In [31]:
!python mrjob_hw53_2.py -r emr \
 s3://filtered-5grams \
 --output-dir=s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words \
 --no-output \
 --no-strict-protocol

using configs in /Users/ssatpati/.mrjob.conf
using existing scratch bucket mrjob-d5ed1dc31babbe2c
using s3://mrjob-d5ed1dc31babbe2c/tmp/ as our scratch dir on S3
creating tmp directory /var/folders/h5/1q71m1c54cn07f16c232pqgm38ynd8/T/mrjob_hw53_2.ssatpati.20151003.062810.595234
writing master bootstrap script to /var/folders/h5/1q71m1c54cn07f16c232pqgm38ynd8/T/mrjob_hw53_2.ssatpati.20151003.062810.595234/b.py
Copying non-input files into s3://mrjob-d5ed1dc31babbe2c/tmp/mrjob_hw53_2.ssatpati.20151003.062810.595234/files/
Waiting 5.0s for S3 eventual consistency
Creating Elastic MapReduce job flow
Job flow created with ID: j-QGLC3PJPOLBE
Created new job flow j-QGLC3PJPOLBE
Job launched 30.4s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 61.3s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 91.8s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 122.8s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 153.1s ago

In [41]:
!rm -rf ./output;mkdir output;cd output;aws s3 cp --recursive s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words .;
!cat output/part-0000* | sort -nrk 1 | head -n 10
!rm -rf ./output

download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/_SUCCESS to ./_SUCCESS
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00003 to ./part-00003
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00000 to ./part-00000
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00002 to ./part-00002
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00004 to ./part-00004
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00006 to ./part-00006
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00001 to ./part-00001
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00005 to ./part-00005
5375699242	"the"
3691308874	"of"
2221164346	"to"
1387638591	"in"
1342195425	"a"
1135779433	"and"
798553959	"that"
756296656	"is"
688053106	"be"
481373389	"as"
sort: write failed: standard output: Broken pipe
sort: write error


### Top 10 Frequent Words are:

```
5375699242	"the"
3691308874	"of"
2221164346	"to"
1387638591	"in"
1342195425	"a"
1135779433	"and"
798553959	"that"
756296656	"is"
688053106	"be"
481373389	"as"
```

### Part (C)
---

In [34]:
%%writefile mrjob_hw53_3.py
#!/usr/bin/python
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
from itertools import combinations

class DenseWords(MRJob):
    
    def steps(self):
        return [
            MRStep(mapper=self.mapper,
                   combiner=self.combiner,
                   reducer=self.reducer,
                   ),
            MRStep(mapper=self.mapper_max_min,
                   reducer=self.reducer_max_min,
                   jobconf={
                            'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
                            'mapred.text.key.comparator.options': '-k1,1rn',
                            }
                   )
        ]
    
    def mapper(self, _, line):
        tokens = line.strip().split('\t')
        unigrams = tokens[0].split()
        density = round((int(tokens[1]) * 1.0 / int(tokens[2])), 3)
        for unigram in unigrams:
            yield unigram, density
            
    def combiner(self, unigram, densities):
        densities = [d for d in densities]
        yield unigram, min(densities) 
        yield unigram, max(densities)
        
    def reducer(self, unigram, densities):
        densities = [d for d in densities]
        yield unigram, min(densities)
        yield unigram, max(densities)
        
    def mapper_max_min(self, unigram, density):
        yield density, unigram
        
    def reducer_max_min(self, density, unigrams):
        for unigram in unigrams:
            yield density, unigram
    
if __name__ == '__main__':
    DenseWords.run()

Overwriting mrjob_hw53_3.py


In [35]:
!chmod a+x mrjob_hw53_3.py

In [36]:
!aws s3 rm --recursive s3://ucb-mids-mls-sayantan-satpati/hw53/density/
!python mrjob_hw53_3.py -q -r emr \
 s3://filtered-5grams \
 --output-dir=s3://ucb-mids-mls-sayantan-satpati/hw53/density \
 --no-output \
 --no-strict-protocol

delete: s3://ucb-mids-mls-sayantan-satpati/hw53/density/_SUCCESS
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00001
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00000
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00002
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00004
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00003
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00005
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00006


In [96]:
!rm -rf ./output;mkdir output;cd output;aws s3 cp --recursive s3://ucb-mids-mls-sayantan-satpati/hw53/density/ .;
!cat output/part-0000* | sort -nrk 1 > merged
!head -n 10 merged
!tail -n 10 merged
!rm -rf ./output

download: s3://ucb-mids-mls-sayantan-satpati/hw53/density/_SUCCESS to ./_SUCCESS
download: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00003 to ./part-00003
download: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00004 to ./part-00004
download: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00005 to ./part-00005
download: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00006 to ./part-00006
download: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00000 to ./part-00000
download: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00001 to ./part-00001
download: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00007 to ./part-00007
download: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00002 to ./part-00002
29.704999999999998	"Death"
14.912000000000001	"write"
14.912000000000001	"and"
12.380000000000001	"NA"
11.557	"xxxx"
11.557	"xxxx"
10.882	"Sc"
10.6	"beep"
10.022	"blah"
9.8040000000000003	"whole"
1.0	"AAAS"
1.0	"AAAI"
1.0	"AAAE"
1.0	"AAAE"

### Top 10 Most Dense Words

```
29.704999999999998	"Death"
14.912000000000001	"write"
14.912000000000001	"and"
12.380000000000001	"NA"
11.557	"xxxx"
11.557	"xxxx"
10.882	"Sc"
10.6	"beep"
10.022	"blah"
9.8040000000000003	"whole"
```

### Top 10 Least Dense Words

```
1.0	"AAAS"
1.0	"AAAI"
1.0	"AAAE"
1.0	"AAAE"
1.0	"AAAA"
1.0	"AAAA"
1.0	"AAA"
1.0	"AA"
1.0	"A's"
1.0	"A"
```


### Part (D)
---

In [40]:
%%writefile mrjob_hw53_4.py
#!/usr/bin/python
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
from itertools import combinations

class DistributionNgram(MRJob):
    
    def steps(self):
        return [
            MRStep(mapper=self.mapper,
                   reducer=self.reducer,
                   jobconf={
                            'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
                            'mapred.text.key.comparator.options': '-k1,1rn',
                            }
                   )
        ]
    
    def mapper(self, _, line):
        line.strip()
        tokens = re.split("\t",line)
        yield int(tokens[1]), tokens[0]
    
    def reducer(self, count, ngrams):
        for ngram in ngrams:
            yield count, ngram
    
if __name__ == '__main__':
    DistributionNgram.run()

Overwriting mrjob_hw53_4.py


In [41]:
!chmod a+x mrjob_hw53_4.py

In [16]:
!aws s3 rm --recursive s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/
!python mrjob_hw53_4.py -q -r emr \
 s3://filtered-5grams/ \
 --output-dir=s3://ucb-mids-mls-sayantan-satpati/hw53/distribution \
 --no-output \
 --no-strict-protocol

delete: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/_SUCCESS
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00001
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00006
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00000
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00003
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00002
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00004
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00005


In [19]:
!rm -rf ./output;mkdir output;cd output;aws s3 cp --recursive s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/ .;
!cat output/part-0000* | sort -nrk 1 > output_hw53_4.txt
!rm -rf ./output
!head -n 10 output_hw53_4.txt

download: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/_SUCCESS to ./_SUCCESS
download: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00000 to ./part-00000
download: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00002 to ./part-00002
download: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00001 to ./part-00001
download: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00004 to ./part-00004
download: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00003 to ./part-00003
download: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00005 to ./part-00005
download: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00006 to ./part-00006
sort: write failed: standard output: Broken pipe
sort: write error
cat: stdout: Broken pipe
2223343	"on the basis of the"
1462063	"at the head of the"
1316960	"as well as in the"
871158	"at the same time the"
833764	"as a matter of fact"
590885	"as a part of the"
544783	"as one of th

### Top 10 NGram Distribution

```
2223343	"on the basis of the"
1462063	"at the head of the"
1316960	"as well as in the"
871158	"at the same time the"
833764	"as a matter of fact"
590885	"as a part of the"
544783	"as one of the most"
533464	"and at the end of"
447364	"I do not think that"
441568	"from the beginning of the"
```

## HW 5.4
---

In this part of the assignment we will focus on developing methods
for detecting synonyms, using the Google 5-grams dataset. To accomplish
this you must script two main tasks using MRJob:

(1) Build stripes of word co-ocurrence for the top 10,000
most frequently appearing words across the entire set of 5-grams,
and output to a file in your bucket on s3 (bigram analysis, though the words are non-contiguous).

(2) Using two (symmetric) comparison methods of your choice 
(e.g., correlations, distances, similarities), pairwise compare 
all stripes (vectors), and output to a file in your bucket on s3.

==Design notes for (1)==
For this task you will be able to modify the pattern we used in HW 3.2
(feel free to use the solution as reference). To total the word counts 
across the 5-grams, output the support from the mappers using the total 
order inversion pattern:

<*word,count>

to ensure that the support arrives before the cooccurrences.

In addition to ensuring the determination of the total word counts,
the mapper must also output co-occurrence counts for the pairs of
words inside of each 5-gram. Treat these words as a basket,
as we have in HW 3, but count all stripes or pairs in both orders,
i.e., count both orderings: (word1,word2), and (word2,word1), to preserve
symmetry in our output for (2).

==Design notes for (2)==
For this task you will have to determine a method of comparison.
Here are a few that you might consider:

- Spearman correlation
- Euclidean distance
- Taxicab (Manhattan) distance
- Shortest path graph distance (a graph, because our data is symmetric!)
- Pearson correlation
- Cosine similarity
- Kendall correlation
...

### HW 5.4: Part 1 > Step 1:

***Find the frequent unigrams for the top 10,000 most frequently appearing words across the entire set of 5-grams***

* Use the freqency count output from earlier step to get the top 10K most frequent words

In [109]:
!rm -rf ./output;mkdir output;cd output;aws s3 cp --recursive s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words .;
!cat output/part-0000* | sort -nrk 1 | head -n 10000 > output/frequent_unigrams.txt
!head -n 10 output/frequent_unigrams.txt

download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/_SUCCESS to ./_SUCCESS
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00006 to ./part-00006
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00002 to ./part-00002
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00000 to ./part-00000
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00005 to ./part-00005
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00003 to ./part-00003
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00004 to ./part-00004
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00001 to ./part-00001
sort: write failed: standard output: Broken pipe
sort: write error
5375699242	"the"
3691308874	"of"
2221164346	"to"
1387638591	"in"
1342195425	"a"
1135779433	"and"
798553959	"that"
756296656	"is"
688053106	"be"
481373389	"as"


### HW 5.4: Part 1 > Step 2:

***Build stripes of word co-ocurrence for the top 10,000
most frequently appearing words across the entire set of 5-grams***

* Pass the 1 item frequent set generated from earlier step to the mapper(s) for filtering
* The reducers would sum and emit Word Co-Occurrence if support count of 10,000 is met

In [117]:
%%writefile word_cooccurrences_set_hw54.py
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
import sys


class WordCoOccurrenceFrequentSet(MRJob):

    def steps(self):
        return [
            MRStep(mapper_init=self.mapper_init,
                   mapper=self.mapper,
                   combiner=self.combiner,
                   reducer=self.reducer)
        ]
    
    def mapper_init(self):
        # Load the file into memory
        self.unigrams = {}
        with open('frequent_unigrams.txt', 'r') as f:
            for line in f:
                tokens = line.strip().split('\t')
                self.unigrams[tokens[1].replace("\"","")] = int(tokens[0])
        sys.stderr.write('### of unigrams: {0}'.format(self.unigrams))

    def mapper(self, _, line):
        tokens = line.strip().split('\t')
        # List of 5-grams
        words = tokens[0].split()
        # Filter 5-grams to only those in list
        words = [w for w in words if w in self.unigrams.keys()]
        l = len(words)
        for i in xrange(l):
            d = {}
            for j in xrange(l):
                if i != j:
                    d[words[j]] = d.get(words[j], 0) + 1
            # Emit word, stripe
            yield words[i],d

    def combiner(self, word, stripes):
        d = {}
        # Aggregate stripes
        for s in stripes:
            for k, v in s.iteritems():
                d[k] = d.get(k, 0) + v
        yield word,d
        
    def reducer(self, word, stripes):
        d = {}
        # Aggregate stripes
        for s in stripes:
            for k, v in s.iteritems():
                d[k] = d.get(k, 0) + v
        '''
        # Filter based on support count (Not a requirement!)
        d_final = {}
        for k,v in d.iteritems():
            if v >= 10000:
                d_final[k] = v
        '''
        # Combine stripes
        yield word,d

if __name__ == '__main__':
    WordCoOccurrenceFrequentSet.run()

Overwriting word_cooccurrences_set_hw54.py


In [118]:
!chmod a+x word_cooccurrences_set_hw54.py

In [119]:
!aws s3 rm --recursive s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/
!python word_cooccurrences_set_hw54.py -q -r emr \
 s3://filtered-5grams/ \
 --file output/frequent_unigrams.txt \
 --output-dir=s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur \
 --no-output \
 --no-strict-protocol

#### HW 5.4 Part 2 > Step 1

***From the output of Part 1 > Step 2, create a single file with all the stripes - frequent itemsets of size 2***

In [133]:
!rm -rf ./output;mkdir output;cd output;aws s3 cp --recursive s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/ .;
!cat output/part-000* | sort -k 1 > output/frequent_stripes.txt
!wc -l output/frequent_stripes.txt

download: s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/_SUCCESS to ./_SUCCESS
download: s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/part-00000 to ./part-00000
download: s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/part-00004 to ./part-00004
download: s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/part-00002 to ./part-00002
download: s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/part-00001 to ./part-00001
download: s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/part-00006 to ./part-00006
download: s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/part-00005 to ./part-00005
download: s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/part-00007 to ./part-00007
download: s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/part-00008 to ./part-00008
download: s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/part-00009 to ./part-00009
download: s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/part-00010 to ./part-00010
download: s3:/

In [15]:
%%writefile dist_calc_hw54.py
from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.compat import get_jobconf_value
import re
import sys
import ast
import urllib2
import math


class DistanceCalc(MRJob):

    def steps(self):
        return [
            MRStep(mapper_init=self.mapper_init,
                   mapper=self.mapper,
                  jobconf={
                            'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
                            'mapred.text.key.comparator.options': '-k1,1n',
                            }
                   )
        ]
    
    def mapper_init(self):
        # Load the file into memory
        self.counter = 0
        self.stripes = {}
        '''
        f = urllib2.urlopen("https://s3-us-west-2.amazonaws.com/ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/frequent_stripes.txt")
        for line in f.readlines():
            tokens = line.strip().split('\t')
            self.stripes[tokens[0].replace("\"","")] = ast.literal_eval(tokens[1])
            self.increment_counter('distance', 'num_stripes_loaded', amount=1)
        '''
        with open('frequent_stripes.txt','r') as f:
            for line in f:
                tokens = line.strip().split('\t')
                self.stripes[tokens[0].replace("\"","")] = ast.literal_eval(tokens[1])
                self.increment_counter('distance', 'num_stripes_loaded', amount=1)
      
        sys.stderr.write('### of stripes: {0}\n'.format(len(self.stripes)))

    def mapper(self, _, line):
        dist_type = get_jobconf_value('dist_type')
        tokens = line.strip().split('\t')
        key = tokens[0].replace("\"","")
        dict_pairs = ast.literal_eval(tokens[1])
        s1 = set(dict_pairs.keys())
        for n_key, n_dict_pairs in self.stripes.iteritems():
            # TODO distance calc for only (a,b) but not (b,a) --> Redundant
            if key > n_key:
                continue
            
            self.counter += 1   
            if self.counter % 1000 == 0:
                self.set_status('# of Distances Calculated: {0}'.format(self.counter))
                
            s2 = set(n_dict_pairs.keys())
            distance = None
            
            if dist_type == 'euclid':

                # Calculate Euclidean Distance
                # Get the union of keys from both stripes
                union_keys = s1.union(s2)

                squared_distance = 0
                for k in union_keys:
                    squared_distance += (dict_pairs.get(k, 0) - n_dict_pairs.get(k, 0)) ** 2
                    
                distance = math.sqrt(squared_distance)
                
            if dist_type == 'cosine':
           
                # Calculate Cosine Distance
                # Get the intersection of keys from both stripes
                intersection_keys = s1.intersection(s2)

                dot_x_y = 0
                for k in intersection_keys:
                    dot_x_y += dict_pairs[k] * n_dict_pairs[k]

                norm_x = 0
                for k in s1:
                    norm_x += dict_pairs[k] * dict_pairs[k]

                norm_y = 0
                for k in s2:
                    norm_y += n_dict_pairs[k] * n_dict_pairs[k]

                distance = float(dot_x_y) / (math.sqrt(norm_x) * math.sqrt(norm_y))
                    
          
            self.increment_counter('distance', 'num_{0}_distances'.format(dist_type), amount=1)
            yield (distance), (key, n_key)


if __name__ == '__main__':
    DistanceCalc.run()

Overwriting dist_calc_hw54.py


In [16]:
!chmod a+x dist_calc_hw54.py

### Distance Calculation

***Distance has been calculated in local m/c using a smaller dataset. Tried multiple times in EMR, but everytime the job failed with errors that were not very helpful in debuggin the problem. Most likely the jobs failed due to Out of Memory Errors. Based on my online research, looks like there are better ways to parallelize the distance calculations (Cosine, Jaccard, Euclidean, Manhattan etc), but was unable to explore those due to time constraints.***

In [21]:
!rm output/euclid_dist.txt
!time python dist_calc_hw54.py -r local \
--file output/frequent_stripes.txt \
--jobconf 'dist_type=euclid' \
--no-strict-protocol \
output/test_stripes.txt > output/euclid_dist.txt

using configs in /Users/ssatpati/.mrjob.conf
creating tmp directory /var/folders/h5/1q71m1c54cn07f16c232pqgm38ynd8/T/dist_calc_hw54.ssatpati.20151006.042953.238385
writing wrapper script to /var/folders/h5/1q71m1c54cn07f16c232pqgm38ynd8/T/dist_calc_hw54.ssatpati.20151006.042953.238385/setup-wrapper.sh
writing to /var/folders/h5/1q71m1c54cn07f16c232pqgm38ynd8/T/dist_calc_hw54.ssatpati.20151006.042953.238385/step-0-mapper_part-00000
> sh -ex setup-wrapper.sh /Users/ssatpati/anaconda/bin/python dist_calc_hw54.py --step-num=0 --mapper --no-strict-protocols /var/folders/h5/1q71m1c54cn07f16c232pqgm38ynd8/T/dist_calc_hw54.ssatpati.20151006.042953.238385/input_part-00000 > /var/folders/h5/1q71m1c54cn07f16c232pqgm38ynd8/T/dist_calc_hw54.ssatpati.20151006.042953.238385/step-0-mapper_part-00000
STDERR: + __mrjob_PWD=/private/var/folders/h5/1q71m1c54cn07f16c232pqgm38ynd8/T/dist_calc_hw54.ssatpati.20151006.042953.238385/job_local_dir/0/mapper/0
STDERR: + exec
STDERR: + /Users/ssatpati/anaconda/bin/

In [22]:
!head -10 output/euclid_dist.txt

162651.75911437295	["life", "writings"]
170813.8704028452	["life", "stain"]
166962.7704968985	["life", "lord"]
169735.15824071335	["life", "sinking"]
167200.72430166084	["life", "regional"]
167569.63904001226	["life", "yellow"]
169539.30474966564	["life", "uncertain"]
169395.3866668157	["life", "prize"]
169005.3579387352	["life", "wooden"]
170999.37683512183	["life", "persisted"]


In [23]:
!rm output/cosine_dist.txt
!time python dist_calc_hw54.py -r local \
--file output/frequent_stripes.txt \
--jobconf 'dist_type=cosine' \
--no-strict-protocol \
output/test_stripes.txt > output/cosine_dist.txt

using configs in /Users/ssatpati/.mrjob.conf
creating tmp directory /var/folders/h5/1q71m1c54cn07f16c232pqgm38ynd8/T/dist_calc_hw54.ssatpati.20151006.043153.014211
writing wrapper script to /var/folders/h5/1q71m1c54cn07f16c232pqgm38ynd8/T/dist_calc_hw54.ssatpati.20151006.043153.014211/setup-wrapper.sh
writing to /var/folders/h5/1q71m1c54cn07f16c232pqgm38ynd8/T/dist_calc_hw54.ssatpati.20151006.043153.014211/step-0-mapper_part-00000
> sh -ex setup-wrapper.sh /Users/ssatpati/anaconda/bin/python dist_calc_hw54.py --step-num=0 --mapper --no-strict-protocols /var/folders/h5/1q71m1c54cn07f16c232pqgm38ynd8/T/dist_calc_hw54.ssatpati.20151006.043153.014211/input_part-00000 > /var/folders/h5/1q71m1c54cn07f16c232pqgm38ynd8/T/dist_calc_hw54.ssatpati.20151006.043153.014211/step-0-mapper_part-00000
STDERR: + __mrjob_PWD=/private/var/folders/h5/1q71m1c54cn07f16c232pqgm38ynd8/T/dist_calc_hw54.ssatpati.20151006.043153.014211/job_local_dir/0/mapper/0
STDERR: + exec
STDERR: + /Users/ssatpati/anaconda/bin/

In [24]:
!head -n 10 output/cosine_dist.txt

0.9421374042513717	["life", "writings"]
0.8313871899681543	["life", "stain"]
0.91983668656369	["life", "lord"]
0.8690261634238905	["life", "sinking"]
0.914862119190327	["life", "regional"]
0.8862305158599959	["life", "yellow"]
0.8287440348705982	["life", "uncertain"]
0.8154294013145477	["life", "prize"]
0.7597525475632223	["life", "wooden"]
0.6218562101579173	["life", "persisted"]


## HW 5.5
---

In this part of the assignment you will evaluate the success of you synonym detector.
Take the top 1,000 closest/most similar/correlative pairs of words as determined
by your measure in (2), and use the synonyms function in the accompanying
python code:

nltk_synonyms.py

Note: This will require installing the python nltk package:

http://www.nltk.org/install.html

and downloading its data with nltk.download().

For each (word1,word2) pair, check to see if word1 is in the list, 
synonyms(word2), and vice-versa. If one of the two is a synonym of the other, 
then consider this pair a 'hit', and then report the precision, recall, and F1 measure  of 
your detector across your 1,000 best guesses. Report the macro averages of these measures.

In [30]:
!cat output/cosine_dist.txt | sort -k 1 > output/cosine_dist_sorted.txt
!head -n 1000 output/cosine_dist_sorted.txt > output/cosine_dist_top_1000.txt
!head -n 10 output/cosine_dist_top_1000.txt

0.0636379482385911	["life", "sur"]
0.07529117883951839	["life", "los"]
0.10126889679347573	["life", "que"]
0.13826129238715115	["life", "und"]
0.2207153823724399	["life", "wilt"]
0.22714412269531603	["life", "singled"]
0.2581920664977755	["life", "subdivided"]
0.26553826954428433	["life", "wasn"]
0.29670000814305053	["life", "shalt"]
0.31510580174581476	["life", "refrain"]


### Unfortunately, none of top 1000 similar words from my list matched the synonyms from NLTK wordnet

In [39]:
import nltk
from nltk.corpus import wordnet as wn
import sys
import ast
#print all the synset element of an element
def synonyms(string):
    syndict = {}
    for i,j in enumerate(wn.synsets(string)):
        syns = j.lemma_names()
        for syn in syns:
            syndict.setdefault(syn,1)
    return syndict.keys()

# Find out synonyms for life
syn = synonyms("life")
for s in syn:
    print s
    
# Check if any of the top 1000 matches the synonym list
with open('output/cosine_dist_top_1000.txt', 'r') as f:
    for line in f:
        t = line.strip().split('\t')
        pair = ast.literal_eval(t[1])
        if pair[1] in syn:
            print 'Synonym Matched: {0}'.format(pair[1])

living
life
sprightliness
lifespan
spirit
liveliness
animation
life_story
life_sentence
lifetime
aliveness
life-time
biography
life_history
