#DATASCI W261: Machine Learning at Scale 

* **Sayantan Satpati**
* **sayantan.satpati@ischool.berkeley.edu**
* **W261**
* **Week-5**
* **Assignment-5**
* **Date of Submission: 13-OCT-2015**

#  === Week 5: mrjob, aws, and n-grams ===

## HW 5.3
---

For the remainder of this assignment you will work with a large subset 
of the Google n-grams dataset,

https://aws.amazon.com/datasets/google-books-ngrams/

which we have placed in a bucket on s3:

s3://filtered-5grams/

In particular, this bucket contains (~200) files in the format:

(ngram) \t (count) \t (pages_count) \t (books_count)

Do some EDA on this dataset using mrjob, e.g., 

- Longest 5-gram (number of characters)
- Top 10 most frequent words (count), i.e., unigrams
- Most/Least densely appearing words (count/pages_count) sorted in decreasing order of relative frequency (Hint: save to PART-000* and take the head -n 1000)
- Distribution of 5-gram sizes (counts) sorted in decreasing order of relative frequency. (Hint: save to PART-000* and take the head -n 1000)
OPTIONAL Question:
- Plot the log-log plot of the frequency distributuion of unigrams. Does it follow power law distribution?

For more background see:
https://en.wikipedia.org/wiki/Log%E2%80%93log_plot
https://en.wikipedia.org/wiki/Power_law

### Longest 5-gram (number of characters)

### Part (A)
---

***Refer to Chris's notebook***

### Part (B)
---

In [26]:
%%writefile mrjob_hw53_2.py
#!/usr/bin/python
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
from itertools import combinations

class FrequentUnigrams(MRJob):
    
    def steps(self):
        return [
            MRStep(mapper=self.mapper,
                   combiner=self.combiner,
                   reducer=self.reducer),
             MRStep(mapper=self.mapper_frequent_unigrams,
                   reducer=self.reducer_frequent_unigrams,
                   jobconf={
                            'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
                            'mapred.text.key.comparator.options': '-k1,1rn',
                            }
                   )
        ]
    
    def mapper(self, _, line):
        line.strip()
        tokens = re.split("\t",line)
        unigrams = tokens[0].split()
        count = int(tokens[1])
        for unigram in unigrams:
            yield unigram, count
    
    def combiner(self, unigram, counts):
        yield unigram, sum(counts)
    
    def reducer(self, unigram, counts):
        yield unigram, sum(counts)
        
    def mapper_frequent_unigrams(self, unigram, count):
        yield count, unigram
        
    def reducer_frequent_unigrams(self, count, unigrams):
        for unigram in unigrams:
            yield count, unigram
    
if __name__ == '__main__':
    FrequentUnigrams.run()

Overwriting mrjob_hw53_2.py


In [27]:
!chmod a+x mrjob_hw53_2.py

In [31]:
!python mrjob_hw53_2.py -r emr \
 s3://filtered-5grams \
 --output-dir=s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words \
 --no-output \
 --no-strict-protocol

using configs in /Users/ssatpati/.mrjob.conf
using existing scratch bucket mrjob-d5ed1dc31babbe2c
using s3://mrjob-d5ed1dc31babbe2c/tmp/ as our scratch dir on S3
creating tmp directory /var/folders/h5/1q71m1c54cn07f16c232pqgm38ynd8/T/mrjob_hw53_2.ssatpati.20151003.062810.595234
writing master bootstrap script to /var/folders/h5/1q71m1c54cn07f16c232pqgm38ynd8/T/mrjob_hw53_2.ssatpati.20151003.062810.595234/b.py
Copying non-input files into s3://mrjob-d5ed1dc31babbe2c/tmp/mrjob_hw53_2.ssatpati.20151003.062810.595234/files/
Waiting 5.0s for S3 eventual consistency
Creating Elastic MapReduce job flow
Job flow created with ID: j-QGLC3PJPOLBE
Created new job flow j-QGLC3PJPOLBE
Job launched 30.4s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 61.3s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 91.8s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 122.8s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 153.1s ago

In [41]:
!rm -rf ./output;mkdir output;cd output;aws s3 cp --recursive s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words .;
!cat output/part-0000* | sort -nrk 1 | head -n 10
!rm -rf ./output

download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/_SUCCESS to ./_SUCCESS
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00003 to ./part-00003
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00000 to ./part-00000
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00002 to ./part-00002
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00004 to ./part-00004
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00006 to ./part-00006
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00001 to ./part-00001
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00005 to ./part-00005
5375699242	"the"
3691308874	"of"
2221164346	"to"
1387638591	"in"
1342195425	"a"
1135779433	"and"
798553959	"that"
756296656	"is"
688053106	"be"
481373389	"as"
sort: write failed: standard output: Broken pipe
sort: write error


### Top 10 Frequent Words are:

```
5375699242	"the"
3691308874	"of"
2221164346	"to"
1387638591	"in"
1342195425	"a"
1135779433	"and"
798553959	"that"
756296656	"is"
688053106	"be"
481373389	"as"
```

### Part (C)
---

In [34]:
%%writefile mrjob_hw53_3.py
#!/usr/bin/python
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
from itertools import combinations

class DenseWords(MRJob):
    
    def steps(self):
        return [
            MRStep(mapper=self.mapper,
                   combiner=self.combiner,
                   reducer=self.reducer,
                   ),
            MRStep(mapper=self.mapper_max_min,
                   reducer=self.reducer_max_min,
                   jobconf={
                            'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
                            'mapred.text.key.comparator.options': '-k1,1rn',
                            }
                   )
        ]
    
    def mapper(self, _, line):
        tokens = line.strip().split('\t')
        unigrams = tokens[0].split()
        density = round((int(tokens[1]) * 1.0 / int(tokens[2])), 3)
        for unigram in unigrams:
            yield unigram, density
            
    def combiner(self, unigram, densities):
        densities = [d for d in densities]
        yield unigram, min(densities) 
        yield unigram, max(densities)
        
    def reducer(self, unigram, densities):
        densities = [d for d in densities]
        yield unigram, min(densities)
        yield unigram, max(densities)
        
    def mapper_max_min(self, unigram, density):
        yield density, unigram
        
    def reducer_max_min(self, density, unigrams):
        for unigram in unigrams:
            yield density, unigram
    
if __name__ == '__main__':
    DenseWords.run()

Overwriting mrjob_hw53_3.py


In [35]:
!chmod a+x mrjob_hw53_3.py

In [36]:
!aws s3 rm --recursive s3://ucb-mids-mls-sayantan-satpati/hw53/density/
!python mrjob_hw53_3.py -q -r emr \
 s3://filtered-5grams \
 --output-dir=s3://ucb-mids-mls-sayantan-satpati/hw53/density \
 --no-output \
 --no-strict-protocol

delete: s3://ucb-mids-mls-sayantan-satpati/hw53/density/_SUCCESS
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00001
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00000
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00002
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00004
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00003
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00005
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00006


In [96]:
!rm -rf ./output;mkdir output;cd output;aws s3 cp --recursive s3://ucb-mids-mls-sayantan-satpati/hw53/density/ .;
!cat output/part-0000* | sort -nrk 1 > merged
!head -n 10 merged
!tail -n 10 merged
!rm -rf ./output

download: s3://ucb-mids-mls-sayantan-satpati/hw53/density/_SUCCESS to ./_SUCCESS
download: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00003 to ./part-00003
download: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00004 to ./part-00004
download: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00005 to ./part-00005
download: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00006 to ./part-00006
download: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00000 to ./part-00000
download: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00001 to ./part-00001
download: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00007 to ./part-00007
download: s3://ucb-mids-mls-sayantan-satpati/hw53/density/part-00002 to ./part-00002
29.704999999999998	"Death"
14.912000000000001	"write"
14.912000000000001	"and"
12.380000000000001	"NA"
11.557	"xxxx"
11.557	"xxxx"
10.882	"Sc"
10.6	"beep"
10.022	"blah"
9.8040000000000003	"whole"
1.0	"AAAS"
1.0	"AAAI"
1.0	"AAAE"
1.0	"AAAE"

### Top 10 Most Dense Words

```
29.704999999999998	"Death"
14.912000000000001	"write"
14.912000000000001	"and"
12.380000000000001	"NA"
11.557	"xxxx"
11.557	"xxxx"
10.882	"Sc"
10.6	"beep"
10.022	"blah"
9.8040000000000003	"whole"
```

### Top 10 Least Dense Words

```
1.0	"AAAS"
1.0	"AAAI"
1.0	"AAAE"
1.0	"AAAE"
1.0	"AAAA"
1.0	"AAAA"
1.0	"AAA"
1.0	"AA"
1.0	"A's"
1.0	"A"
```


### Part (D)
---

In [40]:
%%writefile mrjob_hw53_4.py
#!/usr/bin/python
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
from itertools import combinations

class DistributionNgram(MRJob):
    
    def steps(self):
        return [
            MRStep(mapper=self.mapper,
                   reducer=self.reducer,
                   jobconf={
                            'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
                            'mapred.text.key.comparator.options': '-k1,1rn',
                            }
                   )
        ]
    
    def mapper(self, _, line):
        line.strip()
        tokens = re.split("\t",line)
        yield int(tokens[1]), tokens[0]
    
    def reducer(self, count, ngrams):
        for ngram in ngrams:
            yield count, ngram
    
if __name__ == '__main__':
    DistributionNgram.run()

Overwriting mrjob_hw53_4.py


In [41]:
!chmod a+x mrjob_hw53_4.py

In [16]:
!aws s3 rm --recursive s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/
!python mrjob_hw53_4.py -q -r emr \
 s3://filtered-5grams/ \
 --output-dir=s3://ucb-mids-mls-sayantan-satpati/hw53/distribution \
 --no-output \
 --no-strict-protocol

delete: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/_SUCCESS
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00001
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00006
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00000
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00003
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00002
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00004
delete: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00005


In [19]:
!rm -rf ./output;mkdir output;cd output;aws s3 cp --recursive s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/ .;
!cat output/part-0000* | sort -nrk 1 > output_hw53_4.txt
!rm -rf ./output
!head -n 10 output_hw53_4.txt

download: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/_SUCCESS to ./_SUCCESS
download: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00000 to ./part-00000
download: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00002 to ./part-00002
download: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00001 to ./part-00001
download: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00004 to ./part-00004
download: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00003 to ./part-00003
download: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00005 to ./part-00005
download: s3://ucb-mids-mls-sayantan-satpati/hw53/distribution/part-00006 to ./part-00006
sort: write failed: standard output: Broken pipe
sort: write error
cat: stdout: Broken pipe
2223343	"on the basis of the"
1462063	"at the head of the"
1316960	"as well as in the"
871158	"at the same time the"
833764	"as a matter of fact"
590885	"as a part of the"
544783	"as one of th

### Top 10 NGram Distribution

```
2223343	"on the basis of the"
1462063	"at the head of the"
1316960	"as well as in the"
871158	"at the same time the"
833764	"as a matter of fact"
590885	"as a part of the"
544783	"as one of the most"
533464	"and at the end of"
447364	"I do not think that"
441568	"from the beginning of the"
```

## HW 5.4
---

In this part of the assignment we will focus on developing methods
for detecting synonyms, using the Google 5-grams dataset. To accomplish
this you must script two main tasks using MRJob:

(1) Build stripes of word co-ocurrence for the top 10,000
most frequently appearing words across the entire set of 5-grams,
and output to a file in your bucket on s3 (bigram analysis, though the words are non-contiguous).

(2) Using two (symmetric) comparison methods of your choice 
(e.g., correlations, distances, similarities), pairwise compare 
all stripes (vectors), and output to a file in your bucket on s3.

==Design notes for (1)==
For this task you will be able to modify the pattern we used in HW 3.2
(feel free to use the solution as reference). To total the word counts 
across the 5-grams, output the support from the mappers using the total 
order inversion pattern:

<*word,count>

to ensure that the support arrives before the cooccurrences.

In addition to ensuring the determination of the total word counts,
the mapper must also output co-occurrence counts for the pairs of
words inside of each 5-gram. Treat these words as a basket,
as we have in HW 3, but count all stripes or pairs in both orders,
i.e., count both orderings: (word1,word2), and (word2,word1), to preserve
symmetry in our output for (2).

==Design notes for (2)==
For this task you will have to determine a method of comparison.
Here are a few that you might consider:

- Spearman correlation
- Euclidean distance
- Taxicab (Manhattan) distance
- Shortest path graph distance (a graph, because our data is symmetric!)
- Pearson correlation
- Cosine similarity
- Kendall correlation
...

### HW 5.4: Part 1 > Step 1:

***Find the frequent unigrams for the top 10,000 most frequently appearing words across the entire set of 5-grams***

* Use the freqency count output from earlier step to get the top 10K most frequent words

In [109]:
!rm -rf ./output;mkdir output;cd output;aws s3 cp --recursive s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words .;
!cat output/part-0000* | sort -nrk 1 | head -n 10000 > output/frequent_unigrams.txt
!head -n 10 output/frequent_unigrams.txt

download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/_SUCCESS to ./_SUCCESS
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00006 to ./part-00006
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00002 to ./part-00002
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00000 to ./part-00000
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00005 to ./part-00005
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00003 to ./part-00003
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00004 to ./part-00004
download: s3://ucb-mids-mls-sayantan-satpati/hw53/frequent_words/part-00001 to ./part-00001
sort: write failed: standard output: Broken pipe
sort: write error
5375699242	"the"
3691308874	"of"
2221164346	"to"
1387638591	"in"
1342195425	"a"
1135779433	"and"
798553959	"that"
756296656	"is"
688053106	"be"
481373389	"as"


### HW 5.4: Part 1 > Step 2:

***Build stripes of word co-ocurrence for the top 10,000
most frequently appearing words across the entire set of 5-grams***

* Pass the 1 item frequent set generated from earlier step to the mapper(s) for filtering
* The reducers would sum and emit Word Co-Occurrence if support count of 10,000 is met

In [117]:
%%writefile word_cooccurrences_set_hw54.py
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
import sys


class WordCoOccurrenceFrequentSet(MRJob):

    def steps(self):
        return [
            MRStep(mapper_init=self.mapper_init,
                   mapper=self.mapper,
                   combiner=self.combiner,
                   reducer=self.reducer)
        ]
    
    def mapper_init(self):
        # Load the file into memory
        self.unigrams = {}
        with open('frequent_unigrams.txt', 'r') as f:
            for line in f:
                tokens = line.strip().split('\t')
                self.unigrams[tokens[1].replace("\"","")] = int(tokens[0])
        sys.stderr.write('### of unigrams: {0}'.format(self.unigrams))

    def mapper(self, _, line):
        tokens = line.strip().split('\t')
        # List of 5-grams
        words = tokens[0].split()
        # Filter 5-grams to only those in list
        words = [w for w in words if w in self.unigrams.keys()]
        l = len(words)
        for i in xrange(l):
            d = {}
            for j in xrange(l):
                if i != j:
                    d[words[j]] = d.get(words[j], 0) + 1
            # Emit word, stripe
            yield words[i],d

    def combiner(self, word, stripes):
        d = {}
        # Aggregate stripes
        for s in stripes:
            for k, v in s.iteritems():
                d[k] = d.get(k, 0) + v
        yield word,d
        
    def reducer(self, word, stripes):
        d = {}
        # Aggregate stripes
        for s in stripes:
            for k, v in s.iteritems():
                d[k] = d.get(k, 0) + v
        '''
        # Filter based on support count (Not a requirement!)
        d_final = {}
        for k,v in d.iteritems():
            if v >= 10000:
                d_final[k] = v
        '''
        # Combine stripes
        yield word,d

if __name__ == '__main__':
    WordCoOccurrenceFrequentSet.run()

Overwriting word_cooccurrences_set_hw54.py


In [118]:
!chmod a+x word_cooccurrences_set_hw54.py

In [119]:
!aws s3 rm --recursive s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/
!python word_cooccurrences_set_hw54.py -q -r emr \
 s3://filtered-5grams/ \
 --file output/frequent_unigrams.txt \
 --output-dir=s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur \
 --no-output \
 --no-strict-protocol

### HW 5.4 Part 2 > Step 1

***From the output of Part 1 > Step 2, create a single file with all the stripes - frequent itemsets of size 2***

In [133]:
!rm -rf ./output;mkdir output;cd output;aws s3 cp --recursive s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/ .;
!cat output/part-000* | sort -k 1 > output/frequent_stripes.txt
!wc -l output/frequent_stripes.txt

download: s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/_SUCCESS to ./_SUCCESS
download: s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/part-00000 to ./part-00000
download: s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/part-00004 to ./part-00004
download: s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/part-00002 to ./part-00002
download: s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/part-00001 to ./part-00001
download: s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/part-00006 to ./part-00006
download: s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/part-00005 to ./part-00005
download: s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/part-00007 to ./part-00007
download: s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/part-00008 to ./part-00008
download: s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/part-00009 to ./part-00009
download: s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/part-00010 to ./part-00010
download: s3:/

### 1st Take on Distance Calculation using Mapper Side Hash-Join (Brute Force)
---

***Approach***

* Implement a Hash Join using a Mapper only Job. The frequent 10K word pairs (stripes) are fed to each mapper as a distributed cache.
* **With 11 m1.large instances, the job completed in 15 hrs**

In [60]:
%%writefile dist_calc_hw54.py
from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.compat import get_jobconf_value
import re
import sys
import ast
import urllib2
import math


class DistanceCalc(MRJob):

    def steps(self):
        return [
            MRStep(mapper_init=self.mapper_init,
                   mapper=self.mapper)
        ]
    
    def mapper_init(self):
        # Load the file into memory
        self.counter = 0
        self.stripes = {}
        '''
        f = urllib2.urlopen("https://s3-us-west-2.amazonaws.com/ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/frequent_stripes.txt")
        for line in f.readlines():
            tokens = line.strip().split('\t')
            self.stripes[tokens[0].replace("\"","")] = ast.literal_eval(tokens[1])
            self.increment_counter('distance', 'num_stripes_loaded', amount=1)
        '''
        # Load File
        with open('frequent_stripes.txt','r') as f:
            for line in f:
                tokens = line.strip().split('\t')
                self.stripes[tokens[0].replace("\"","")] = ast.literal_eval(tokens[1])
                self.increment_counter('distance', 'num_stripes_loaded', amount=1)
        
        sys.stderr.write('### of stripes Loaded: {0}\n'.format(len(self.stripes)))
        
        ''' (Not Required By Cosine)
        #Pre-Process File (Add missing keys to make it a 10K by 10K matrix)
        s1 = set(self.stripes.keys()) # Full Set
        for key,value_dict in self.stripes.iteritems():
            s2 = set(value_dict.keys())
            missing_keys = s1.difference(s2)
            for mk in missing_keys:
                self.stripes[key][mk] = 0
        sys.stderr.write('### of stripes Pre-Processed: {0}\n'.format(len(self.stripes)))
        '''

    def mapper(self, _, line):
        dist_type = get_jobconf_value('dist_type')
        tokens = line.strip().split('\t')
        
        key = tokens[0].replace("\"","")
        dict_pairs = ast.literal_eval(tokens[1])
        
        for n_key, n_dict_pairs in self.stripes.iteritems():
            # TODO distance calc for only (a,b) but not (b,a) --> Redundant
            if key > n_key:
                continue
            
            self.counter += 1   
            if self.counter % 1000 == 0:
                self.set_status('# of Distances Calculated: {0}'.format(self.counter))
                
            distance = None
            
            if dist_type == 'euclid':

                # Calculate Euclidean Distance
                squared_distance = 0
                for k in n_dict_pairs.keys():
                    squared_distance += (dict_pairs.get(k, 0) - n_dict_pairs.get(k, 0)) ** 2
                    
                distance = math.sqrt(squared_distance)
                
            if dist_type == 'cosine':
                
                # Calculate Cosine Distance
                # Get the intersection of keys from both stripes
                norm_x = 0
                norm_y = 0
                dot_x_y = 0
                for k in self.stripes.keys(): # Iterate through entire key range once
                    norm_x += dict_pairs.get(k,0) * dict_pairs.get(k,0)
                    norm_y += n_dict_pairs.get(k,0) * n_dict_pairs.get(k,0)
                    dot_x_y += dict_pairs.get(k,0) * n_dict_pairs.get(k,0)
                    
                distance = float(dot_x_y) / (math.sqrt(norm_x) * math.sqrt(norm_y))
          
            self.increment_counter('distance', 'num_{0}_distances'.format(dist_type), amount=1)
            yield (distance), (key, n_key)


if __name__ == '__main__':
    DistanceCalc.run()

Overwriting dist_calc_hw54.py


In [61]:
!chmod a+x dist_calc_hw54.py

In [None]:
!aws s3 rm --recursive s3://ucb-mids-mls-sayantan-satpati/hw54/distance/output/
!python dist_calc_hw54.py -q -r emr \
 --file s3://ucb-mids-mls-sayantan-satpati/hw54/distance/input/frequent_stripes.txt \
 --jobconf 'dist_type=cosine' \
 s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/ \
 --output-dir=s3://ucb-mids-mls-sayantan-satpati/hw54/distance/output/ \
 --bootstrap-action="s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
        -m mapred.tasktracker.map.tasks.maximum=2 \
        -m mapreduce.map.memory.mb=3000" \
 --no-strict-protocol

#### Get Top 1K Word Pairs with minimum distance

In [14]:
%%writefile top_1k_syn_hw54.py
#!/usr/bin/python
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
import ast
import sys
from itertools import combinations

class Top1KSynonyms(MRJob):
    
    def steps(self):
        return [
            MRStep(mapper=self.mapper,
                   reducer=self.reducer,
                   jobconf={
                            'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
                            'mapred.text.key.comparator.options': '-k1,1rn',
                            'mapred.reduce.tasks': 1
                            }
                   )
        ]
    
    def mapper(self, _, line):
        tokens = line.strip().split('\t')
        [w1,w2] = ast.literal_eval(tokens[1])
        if w1 != w2:
            yield float(tokens[0]),tokens[1]
    
    def reducer(self, key, values):
        for value in values:
            yield key, value
    
if __name__ == '__main__':
    Top1KSynonyms.run()

Overwriting top_1k_syn_hw54.py


In [15]:
!chmod a+x top_1k_syn_hw54.py

In [None]:
!aws s3 rm --recursive s3://ucb-mids-mls-sayantan-satpati/hw54/distance/synonyms/
!python top_1k_syn_hw54.py -q -r emr \
 s3://ucb-mids-mls-sayantan-satpati/hw54/distance/output/ \
 --output-dir=s3://ucb-mids-mls-sayantan-satpati/hw54/distance/synonyms/ \
 --no-strict-protocol

In [None]:
!rm -rf ./output1;mkdir output1;cd output1;aws s3 cp --recursive s3://ucb-mids-mls-sayantan-satpati/hw54/distance/synonyms/ .;
!head -n 1000 output1/part-00000 > output1/top1k.txt

In [3]:
!wc -l output1/top1k.txt
!head -n 10 output1/top1k.txt

    1000 output1/top1k.txt
0.9989871786920097	"[\"center\", \"centre\"]"
0.9983273473595391	"[\"aegis\", \"auspices\"]"
0.99819561795478362	"[\"seventeenth\", \"sixteenth\"]"
0.99808891300127756	"[\"eighteenth\", \"nineteenth\"]"
0.99803557390404385	"[\"Germans\", \"Russians\"]"
0.99793556998358346	"[\"ascribed\", \"attributed\"]"
0.99793255052813279	"[\"authenticity\", \"validity\"]"
0.99782320499493526	"[\"favor\", \"favour\"]"
0.99776231214994249	"[\"concerning\", \"respecting\"]"
0.99775180932774554	"[\"Danube\", \"Rhine\"]"


### 2nd Take on Distance Calculation
---

***Approach taken from the paper: https://terpconnect.umd.edu/~oard/pdf/acl08elsayed2.pdf***

**3 step map/reduce job:**

**Step:1**

* Create an inverted index with postings of all terms (Term as key, and Stripes/Docs are Values)

* Mapper: Emits (key, stripe/doc)
* Reducer: Creates the postings for each key

**Step:2**

* Calculate Cosine Distance

* Mapper: Emits docId pair as key, and (x*x, y*y, x*y) as value where x=count of docId_1 and y=count of docId_2
* Combiner: Sums Up (x*x, y*y, x*y) to reduce IO between mappers and reducers
* Reducer: Calculates the cosine distance

**Step:3**

* Sort and take top 1000 words which are similar

#### The job is running in AWS for the last 1 Day & 9 hrs. Cluster size: 15 m1.large slaves

**As of writing this notebook, the job was done till this point**
```
2015-10-13 00:11:53,485 INFO org.apache.hadoop.streaming.StreamJob (main):  map 100%  reduce 16%
2015-10-13 00:40:20,736 INFO org.apache.hadoop.streaming.StreamJob (main):  map 100%  reduce 17%
```

In [5]:
%%writefile dist_calc_inverted_index_hw54.py
#!/usr/bin/python
from mrjob.job import MRJob
from mrjob.step import MRStep
import re, sys
import ast
import math
from mrjob.protocol import RawValueProtocol,JSONProtocol
from itertools import combinations

class DistanceCalcInvertedIndex(MRJob):
        
    def steps(self):
        return [
            MRStep(mapper=self.mapper,
                   reducer=self.reducer
                  ),
            MRStep(mapper=self.mapper_dist,
                   combiner=self.combiner_dist,
                   reducer=self.reducer_dist
                  ),
            MRStep(mapper_init=self.mapper_top1k_init,
                   mapper=self.mapper_top1k,
                   mapper_final=self.mapper_top1k_final,
                   reducer_init=self.reducer_top1k_init,
                   reducer=self.reducer_top1k,
                   reducer_final=self.reducer_top1k_final,
                   jobconf={
                            'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
                            'mapred.text.key.comparator.options': '-k1,1rn',
                            'mapred.reduce.tasks': 1
                            }
                  )
        ]

    # Step:1
    def mapper(self, _, line):
        tokens = line.strip().split('\t')
        
        docId = tokens[0].replace("\"","")
        stripe = ast.literal_eval(tokens[1])
        
        for word,cnt in stripe.iteritems():
            yield word, (docId, int(cnt))
    
    def reducer(self, word, docId_count):
        docId_counts = [i for i in docId_count]
        yield word, docId_counts
        
    # Step: 2
        
    def mapper_dist(self, key, docId_counts):
        #sys.stderr.write('{0} # {1}\n'.format(key, docId_counts))
        docId_counts = sorted(docId_counts)
        l = len(docId_counts)
        for i in xrange(l):
            for j in xrange(l):
                if j > i:
                    x = docId_counts[i][1]
                    y = docId_counts[j][1]
                    yield (docId_counts[i][0], docId_counts[j][0]), (x*x, y*y, x*y)
                    
    
    def combiner_dist(self, docId_pair, values):
        dot_x_y = 0
        x_squared = 0
        y_squared = 0
        for v in values:
            x_squared += v[0]
            y_squared += v[1]
            dot_x_y += v[2]
        
        yield docId_pair, (x_squared, y_squared, dot_x_y) 
        
        
    def reducer_dist(self, docId_pair, values):
        dot_x_y = 0
        norm_x = 0
        norm_y = 0
        for v in values:
            norm_x += v[0]
            norm_y += v[1]
            dot_x_y += v[2]
        
        yield docId_pair, float(dot_x_y) / (math.sqrt(norm_x) * math.sqrt(norm_y)) 
        
        
    # Step: 3
    
    def mapper_top1k_init(self):
        self.TOP_N = 1000
        self.top_1k_pairs = []

    def mapper_top1k(self, docId_pair, distance):
        self.top_1k_pairs.append((distance, docId_pair))
        if len(self.top_1k_pairs) > self.TOP_N:
            self.top_1k_pairs.sort(key=lambda x: -x[0])
            self.top_1k_pairs = self.top_1k_pairs[:self.TOP_N]
            
    def mapper_top1k_final(self):
        sys.stderr.write('##### [Mapper_Final]: {0}\n'.format(len(self.top_1k_pairs)))
        for e in self.top_1k_pairs:
            yield e[0], e[1]
            
    def reducer_top1k_init(self):
        self.TOP_N = 1000
        self.top_1k_pairs = []
            
    def reducer_top1k(self, distance, docId_pairs):
        for docId_pair in docId_pairs:
            self.top_1k_pairs.append((distance, docId_pair))
        if len(self.top_1k_pairs) > self.TOP_N:
            self.top_1k_pairs.sort(key=lambda x: -x[0])
            self.top_1k_pairs = self.top_1k_pairs[:self.TOP_N]
        
    def reducer_top1k_final(self):
        sys.stderr.write('##### [Reducer_Final]: {0}\n'.format(len(self.top_1k_pairs)))
        for e in self.top_1k_pairs:
            yield e[0], e[1]
        
    
if __name__ == '__main__':
    DistanceCalcInvertedIndex.run()

Overwriting dist_calc_inverted_index_hw54.py


In [6]:
!chmod a+x dist_calc_inverted_index_hw54.py

In [None]:
!aws s3 rm --recursive s3://ucb-mids-mls-sayantan-satpati/hw54/distance/output1/
!python dist_calc_inverted_index_hw54.py -q -r emr \
 s3://ucb-mids-mls-sayantan-satpati/hw54/word_cooccur/ \
 --output-dir=s3://ucb-mids-mls-sayantan-satpati/hw54/distance/output1/ \
 --no-strict-protocol

delete: s3://ucb-mids-mls-sayantan-satpati/hw54/distance/output1/part-00000
delete: s3://ucb-mids-mls-sayantan-satpati/hw54/distance/output1/part-00010
delete: s3://ucb-mids-mls-sayantan-satpati/hw54/distance/output1/part-00011
delete: s3://ucb-mids-mls-sayantan-satpati/hw54/distance/output1/part-00004
delete: s3://ucb-mids-mls-sayantan-satpati/hw54/distance/output1/part-00001
delete: s3://ucb-mids-mls-sayantan-satpati/hw54/distance/output1/part-00006
delete: s3://ucb-mids-mls-sayantan-satpati/hw54/distance/output1/part-00005
delete: s3://ucb-mids-mls-sayantan-satpati/hw54/distance/output1/part-00002
delete: s3://ucb-mids-mls-sayantan-satpati/hw54/distance/output1/part-00003
delete: s3://ucb-mids-mls-sayantan-satpati/hw54/distance/output1/part-00007
delete: s3://ucb-mids-mls-sayantan-satpati/hw54/distance/output1/part-00009
delete: s3://ucb-mids-mls-sayantan-satpati/hw54/distance/output1/part-00012
delete: s3://ucb-mids-mls-sayantan-satpati/hw54/distance/output1/part-00013
delete: s3:/

In [7]:
''' For Running under Sayantan's personal A/C
!aws s3 rm --recursive s3://w261/distance/output/
!python dist_calc_inverted_index_hw54.py -q -r emr \
 s3://w261/word_cooccur/ \
 --output-dir=s3://w261/distance/output/ \
 --no-strict-protocol
'''

Traceback (most recent call last):
  File "dist_calc_inverted_index_hw54.py", line 122, in <module>
    DistanceCalcInvertedIndex.run()
  File "/Users/ssatpati/anaconda/lib/python2.7/site-packages/mrjob/job.py", line 461, in run
    mr_job.execute()
  File "/Users/ssatpati/anaconda/lib/python2.7/site-packages/mrjob/job.py", line 479, in execute
    super(MRJob, self).execute()
  File "/Users/ssatpati/anaconda/lib/python2.7/site-packages/mrjob/launch.py", line 153, in execute
    self.run_job()
  File "/Users/ssatpati/anaconda/lib/python2.7/site-packages/mrjob/launch.py", line 216, in run_job
    runner.run()
  File "/Users/ssatpati/anaconda/lib/python2.7/site-packages/mrjob/runner.py", line 470, in run
    self._run()
  File "/Users/ssatpati/anaconda/lib/python2.7/site-packages/mrjob/emr.py", line 882, in _run
    self._wait_for_job_to_complete()
  File "/Users/ssatpati/anaconda/lib/python2.7/site-packages/mrjob/emr.py", line 1767, in _wait_for_job_to_complete
    raise

## HW 5.5
---

In this part of the assignment you will evaluate the success of you synonym detector.
Take the top 1,000 closest/most similar/correlative pairs of words as determined
by your measure in (2), and use the synonyms function in the accompanying
python code:

nltk_synonyms.py

Note: This will require installing the python nltk package:

http://www.nltk.org/install.html

and downloading its data with nltk.download().

For each (word1,word2) pair, check to see if word1 is in the list, 
synonyms(word2), and vice-versa. If one of the two is a synonym of the other, 
then consider this pair a 'hit', and then report the precision, recall, and F1 measure  of 
your detector across your 1,000 best guesses. Report the macro averages of these measures.

### Precision, Recall, and F1 Score
___

*** The following exercise has been done using the distance output from the brute force implementation (Cosine Distance) ***

#### PRECISION: 0.055

#### RECALL: 0.057

#### F1 Score: 0.056

#### Match Found in 51 items!!!

In [22]:
import nltk
from nltk.corpus import wordnet as wn
import sys
import ast
#print all the synset element of an element
def synonyms(string):
    syndict = {}
    for i,j in enumerate(wn.synsets(string)):
        syns = j.lemma_names()
        for syn in syns:
            syndict.setdefault(syn,1)
    return syndict.keys()

cnt_t = 0
cnt_m = 0
cnt_fn = 0

# Load synomyn file
dict_syn = {}
with open('synonyms.txt', 'r') as f:
    for line in f:
        t = line.strip().split(',')
        w1 = t[0].lower()
        w2 = t[1].lower()
        if w1 in dict_syn.keys():
            dict_syn[w1].append(w2)
        else:
            dict_syn[w1] = [w2]
        if w2 in dict_syn.keys():
            dict_syn[w2].append(w1)
        else:
            dict_syn[w2] = [w1]

print "Length of Synonym Dict: {0}".format(len(dict_syn))
            
# Check if any of the top 1000 matches the synonym list
with open('output1/top1k.txt', 'r') as f:
    for line in f:
        cnt_t += 1
        t = line.strip().split('\t')
        t[1] = t[1].replace("\\","")[1:-1]
        pair = ast.literal_eval(t[1])
        syn0 = synonyms(pair[0].lower())
        syn1 = synonyms(pair[1].lower())

        # Precision
        if pair[1].lower() in syn0 or pair[0].lower() in syn1:
            print "MATCH: {0}".format(pair)
            cnt_m += 1
        
        # Recall
        if pair[0].lower() in dict_syn.keys() or pair[1].lower() in dict_syn.keys():
            cnt_fn += 1
        
            
print "\nTotal Count: {0}, TP: {1}, FP: {2}, FN: {3}".format(cnt_t, cnt_m, cnt_t - cnt_m, cnt_fn)

p = round(float(cnt_m) / cnt_t, 3)
r = round(float(cnt_m) / (cnt_m + cnt_fn), 3)
f1 = round(2 * p * r / (p + r), 3)

print "\n### PRECISION: {0}".format(p)
print "\n### RECALL: {0}".format(r)
print "\n### F1 Score: {0}".format(f1)



Length of Synonym Dict: 7315
MATCH: ['center', 'centre']
MATCH: ['aegis', 'auspices']
MATCH: ['favor', 'favour']
MATCH: ['Earl', 'earl']
MATCH: ['color', 'colour']
MATCH: ['neighbourhood', 'vicinity']
MATCH: ['Fig', 'Figure']
MATCH: ['irrespective', 'regardless']
MATCH: ['fulfillment', 'fulfilment']
MATCH: ['honor', 'honour']
MATCH: ['authenticity', 'legitimacy']
MATCH: ['analyse', 'analyze']
MATCH: ['Emperor', 'emperor']
MATCH: ['assess', 'evaluate']
MATCH: ['vigor', 'vigour']
MATCH: ['colors', 'colours']
MATCH: ['calculate', 'compute']
MATCH: ['calculation', 'computation']
MATCH: ['Gospel', 'gospel']
MATCH: ['establishment', 'formation']
MATCH: ['castle', 'palace']
MATCH: ['recognise', 'recognize']
MATCH: ['embodiment', 'incarnation']
MATCH: ['labor', 'labour']
MATCH: ['brink', 'verge']
MATCH: ['chiefly', 'principally']
MATCH: ['chief', 'principal']
MATCH: ['harbor', 'harbour']
MATCH: ['theater', 'theatre']
MATCH: ['neighborhood', 'neighbourhood']
MATCH: ['Duke', 'duke']
MATCH: ['adv

## HW 5.6 & HW 5.7
---

***Didn't get time to attempt these optional problems. Would take a look at these shortly!***