# HW 5.4  (over 2Gig of Data)
In this part of the assignment we will focus on developing methods
for detecting synonyms, using the Google 5-grams dataset. To accomplish
this you must script two main tasks using MRJob:

(1) Build stripes of word co-ocurrence for the top 10,000 using the words ranked from 9001,-10,000 as a basis
most frequently appearing words across the entire set of 5-grams,
and output to a file in your bucket on s3 (bigram analysis, though the words are non-contiguous).

(2) Using two (symmetric) comparison methods of your choice 
(e.g., correlations, distances, similarities), pairwise compare 
all stripes (vectors), and output to a file in your bucket on s3.

==Design notes for (1)==
For this task you will be able to modify the pattern we used in HW 3.2
(feel free to use the solution as reference). To total the word counts 
across the 5-grams, output the support from the mappers using the total 
order inversion pattern:

<*word,count>

to ensure that the support arrives before the cooccurrences.

In addition to ensuring the determination of the total word counts,
the mapper must also output co-occurrence counts for the pairs of
words inside of each 5-gram. Treat these words as a basket,
as we have in HW 3, but count all stripes or pairs in both orders,
i.e., count both orderings: (word1,word2), and (word2,word1), to preserve
symmetry in our output for (2).

==Design notes for (2)==
For this task you will have to determine a method of comparison.
Here are a few that you might consider:

- Jaccard
- Cosine similarity
- Spearman correlation
- Euclidean distance
- Taxicab (Manhattan) distance
- Shortest path graph distance (a graph, because our data is symmetric!)
- Pearson correlation
- Kendall correlation
...

However, be cautioned that some comparison methods are more difficult to
parallelize than others, and do not perform more associations than is necessary, 
since your choice of association will be symmetric.

Please use the inverted index (discussed in live session #5) based pattern to compute the pairwise (term-by-term) similarity matrix. 

In [24]:
%%writefile top_vocabs.py
from mrjob.job import MRJob
from mrjob.protocol import RawValueProtocol
from mrjob.step import MRStep

# this script is to find the top 1000 vocabs words

class top_vocabs(MRJob):
    # use raw value protocol to get inversion done easier
    #INPUT_PROTOCOL = RawValueProtocol
    #INTERNAL_PROTOCOL = RawValueProtocol
    #OUTPUT_PROTOCOL = RawValueProtocol
    
    # step1 do word counts
    # step2 get the top 1000 vocabs by counts
    def steps(self):
        return [
            MRStep(mapper=self.wordCountsMapper, reducer=self.wordCountsReducer),
            MRStep(reducer_init=self.rankWords_init, reducer=self.getTopVocabs,
                  jobconf={
                    'stream.num.map.output.key.fields': 2,
                    'mapreduce.partition.keypartitioner.options': '-k1,1',
                    'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
                    'mapred.text.key.comparator.options': "-k2,2nr",
                    'mapred.reduce.tasks': 1
                })
        ]      

    def wordCountsMapper(self, _, line):
        lineArr = line.strip().split('\t')
        ngram_count = int(lineArr[1])
        for word in lineArr[0].split(' '):
            # each word appear n times as the ngram count
            yield word.lower() , ngram_count
            
    def wordCountsReducer(self, word, counts):
        # summing up counts for each word
        yield word, sum(counts)
        
    def rankWords_init(self):
        self.vocab = set()
        # grab the top 1000 words as vocab
        self.vocab_size = 1000
        
    def getTopVocabs(self, word, count):
        if len(self.vocab) < self.vocab_size:
            self.vocab.add(word)
            yield word, max(count)

if __name__ == '__main__':
    top_vocabs.run()

Writing top_vocabs.py


In [28]:
!hdfs dfs -rm -r /w261/hw5/dev_output
!python top_vocabs.py -r hadoop hdfs:///w261/hw5/dev/googlebooks-eng-all-5gram-20090715-0-filtered.txt --output-dir hdfs:///w261/hw5/dev_output/ --no-output

16/02/14 02:21:57 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /w261/hw5/dev_output
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /tmp/top_vocabs.root.20160214.022158.308846
writing wrapper script to /tmp/top_vocabs.root.20160214.022158.308846/setup-wrapper.sh
Using Hadoop version 2.6.0
Copying local files into hdfs:///user/root/tmp/mrjob/top_vocabs.root.20160214.022158.308846/files/

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

HADOOP: packageJobJar: [] [/usr/lib/hadoop/lib/hadoop-streaming-2.6.0.jar] /tmp/streamjob4086499293356888855.jar tmpDir=null
HADOOP: Connecting to ResourceManager at /0.0.0.0:8032
HADOOP: Connecting to 

In [20]:
%%writefile synonym_matrix.py
from mrjob.job import MRJob
from mrjob.protocol import RawValueProtocol
from mrjob.step import MRStep

# this script creates co-occurence matrix and output into an inverted index

class synonym_matrix(MRJob):
    
    SORT_VALUES = True
    
    def steps(self):
        # first step is to count the frequencies of co-occurence terms
        # second step is to produce inverted indexes
        return [
            MRStep(mapper_init=self.bigram_mapper_init, mapper=self.bigram_mapper, reducer=self.bigram_reducer),
            MRStep(reducer=self.createInvertedIndex)
        ]
    
    # load up vocabulary
    def bigram_mapper_init(self):
        self.vocab = set()
        for line in open('vocabs.txt', 'r'):
            lineArr = line.strip().split('\t')
            self.vocab.add(lineArr[0].replace('"', '').lower())
    
    # count up co-occuring bigrams
    def bigram_mapper(self, _, line):
        emitted_words = set()
        lineArr = line.strip().split('\t')
        words = [w.lower() for w in lineArr[0].split(' ')]
        # total counts of this particular ngram
        ngram_counts = int(lineArr[1])
        if len(words) > 1:
            for word in words:
                if word not in self.vocab or word in emitted_words:
                    continue
                # emit total ngram count to calculate frequency in the reducer
                yield word, ('*', ngram_counts)
                emitted_words.add(word)
                co_occured_words = set()
                for another_word in words:
                    if word != another_word and another_word in self.vocab and another_word not in co_occured_words:
                        co_occured_words.add(another_word)
                        yield word, (another_word, ngram_counts)
    
    # calculating co-occuring frequencies
    def bigram_reducer(self, word, cocurs):       
        ngram_total = 0
        current_cocurrence = 0
        current_coword = None
        for coword, count in cocurs:
            if coword == '*':
                ngram_total += count
                continue
            if coword == current_coword:
                current_cocurrence += count
            else:
                if current_coword != None:
                    yield current_coword, (word, float(current_cocurrence)/float(ngram_total))
                current_coword = coword
                current_cocurrence = count
        if current_coword != None:
            yield current_coword, (word, float(current_cocurrence)/float(ngram_total))
    
    # make the inverted index
    def createInvertedIndex(self, cowords, word_freq):
        yield cowords, '|'.join(["%s:%s"%(word, freq) for (word, freq) in word_freq])
     
    
if __name__ == '__main__':
    synonym_matrix.run()

Overwriting synonym_matrix.py


In [21]:
 !python synonym_matrix.py googlebooks-eng-all-5gram-20090715-0-filtered.txt --file vocabs.txt --output-dir bigram_results --no-output

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
ignoring partitioner keyword arg (requires real Hadoop): 'org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner'

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

creating tmp directory c:\temp\synonym_matrix.kuanlin.20160214.053618.632000
writing to c:\temp\synonym_matrix.kuanlin.20160214.053618.632000\step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to c:\temp\synonym_matrix.kuanlin.20160214.053618.632000\step-0-mapper-sorted
> sort 'c:\temp\synonym_matrix.kuanlin.20160214.053618.632000\step-0-mapper_part-00000'
writing to c:\temp\synonym_matrix.kuanlin.20160214.053618.632000\step-0-reducer_part-00000
Counters from step 1:
  (no counters found)
writing to c:\t

In [2]:
%%writefile synonym_detection_cosine.py
from mrjob.job import MRJob
from mrjob.protocol import RawValueProtocol
from mrjob.step import MRStep

# synonym detection with cosine similarity

class synonym_detection_cosine(MRJob):
    
    # load up vocabulary
    def mapper_init(self):
        self.vocab = []
        for line in open('vocabs.txt', 'r'):
            lineArr = line.strip().split('\t')
            self.vocab.append(lineArr[0].replace('"', '').lower())
            
    def mapper(self, _, line):
        lineArr = line.strip().replace('"', '').split('\t')
        co_word = lineArr[0]
        doc_terms = {}
        # parse the stripes of co-occuring terms
        for term_freq in lineArr[1].split('|'):
            term = term_freq.split(':')[0]
            freq = float(term_freq.split(':')[1])
            doc_terms[term] = freq
        
        # emits components of cosine similarity calculation
        for i, term1 in enumerate(self.vocab):
            if i != len(self.vocab)-1:
                for term2 in self.vocab[i:]:
                    if term1 == term2:
                        continue
                    term1_freq = 0.0
                    term2_freq = 0.0
                    if term1 in doc_terms:
                        term1_freq = doc_terms[term1]
                    if term2 in doc_terms:
                        term2_freq = doc_terms[term2]
                    # collect data into co-occuring terms
                    yield term1 + ":" + term2, (term1_freq**2, term2_freq**2, term1_freq*term2_freq)
    
    # calculates the overall cosine similarity
    def reducer(self, coterms, metric_vals):
        a_b = 0.0
        a_sqr = 0.0
        b_sqr = 0.0
        for data in metric_vals:
            a_b += data[2]
            a_sqr += data[0]
            b_sqr += data[1]
        consine_sim = a_b/((a_sqr**0.5)*(b_sqr**0.5))
        # only print out similarity scoare above 0.5
        if consine_sim > 0.5:
            yield coterms, consine_sim
    
            
if __name__ == '__main__':
    synonym_detection_cosine.run()

Overwriting synonym_detection_cosine.py


In [3]:
 !python synonym_detection_cosine.py bigram_reduced.txt --file vocabs.txt --output-dir synonym_result1 --no-output

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

creating tmp directory c:\temp\synonym_detection_cosine.kuanlin.20160214.070507.920000
writing to c:\temp\synonym_detection_cosine.kuanlin.20160214.070507.920000\step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to c:\temp\synonym_detection_cosine.kuanlin.20160214.070507.920000\step-0-mapper-sorted
> sort 'c:\temp\synonym_detection_cosine.kuanlin.20160214.070507.920000\step-0-mapper_part-00000'
writing to c:\temp\synonym_detection_cosine.kuanlin.20160214.070507.920000\step-0-reducer_part-00000
Counters from step 1:
  (no counters found)
Moving c:\temp\synonym_detection_cosine.kuanlin.20160214.070507.920000\step-0-

In [9]:
%%writefile synonym_detection_jaccard.py
from mrjob.job import MRJob
from mrjob.protocol import RawValueProtocol
from mrjob.step import MRStep

# synonym detection with Jaccard index

class synonym_detection_jaccard(MRJob):
    
    # load up vocabulary
    def mapper_init(self):
        self.vocab = []
        for line in open('vocabs.txt', 'r'):
            lineArr = line.strip().split('\t')
            self.vocab.append(lineArr[0].replace('"', '').lower())
            
    def mapper(self, _, line):
        lineArr = line.strip().replace('"', '').split('\t')
        co_word = lineArr[0]
        doc_terms = {}
        # parse the stripes of co-occuring terms
        for term_freq in lineArr[1].split('|'):
            term = term_freq.split(':')[0]
            freq = float(term_freq.split(':')[1])
            doc_terms[term] = freq
        
        # emits components of jaccard similarity calculation
        for i, term1 in enumerate(self.vocab):
            if i != len(self.vocab)-1:
                for term2 in self.vocab[i:]:
                    if term1 == term2:
                        continue
                    term1_score = 0
                    term2_score = 0
                    if term1 in doc_terms:
                        if doc_terms[term1] > 0:
                            term1_score = 1
                    if term2 in doc_terms:
                        if doc_terms[term2] > 0:
                            term2_score = 1
                    # collect data into co-occuring terms
                    yield term1 + ":" + term2, (term1_score, term2_score)
    
    # calculates the overall cosine similarity
    def reducer(self, coterms, metric_vals):
        a_sum = 0
        b_sum = 0
        a_inter_b = 0
        for data in metric_vals:
            a_sum += data[0]
            b_sum += data[1]
            if data[0] > 0 and data[1] > 0:
                a_inter_b += 1
        jaccard_sim = float(a_inter_b)/float(a_sum + b_sum - a_inter_b)
        # only print out similarity scoare above 0.2
        if jaccard_sim > 0.2:
            yield coterms, jaccard_sim
    
            
if __name__ == '__main__':
    synonym_detection_jaccard.run()

Overwriting synonym_detection_jaccard.py


In [10]:
 !python synonym_detection_jaccard.py bigram_reduced.txt --file vocabs.txt --output-dir synonym_result2 --no-output

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration

PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols

creating tmp directory c:\temp\synonym_detection_jaccard.kuanlin.20160214.071851.269000
writing to c:\temp\synonym_detection_jaccard.kuanlin.20160214.071851.269000\step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to c:\temp\synonym_detection_jaccard.kuanlin.20160214.071851.269000\step-0-mapper-sorted
> sort 'c:\temp\synonym_detection_jaccard.kuanlin.20160214.071851.269000\step-0-mapper_part-00000'
writing to c:\temp\synonym_detection_jaccard.kuanlin.20160214.071851.269000\step-0-reducer_part-00000
Counters from step 1:
  (no counters found)
Moving c:\temp\synonym_detection_jaccard.kuanlin.20160214.071851.269000\s