#DATASCI W261: Machine Learning at Scale

#HW7 (Group Z)

- Juanjo Carin, Christopher Llop, Sayantan Satpati
- [juanjose.carin@ischool.berkeley.edu](mailto:juanjose.carin@ischol.berkeley.com)
- [christopher.llop@ischool.berkeley.edu](mailto:christopher.llop@ischool.berkeley.edu)
- [sayantan.satpati@ischool.berkeley.edu](mailto:sayantan.satpati@ischool.berkeley.edu)
- W261-2
- Week 07
- Submission date: 10/27/2015

#General Description

**In this assignment you will explore networks and develop MRJob code for finding shortest path graph distances. To build up to large data you will develop your code on some very simple, toy networks. After this you will take your developed code forward and modify it and apply it to two larger datasets (performing EDA along the way).**

#Undirected toy network dataset

**In an undirected network all links are symmetric, i.e., for a pair of nodes 'A' and 'B,' both of the links A -> B and B -> A will exist.**

**The toy data are available in a sparse (stripes) representation:**

    (node) \t (dictionary of links)

**on AWS via the url: s3://ucb-mids-mls-networks/undirected_toy.txt**

**In the dictionary, target nodes are keys, link weights are values (here, all weights are 1, i.e., the network is unweighted).**

#Directed toy network dataset

**In a directed network all links are not necessarily symmetric, i.e., for a pair of nodes 'A' and 'B,' it is possible for only one of A -> B or B -> A to exist.**

**These toy data are available in a sparse (stripes) representation:**

    (node) \t (dictionary of links)

**on AWS via the url: s3://ucb-mids-mls-networks/directed_toy.txt**

**In the dictionary, target nodes are keys, link weights are values (here, all weights are 1, i.e., the network is unweighted).**

#HW7.0: Shortest path graph distances (toy networks)

**In this part of your assignment you will develop the base of your code for the week.**

**Write MRJob classes to find shortest path graph distances, as described in the lectures. In addition to finding the distances, your code should also output a distance-minimizing path between the source and target. Work locally for this part of the assignment, and use both of the undirected and directed toy networks.**

**To proof you code's function, run the following jobs**

- **shortest path in the undirected network from node 1 to node 4
    - Solution: 1,5,4 
- shortest path in the directed network from node 1 to node 5
    - Solution: 1,2,4,5**

**and report your output---make sure it is correct!**

In [3]:
from collections import defaultdict
class Graph:
    def __init__(self):
        self.nodes = set()
        self.edges = defaultdict(list)
        self.distances = {}
 
    def add_node(self, value):
        self.nodes.add(value)
 
    def add_edge(self, node1, node2, distance = 1,direct = True):
        self.edges[node1].append(node2)
        self.distances[(node1, node2)] = distance
        if not direct:
            self.edges[node2].append(node1)
            self.distances[(node2, node1)] = distance
 
 
def dijsktra(graph, initial):
    visited = {initial: 0}
    nodes = set(graph.nodes)
    while nodes:
        min_node = None
        for node in nodes:
            if node in visited:
                if min_node is None:
                    min_node = node
                elif visited[node] < visited[min_node]:
                    min_node = node
        if min_node is None:
            break
        nodes.remove(min_node)
        current_weight = visited[min_node]
        for neighbour in graph.edges[min_node]:
            try:
                weight = current_weight + graph.distances[(min_node, neighbour)]
            except:
                continue
            if neighbour not in visited or weight < visited[neighbour]:
                visited[neighbour] = weight
    return visited

In [19]:
import ast
g = Graph()
nodes = []
edges = []
with open('directed_toy.txt', 'r') as myfile:
    for line in myfile:
        word = line.split('\t')
        nodes.append(word[0])
        dict = ast.literal_eval(word[1])
        for k in dict.keys():
            edges.append((word[0],k,dict[k]))  
print nodes
print edges
#nodes = ['A', 'B', 'C', 'D', 'E', 'F']
#edges = [('A', 'B', 1), ('A', 'C', 5), ('B', 'C', 2), ('C', 'D', 4), ('C', 'E', 3), ('D', 'F', 5), ('F', 'C', 3)]
for node in nodes:
    g.add_node(node)
for edge in edges:
    g.add_edge(*edge)
dijsktra(g, '1')

['1', '2', '3', '4', '5']
[('1', '2', 1), ('1', '6', 1), ('2', '1', 1), ('2', '3', 1), ('2', '4', 1), ('3', '2', 1), ('3', '4', 1), ('4', '2', 1), ('4', '5', 1), ('5', '1', 1), ('5', '2', 1), ('5', '4', 1)]


{'1': 0, '2': 1, '3': 2, '4': 2, '5': 3, '6': 1}

In [31]:
source = '1'
SSSP = {}
for node in nodes:
    if node == source:
        SSSP[node] = 0
    else:
        SSSP[node] = float('inf')
Frontiers = [source]
print Frontiers
print SSSP

import ast
emul_reducer = {}
with open('directed_toy.txt', 'r') as myfile:
    for line in myfile:
        line = line.split('\t')
        node = line[0]
        sink = ast.literal_eval(line[1])
        for sink_node in sink.keys():
            if node in Frontiers:
                #yield sink_node, SSSP[node] + sink[sink_node]
                emul_reducer[sink_node] = SSSP[node] + sink[sink_node]

for k in emul_reducer.keys():
    

['1']
{'1': 0, '3': inf, '2': inf, '5': inf, '4': inf}
2 1
6 1


#HW5.1

1. **In the database world What is 3NF? **
2. **Does machine learning use data in 3NF? If so why?**
3. **In what form does ML consume data?**
4. **Why would one use log files that are denormalized?**

1. **3NF** (Third Normal Form) is a normalization (i.e., an organization of data into columns--attributes--and tables--relations-- to minimize redundancies, by decomposing a flat table into smaller relational tables)  in which:
    * the relation table is in 2NF:
        * every non-prime attribute of the table (i.e., that does not belong to any candidate key of the table) is dependent on the whole of every candidate key)
    * every non-prime attribute  of the relation table is non-transitively dependent on every superkey of R.

  Requiring existence of "the key" ensures that the table is in 1NF; requiring that non-key attributes be dependent on "the whole key" ensures 2NF; further requiring that non-key attributes be dependent on "nothing but the key" ensures 3NF.

2. To solve **Machine Learning** problems we usually do not use data in **3NF** because the information in each table alone does not give the "full picture:" we need to **denormalize** the **data** first, joining or aggregating tables, to be able to answer typical questions from a Machine Learning perspective (that involve all the dimensions at hand)

3. As mentioned in the previous point, ML algorithms use **denormalized data**. This is because most of those algorithms apply a function repeatedly to the same dataset to optimize a parameter (e.g., through gradient descen), so we need the total amount of information.

4. For the reason exposed above: denormalized log files include, for a particular observation (or log file), all the information (variables) we are going to use to apply a ML algorithm.

#HW5.2

**Using MRJob, implement a hashside join (memory-backed map-side) for left, right and inner joins. Run your code on the  data used in HW 4.4: (Recall HW 4.4: Find the most frequent visitor of each page using mrjob and the output of 4.2 (i.e., transformed log file). In this output please include the webpage URL, webpageID and Visitor ID.)**

**Justify which table you chose as the Left table in this hashside join.**

**Please report the number of rows resulting from:**

1. **Inner joining Table Left with Table Right**

2. **Right joining Table Left with Table Right**

3. **Left joining Table Left with Table Right**

(I've reversed the order mentioned in the Instructions, so each new join adds a bit of complexity over the previous).

### Create Left and Right Tables
Since I included the URLs in the transformed log file, I will generate both tables from scratch.

Recall that the lines in the original file have these form:

    ...
    A,1100,1,"MS in Education","/education"
    A,1210,1,"SNA Support","/snasupport"
    C,"10001",10001
    V,1000,1
    V,1001,1
    V,1002,1
    C,"10002",10002
    V,1001,1
    V,1003,1
    ...

I.e., all the webpage IDs (the primary key) are listed with their URLs, and then each visitor ID, followed by the webpages he or she visited.

In [1]:
import urllib2
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/anonymous/' +\
    'anonymous-msweb.data'
import os
os.chdir('/home/hduser/Dropbox/W261/HW5')
# Two counters to keep track of number of distinct webpages and visitors
A = 0
C = 0

with open('TableLeft.txt', 'w') as TL, open('TableRight.txt', 'w') as TR:
    for line in urllib2.urlopen(url):
        record = line.strip().split(',')
        record = [x.strip('"') for x in record]
        # If the record corresponds to an attribute, linke webpage ID with URL
        if record[0] == 'A':
            A += 1
            key = record[1] # webpage ID
            value = record[4] # webpage URL
            TL.write(key + ',' + value + '\n')
        # If the record corresponds to a case (visitor), save that info...
        elif record[0] == 'C':
            C += 1
            value = record[1]
        # ... and pass it to the Vroot (i.e., link visitor ID and webpage ID)
        elif record[0] == 'V':
            key = record[1]
            TR.write(key + ',' + value + '\n')
            
print 'Training Instances  {}'.format(C)
print 'Attributes  {}'.format(A)

Training Instances  32711
Attributes  294


According to [https://kdd.ics.uci.edu/databases/msweb/msweb.data.html](https://kdd.ics.uci.edu/databases/msweb/msweb.data.html) there were:

`Training Instances  32711
Attributes  294`

Exploratory analysis of the 2 tables:

In [2]:
# Number of lines in TableLeft.txt
!echo "Number of webpages:         "$(cat TableLeft.txt | wc -l)
# Number of unique visitor IDs in TableRight.txt
!echo "Number of visitors:         "$(cat TableRight.txt | cut -d',' -f2 | \
                                      uniq | wc -l)
# Number of unique webpage IDs in TableRight.txt
    # (sort before finding unique values: they have to be adjacent)
!echo "Number of webpages visited: "$(cat TableRight.txt | cut -d',' -f1 | \
                                      sort | uniq | wc -l)
# Number of lines in TableRight.txt
!echo "Number of visits:           "$(cat TableRight.txt | wc -l)

Number of webpages:         294
Number of visitors:         32711
Number of webpages visited: 285
Number of visits:           98654


As we already saw in HW4, 9 webpages were not visited.

##HW5.2.1: Inner

### Create MRJob task for Inner Joins

In [3]:
%%writefile HashSideInnerJoin.py
from mrjob.job import MRJob
import csv
    
class HashSideInnerJoin(MRJob):

    def mapper_init(self):
        # Load left-side table in memory as dictionary
        self.TL = {}
        # The absolute path will be passed as argument when calling MRJob
        for key, value in csv.reader(open("TableLeft.txt", "r")):
            # key = webpage ID, value = webpage URL
            self.TL[key] = value   
        
    def mapper(self, _, line):
        # Iterate over the right-side table, a record at a time
        TRrecord = line.split(",")
        key = TRrecord[0]
        value_visitor = TRrecord[1]
        # Look for each record, in the left-side table (in-memory)
        if key in self.TL.keys():
            yield key, (self.TL[key], value_visitor)
    
    # The reducer is optional. If not specified, I found out records are not 
        # sorted by webpage ID
    def reducer(self, key, value):
        for val_url, val_visitor in value:
            yield key, (val_url, val_visitor)
            
if __name__ == '__main__':
    HashSideInnerJoin.run()

Overwriting HashSideInnerJoin.py


### Create Python script to execute any of the 3 types of Join

In [19]:
%%writefile HW52.py
#!/home/hduser/anaconda/bin/python
import sys
JoinType = sys.argv[1]

# Import the class

if JoinType == 'Inner':
    from HashSideInnerJoin import HashSideInnerJoin
    JoinClass = 'HashSideInnerJoin'
    output = 'InnerJoinTable.txt'
elif JoinType == 'Right':
    from HashSideRightJoin import HashSideRightJoin
    JoinClass = 'HashSideRightJoin'
    output = 'RightJoinTable.txt'
elif JoinType == 'Left':
    from HashSideLeftJoin import HashSideLeftJoin
    JoinClass = 'HashSideLeftJoin'
    output = 'LeftJoinTable.txt'
else:
    raise ValueError('USE Inner, Right, OR Left AS ARGUMENTS')
    
# Use the 2 tables, left-side as seconrd argument (to be load by mapper_init)
mr_job = eval(JoinClass)(args=['TableRight.txt', '--file=TableLeft.txt'])
with mr_job.make_runner() as runner:
    runner.run()
    # Create the join table
    with open(output,'w') as result:
        for line in runner.stream_output():
            webpageID = str(mr_job.parse_output_line(line)[0])
            # Extract webpage URL and visitor ID from value
            webpageURL = mr_job.parse_output_line(line)[1][0]
            visitorID = str(mr_job.parse_output_line(line)[1][1])
            result.writelines(webpageID + ',' + webpageURL + ',' + visitorID 
                              +'\n')
    result.close()

Overwriting HW52.py


###Call Python Script with Inner Join

In [20]:
!chmod a+x HW52.py
!./HW52.py Inner

###Create bash script for EDA of the joint table

In [39]:
%%writefile EDA_HW52.sh
joinTable=$1

echo "Number of webpage IDs:                            "\
    $(cut -d, -f1 $joinTable | grep -v None | sort | uniq | wc -l)
echo "Number of webpage URLs:                           "\
    $(cut -d, -f1,2 $joinTable | grep -v None | sort  | uniq | wc -l)
echo "Number of webpages with no associated webpage URL:"\
    $(cut -d, -f1,2 $joinTable | grep None | sort | uniq | wc -l)
echo "Number of webpages visited:                       "\
    $(cut -d, -f1,3 $joinTable | grep -v None | cut -d, -f1 | sort | uniq | \
      wc -l)
echo "Number of records:                                "\
    $(wc -l < $joinTable)
echo "Number of visits:                                 "\
    $(cut -d, -f3 $joinTable | grep -v None | wc -l)
echo "Number of webpages with no associated visitor ID: "\
    $(cut -d, -f1,3 $joinTable | grep None | sort | uniq | wc -l)
echo "Number of visitors (IDs):                         "\
    $(cut -d, -f3 $joinTable | grep -v None | sort | uniq | wc -l)
if [ $(grep None $joinTable | wc -l) != 0 ]; then \
    echo -e "Webpages with no visits or URL:\n$(grep None $joinTable | \
    sort | sed 's/^/\t/')"; fi

Overwriting EDA_HW52.sh


###EDA and output of the inner join

Exploratory analysis of the joint table:

In [40]:
!chmod a+x EDA_HW52.sh
!./EDA_HW52.sh InnerJoinTable.txt

Number of webpage IDs:                             285
Number of webpage URLs:                            285
Number of webpages with no associated webpage URL: 0
Number of webpages visited:                        285
Number of records:                                 98654
Number of visits:                                  98654
Number of webpages with no associated visitor ID:  0
Number of visitors (IDs):                          32711


Of course (being this an inner join), all webpage URLs are matched with visitor IDs and vice versa.

In [8]:
!head -50 InnerJoinTable.txt

1000,/regwiz,10001
1000,/regwiz,10010
1000,/regwiz,10039
1000,/regwiz,10073
1000,/regwiz,10087
1000,/regwiz,10101
1000,/regwiz,10132
1000,/regwiz,10141
1000,/regwiz,10154
1000,/regwiz,10162
1000,/regwiz,10166
1000,/regwiz,10201
1000,/regwiz,10218
1000,/regwiz,10220
1000,/regwiz,10324
1000,/regwiz,10348
1000,/regwiz,10376
1000,/regwiz,10384
1000,/regwiz,10409
1000,/regwiz,10429
1000,/regwiz,10454
1000,/regwiz,10457
1000,/regwiz,10471
1000,/regwiz,10497
1000,/regwiz,10511
1000,/regwiz,10520
1000,/regwiz,10541
1000,/regwiz,10564
1000,/regwiz,10599
1000,/regwiz,10752
1000,/regwiz,10756
1000,/regwiz,10861
1000,/regwiz,10935
1000,/regwiz,10943
1000,/regwiz,10969
1000,/regwiz,11027
1000,/regwiz,11050
1000,/regwiz,11410
1000,/regwiz,11429
1000,/regwiz,11440
1000,/regwiz,11490
1000,/regwiz,11501
1000,/regwiz,11528
1000,/regwiz,11539
1000,/regwiz,11544
1000,/regwiz,11685
1000,/regwiz,11695
1000,/regwiz,11723
1000,/regwiz,11766
1000,/regwiz,11774


##HW5.2.2: Right

### Create MRJob task for Right Joins

In [9]:
%%writefile HashSideRightJoin.py
from mrjob.job import MRJob
import csv
    
class HashSideRightJoin(MRJob):

    def mapper_init(self):
        # Load left-side table in memory as dictionary
        self.TL = {}
        # The absolute path will be passed as argument when calling MRJob
        for key, value in csv.reader(open("TableLeft.txt", "r")):
            # key = webpage ID, value = webpage URL
            self.TL[key] = value   
        
    def mapper(self, _, line):
        # Iterate over the right-side table, a record at a time
        TRrecord = line.split(",")
        key = TRrecord[0]
        value_visitor = TRrecord[1]
        # Look for each record, in the left-side table (in-memory)
        if key in self.TL.keys():
            yield key, (self.TL[key], value_visitor)
        # And if there's no match, include the visitor info anyway
        else:
            yield key, (None, value_visitor)
    
    # The reducer is optional. If not specified, I found out records are not 
        # sorted by webpage ID
    def reducer(self, key, value):
        for val_url, val_visitor in value:
            yield key, (val_url, val_visitor)
            
if __name__ == '__main__':
    HashSideRightJoin.run()

Overwriting HashSideRightJoin.py


###Call Python Script with Right Join

In [10]:
!python HW52.py Right

###EDA and output of the right join

Exploratory analysis of the joint table:

In [37]:
!./EDA_HW52.sh RightJoinTable.txt

Number of webpage IDs:                             285
Number of webpage URLs:                            285
Number of webpages visited:                        285
Number of webpages with no associated webpage URL: 0
Number of records:                                 98654
Number of visits:                                  98654
Number of webpages with no associated visitor ID:  0
Number of visitors (IDs):                          32711


Being this a right join, we might have found some visits not matched with any URL, but that's not the case (because all primary keys: the webpage IDs) appear in the left-side table.

In [12]:
!head -50 RightJoinTable.txt 

1000,/regwiz,10001
1000,/regwiz,10010
1000,/regwiz,10039
1000,/regwiz,10073
1000,/regwiz,10087
1000,/regwiz,10101
1000,/regwiz,10132
1000,/regwiz,10141
1000,/regwiz,10154
1000,/regwiz,10162
1000,/regwiz,10166
1000,/regwiz,10201
1000,/regwiz,10218
1000,/regwiz,10220
1000,/regwiz,10324
1000,/regwiz,10348
1000,/regwiz,10376
1000,/regwiz,10384
1000,/regwiz,10409
1000,/regwiz,10429
1000,/regwiz,10454
1000,/regwiz,10457
1000,/regwiz,10471
1000,/regwiz,10497
1000,/regwiz,10511
1000,/regwiz,10520
1000,/regwiz,10541
1000,/regwiz,10564
1000,/regwiz,10599
1000,/regwiz,10752
1000,/regwiz,10756
1000,/regwiz,10861
1000,/regwiz,10935
1000,/regwiz,10943
1000,/regwiz,10969
1000,/regwiz,11027
1000,/regwiz,11050
1000,/regwiz,11410
1000,/regwiz,11429
1000,/regwiz,11440
1000,/regwiz,11490
1000,/regwiz,11501
1000,/regwiz,11528
1000,/regwiz,11539
1000,/regwiz,11544
1000,/regwiz,11685
1000,/regwiz,11695
1000,/regwiz,11723
1000,/regwiz,11766
1000,/regwiz,11774


##HW5.2.3: Left

### Create MRJob task for Left Joins

In [23]:
%%writefile HashSideLeftJoin.py
from mrjob.job import MRJob
import csv
    
class HashSideLeftJoin(MRJob):

    def __init__(self, *args, **kwargs):
        super(HashSideLeftJoin, self).__init__(*args, **kwargs)
        self.TLkeys = []

    def mapper_init(self):
        # Load left-side table in memory as dictionary
        self.TL = {}
        # The absolute path will be passed as argument when calling MRJob
        for key, value in csv.reader(open("TableLeft.txt", "r")):
            # key = webpage ID, value = webpage URL
            self.TL[key] = value   
            self.TLkeys.append(key)
        
    def mapper(self, _, line):
        # Iterate over the right-side table, a record at a time
        TRrecord = line.split(",")
        key = TRrecord[0]
        value_visitor = TRrecord[1]
        # Look for each record, in the left-side table (in-memory)
        if key in self.TL.keys():
            try:
                self.TLkeys.remove(key)
            except ValueError:
                pass
            yield key, (self.TL[key], value_visitor)
    
    def mapper_final(self):
        # Iterate over the right-side table, a record at a time
        for key in self.TLkeys:
            yield key, (self.TL[key], None)
    
    def reducer(self, key, value):
        for val_url, val_visitor in value:
            yield key, (val_url, val_visitor)
            
if __name__ == '__main__':
    HashSideLeftJoin.run()

Overwriting HashSideLeftJoin.py


###Call Python Script with Left Join

In [24]:
!python HW52.py Left

###EDA and output of the left join

Exploratory analysis of the joint table:

In [38]:
!./EDA_HW52.sh LeftJoinTable.txt

Number of webpage IDs:                             294
Number of webpage URLs:                            294
Number of webpages visited:                        285
Number of webpages with no associated webpage URL: 0
Number of records:                                 98663
Number of visits:                                  98654
Number of webpages with no associated visitor ID:  9
Number of visitors (IDs):                          32711
Webpages with no visits or URL:
	1287,/autoroute,None
	1288,/library,None
	1289,/masterchef,None
	1290,/devmovies,None
	1291,/news,None
	1292,/northafrica,None
	1293,/encarta,None
	1294,/bookshelf,None
	1297,/centroam,None


As expected, this joint table contains 9 more records than the other two, since 9 URLs are not matched with any visitor IDs.

In [16]:
!head -50 LeftJoinTable.txt

1000,/regwiz,10001
1000,/regwiz,10010
1000,/regwiz,10039
1000,/regwiz,10073
1000,/regwiz,10087
1000,/regwiz,10101
1000,/regwiz,10132
1000,/regwiz,10141
1000,/regwiz,10154
1000,/regwiz,10162
1000,/regwiz,10166
1000,/regwiz,10201
1000,/regwiz,10218
1000,/regwiz,10220
1000,/regwiz,10324
1000,/regwiz,10348
1000,/regwiz,10376
1000,/regwiz,10384
1000,/regwiz,10409
1000,/regwiz,10429
1000,/regwiz,10454
1000,/regwiz,10457
1000,/regwiz,10471
1000,/regwiz,10497
1000,/regwiz,10511
1000,/regwiz,10520
1000,/regwiz,10541
1000,/regwiz,10564
1000,/regwiz,10599
1000,/regwiz,10752
1000,/regwiz,10756
1000,/regwiz,10861
1000,/regwiz,10935
1000,/regwiz,10943
1000,/regwiz,10969
1000,/regwiz,11027
1000,/regwiz,11050
1000,/regwiz,11410
1000,/regwiz,11429
1000,/regwiz,11440
1000,/regwiz,11490
1000,/regwiz,11501
1000,/regwiz,11528
1000,/regwiz,11539
1000,/regwiz,11544
1000,/regwiz,11685
1000,/regwiz,11695
1000,/regwiz,11723
1000,/regwiz,11766
1000,/regwiz,11774


#HW5.3

## See the other notebooks

In [12]:
!python Longest_driver.py
# No need to print first values: there is only one

155	ROPLEZIMPREDASTRODONBRASLPKLSON YHROACLMPARCHEYXMMIOUDAVESAURUS PIOFPILOCOWERSURUASOGETSESNEGCP TYRAVOPSIFENGOQUAPIALLOBOSKENUO OWINFUYAIOKENECKSASXHYILPOYNUAT
155	AIOPJUMRXUYVASLYHYPSIBEMAPODIKR UFRYDIUUOLBIGASUAURUSREXLISNAYE RNOONDQSRUNSUBUNOUGRABBERYAIRTC UTAHRAPTOREDILEIPMILBDUMMYUVERI SYEVRAHVELOCYALLOSAURUSLINROTSR


In [106]:
!./FreqUnigrams_driver.py | sort -rn # Sort the 10 items in reverse order

5375699242	the
3691308874	of
2221164346	to
1387638591	in
1342195425	a
1135779433	and
798553959	that
756296656	is
688053106	be
481373389	as


In [107]:
!sort -rn DenseUnigrams.txt | head -50

11.557291666666666	"xxxx"
10.161726044782885	"NA"
8.0741599073001158	"blah"
7.5333333333333332	"nnn"
6.5611436445056839	"nd"
5.4073642846747196	"ND"
4.921875	"oooooooooooooooo"
4.7272727272727275	"PIC"
4.5116279069767442	"llll"
4.3494983277591972	"LUTHER"
4.2072378595731514	"oooooo"
4.0908402725208175	"NN"
3.9492846924177396	"ooooo"
3.9313725490196076	"OOOOOO"
3.7877030162412995	"IIII"
3.7624521072796937	"lillelu"
3.6570701447431206	"OOOOO"
3.6065624999999999	"Sc"
3.5769230769230771	"Pfeffermann"
3.5769230769230771	"Madarassy"
3.5600000000000001	"Meteoritical"
3.5364916773367479	"Undecided"
3.505639097744361	"Lib"
3.5	"xxxxxxxx"
3.4791318864774623	"ri"
3.3750684931506849	"Vir"
3.2390171258376768	"DREAM"
3.2290388548057258	"beep"
3.1886792452830188	"Latha"
3.1883175058233291	"MARTIN"
3.1699346405228757	"Lis"
3.1147458480120784	"Ac"
3.0371428571428569	"OUTPUT"
3.0222222222222221	"HENNESSY"
3.0	"ALLIS"
2.9191176470588234	"IYENGAR"
2.8698912704670052	"ft

In [98]:
!echo -e "Number of unigrams that never appear more than once in a page: "\
    $(grep $'1.0\t' DenseUnigrams.txt | wc -l)"\n"
!sort -rn DenseUnigrams.txt | tail -20

Number of unigrams that never appear more than once in a page: 166114

1.0	"Aana"
1.0	"AAN"
1.0	"Aan"
1.0	"aame"
1.0	"AAMC"
1.0	"Aaltonen"
1.0	"AAL"
1.0	"aahs"
1.0	"AAHPERD"
1.0	"aahed"
1.0	"aah"
1.0	"Aagje"
1.0	"AAFES"
1.0	"AAE"
1.0	"Aadam"
1.0	"AACVPR"
1.0	"AACP"
1.0	"AAAE"
1.0	"AAAA"
1.0	"aA"


#HW5.4

**In this part of the assignment we will focus on developing methods for detecting synonyms, using the Google 5-grams dataset. To accomplish this you must script two main tasks using MRJob:**

1. **Build stripes of word co-ocurrence for the top 10,000 most frequently appearing words across the entire set of 5-grams, and output to a file in your bucket on s3 (bigram analysis, though the words are non-contiguous).**

2. **Using two (symmetric) comparison methods of your choice (e.g., correlations, distances, similarities), pairwise compare  all stripes (vectors), and output to a file in your bucket on s3.**

> Design notes for (1)

> For this task you will be able to modify the pattern we used in HW 3.2 (feel free to use the solution as reference). To total the word counts across the 5-grams, output the support from the mappers using the total order inversion pattern:

    > <*word,count>

> to ensure that the support arrives before the cooccurrences.

> In addition to ensuring the determination of the total word counts, the mapper must also output co-occurrence counts for the pairs of words inside of each 5-gram. Treat these words as a basket, as we have in HW 3, but count all stripes or pairs in both orders, i.e., count both orderings: (word1,word2), and (word2,word1), to preserve symmetry in our output for (2).

> Design notes for (2)

> For this task you will have to determine a method of comparison. Here are a few that you might consider:

> - Spearman correlation
> - Euclidean distance
> - Taxicab (Manhattan) distance
> - Shortest path graph distance (a graph, because our data is symmetric!)
> - Pearson correlation
> - Cosine similarity
> - Kendall correlation
> - ...

> However, be cautioned that some comparison methods are more difficult to parallelize than others, and do not perform more associations than is necessary, since your choice of association will be symmetric.


I've chosen **Manhattan** (or **Taxicab**) **distance** (Euclideand distance could be implemented in the same way just calculating the square of all differences, and the square root of the sum of those squared differences), which is defined as:

$$d_1(\mathbf{p},\mathbf{q})=\left \|\mathbf{p}-\mathbf{q}  \right \|_1=
\sum_{i=1}^N\left | p_i-q_i \right |$$

##HW5.4.1

In [4]:
%%writefile HW541_TopN.py
#!/home/hduser/anaconda/bin/python
from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.protocol import RawValueProtocol
import re
from operator import itemgetter
from mrjob.compat import get_jobconf_value

class HW541_TopN(MRJob):
    
    OUTPUT_PROTOCOL = RawValueProtocol
    
    def jobconf(self):
        orig_jobconf = super(HW541_TopN, self).jobconf()        
        custom_jobconf = {'mapred.reduce.tasks': '1'}
        combined_jobconf = orig_jobconf
        combined_jobconf.update(custom_jobconf)
        self.jobconf = combined_jobconf
        return combined_jobconf
    
    def configure_options(self):
        super(HW541_TopN, self).configure_options()
        # The number of most frequent unigrams can be configured by
            # the user as an argument
        self.add_passthrough_option('--number_unigrams',  
                                    dest='number_unigrams', type='int', 
                                    default=10)
    
    def steps(self):
        return [MRStep(mapper = self.mapper, combiner = self.combiner,
                       reducer_init = self.reducer_init, 
                       reducer = self.reducer, 
                       reducer_final = self.reducer_final)]
    
    def mapper(self, _, line):
        line.strip()
        [ngram,count,pages,books] = re.split("\t",line)
        # Output the count for each word in the 5-gram
        unigrams = ngram.split()
        for unigram in unigrams:
            yield unigram, int(count)

    def combiner(self, unigram, count):
        yield unigram, sum(count)

    def reducer_init(self):
        self.top = {}

    def reducer(self, unigram, count):
        total = sum(count)
        # If we have not exceeded max size of the dictionary yet
        if len(self.top.keys()) < self.options.number_unigrams:
            self.top[unigram] = total
        # If exceeded, include new unigram only if more frequent that
                # other previously stored
        else:
            if total > min(self.top.values()):
                # Remove unigram not so frequent
                self.top.pop(min(self.top, key = self.top.get))
                # Add new unigram
                self.top[unigram] = total
    
    def reducer_final(self):
        for unigram in self.top.keys():
            yield None,unigram+'\t'+str(self.top[unigram])

if __name__ == '__main__':
    HW541_TopN.run()

Overwriting HW541_TopN.py


In [5]:
!chmod +x HW541_TopN.py

Test the MRJob task:

In [1]:
!./HW541_TopN.py gbooks_filtered_sample.txt --number_unigrams=4 | sort -k2 -rn > Top100_sample.txt
!head -10 Top100_sample.txt
!cut -f1 Top100_sample.txt | sort -n > /tmp/Top10kWords.txt

using configs in /home/hduser/.mrjob.conf
creating tmp directory /tmp/HW541_TopN.hduser.20151013.223953.985203
writing to /tmp/HW541_TopN.hduser.20151013.223953.985203/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /tmp/HW541_TopN.hduser.20151013.223953.985203/step-0-mapper-sorted
> sort /tmp/HW541_TopN.hduser.20151013.223953.985203/step-0-mapper_part-00000
writing to /tmp/HW541_TopN.hduser.20151013.223953.985203/step-0-reducer_part-00000
Counters from step 1:
  (no counters found)
Moving /tmp/HW541_TopN.hduser.20151013.223953.985203/step-0-reducer_part-00000 -> /tmp/HW541_TopN.hduser.20151013.223953.985203/output/part-00000
Streaming final output from /tmp/HW541_TopN.hduser.20151013.223953.985203/output
removing tmp directory /tmp/HW541_TopN.hduser.20151013.223953.985203
I	805344
the	633346
of	476762
have	411226


In [106]:
%%writefile HW541_TopN_driver.py
#!/home/hduser/anaconda/bin/python
from HW541_TopN import HW541_TopN

import os

mr_job = HW541_TopN(args=[
        's3://filtered-5grams/', '-r', 'emr', 
        '--number_unigrams=10000',
        '--output-dir=s3://ucb-mids-mls-juanjocarin/Top10k_output',
        '--no-output'])

with mr_job.make_runner() as runner: 
    runner.run()

os.system("aws s3 cp s3://ucb-mids-mls-juanjocarin/Top10k_output/part-00000 \
    ./Top10k.txt")
os.system("aws s3 rm s3://ucb-mids-mls-juanjocarin/Top10k_output/part-00000")
os.system("aws s3 rm s3://ucb-mids-mls-juanjocarin/Top10k_output/_SUCCESS")

Overwriting HW541_TopN_driver.py


In [107]:
!chmod +x HW541_TopN_driver.py

In [228]:
!./HW541_TopN_driver.py

download: s3://ucb-mids-mls-juanjocarin/Top10k_output/part-00000 to ./Top10k.txt
delete: s3://ucb-mids-mls-juanjocarin/Top10k_output/part-00000
delete: s3://ucb-mids-mls-juanjocarin/Top10k_output/_SUCCESS


Keep only the frequent words, not their counts:

In [234]:
!cut -f1 Top10k.txt | sort -n > Top10kWords.txt

In [7]:
%%writefile HW541.py
#!/home/hduser/anaconda/bin/python
from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.protocol import RawValueProtocol
import re
from itertools import combinations 
from operator import itemgetter

class HW541(MRJob):

    # I keep 2 lists of the Frequent Unigrams dictionary
        # Otherwise the single gets duplicated when run locally
    TopFrequentUnigramsM = []
    TopFrequentUnigramsR = []

    def jobconf(self):
        orig_jobconf = super(HW541, self).jobconf()        
        custom_jobconf = {'mapred.reduce.tasks': '1'}
        combined_jobconf = orig_jobconf
        combined_jobconf.update(custom_jobconf)
        self.jobconf = combined_jobconf
        return combined_jobconf
    
    def steps(self):
        return [MRStep(mapper_init = self.mapper_init, 
                       mapper = self.mapper, combiner = self.combiner, 
                       reducer_init = self.reducer_init, 
                       reducer = self.reducer)]
    
    ## pull in the top occurring words dictionary here for the mapper
    def mapper_init(self):
        f = open("Top10kWords.txt","r")
        for unigram in f:
            unigram = unigram.strip()
            self.TopFrequentUnigramsM.append(unigram)
        
    def mapper(self, _, line):
        cooccur = {}
        line.strip()
        [ngram,count,pages,books] = re.split("\t",line)
        # Output the count for each word in the 5-gram
        unigrams = ngram.split()
        # Get all of the 2-sets
        combs = list(combinations(unigrams,2))
        for combination in combs:
            unigram1,unigram2 = combination
            if unigram1 in self.TopFrequentUnigramsM and \
                unigram2 in self.TopFrequentUnigramsM:
                    cooccur.setdefault(unigram1,{})
                    cooccur[unigram1].setdefault(unigram2,0)
                    cooccur[unigram1][unigram2] += int(count)
                    cooccur.setdefault(unigram2,{})
                    cooccur[unigram2].setdefault(unigram1,0)
                    cooccur[unigram2][unigram1] += int(count)
        for unigram1 in cooccur.keys():
            yield unigram1, cooccur[unigram1]
        
    def combiner(self, unigram1, values):
        cooccur = {}
        for stripe in values:
            for unigram2 in stripe.keys():
                cooccur.setdefault(unigram2,0)
                cooccur[unigram2] += stripe[unigram2]
        yield unigram1, cooccur

    def reducer_init(self):
        f = open("Top10kWords.txt","r")
        for word in f:
            word = word.strip()
            self.TopFrequentUnigramsR.append(word)
    
    def reducer(self, unigram1, values):
        cooccur = {}
        for stripe in values:
            for unigram2 in stripe.keys():
                cooccur.setdefault(unigram2,0)
                cooccur[unigram2] += stripe[unigram2]
        for unigram2 in self.TopFrequentUnigramsR:
            cooccur.setdefault(unigram2,0)
        yield unigram1,','.join([str(cooccur[unigram2]) for unigram2 in \
                                  sorted(self.TopFrequentUnigramsR)])

if __name__ == '__main__':
    HW541.run()

Overwriting HW541.py


In [8]:
!chmod +x HW541.py

Test with the sample file:

In [2]:
!./HW541.py gbooks_filtered_sample.txt --file /tmp/Top10kWords.txt > /tmp/Stripes.txt
!head -20 /tmp/Stripes.txt

using configs in /home/hduser/.mrjob.conf
creating tmp directory /tmp/HW541.hduser.20151013.224022.306420
writing to /tmp/HW541.hduser.20151013.224022.306420/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /tmp/HW541.hduser.20151013.224022.306420/step-0-mapper-sorted
> sort /tmp/HW541.hduser.20151013.224022.306420/step-0-mapper_part-00000
writing to /tmp/HW541.hduser.20151013.224022.306420/step-0-reducer_part-00000
Counters from step 1:
  (no counters found)
Moving /tmp/HW541.hduser.20151013.224022.306420/step-0-reducer_part-00000 -> /tmp/HW541.hduser.20151013.224022.306420/output/part-00000
Streaming final output from /tmp/HW541.hduser.20151013.224022.306420/output
removing tmp directory /tmp/HW541.hduser.20151013.224022.306420
"I"	"34302,393307,25203,86000"
"have"	"393307,130,3747,14896"
"of"	"25203,3747,31638,317731"
"the"	"86000,14896,317731,46498"


In [245]:
%%writefile HW541_driver.py
#!/home/hduser/anaconda/bin/python
from HW541 import HW541

import os

mr_job = HW541(args=['s3://filtered-5grams/', '-r', 'emr',
                     '--file=Top10kWords.txt',
                     '--output-dir=s3://ucb-mids-mls-juanjocarin/Stripes_output',
                     '--no-output'])

with mr_job.make_runner() as runner: 
    runner.run()

os.system("aws s3 cp s3://ucb-mids-mls-juanjocarin/Stripes_output/part-00000 \
    s3://ucb-mids-mls-juanjocarin/Stripes.txt")
os.system("aws s3 rm s3://ucb-mids-mls-juanjocarin/Stripes_output/part-00000")
os.system("aws s3 rm s3://ucb-mids-mls-juanjocarin/Stripes_output/_SUCCESS")

Overwriting HW541_driver.py


In [246]:
!chmod +x HW541_driver.py

In [247]:
!./HW541_driver.py 

copy: s3://ucb-mids-mls-juanjocarin/Stripes_output/part-00000 to s3://ucb-mids-mls-juanjocarin/Stripes.txt
delete: s3://ucb-mids-mls-juanjocarin/Stripes_output/part-00000
delete: s3://ucb-mids-mls-juanjocarin/Stripes_output/_SUCCESS


Now we have the 10,000 stripes, that we'll use in the last stage, in a S3 bucket.

##HW5.4.2

Due to time constraints (calculating the **Manhattan** distances in AWS took **28 hours** with 4 m1.medium instances... adn after that it crashed, we think due to lack of resources), we were only able to implement one metric, instead of two. Implementing **Euclidean** distance would be quite easy: we would just have to square differences and calculate the square root of the sum.

Let's suppose that our coordinates are (for simplicity I'm using integers, though I've finally used confidences, which are float numbers):

$\begin{pmatrix}
7 & 8 & 5\\
8 & 4 & 1\\ 
5 & 1 & 9
\end{pmatrix}$

The (Manhattan) distance matrix is easy to calculate (and of course the elements in the diagonal will be null):

$\begin{pmatrix}
0 & 9 & 13\\
9 & 0 & 14\\ 
13 & 14 & 0
\end{pmatrix}$

With the first row, 
$\begin{pmatrix}
7 & 8 & 5
\end{pmatrix}$, corresponding to the 1st component, we can calculate $\mid p_1 - q_1 \mid$ for all possible combinations of $\mathbf{p}$ and $\mathbf{q}$: $\begin{pmatrix}
0 & 1 & 2
\end{pmatrix}$ if $\mathbf{q}$ is the unigram corresponding to the first column, $\begin{pmatrix}
1 & 0 & 3
\end{pmatrix}$ if $\mathbf{q}$ is the unigram corresponding to the second column, and so on. If we proceed the same way for all rows, for the unigram in the first column we could obtain the following matrix:

$\begin{pmatrix}
0 & 1 & 2\\
0 & 4 & 7\\ 
0 & 4 & 4
\end{pmatrix}$

The row-wise sum of this matrix corresponds to the first row in our distance matrix: $\begin{pmatrix}
0 & 9 & 13
\end{pmatrix}$, which give us the first component of the distance between the first unigram and itself, between that unigram and the second, and between the first unigram and the third.

In [1]:
%%writefile HW542.py
#!/home/hduser/anaconda/bin/python
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
from itertools import combinations 
from operator import itemgetter
from math import sqrt
from mrjob.protocol import RawValueProtocol

class HW542(MRJob):

    def jobconf(self):
        orig_jobconf = super(HW542, self).jobconf()        
        custom_jobconf = {'mapred.reduce.tasks': '1',
                          'mapred.output.key.comparator.class': 
                          'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
                          'mapred.text.key.comparator.options': '-k1n'}
        combined_jobconf = orig_jobconf
        combined_jobconf.update(custom_jobconf)
        self.jobconf = combined_jobconf
        return combined_jobconf
    
    
    OUTPUT_PROTOCOL = RawValueProtocol

    def steps(self):
        return [MRStep(
                mapper = self.mapper, 
                reducer = self.reducer)]
    #,
    #            MRStep(
    #            reducer = self.reducer_aggregation)]

    
    def mapper(self, _, line):
        # i-th line (corresponding to i-th unigram from the top 10,000 
            # frequent contains the i-th coordinates for all unigrams
        line = re.sub('\"', '', line)
        line = line.split()
        unigram = line[0]
        coords = line[1].split(',')
        N = len(coords)
        # We have N (=10,000) coordinates and points
        # For each row (or vector) of N elements we're going to calculate N
            # other vectors, by subtracting the 1st, second, ... N-th element
            # and taking the absolute value
        # We also need the unigram, because (since they were ordered 
            # alphabetically) it will allow us to detect the value of "i"
        for i in range(len(coords)):
            yield i, (unigram,[abs(int(coords[i])-int(x)) for x in coords])
    
    def reducer(self, row, values):
        unigram = []
        sum_coord = None
        for value in values:
            N = len(value[1])
            if not sum_coord:
                sum_coord = [0]*N
            unigram.append(value[0])
            #sum_coord += [s+int(v) for s,v in zip(sum_coord,value)]
            sum_coord = [s+int(v) for s,v in zip(sum_coord,value[1])]
        yield None,sorted(unigram)[row]+','+','.join([str(x) for x in sum_coord])
        
if __name__ == '__main__':
    HW542.run()

Overwriting HW542.py


In [2]:
!chmod +x HW542.py

Let's continue our test with the confidence matrix of the sample file:

In [3]:
!./HW542.py /tmp/Stripes.txt | head -20

using configs in /home/hduser/.mrjob.conf
creating tmp directory /tmp/HW542.hduser.20151013.224417.152742
writing to /tmp/HW542.hduser.20151013.224417.152742/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /tmp/HW542.hduser.20151013.224417.152742/step-0-mapper-sorted
> sort /tmp/HW542.hduser.20151013.224417.152742/step-0-mapper_part-00000
writing to /tmp/HW542.hduser.20151013.224417.152742/step-0-reducer_part-00000
Counters from step 1:
  (no counters found)
Moving /tmp/HW542.hduser.20151013.224417.152742/step-0-reducer_part-00000 -> /tmp/HW542.hduser.20151013.224417.152742/output/part-00000
Streaming final output from /tmp/HW542.hduser.20151013.224417.152742/output
I,0,844742,636825,762139
have,844742,0,702447,667659
of,636825,702447,0,629272
the,762139,667659,629272,0
removing tmp directory /tmp/HW542.hduser.20151013.224417.152742


As expected, $\text{distance}(\text{unigram}_i,\text{unigram}_i)=0.0 \text{ } \forall i \in \{1,N\}$. Since the number of unigrams we are comparing in this example is so small, it's very easy to check that the code works perfectly.

In [5]:
%%writefile HW542_driver.py
#!/home/hduser/anaconda/bin/python
from HW542 import HW542

import os

mr_job = HW542(args=['s3://ucb-mids-mls-juanjocarin/Stripes.txt', 
                     '-r', 'emr',
                     '--output-dir=s3://ucb-mids-mls-juanjocarin/Manhattan_output2',
                     '--no-output'])
with mr_job.make_runner() as runner:
    runner.run()
#os.system("aws s3 cp s3://ucb-mids-mls-juanjocarin/Manhattan_output/part-00000 \
#    s3://ucb-mids-mls-juanjocarin/Manhattan_distances.txt")
#os.system("aws s3 rm s3://ucb-mids-mls-juanjocarin/Manhattan_output/part-00000")
#os.system("aws s3 rm s3://ucb-mids-mls-juanjocarin/Manhattan_output/_SUCCESS")    

Overwriting HW542_driver.py


In [6]:
!chmod +x HW542_driver.py

Unfortunately, we were not able to check whether the implementation works at scale or not (it didn't, at least with the limited resources I used)...

<img src="./output.JPG">

**Syslog**:

    2015-10-13 22:21:40,139 INFO org.apache.hadoop.streaming.StreamJob (main):  map 72%  reduce 0%
    2015-10-13 22:22:04,187 INFO org.apache.hadoop.streaming.StreamJob (main):  map 100%  reduce 100%
    2015-10-13 22:22:04,188 INFO org.apache.hadoop.streaming.StreamJob (main): To kill this job, run:
    2015-10-13 22:22:04,188 INFO org.apache.hadoop.streaming.StreamJob (main): /home/hadoop/bin/hadoop job  -Dmapred.job.tracker=172.31.1.224:9001 -kill job_201510111752_0001
    2015-10-13 22:22:04,189 INFO org.apache.hadoop.streaming.StreamJob (main): Tracking URL: http://ip-172-31-1-224.ec2.internal:9100/jobdetails.jsp?jobid=job_201510111752_0001
    2015-10-13 22:22:04,189 ERROR org.apache.hadoop.streaming.StreamJob (main): Job not successful. Error: # of failed Reduce Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201510111752_0001_r_000000
    2015-10-13 22:22:04,189 INFO org.apache.hadoop.streaming.StreamJob (main): killJob...

**Stderr**:

    Streaming Command Failed!

In [None]:
!./HW542_driver.py

#HW5.5

##See the other notebooks