#DATASCI W261: Machine Learning at Scale 

* **Sayantan Satpati**
* **sayantan.satpati@ischool.berkeley.edu**
* **W261**
* **Week-3**
* **Assignment-4**
* **Date of Submission: 22-SEP-2015**

#  === Week 3 Hadoop & Apriori ===

**HW3.0. What is a merge sort? Where is it used in Hadoop?**



### HW3.1.
---
Product Recommendations: The action or practice of selling additional products or services 
to existing customers is called cross-selling. Giving product recommendation is 
one of the examples of cross-selling that are frequently used by online retailers. 
One simple method to give product recommendations is to recommend products that are frequently
browsed together by the customers.

Suppose we want to recommend new products to the customer based on the products they
have already browsed on the online website. Write a program using the A-priori algorithm
to find products which are frequently browsed together. Fix the support to s = 100 
(i.e. product pairs need to occur together at least 100 times to be considered frequent) 
and find itemsets of size 2 and 3.

Use the online browsing behavior dataset at: 

https://www.dropbox.com/s/zlfyiwa70poqg74/ProductPurchaseData.txt?dl=0

Each line in this dataset represents a browsing session of a customer. 
On each line, each string of 8 characters represents the id of an item browsed during that session. 
The items are separated by spaces.

Do some exploratory data analysis of this dataset. 
Report your findings such as number of unique products; largest basket, etc. using Hadoop Map-Reduce.

In [54]:
%%writefile mapper_hw31.py

#!/usr/bin/env python

import sys
import re
import itertools

for line in sys.stdin:
    try:
        # Remove leading & trailing chars
        line = line.strip()
        # Split the line by <TAB> delimeter
        items = re.split(r'\s+', line)
        #Sort the list
        items.sort()
        
        for c in itertools.combinations(items, 1):
            print '%s,%s\t%d\t%d' %(c[0], '*', 1, len(items))
            
        for c in itertools.combinations(items, 2):
            print '%s,%s\t%d' %(c[0], c[1], 1)
        
        ''' Commenting out itemset-3 for the moment
        for c in itertools.combinations(items, 3):
            print '%s,%s,%s\t%d' %(c[0], c[1], c[2], 1)
        '''
    except Exception as e:
        print e

Overwriting mapper_hw31.py


In [55]:
!chmod a+x mapper_hw31.py

In [60]:
%%writefile reducer_hw31.py
#!/usr/bin/python
import sys
import re
import heapq
from sets import Set

'''
a1,* 1 4
a1,* 1 5
a1,b1 1
a1,b1 1
a1,b2 1
a1,b2 1
a2,* 1 6
'''

itemset_1_cnt = 0
itemset_2_cnt = 0

itemset_1_last = None
itemset_2_last = None

THRESHOLD = 100

# Statistics
# Unique Items
uniq = Set()
# Max Basket Length
max_basket_len = 0
# Total Itemset Counts for Sizes: 1 & 2
total_itemset_1 = 0
total_itemset_2 = 0

for line in sys.stdin:
    # Remove leading & trailing chars
        line = line.strip()
        # Split the line by <TAB> delimeter
        tokens = re.split(r'\s+', line)
    
        # Split the key by <COMMA> delimeter
        items = tokens[0].split(",")
        i1 = items[0]
        i2 = items[1]
        
        # Count
        count = int(tokens[1])
        
        if itemset_1_last != i1:
            uniq.add(i1)
            
            basket_len = int(tokens[2])
            if basket_len >= max_basket_len:
                max_basket_len = basket_len
                
            if itemset_1_cnt >= THRESHOLD:
                total_itemset_1 += 1
                
            # Reset
            itemset_1_last = i1
            itemset_1_cnt = count
            itemset_2_last = None
            itemset_2_cnt = 0
        else:
            if i2 == '*':
                itemset_1_cnt += count
                
                basket_len = int(tokens[2])
                if basket_len > max_basket_len:
                    max_basket_len = basket_len
            else:
                if itemset_2_last != tokens[0]:
                    if itemset_1_cnt >= THRESHOLD and itemset_2_cnt >= THRESHOLD:
                        total_itemset_2 += 1
                        
                    itemset_2_last = tokens[0]
                    itemset_2_cnt = count
                else:
                    itemset_2_cnt += count
                    
# Last Set of Counts
if itemset_1_cnt >= THRESHOLD:
    total_itemset_1 += 1
if itemset_1_cnt >= THRESHOLD and itemset_2_cnt >= THRESHOLD:
    total_itemset_2 += 1

print '=== Statistics ==='
print 'Total Unique Items: %d' %(len(uniq))
print 'Maximum Basket Length: %d' %(max_basket_len)
print 'Total # itemsets of size 1: %d' %(total_itemset_1)
print 'Total # itemsets of size 2: %d' %(total_itemset_2)
        
        

Overwriting reducer_hw31.py


In [86]:
%%writefile reducer_hw31.py
#!/usr/bin/python
import sys
import re
import heapq
from sets import Set

'''
a1,* 1 4
a1,* 1 5
a1,b1 1
a1,b1 1
a1,b2 1
a1,b2 1
a2,* 1 6
'''

itemset_1_cnt = 0
itemset_2_cnt = 0

itemset_1_last = None

THRESHOLD = 100

# Statistics
# Unique Items
uniq = Set()
# Max Basket Length
max_basket_len = 0
# Total Itemset Counts for Sizes: 1 & 2
total_itemset_1 = 0
total_itemset_2 = 0

d_counts = {}

def update_counts():
    global itemset_1_last
    global d_counts
    global total_itemset_1
    global total_itemset_2
    key = '{0},{1}'.format(itemset_1_last, '*')
    if d_counts.get(key, 0) >= THRESHOLD:
        total_itemset_1 += 1
        for k,v in d_counts.iteritems():
            if k != key:
                if v >= THRESHOLD:
                    total_itemset_2 += 1

for line in sys.stdin:
    # Remove leading & trailing chars
        line = line.strip()
        # Split the line by <TAB> delimeter
        tokens = re.split(r'\s+', line)
    
        # Split the key by <COMMA> delimeter
        items = tokens[0].split(",")
        i1 = items[0]
        i2 = items[1]
        
        # Count
        count = int(tokens[1])
        
        if not itemset_1_last:
            itemset_1_last = i1
        
        if itemset_1_last != i1:
            # Emit Contents of Dict
            update_counts()
            
            if i2 == '*':
                uniq.add(i1)
                basket_len = int(tokens[2])
                if basket_len > max_basket_len:
                    max_basket_len = basket_len
                            
            d_counts.clear()
            itemset_1_last = i1
        else:
            key = tokens[0]
            d_counts[key] = d_counts.get(key, 0) + count
            
            if i2 == '*':
                uniq.add(i1)
                basket_len = int(tokens[2])
                if basket_len > max_basket_len:
                    max_basket_len = basket_len
                    
# Last Record
update_counts()
                    
print '=== Statistics ==='
print 'Total Unique Items: %d' %(len(uniq))
print 'Maximum Basket Length: %d' %(max_basket_len)
print 'Total # itemsets of size 1: %d' %(total_itemset_1)
print 'Total # itemsets of size 2: %d' %(total_itemset_2)
        
        

Overwriting reducer_hw31.py


In [88]:
!chmod a+x reducer_hw31.py

In [89]:
'''
HW3.1. Product Recommendations
'''

# Delete existing Output Dirs if available
!hadoop fs -rm -r -skipTrash /user/cloudera/w261/wk3/hw31/output

# Run the Hadoop Streaming Command
!hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.7.0.jar \
-D map.output.key.field.separator=, \
-D mapred.text.key.partitioner.options=-k1,1 \
-input /user/cloudera/w261/wk3/hw31/input/ProductPurchaseData.txt \
-output /user/cloudera/w261/wk3/hw31/output \
-file ./mapper_hw31.py \
-mapper 'python mapper_hw31.py' \
-file ./reducer_hw31.py \
-reducer 'python reducer_hw31.py'

# Show Output
!hadoop fs -cat /user/cloudera/w261/wk3/hw31/output/part-00000

Deleted /user/cloudera/w261/wk3/hw31/output
packageJobJar: [./mapper_hw31.py, ./reducer_hw31.py, /tmp/hadoop-cloudera/hadoop-unjar4802976138694835527/] [] /tmp/streamjob371666725523987824.jar tmpDir=null
15/09/20 12:15:58 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
15/09/20 12:15:58 INFO mapred.FileInputFormat: Total input paths to process : 1
15/09/20 12:15:59 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-cloudera/mapred/local]
15/09/20 12:15:59 INFO streaming.StreamJob: Running job: job_201509191045_0012
15/09/20 12:15:59 INFO streaming.StreamJob: To kill this job, run:
15/09/20 12:15:59 INFO streaming.StreamJob: UNDEF/bin/hadoop job  -Dmapred.job.tracker=localhost.localdomain:8021 -kill job_201509191045_0012
15/09/20 12:15:59 INFO streaming.StreamJob: Tracking URL: http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201509191045_0012
15/09/20 12:16:00 INFO streaming.StreamJob:  map 0%  reduce 0%
15/09

### HW3.2. (Computationally prohibitive but then again Hadoop can handle this)
---

Note: for this part the writeup will require a specific rule ordering but the program need not sort the output.

List the top 5 rules with corresponding confidence scores in decreasing order of confidence score 
for frequent (100>count) itemsets of size 2. 
A rule is of the form: 

(item1) ⇒ item2.

Fix the ordering of the rule lexicographically (left to right), 
and break ties in confidence (between rules, if any exist) 
by taking the first ones in lexicographically increasing order. 
Use Hadoop MapReduce to complete this part of the assignment; 
use a single mapper and single reducer; use a combiner if you think it will help and justify. 

In [93]:
%%writefile mapper_hw32.py

#!/usr/bin/env python

import sys
import re
import itertools

for line in sys.stdin:
    try:
        # Remove leading & trailing chars
        line = line.strip()
        # Split the line by <TAB> delimeter
        items = re.split(r'\s+', line)
        #Sort the list
        items.sort()
        
        l = len(items)
        
        for i in xrange(l):
            print '%s,*\t%d' %(items[i], 1)
            for j in xrange(i+1, l):
               print '%s,%s\t%d' %(items[i], items[j], 1) 
    except Exception as e:
        print e

Overwriting mapper_hw32.py


In [94]:
!chmod a+x mapper_hw32.py

In [98]:
%%writefile reducer_hw32.py
#!/usr/bin/python
import sys
import re
import heapq

itemset_1_cnt = 0
itemset_2_cnt = 0

itemset_1_last = None
itemset_2_last = None

'''
a1,* 1
a1,* 1
a1,b1 1
a1,b1 1
a1,b2 1
a1,b2 1
a2,* 1
'''

THRESHOLD = 100
# Store Itemsets 2
dict = {}

for line in sys.stdin:
    # Remove leading & trailing chars
        line = line.strip()
        # Split the line by <TAB> delimeter
        tokens = re.split(r'\s+', line)
    
        # Split the key by <COMMA> delimeter
        items = tokens[0].split(",")
        i1 = items[0]
        i2 = items[1]
        
        if not itemset_1_last:
            itemset_1_last = i1
        
        if itemset_1_last != i1:
            '''
            if itemset_1_cnt >= THRESHOLD:
                confidence = (itemset_2_cnt * 1.0) / itemset_1_cnt
                print '[%d,%d]%s\t%f' %(itemset_1_cnt, itemset_2_cnt, tokens[0], confidence)
                dict[tokens[0]] = confidence
            '''
                        
            # Reset
            itemset_1_last = i1
            itemset_1_cnt = int(tokens[1])
            itemset_2_last = None
            itemset_2_cnt = 0
        else:
            if i2 == '*':
                itemset_1_cnt += int(tokens[1])
            else:
                if itemset_2_last != tokens[0]:
                    if itemset_1_cnt >= THRESHOLD and itemset_2_cnt >= THRESHOLD:
                        confidence = (itemset_2_cnt * 1.0) / itemset_1_cnt
                        #print '[%d,%d]%s\t%f' %(itemset_1_cnt, itemset_2_cnt, itemset_2_last, confidence)
                        dict[itemset_2_last] = confidence
                    itemset_2_last = tokens[0]
                    itemset_2_cnt = int(tokens[1]) 
                else:
                    itemset_2_cnt += int(tokens[1])                    

# Last Set of Counts
if itemset_1_cnt >= THRESHOLD and itemset_2_cnt >= THRESHOLD:
    confidence = (itemset_2_cnt * 1.0) / itemset_1_cnt
    #print '[%d,%d]%s\t%f' %(itemset_1_cnt, itemset_2_cnt, itemset_2_last, confidence)
    dict[itemset_2_last] = confidence

print '=== Top 5 Confidence ==='
sorted_dict = sorted(dict.items(), key=lambda x:(-x[1], x[0]))
for j,k in sorted_dict[:5]:
    print '%s\t%f' %(j,k)
        
        
        

Overwriting reducer_hw32.py


In [99]:
!chmod a+x reducer_hw32.py

In [100]:
'''
HW3.2. Confidence Caculations
List the top 5 rules with corresponding confidence scores in decreasing order of confidence score 
for frequent (100>count) itemsets of size 2
'''

# Delete existing Output Dirs if available
!hadoop fs -rm -r -skipTrash /user/cloudera/w261/wk3/hw32/output

# Run the Hadoop Streaming Command
!hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.7.0.jar \
-D map.output.key.field.separator=, \
-D mapred.text.key.partitioner.options=-k1,1 \
-input /user/cloudera/w261/wk3/hw32/input/ProductPurchaseData.txt \
-output /user/cloudera/w261/wk3/hw32/output \
-file ./mapper_hw32.py \
-mapper 'python mapper_hw32.py' \
-file ./reducer_hw32.py \
-reducer 'python reducer_hw32.py'

# Show Output
!hadoop fs -cat /user/cloudera/w261/wk3/hw32/output/part-00000

Deleted /user/cloudera/w261/wk3/hw32/output
packageJobJar: [./mapper_hw32.py, ./reducer_hw32.py, /tmp/hadoop-cloudera/hadoop-unjar8486018449480148677/] [] /tmp/streamjob286663036493526101.jar tmpDir=null
15/09/20 13:08:58 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
15/09/20 13:08:58 INFO mapred.FileInputFormat: Total input paths to process : 1
15/09/20 13:08:58 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-cloudera/mapred/local]
15/09/20 13:08:58 INFO streaming.StreamJob: Running job: job_201509191045_0014
15/09/20 13:08:58 INFO streaming.StreamJob: To kill this job, run:
15/09/20 13:08:58 INFO streaming.StreamJob: UNDEF/bin/hadoop job  -Dmapred.job.tracker=localhost.localdomain:8021 -kill job_201509191045_0014
15/09/20 13:08:59 INFO streaming.StreamJob: Tracking URL: http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201509191045_0014
15/09/20 13:09:00 INFO streaming.StreamJob:  map 0%  reduce 0%
15/09

### HW3.3
---

Benchmark your results using the pyFIM implementation of the Apriori algorithm
(Apriori - Association Rule Induction / Frequent Item Set Mining implemented by Christian Borgelt). 
You can download pyFIM from here:***

http://www.borgelt.net/pyfim.html

Comment on the results from both implementations (your Hadoop MapReduce of apriori versus pyFIM) 
in terms of results and execution times.

#### For this part, the following steps were performed:

1. Since I was using a Mac, I spinned up a Ubuntu 1404 (Micro) VM in Amazon EC2 Cluster
2. Installed all required libraries: pip, git, ipython
3. Downloaded the fim.so file from the link & set PYTHONPATH & LD_LIBRARY_PATH
4. Ran the following command in the VM

In [None]:
!python top5pyfim.py | sort -n -r -k 3

```
FRO40251	('DAI93865',)	1.0
FRO40251	('GRO85051',)	0.999176276771
FRO40251	('GRO38636',)	0.990654205607
FRO40251	('ELE12951',)	0.990566037736
FRO40251	('DAI88079',)	0.986725663717
FRO40251	('FRO92469',)	0.983510011779
SNA82528	('DAI43868',)	0.972972972973
DAI62779	('DAI23334',)	0.954545454545
```

### HW3.4 (Conceptual Exercise)

Suppose that you wished to perform the Apriori algorithm once again,
though this time now with the goal of listing the top 5 rules with corresponding confidence scores 
in decreasing order of confidence score for itemsets of size 3 using Hadoop MapReduce.
A rule is now of the form: 

(item1, item2) ⇒ item3 

Recall that the Apriori algorithm is iterative for increasing itemset size,
working off of the frequent itemsets of the previous size to explore 
ONLY the NECESSARY subset of a large combinatorial space. 
Describe how you might design a framework to perform this exercise.

In particular, focus on the following:
  — map-reduce steps required
  - enumeration of item sets and filtering for frequent candidates