# Homework 2 - MapReduce

There is only one task in this homework. You are asked to implement the Social Triangle example discussed in class. In particular, given the email dataset, please list all "reciprocal" relationships in the company. Recall that:

If A emails B and B emails A, then A and B is *reciprocal*.

If A emails B but B doesn’t email A, then A and B is *directed*.

**Dataset:** We will use a subset of the open [Enron Email Dataset](https://www.cs.cmu.edu/~./enron/ "Enron Email Dataset"), which contains approximately 10,000 simplified email headers from the Enron Corporation. A subset of the data is available as **enron_mails_small.csv** as part of this notebook. The file contains 3 columns *Date*, *From*, and *To*. Their description is as follows:

|Column name|Description|
|--|--|
|Date |The date and time of the email, in the format YYYY-MM-DD hh-mm-ss, <br />e.g. "1998-10-30 07:43:00" |
|From |The sender email address, <br />e.g. "mark.taylor@enron.com" |
|To | A list of recipients' email addresses separated by semicolons ';', <br />e.g. "jennifer.fraser@enron.com;jeffrey.hodge@enron.com" |

Note that, we only care about users employed by Enron, i.e. only relationships where email addresses end with *'@enron.com'*.

The expected output is also provided below. For each reciprocal relationship, please output a tuple consisting of two strings. The first one is always **'reciprocal'**. And the second one is a string showing the name of the two person in the following format: **'Jane Doe : John Doe'**. The names should be presented in the lexical order, i.e. there will not be a 'John Doe : Jane Doe' since 'Jane' is ordered before 'John.

Though the dataset only contains email addresses, not actual names, we're assuming that the email aliases were created based on their name. For example:

|Email Address|Converted Name|
|--|--|
|mark.taylor@enron.com|Mark Taylor|
|alan.aronowitz@enron.com|Alan Aronowitz|
|marc.r.cutler@enron.com|Marc R Cutler|
|hugh@enron.com|Hugh|

In [95]:
!pip install mrjob
!gdown --id 1sq4-zXn2Z82mdLSBBegEgsUsfqtgza-C -O mapreduce.py
!gdown --id 1It6GP8O2JqkmUtZKbYp1kpwpuwOXlLps -O enron_mails_small.csv

Downloading...
From: https://drive.google.com/uc?id=1sq4-zXn2Z82mdLSBBegEgsUsfqtgza-C
To: /content/mapreduce.py
100% 2.66k/2.66k [00:00<00:00, 4.62MB/s]
Downloading...
From: https://drive.google.com/uc?id=1It6GP8O2JqkmUtZKbYp1kpwpuwOXlLps
To: /content/enron_mails_small.csv
100% 857k/857k [00:00<00:00, 55.1MB/s]


## Task 1 (10 pts)
You must complete the **MRFindReciprocal** class below (which is inherited from MRJob), and your code must run with the **mapreduce.py** package **mr.runJob()** as provided.

In [96]:
import csv
import mapreduce as mr
from mrjob.job import MRJob
from mrjob.step import MRStep

################################
### YOUR WORK SHOULD BE HERE ###
################################
class MRFindReciprocal(MRJob):
  '''
  PLEASE COMPLETE THIS CLASS. THIS SHOULD BE THE ONLY PLACE THAT YOU CAN EDIT.
  THE INPUT OF YOUR MAPREDUCE JOB WOULD BE LINE OF TEXT WITHOUT '\n'.
  '''
  def mapper1(self, _, line):
    row = line.split(',')
    if '@enron.com' in row[1]:
      mailer = row[1].split('@')[0].replace('.', ' ').lower()
      receiver_list = filter(lambda x: '@enron.com' in x, row[2].split(';')) # split the receiver's email
      receiver_list = list(map(lambda x: x.split('@')[0].replace('.', ' ').lower(), receiver_list))
      yield (mailer, receiver_list)

  def reducer1(self, mailer, receiver_list):
    yield (mailer, [item for sublist in receiver_list for item in sublist])

  
  def mapper2(self, mailer, receiver_list):
    receiver_list = set(receiver_list)
    mailer = mailer.title()
    for i in receiver_list:
      i = i.title()
      if mailer < i:
        from_to = (mailer, i)
      else:
        from_to = (i, mailer)
      
      yield (from_to, 1)

  def reducer2(self, from_to, count):
      yield (from_to, sum(count))

  def mapper3(self, from_to, count):
    if count > 1:
      yield ('reciprocal', from_to[0]+' : '+from_to[1])


  def steps(self):
    return [
      MRStep(mapper=self.mapper1, reducer=self.reducer1),
      MRStep(mapper=self.mapper2, reducer=self.reducer2),
      MRStep(mapper=self.mapper3), 
    ]


###################################
### DO NOT EDIT BELOW THIS LINE ###
###################################
job = MRFindReciprocal(args=[])
with open('enron_mails_small.csv', 'r') as fi:
  next(fi)
  output = list(mr.runJob(enumerate(map(lambda x: x.strip(), fi)), job))

print(len(output))
output

35


[('reciprocal', 'Brenda Whitehead : Elizabeth Sager'),
 ('reciprocal', 'Carol Clair : Debra Perlingiere'),
 ('reciprocal', 'Carol Clair : Mark Taylor'),
 ('reciprocal', 'Carol Clair : Richard Sanders'),
 ('reciprocal', 'Carol Clair : Sara Shackleton'),
 ('reciprocal', 'Carol Clair : Tana Jones'),
 ('reciprocal', 'Debra Perlingiere : Kevin Ruscitti'),
 ('reciprocal', 'Drew Fossum : Susan Scott'),
 ('reciprocal', 'Elizabeth Sager : Janette Elbertson'),
 ('reciprocal', 'Elizabeth Sager : Mark Haedicke'),
 ('reciprocal', 'Elizabeth Sager : Mark Taylor'),
 ('reciprocal', 'Elizabeth Sager : Richard Sanders'),
 ('reciprocal', 'Eric Bass : Susan Scott'),
 ('reciprocal', 'Fletcher Sturm : Greg Whalley'),
 ('reciprocal', 'Fletcher Sturm : Sally Beck'),
 ('reciprocal', 'Gerald Nemec : Susan Scott'),
 ('reciprocal', 'Grant Masson : Vince Kaminski'),
 ('reciprocal', 'Greg Whalley : Richard Sanders'),
 ('reciprocal', 'Janette Elbertson : Mark Taylor'),
 ('reciprocal', 'Janette Elbertson : Richard Sa

## Task 2 (5 points)
Please also convert your MR Job Class in Task 1 into a stand-alone `BDM_HW2.py` file that can be run directly with `python` similar to our Lab 3.

In [97]:
!python BDM_HW2.py enron_mails_small.csv

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/BDM_HW2.root.20211013.165233.070301
Running step 1 of 3...
Running step 2 of 3...
Running step 3 of 3...
job output is in /tmp/BDM_HW2.root.20211013.165233.070301/output
Streaming final output from /tmp/BDM_HW2.root.20211013.165233.070301/output...
"reciprocal"	"Brenda Whitehead : Elizabeth Sager"
"reciprocal"	"Carol Clair : Debra Perlingiere"
"reciprocal"	"Carol Clair : Mark Taylor"
"reciprocal"	"Carol Clair : Richard Sanders"
"reciprocal"	"Carol Clair : Sara Shackleton"
"reciprocal"	"Carol Clair : Tana Jones"
"reciprocal"	"Gerald Nemec : Susan Scott"
"reciprocal"	"Grant Masson : Vince Kaminski"
"reciprocal"	"Greg Whalley : Richard Sanders"
"reciprocal"	"Janette Elbertson : Mark Taylor"
"reciprocal"	"Janette Elbertson : Richard Sanders"
"reciprocal"	"Liz Taylor : Mark Haedicke"
"reciprocal"	"Mark Haedicke : Mark Taylor"
"reciprocal"	"Mark Haedicke : Michelle Cash"
"

Code for BDM_HW2.py file

In [None]:
from mrjob.job import MRJob
from mrjob.step import MRStep


class MRFindReciprocal(MRJob):
  '''
  PLEASE COMPLETE THIS CLASS. THIS SHOULD BE THE ONLY PLACE THAT YOU CAN EDIT.
  THE INPUT OF YOUR MAPREDUCE JOB WOULD BE LINE OF TEXT WITHOUT '\n'.
  '''
  def mapper1(self, _, line):
    row = line.split(',')
    if '@enron.com' in row[1]:
      mailer = row[1].split('@')[0].replace('.', ' ').lower()
      receiver_list = filter(lambda x: '@enron.com' in x, row[2].split(';')) # split the receiver's email
      receiver_list = list(map(lambda x: x.split('@')[0].replace('.', ' ').lower(), receiver_list))
      yield (mailer, receiver_list)

  def reducer1(self, mailer, receiver_list):
    yield (mailer, [item for sublist in receiver_list for item in sublist])

  
  def mapper2(self, mailer, receiver_list):
    receiver_list = set(receiver_list)
    mailer = mailer.title()
    for i in receiver_list:
      i = i.title()
      if mailer < i:
        from_to = (mailer, i)
      else:
        from_to = (i, mailer)
      
      yield (from_to, 1)

  def reducer2(self, from_to, count):
      yield (from_to, sum(count))

  def mapper3(self, from_to, count):
    if count > 1:
      yield ('reciprocal', from_to[0]+' : '+from_to[1])


  def steps(self):
    return [
      MRStep(mapper=self.mapper1, reducer=self.reducer1),
      MRStep(mapper=self.mapper2, reducer=self.reducer2),
      MRStep(mapper=self.mapper3), 
    ]


if __name__ == '__main__':
    MRFindReciprocal.run()