# Homework 3 - MapReduce

In this homework, we are practicing the MapReduce programming paradigm. 

You are required to turn in this notebook as BDM\_HW3\_MR\_**NetId**.ipynb. You will be asked to complete each task using the accompanied *mapreduce* package (as **mapreduce.py**) and/or MRJob using one or more MapReduce "steps". For each such step (with **mr.run()**, you are expected to supply a mapper and a reducer as needed. Or if you're using MRJob, please call **mr.runJob()** instead. Below are sample usage of the package:

```python
    # Run on input1 using your mapper1 and reducer1 function
    output = list(mr.run(input1, mapper1, reducer1))

    # Run on input2 using only your mapper2, no reduce phase
    output = list(mr.run(enumerate(input2), mapper2, combiner2))
    
    # Run on input3 using 2 nested MapReduce jobs
    output = mr.run(mr.run(input3, mapper3, reducer3), mapper4)
```
    
Please note that the input must be an iteratable of **key/value pairs**. If your inpu tdata does not have a key, you can simply add a null or index key through **enumerator(input)**. The output of the mr.run() is always a **generator**. You have to cast it to a list if you'd like to view, index or print it out.

## Task (10 points)

There is only one task in this homework. You are asked to implement the Social Triangle example discussed in class. In particular, given the email dataset, please list all "reciprocal" relationships in the company. Recall that:

If A emails B and B emails A, then A and B is *reciprocal*.

If A emails B but B doesn’t email A, then A and B is *directed*.

**Dataset:** We will use a subset of the open [Enron Email Dataset](https://www.cs.cmu.edu/~./enron/ "Enron Email Dataset"), which contains approximately 10,000 simplified email headers from the Enron Corporation. You can download this dataset from NYU Classes as **enron_mails_small.csv**. The file contains 3 columns *Date*, *From*, and *To*. Their description is as follows:

|Column name|Description|
|--|--|
|Date |The date and time of the email, in the format YYYY-MM-DD hh-mm-ss, <br />e.g. "1998-10-30 07:43:00" |
|From |The sender email address, <br />e.g. "mark.taylor@enron.com" |
|To | A list of recipients' email addresses separated by semicolons ';', <br />e.g. "jennifer.fraser@enron.com;jeffrey.hodge@enron.com" |

Note that, we only care about users employed by Enron, or only relationships having email addresses that end with *'@enron.com'*.

The expected output is also provided below. For each reciprocal relationship, please output a tuple consisting of two strings. The first one is always **'reciprocal'**. And the second one is a string showing the name of the two person in the following format: **'Jane Doe : John Doe'**. The names should be presented in the lexical order, i.e. there will not be a 'John Doe : Jane Doe' since 'Jane' is ordered before 'John.

Though the dataset only contains email addresses, not actual names, we're assuming that the email aliases were created based on their name. For example:

|Email Address|Converted Name|
|--|--|
|mark.taylor@enron.com|Mark Taylor|
|alan.aronowitz@enron.com|Alan Aronowitz|
|marc.r.cutler@enron.com|Marc R Cutler|
|hugh@enron.com|Hugh|

Please fill the code block with a series of MapReduce jobs using your own mapper and reducer functions. Be sure to include the naming convention logic into one of your mappers and/or reducers.

In [1]:
import csv
import mapreduce as mr

def mapper1(_, row):
    sender = row['From']
    if sender.endswith('enron.com'):
        recipients = list(filter(lambda x: x.endswith('enron.com') and x!=sender,
                                 row['To'].split(';')))
        if recipients:
            yield (sender, recipients)

def reducer1(sender, recipients):
    yield (sender, list(set().union(*recipients)))

def mapper2(_, row):
    sender = row['From']
    if sender.endswith('enron.com'):
        recipients = filter(lambda x: x.endswith('enron.com') and x!=sender,
                            row['To'].split(';'))
        for recipient in recipients:
            yield (recipient, sender)

def reducer2(recipient, senders):
    yield (recipient, list(set(senders)))

def mapper3(person1, person2s):
    yield (person1,person2s)

def reducer3(person, edges):
    if len(edges)==2:
        for reciprocal in (set(edges[0]) & set(edges[1])):
            yield (person, reciprocal)

def mapper4(person1, person2):
    yield (tuple(sorted([person1, person2])), None)

def formatName(email):
    return email.split('@')[0].replace('.', ' ').title()

def reducer4(pair, _):
    person1,person2 = pair
    yield ('reciprocal', '%s : %s' % (formatName(person1), formatName(person2)))
    
with open('enron_mails_small.csv', 'r') as fi:
    reader = csv.DictReader(fi)
    data = list(reader)
    output1 = list(mr.run(enumerate(data), mapper1, reducer1))
    output2 = list(mr.run(enumerate(data), mapper2, reducer2))
    output3 = list(mr.run(mr.run(output1+output2, mapper3, reducer3),
                          mapper4, reducer4))

print(len(output3))
output3

35


[('reciprocal', 'Brenda Whitehead : Elizabeth Sager'),
 ('reciprocal', 'Carol Clair : Debra Perlingiere'),
 ('reciprocal', 'Carol Clair : Mark Taylor'),
 ('reciprocal', 'Carol Clair : Richard Sanders'),
 ('reciprocal', 'Carol Clair : Sara Shackleton'),
 ('reciprocal', 'Carol Clair : Tana Jones'),
 ('reciprocal', 'Debra Perlingiere : Kevin Ruscitti'),
 ('reciprocal', 'Drew Fossum : Susan Scott'),
 ('reciprocal', 'Elizabeth Sager : Janette Elbertson'),
 ('reciprocal', 'Elizabeth Sager : Mark Haedicke'),
 ('reciprocal', 'Elizabeth Sager : Mark Taylor'),
 ('reciprocal', 'Elizabeth Sager : Richard Sanders'),
 ('reciprocal', 'Eric Bass : Susan Scott'),
 ('reciprocal', 'Fletcher Sturm : Greg Whalley'),
 ('reciprocal', 'Fletcher Sturm : Sally Beck'),
 ('reciprocal', 'Gerald Nemec : Susan Scott'),
 ('reciprocal', 'Grant Masson : Vince Kaminski'),
 ('reciprocal', 'Greg Whalley : Richard Sanders'),
 ('reciprocal', 'Janette Elbertson : Mark Taylor'),
 ('reciprocal', 'Janette Elbertson : Richard Sa

In [2]:
import csv
import mapreduce as mr

def mapper1(_, row):
    sender = row['From']
    if sender.endswith('enron.com'):
        recipients = list(filter(lambda x: x.endswith('enron.com') and x!=sender,
                                 row['To'].split(';')))
        if recipients:
            yield (sender, (0, recipients))
            for recipient in recipients:
                yield (recipient, (1, sender))

def reducer1(person, others):
    senders = map(lambda x: x[1], filter(lambda x: x[0]==1, others))
    recipients = map(lambda x: x[1], filter(lambda x: x[0]==0, others))
    reciprocals = set(senders) & (set().union(*recipients))
    for person2 in reciprocals:
        yield (person, person2)

def mapper4(person1, person2):
    yield (tuple(sorted([person1, person2])), None)

def formatName(email):
    return email.split('@')[0].replace('.', ' ').title()

def reducer4(pair, _):
    person1, person2 = pair
    yield ('reciprocal', '%s : %s' % (formatName(person1), formatName(person2)))
    
with open('enron_mails_small.csv', 'r') as fi:
    reader = csv.DictReader(fi)
    data = enumerate(reader)
    output1 = list(mr.run(data, mapper1, reducer1))
    output3 = list(mr.run(output1, mapper4, reducer4))

print(len(output3))
output3

35


[('reciprocal', 'Brenda Whitehead : Elizabeth Sager'),
 ('reciprocal', 'Carol Clair : Debra Perlingiere'),
 ('reciprocal', 'Carol Clair : Mark Taylor'),
 ('reciprocal', 'Carol Clair : Richard Sanders'),
 ('reciprocal', 'Carol Clair : Sara Shackleton'),
 ('reciprocal', 'Carol Clair : Tana Jones'),
 ('reciprocal', 'Debra Perlingiere : Kevin Ruscitti'),
 ('reciprocal', 'Drew Fossum : Susan Scott'),
 ('reciprocal', 'Elizabeth Sager : Janette Elbertson'),
 ('reciprocal', 'Elizabeth Sager : Mark Haedicke'),
 ('reciprocal', 'Elizabeth Sager : Mark Taylor'),
 ('reciprocal', 'Elizabeth Sager : Richard Sanders'),
 ('reciprocal', 'Eric Bass : Susan Scott'),
 ('reciprocal', 'Fletcher Sturm : Greg Whalley'),
 ('reciprocal', 'Fletcher Sturm : Sally Beck'),
 ('reciprocal', 'Gerald Nemec : Susan Scott'),
 ('reciprocal', 'Grant Masson : Vince Kaminski'),
 ('reciprocal', 'Greg Whalley : Richard Sanders'),
 ('reciprocal', 'Janette Elbertson : Mark Taylor'),
 ('reciprocal', 'Janette Elbertson : Richard Sa

In [7]:
import csv
import mapreduce as mr

def mapper1(_, row):
    sender = row['From']
    if sender.endswith('@enron.com'):
        for recipient in row['To'].split(';'):
            if recipient.endswith('@enron.com'):
                if sender<recipient:
                    print(sender, recipient)
                    yield ((sender, recipient), 0)
                else:
                    yield ((recipient, sender), 1)

def reducer1(pair, counts):
    #print(pair,counts)
    yield (pair, len(set(counts))==2)

def formatName(email):
    return email.split('@')[0].replace('.', ' ').title()

def mapper2(pair, reciprocal):
    print(pair,reciprocal)
    person1, person2 = pair
    if reciprocal:
        yield ('reciprocal', '%s : %s' % (formatName(person1), formatName(person2)))
    
with open('enron_mails_small.csv', 'r') as fi:
    reader = enumerate(csv.DictReader(fi))
    output1 = list(mr.run(mr.run(reader, mapper1, reducer1), mapper2))

print(len(output1))
output1

mark.taylor@enron.com shari.stack@enron.com
mark.taylor@enron.com yao.apasu@enron.com
mark.taylor@enron.com paul.simons@enron.com
mark.taylor@enron.com tana.jones@enron.com
mark.taylor@enron.com per.sekse@enron.com
mark.taylor@enron.com per.sekse@enron.com
mark.taylor@enron.com michelle.cash@enron.com
mark.taylor@enron.com paul.simons@enron.com
mark.taylor@enron.com paul.simons@enron.com
mark.taylor@enron.com richard.sanders@enron.com
mark.taylor@enron.com tana.jones@enron.com
mark.taylor@enron.com scott.sefton@enron.com
mark.taylor@enron.com paul.simons@enron.com
mark.taylor@enron.com martin.rosell@enron.com
mark.taylor@enron.com tana.jones@enron.com
mark.taylor@enron.com scott.sefton@enron.com
mark.taylor@enron.com tana.jones@enron.com
mark.taylor@enron.com tana.jones@enron.com
mark.taylor@enron.com paul.simons@enron.com
mark.taylor@enron.com martin.rosell@enron.com
mark.taylor@enron.com yao.apasu@enron.com
mark.taylor@enron.com sara.shackleton@enron.com
mark.taylor@enron.com shari.s

gerald.nemec@enron.com john.scarborough@enron.com
elizabeth.sager@enron.com lou.stoler@enron.com
gerald.nemec@enron.com shonnie.daniel@enron.com
mark.haedicke@enron.com rob.walls@enron.com
mark.haedicke@enron.com mark.taylor@enron.com
gerald.nemec@enron.com julia.murray@enron.com
sara.shackleton@enron.com shari.stack@enron.com
sara.shackleton@enron.com tana.jones@enron.com
richard.sanders@enron.com twanda.sweet@enron.com
richard.sanders@enron.com twanda.sweet@enron.com
mark.taylor@enron.com tana.jones@enron.com
mark.haedicke@enron.com richard.sanders@enron.com
richard.sanders@enron.com twanda.sweet@enron.com
sara.shackleton@enron.com susan.flynn@enron.com
tana.jones@enron.com tom.moran@enron.com
sara.shackleton@enron.com tanya.rohauer@enron.com
richard.sanders@enron.com shawn.anderson@enron.com
richard.sanders@enron.com sheila.tweed@enron.com
mark.haedicke@enron.com martin.rosell@enron.com
mark.haedicke@enron.com richard.sanders@enron.com
sara.shackleton@enron.com tanya.rohauer@enron.c

chris.germany@enron.com joe.casas@enron.com
chris.germany@enron.com molly.johnson@enron.com
chris.germany@enron.com robin.barbe@enron.com
chris.germany@enron.com scott.hendrickson@enron.com
chris.germany@enron.com sarah.mulholland@enron.com
chris.germany@enron.com kate.fraser@enron.com
chris.germany@enron.com dick.jenkins@enron.com
chris.germany@enron.com clarissa.garcia@enron.com
chris.germany@enron.com cindy.vachuska@enron.com
chris.germany@enron.com molly.lafuze@enron.com
chris.germany@enron.com david.oliver@enron.com
chris.germany@enron.com victoria.versen@enron.com
sally.beck@enron.com scott.pleus@enron.com
sally.beck@enron.com sheila.glover@enron.com
chris.germany@enron.com susan.pereira@enron.com
chris.germany@enron.com dan.junek@enron.com
chris.germany@enron.com scott.neal@enron.com
chris.germany@enron.com kate.fraser@enron.com
chris.germany@enron.com robin.barbe@enron.com
chris.germany@enron.com jared.kaiser@enron.com
chris.germany@enron.com karen.mcilvoy@enron.com
chris.germa

brenda.whitehead@enron.com james.fallon@enron.com
brenda.whitehead@enron.com kevin.presto@enron.com
brenda.whitehead@enron.com rogers.herndon@enron.com
brenda.whitehead@enron.com karen.denne@enron.com
chris.germany@enron.com cindy.vachuska@enron.com
chris.germany@enron.com clarissa.garcia@enron.com
chris.germany@enron.com molly.lafuze@enron.com
chris.germany@enron.com joann.collins@enron.com
chris.germany@enron.com robert.allwein@enron.com
chris.germany@enron.com joan.veselack@enron.com
chris.germany@enron.com jesse.villarreal@enron.com
chris.germany@enron.com victoria.versen@enron.com
chris.germany@enron.com meredith.mitchell@enron.com
daren.farmer@enron.com robert.lloyd@enron.com
daren.farmer@enron.com pat.clynes@enron.com
chris.germany@enron.com jeffrey.hodge@enron.com
chris.germany@enron.com colleen.sullivan@enron.com
drew.fossum@enron.com john.dushinske@enron.com
chris.germany@enron.com sylvia.campos@enron.com
chris.germany@enron.com linda.bryan@enron.com
judy.hernandez@enron.com 

drew.fossum@enron.com martha.benner@enron.com
daren.farmer@enron.com jackie.young@enron.com
eric.bass@enron.com thomas.martin@enron.com
benjamin.rogers@enron.com john.house@enron.com
daren.farmer@enron.com michael.eiben@enron.com
mike.carson@enron.com tara.sweitzer@enron.com
gerald.nemec@enron.com robert.walker@enron.com
eric.bass@enron.com kyle.lilly@enron.com
mark.taylor@enron.com rod.nelson@enron.com
mark.taylor@enron.com paul.simons@enron.com
eric.bass@enron.com shanna.husser@enron.com
sara.shackleton@enron.com taffy.milligan@enron.com
chris.germany@enron.com scott.goodell@enron.com
chris.germany@enron.com molly.johnson@enron.com
chris.germany@enron.com victoria.versen@enron.com
chris.germany@enron.com joan.veselack@enron.com
chris.germany@enron.com robert.allwein@enron.com
chris.germany@enron.com joann.collins@enron.com
chris.germany@enron.com dan.junek@enron.com
chris.germany@enron.com judy.townsend@enron.com
rod.hayslett@enron.com shanna.funkhouser@enron.com
eric.bass@enron.com 

debra.perlingiere@enron.com stephanie.sever@enron.com
debra.perlingiere@enron.com russell.diamond@enron.com
debra.perlingiere@enron.com veronica.espinoza@enron.com
debra.perlingiere@enron.com joya.davis@enron.com
benjamin.rogers@enron.com dee.madole@enron.com
debra.perlingiere@enron.com susan.bailey@enron.com
chris.germany@enron.com kimberly.brown@enron.com
sara.shackleton@enron.com trena.mcfarland@enron.com
sara.shackleton@enron.com william.stuart@enron.com
carol.clair@enron.com larry.hunter@enron.com
carol.clair@enron.com david.dupre@enron.com
chris.germany@enron.com scott.goodell@enron.com
john.arnold@enron.com sandra.vu@enron.com
eric.bass@enron.com shanna.husser@enron.com
john.arnold@enron.com tara.sweitzer@enron.com
benjamin.rogers@enron.com jinsung.myung@enron.com
michelle.cash@enron.com susan.carrera@enron.com
michelle.cash@enron.com tony.jarrett@enron.com
richard.sanders@enron.com teresa.bushman@enron.com
debra.perlingiere@enron.com russell.diamond@enron.com
debra.perlingiere@

chris.germany@enron.com sarah.mulholland@enron.com
chris.germany@enron.com kate.fraser@enron.com
chris.germany@enron.com dick.jenkins@enron.com
chris.germany@enron.com scott.goodell@enron.com
chris.germany@enron.com cora.pendergrass@enron.com
chris.germany@enron.com kevin.ruscitti@enron.com
chris.germany@enron.com george.smith@enron.com
carol.clair@enron.com david.forster@enron.com
carol.clair@enron.com david.forster@enron.com
carol.clair@enron.com david.forster@enron.com
carol.clair@enron.com richard.sanders@enron.com
richard.sanders@enron.com twanda.sweet@enron.com
carol.clair@enron.com mark.taylor@enron.com
carol.clair@enron.com sara.shackleton@enron.com
debra.perlingiere@enron.com veronica.espinoza@enron.com
debra.perlingiere@enron.com russell.diamond@enron.com
benjamin.rogers@enron.com george.mccormick@enron.com
chris.germany@enron.com sylvia.campos@enron.com
debra.perlingiere@enron.com janette.elbertson@enron.com
carol.clair@enron.com susan.flynn@enron.com
carol.clair@enron.com l

('brian.bierbach@enron.com', 'twanda.sweet@enron.com') False
('brian.hendon@enron.com', 'gerald.nemec@enron.com') False
('brian.hoskins@enron.com', 'eric.bass@enron.com') False
('brian.hoskins@enron.com', 'jeff.skilling@enron.com') False
('brian.hoskins@enron.com', 'phillip.allen@enron.com') False
('brian.hoskins@enron.com', 'susan.scott@enron.com') False
('brian.kolle@enron.com', 'gerald.nemec@enron.com') False
('brian.redmond@enron.com', 'sally.beck@enron.com') False
('brian.schaffer@enron.com', 'michelle.cash@enron.com') False
('brian.schaffer@enron.com', 'sherri.sera@enron.com') False
('brian_hoskins@enron.com', 'eric.bass@enron.com') False
('briant.baker@enron.com', 'chris.germany@enron.com') False
('briant.baker@enron.com', 'sally.beck@enron.com') False
('britt.davis@enron.com', 'richard.sanders@enron.com') False
('britt.davis@enron.com', 'twanda.sweet@enron.com') False
('brogers2@enron.com', 'jeff.skilling@enron.com') False
('bromine@enron.com', 'jeff.skilling@enron.com') False


('gerald.nemec@enron.com', 'joyce.dorsey@enron.com') False
('gerald.nemec@enron.com', 'juanita.marchand@enron.com') False
('gerald.nemec@enron.com', 'julia.murray@enron.com') False
('gerald.nemec@enron.com', 'kay.young@enron.com') False
('gerald.nemec@enron.com', 'ken.choyce@enron.com') False
('gerald.nemec@enron.com', 'kenneth.kaase@enron.com') False
('gerald.nemec@enron.com', 'kevin.howard@enron.com') False
('gerald.nemec@enron.com', 'kim.zachary@enron.com') False
('gerald.nemec@enron.com', 'lal.echterhoff@enron.com') False
('gerald.nemec@enron.com', 'lance.legal@enron.com') False
('gerald.nemec@enron.com', 'laura.harder@enron.com') False
('gerald.nemec@enron.com', 'lawrence.lawyer@enron.com') False
('gerald.nemec@enron.com', 'lillie.pittman@enron.com') False
('gerald.nemec@enron.com', 'lisa.nemec@enron.com') False
('gerald.nemec@enron.com', 'lynn.bellinghausen@enron.com') False
('gerald.nemec@enron.com', 'mark.courtney@enron.com') False
('gerald.nemec@enron.com', 'mark.knippa@enron.

('meredith.eggleston@enron.com', 'vince.kaminski@enron.com') False
('meredith.mitchell@enron.com', 'sally.beck@enron.com') False
('michael.anderson@enron.com', 'sherri.reinartz@enron.com') False
('michael.beyer@enron.com', 'richard.sanders@enron.com') False
('michael.brown@enron.com', 'richard.sanders@enron.com') False
('michael.brown@enron.com', 'sara.shackleton@enron.com') False
('michael.brown@enron.com', 'tana.jones@enron.com') False
('michael.brown@enron.com', 'twanda.sweet@enron.com') False
('michael.corbally@enron.com', 'tana.jones@enron.com') False
('michael.danielson@enron.com', 'tana.jones@enron.com') False
('michael.eiben@enron.com', 'sally.beck@enron.com') False
('michael.etringer@enron.com', 'sara.shackleton@enron.com') False
('michael.guerriero@enron.com', 'sally.beck@enron.com') False
('michael.guerriero@enron.com', 'sara.shackleton@enron.com') False
('michael.guerriero@enron.com', 'vince.kaminski@enron.com') False
('michael.herman@enron.com', 'sara.shackleton@enron.com'

[('reciprocal', 'Brenda Whitehead : Elizabeth Sager'),
 ('reciprocal', 'Carol Clair : Debra Perlingiere'),
 ('reciprocal', 'Carol Clair : Mark Taylor'),
 ('reciprocal', 'Carol Clair : Richard Sanders'),
 ('reciprocal', 'Carol Clair : Sara Shackleton'),
 ('reciprocal', 'Carol Clair : Tana Jones'),
 ('reciprocal', 'Debra Perlingiere : Kevin Ruscitti'),
 ('reciprocal', 'Drew Fossum : Susan Scott'),
 ('reciprocal', 'Elizabeth Sager : Janette Elbertson'),
 ('reciprocal', 'Elizabeth Sager : Mark Haedicke'),
 ('reciprocal', 'Elizabeth Sager : Mark Taylor'),
 ('reciprocal', 'Elizabeth Sager : Richard Sanders'),
 ('reciprocal', 'Eric Bass : Susan Scott'),
 ('reciprocal', 'Fletcher Sturm : Greg Whalley'),
 ('reciprocal', 'Fletcher Sturm : Sally Beck'),
 ('reciprocal', 'Gerald Nemec : Susan Scott'),
 ('reciprocal', 'Grant Masson : Vince Kaminski'),
 ('reciprocal', 'Greg Whalley : Richard Sanders'),
 ('reciprocal', 'Janette Elbertson : Mark Taylor'),
 ('reciprocal', 'Janette Elbertson : Richard Sa