## MapReduce

In this assignment, we were given the script 'mapreduce.py' (note that this script has been copyrighted by New York University; see the script header for more details) and asked to use its functions to process a batch of Enron emails. The goal was to find every pair of individuals where both individuals had sent at least one email to the other aka a "reciprocal" relationship (as opposed to pairs of individuals where one sent emails to the other but the other never sent any emails in return).

**Dataset:** The data is a subset of the open [Enron Email Dataset](https://www.cs.cmu.edu/~./enron/ "Enron Email Dataset"), which contains approximately 10,000 simplified email headers from the Enron Corporation. The file contains 3 columns *Date*, *From*, and *To*. Their description is as follows:

|Column name|Description|
|--|--|
|Date |The date and time of the email, in the format YYYY-MM-DD hh-mm-ss, <br />e.g. "1998-10-30 07:43:00" |
|From |The sender email address, <br />e.g. "mark.taylor@enron.com" |
|To | A list of recipients' email addresses separated by semicolons ';', <br />e.g. "jennifer.fraser@enron.com;jeffrey.hodge@enron.com" |

For this task, we only looked at pairs of Enron employees (or only relationships having email addresses that end with '@enron.com'). Also note that although the dataset only contains email addresses, we're assuming that the email aliases were created based on actual names. For example:

|Email Address|Converted Name|
|--|--|
|mark.taylor@enron.com|Mark Taylor|
|alan.aronowitz@enron.com|Alan Aronowitz|
|marc.r.cutler@enron.com|Marc R Cutler|
|hugh@enron.com|Hugh|

The result of this notebook is a list of pairs of individuals with a reciprocal relationship.

In [1]:
import csv
import mapreduce as mr

In [2]:
def mapper_split(_, row):
    # Split the receipient email addresses
    for to in row['To'].strip().split(';'):
        # Look for @enron.com
        if '@enron.com' in to and '@enron.com' in row['From']:
            # Yield every from and every to
            from_to = ([row['From'].split('@')[0].replace('.',' '), to.split('@')[0].replace('.',' ')])
            yield row['From'].split('@')[0].replace('.',' ').title(), to.split('@')[0].replace('.',' ').title()

def reducer_set(froms, tos):
    # Reduce the tos to a set
    yield (froms, set(tos))
    
def mapper_counts(froms, tos):
    # List all the froms and tos and add counter
    for to in tos:
        from_tos = sorted([to, froms])        
        yield tuple(from_tos), 1
        
def reducer_sum(from_tos, counts):
    # Sum the counts
    yield from_tos, sum(counts)

def final_mapper(from_tos, counts):
    # Search, format, and print
    if counts > 1:
        yield 'reciprocal', (str(from_tos[0]) + " : " + str(from_tos[1]))

In [3]:
# Final Product
with open('data/enron_mails_small.csv', 'r') as fi:
    reader = enumerate(csv.DictReader(fi))
    output = list(mr.run(mr.run(mr.run(reader, mapper_split, reducer_set), mapper_counts, reducer_sum), final_mapper))

print(len(output))
output

35


[('reciprocal', 'Brenda Whitehead : Elizabeth Sager'),
 ('reciprocal', 'Carol Clair : Debra Perlingiere'),
 ('reciprocal', 'Carol Clair : Mark Taylor'),
 ('reciprocal', 'Carol Clair : Richard Sanders'),
 ('reciprocal', 'Carol Clair : Sara Shackleton'),
 ('reciprocal', 'Carol Clair : Tana Jones'),
 ('reciprocal', 'Debra Perlingiere : Kevin Ruscitti'),
 ('reciprocal', 'Drew Fossum : Susan Scott'),
 ('reciprocal', 'Elizabeth Sager : Janette Elbertson'),
 ('reciprocal', 'Elizabeth Sager : Mark Haedicke'),
 ('reciprocal', 'Elizabeth Sager : Mark Taylor'),
 ('reciprocal', 'Elizabeth Sager : Richard Sanders'),
 ('reciprocal', 'Eric Bass : Susan Scott'),
 ('reciprocal', 'Fletcher Sturm : Greg Whalley'),
 ('reciprocal', 'Fletcher Sturm : Sally Beck'),
 ('reciprocal', 'Gerald Nemec : Susan Scott'),
 ('reciprocal', 'Grant Masson : Vince Kaminski'),
 ('reciprocal', 'Greg Whalley : Richard Sanders'),
 ('reciprocal', 'Janette Elbertson : Mark Taylor'),
 ('reciprocal', 'Janette Elbertson : Richard Sa