# Loading and Processing of Carpediem Email Data

To begin, we need load and create an appropriate dataset in order to proceed with visualization and data exploration. We will begin by adapting the email scrapper using pop from the Olin Snapshot Project. Unfortunately, the data schema of embedded JSON is much too deep and complicated for usage by pandas (an attempt was made to convert into pandas), thus we will try to load and save the data into a csv file of our design for easy pandas conversion.

To finish we will need to do the following to the data:
    1. Manually tag all of them
    2. Split the data into training and test data appropriately
        a. Will require sorting to ensure enough of different tags make it into both 
    3. Create a key file for the test
    4. Remove the tags from the test file

The goal is to first create a list of dictionaries with the email title, subject line, and msg content. In addition we want to store these emails in dictionary form into a csv with the appropriate headers (
*Note that poplib only loads fresh emails, marking unread doesn't work. Have to resend/forward emails to create initial file

In [1]:
import os
import sys
import time
import poplib
import email
import csv
from io import StringIO
from datetime import datetime

def list_to_dict(msg):
    d = dict()
    for field in msg:
        if "*" in field:
            field = "".join(field.split("*"))
        if ":" not in field:
            continue

        print(field)
        key, value = field.split(": ", 1)
        if key == "Sent":
            key = "date"
            value = value.split(" (", 1)[0]
            # ex. string at this point: Tuesday, April 3, 2018 3:51:10 PM
            #value = datetime.strptime(value, "%A, %B %e, %Y %T %p")
        if key == "From":
            key = "who"
            value = field.split("Behalf Of", 1)[1]
        if key == "Subject":
            key = "name"
        if key == "To":
            continue
        d[key] = value
    return d

def get_mail():
    """ fetches new messages from a gmail account using POP, then parses it
    into a dictionary for PostGresDB
    """

    """
    Connect to gmail accound and retrieve messages
    """
    pop_conn = poplib.POP3_SSL('pop.gmail.com')
    pop_conn.user('olinsnapshot@gmail.com')
    pop_conn.pass_('hackingthelibrary')
    # messages = [pop_conn.retr(i) for i in range(1, len(pop_conn.list()[1]) + 1)]

    messages = []

    # Parse messages
    resp, items, octets = pop_conn.list()

    for item in items:
        id, size = item.decode().split(' ')
        resp, text, octets = pop_conn.retr(id)

        text = [x.decode() for x in text]
        text = "\n".join(text)
        file = StringIO(text)

        orig_email = email.message_from_file(file).as_string()
        messages.append(orig_email.split("\n"))

    pop_conn.quit()

    msg_dicts = []
    for msg in messages:
        if not msg: continue
        header_index = -1
        msg_info = []
        body = []
        i = 0
        while(len(msg) > i):
            line = msg[i]
            if "CarpediemOn Behalf Of" in line:
                header_index = i
                msg_info = msg[i:i+5]
                i = i + 5
            elif header_index != -1:
                if line == "":
                    i = i + 1
                    continue
                elif "--" in line:
                    break
                else:
                    body.append(line)

            i = i + 1
        msg_info.append("body: " + " ".join(body))
        print(msg_info)
        current_dict = list_to_dict(msg_info)
        categories = []
        print(list(current_dict.keys()))
        msg_dicts.append(current_dict)

    print(msg_dicts)
    return msg_dicts

def create_data_csv(data_list,location):
    '''
    Converts a data list of dictionaries into a csv file at argument location
    '''
    test_data_1 = open(location, 'w')

    csvwriter = csv.writer(test_data_1)
    count = 0
    for email in data_list:
        if count == 0:
            header = email.keys()
            csvwriter.writerow(header)
            count +=1
        csvwriter.writerow(email.values())

    test_data_1.close()

Loading Data: *in reality done in 4 parts b/c of tedious reforwarding and manually combined

In [2]:
emails_data = get_mail()
#don't run this or it will overwrite 
#create_data_csv(emails_data,'./data/test_data.csv')

['From: CarpediemOn Behalf OfGrace Montagnino', 'Sent: Thursday, May 10, 2018 12:48:36 PM (UTC-05:00) Eastern Time (US & Can=', 'ada)', 'To: Carpediem@lists.olin.edu', 'Subject: [Carpediem] Donate to the Needham Community Council', "body: From: CarpeSERV [mailto:carpeserv-bounces@lists.olin.edu] On Behalf Of Just= in Kunimune (Forwarding) Sent: Wednesday, May 9, 2018 9:41 PM To: carpediem@lists olin. edu <carpediem@lists.olin.edu>; carpeserv@lists.o= lin.edu Subject: [CarpeSERV] More boxes for the Needham Community Council tl;dr =96 put any excess food and cleaning items in the labelled bins in th= e residence halls to donate them to a food bank. Long version: If you saw Melissa's email, it's the same thing, but now with more boxes th= at will be there for longer! There is now a labelled box in the WH anteloun= ge, and will be one in EH tomorrow, for donating food and toiletries to the=  Needham Community Council Food Pantry. These boxes will be out through the=  end of the semester, s

We will classify the data according to tags represented by numbers with the following schema:
0 - academic
1 - burstthebubble
2 - food
3 - holidays
4 - StAR
5 - admission 
6 - athletics 
7 - BOW
8 - collaboratory
9 - CORe
10 - diversity
11 - Final Events
12 - FWOP
13 - GrOW
14 - hours 
15 - library 
16 - NINJAs
17 - OFAC
18 - parties
19 - performance
20 - PGP 
21 - robolab
22 - shop
23 - SLAC
24 - talks
25 - transportation
26 - stuff 
27 - housing
28 - other
** Note that we have added 3 categories to represent emails that indicate something is free or for sale (26), housing (27), and other (28). These are probably events that wouldn't make it onto the actual page but are still useful to classify to figure out that they shouldn't become events.