a4: Unrequited Mail
====================

Overall, I realized my accuracy was hard to determine features that was really accurate in determining my behavior. In the beginning, I ran some statistics on my e-mail behavior in general. I receive about 41.623 e-mails a day and respond to 2.16 of them, which is about 5% of the time. Looking back on my exchanges, I tend to only reply in pressing situations are when the conversation is still "continuing," which is hard for an e-mail responder to gauge when a conversation or exchange is solved and over. My linear classifier had a cross-validation score of 0.598148103478.

Overall, I turned in Mailbot for a week in which I ran a cron job every 10 min on AWS EC2 for 5 days and turned it on specifically for a whitelist of my mom, dad, and Jeff! I did not pull in new data for my calculations and trained my classifier on the existing set (thought it might cause too much of a delay).

Getting Data & Storing Metadata
-------------------------------
For Part 1, I parsed the raw email tsv (create_csv.py) starting from the beginning of the school year (September 1, 2015) as I figured my e-mail patterns change within each school year and the ratio in which I talk to people as well. 

I created two hash tables: mid_all and mid_replied (each keeping the mid and metadata if all my emails, and those I just replied to). I kept them separate for faster parsing of emails I only replied to later on. 

In addition, I created a contact_total_count to track the number of email exchanges I have between certain contacts to later obtain a contact ratio that I will use as a feature.

In [1]:
#PART 1:
    #1) Parse raw email tsv, filter into mid_replied and mid_all
    #2) Create contact_total_count, meaning # of total email exchanges between contact
    #3) Output mid_replied and mid_all:
    #FINAL OUTPUT:
        #mid_all and mid_replied [TIME, RECEIVER LIST, INREPLYTO, Subj, numCC]
    
import csv
from email.utils import parsedate
from time import mktime
from datetime import datetime

contact_total_count = {} #format: K:receiver, V: #count

with open('raw-email-rec.tsv', 'rb') as tsv_file:
    tsv_reader = csv.reader(tsv_file, delimiter='\t')
    next(tsv_file)
    mid_replied = {}
    mid_all = {}
    for row in tsv_reader:
        if len(row) != 8 or not row[4]: #poorly formatted email or no receiver
            continue
        # index of all e-mails
        m_content = []
        mid = row[0].strip()
        mid = mid.replace('<','')
        mid = mid.replace('>','')
        m_date = row[1].strip()
        m_subj = row[2]
        m_senders = row[3]
        m_receivers = row[4]
        m_numcc = len(row[5].strip().split(","))

        #adjust total count of contacts (sender-side)
        sender_list = m_senders.strip().split(' ')
        for s in sender_list:
            s = s.strip()
            if s in contact_total_count:
                contact_total_count[s]+=1
            else:
                contact_total_count[s]=1
        
        #adjust total count of contacts (receiver_side)
        receiver_list = m_receivers.strip().split(' ')
        for r in receiver_list:
            r = r.strip()
            if r in contact_total_count:
                contact_total_count[r]+=1
            else:
                contact_total_count[r]=1
            
        in_reply_to = row[6].strip()
        time = datetime.fromtimestamp(mktime(parsedate(m_date)))
        m_content.append(time)
        m_content.append(receiver_list)
        if in_reply_to: #if string is not empty, actually part of a thread  
            m_content.append(in_reply_to)
        else:
            m_content.append('None')
        m_content.append(m_subj)
        m_content.append(m_numcc)
        mid_all[mid] = m_content
        
        # I responded
        if row[3].strip() == 'sharon_lo@brown.edu':
            mid_replied[mid] = m_content

Creating Features for Contact Relationship
------------------------------------------

Overall, I thought that the most indicative features for whether I will reply to someone and when I will reply is the average reply time as well as my "relationship" with them -- something I thought could be gauged by a ratio in the # of times I reply / # of exchanges we have.

These were overall documented in contact_avg_response and contact_ratio which where then used by MailBot.

I later printed out my avg response time which is quite, quite high! (in milliseconds. Thus, I generally respond to individuals in 16.99 hours. The aggregate_contact_ratio takes into account the email threads I actually replied to (i.e. once I paritipate in an exchange, how likely I will continue). It does not take into account email contacts I never talked to.

aggregate_avg_response:
61166464.0344
aggregate_contact_ratio:
0.359973620336

In [2]:
from __future__ import division
import numpy
import pickle

#PART 2:
        #1) Create contact_reply_times = list of response times for each contact
        #2) Create contact_reply_count = total count of emails initiated or replied to this contact
        #3) contact_avg_response = avg response time for each contact
        #4) contact_ratio = #reply/#total
        #5) write both dictionaries to file for MailBot to parse

contact_reply_times = {} #format receiver: time, time, time,...
contact_reply_count = {} #format receiver: #count

for mid_key in mid_replied:
    mid_meta= mid_replied[mid_key]
    m_date_1 = mid_meta[0]
    m_receiver_list = mid_meta[1]
    m_inreplyto = mid_meta[2]
    if m_inreplyto == 'None':
        for r in m_receiver_list:
            if r in contact_reply_count:
                contact_reply_count[r]+=1
            else:
                contact_reply_count[r]=1
    else: #should be in reply to something, look up in all_messages hashmap
        in_reply_mid = mid_meta[2]
        if in_reply_mid in mid_all.keys():
            m_date_2 = mid_all[in_reply_mid][0]
            diff = abs((m_date_1 - m_date_2).total_seconds())
        else:
            diff = 172800.0 #default value of 2 days
        for r in m_receiver_list:
            if r in contact_reply_count:
                contact_reply_count[r]+=1
            else:
                contact_reply_count[r]=1
            if r in contact_reply_times:
                reply_time_list = contact_reply_times[r]
                reply_time_list.append(diff)
                contact_reply_times[r] = reply_time_list
            else:
                reply_time_list = []
                reply_time_list.append(diff)
                contact_reply_times[r] = reply_time_list

aggregate_avg_response = 0
num_response = 0
#make receiver avg map
for r in contact_reply_times.keys():
    num_response += 1
    time_list = numpy.array(contact_reply_times[r])
    avg_r = numpy.mean(time_list)
    contact_reply_times[r] = avg_r
    aggregate_avg_response += avg_r

print "aggregate_avg_response"
print aggregate_avg_response
    
aggregate_avg_response = aggregate_avg_response/num_response

#write avg_response to file:
with open('avg_response.csv', 'wb') as avg_f:
    csv_writer_avg = csv.writer(avg_f)
    for key, value in contact_reply_times.items():
        csv_writer_avg.writerow([key, value])

aggregate_contact_ratio = 0
contact_ratio = {}
for r in contact_reply_times.keys():
    reply_count = 0
    if r in contact_reply_count:
        reply_count = contact_reply_count[r] 
    ratio_r = reply_count/contact_total_count[r]
    contact_ratio[r] = ratio_r
    aggregate_contact_ratio += ratio_r

aggregate_contact_ratio = aggregate_contact_ratio/num_response

print "aggregate_contact_ratio"
print aggregate_contact_ratio

#write ratio to file:
with open('contact_ratio.csv', 'wb') as f:
    writer = csv.writer(f)
    for key, value in contact_ratio.items():
        writer.writerow([key, value])


aggregate_avg_response
61166464.0344
aggregate_contact_ratio
0.359973620336



Creating Entire Feature Set
---------------------------

I tried may different features but distilled down to these 6 as the most indicative of my reply time. Some others I tried out but had very negligible correlation coefficients:
- "ASAP" or "Urgent" in body or subject of email. Turned out that this led to a lot of spam-emails or flash sales that were like "50% off sale at ___. Be part of it asap!"
- emojis in body of email. I thought that this meant a more "real" relationship but turns out there was not much of an effect.

Overall the features I used were: 
- contact ratio (# of emails replied/total exchanges)
- average response time in past
- day of week received
- day of week sent reply (what time it is currently for Mailbot)
- Time of Day (broken down into buckets for a more accurate gauge)
    - morning, afternoon, night, late night
- boolean: key words in email address (either a student/faculty from Brown or a company)
- how many people cc'd

In [3]:
#PART 3:
#1) data_features = feature vector

#features: 
    #0: contact ratio
    #1: avg response time
    #2: day of week received
    #3: day of week sent, 
    #4: time of day
    #5: boolean: certain words in email address
    #6: how many people cc'd
address_set = ['brown', 'google', 'microsoft', 'pinterest', 'qualtrics', 'square', 'uber', 'airbnb', 'fb', 'twitter']

data_features = []
for mid_key in mid_replied:
    mid_meta= mid_replied[mid_key]
    m_date_1 = mid_meta[0]
    m_day_1= m_date_1.weekday() #where days = {0:'Mon',1:'Tues',2:'Weds',3:'Thurs',4:'Fri',5:'Sat',6:'Sun'}
    m_day_2 = m_day_1 #update if in reply
    
    #categories of time-of-day {0 morning: <6 - 12, 1 afternoon: 12-6, 2 night: 6-12, 3 late night: 12 - 6 }
    m_time = m_date_1.hour
    if m_time >=0 and m_time < 6:
        m_tofday = 3
    elif m_time >=6 and m_time < 12:
        m_tofday = 0
    elif m_time >=12 and m_time < 18:
        m_tofday = 1
    else:
        m_tofday = 2
    
    m_receiver_list = mid_meta[1]
    avg_response_time = 0
    num_receivers = len(m_receiver_list)
    for r in m_receiver_list:
        if r in contact_reply_times:
            avg_response_time += contact_reply_times[r]
        else:
            avg_response_time += 0
    avg_response_time = float(avg_response_time/num_receivers)
    avg_response_time = 0.6*(aggregate_avg_response) + 0.4*(avg_response_time)
    
    m_address = 0
    for r in m_receiver_list:
        for a in address_set:
            if a in r:
                m_address = 1
    
    in_reply_mid = mid_meta[2]
    diff = 0
    if in_reply_mid != 'None': #should be in reply to something, look up in all_messages hashmap
        if in_reply_mid in mid_all:
            m_date_2 = mid_all[in_reply_mid][0]
            m_day_2 = m_date_2.weekday()
            diff = (m_date_1 - m_date_2).total_seconds()
        else:
            diff = 172800.0 #default value of 2 days, bigger than average response
    else:  #don't include initiated emails in training set
        continue
    
    ratio = 0
    for r in m_receiver_list:
        if r in contact_ratio:
            ratio += contact_ratio[r]
        else:
            ratio += 0
    ratio = float(ratio/num_receivers)
    ratio = 0.8*(aggregate_contact_ratio) + 0.2*(ratio)
    
    m_numcc = mid_meta[4]
    
    #features: 
    #0: contact ratio
    #1: avg response time
    #2: day of week recesived
    #3: day of week sent, 
    #4: time of day
    #5: boolean: certain words in email address
    #6: how many people cc'd
    
    mid_features = [ratio, avg_response_time, m_day_2, m_day_1, m_tofday, m_address, m_numcc, abs(diff)]
    data_features.append(mid_features)

Training Linear Classifier
---------------------------

As stated above, my linear classifier had a score of 0.598148103478, where cross-val was much more accurate in the beginning of data (suggesting Senior Spring was extremely unpredictable, which I'm not too surprised).

For me, it seemed like the most notable features were:
- Average contact ratio
- Time of Day
- Whether this is someone with a notable keyword in address
- Day of week sent and received

I think some interesting directions would be able to focus more on features detecting contact relationship and time of day, as these some like the most impact predictors for me.

In terms of quality of realtionship:
 - detect more features in terms of initiated_conversations with this person (how generally "important" is this contact to me in regards to e-mail)

Time of day:
  - differentiating more between weekends and weekdays
    
I also did some spot tests, which seemed like very reasonable. One instance, if I were to hypothetically have a contact where I respond 40% of our interactions, have a 4 hr avg response time, received on Monday, today is Tuesday, has a notable email address, no cc, I'm predicted to respond in 7.37391708 hours.


In [4]:
from sklearn import linear_model
from sklearn import cross_validation
from sklearn.externals import joblib
import cPickle

#regression
Xtrain = []
Ytrain = []

for data_point in data_features:
    Xtrain.append(data_point[0:7])
    Ytrain.append(data_point[7])

lin = linear_model.Lasso()
lin.fit(Xtrain, Ytrain)
score = lin.score(Xtrain,Ytrain)
print score
print(cross_validation.cross_val_score(lin, Xtrain, Ytrain))

print "\n"
print "coefficients: 0:ratio, 1:avg response time, 2:day received, 3:day sent, 4:time of day, 5:boolean: address part of special set, 6:numcc"
print lin.coef_
#predict


with open('my_lin_classifier.pkl', 'wb') as fid:
    cPickle.dump(lin, fid)

Xtest = []

print aggregate_contact_ratio
print aggregate_avg_response

print "\n"
print "I respond 40% of our interactions"
print "4 hr avg response time"
print "received on Monday, today is Tuesday"
print "has a notable email address"
print "no cc"

Xtest.append([0.8*aggregate_contact_ratio + 0.2*(.4), 0.6*(aggregate_avg_response) + (0.4)*14400, 0, 1, 2, 1, 0])
#Xtest.append([0.8*aggregate_contact_ratio + 0.2*0.11594202898550725, 0.6*(aggregate_avg_response) + (0.4)*57796.333333333336, 2, 3, 3, 1, 5])
print lin.predict(Xtest)/3600
#print linear.score(Xtest, Ytest)



0.598148103478
[ 0.34343095  0.35275278 -2.72528618]


coefficients: 0:ratio, 1:avg response time, 2:day received, 3:day sent, 4:time of day, 5:boolean: address part of special set, 6:numcc
[ -3.33174879e+05   2.24788742e+00   1.28343702e+03  -8.62235434e+03
   1.75519646e+04  -1.43057431e+04   0.00000000e+00]
0.359973620336
217674.249233


I respond 40% of our interactions
4 hr avg response time
received on Monday, today is Tuesday
has a notable email address
no cc
[ 7.37391708]
