# Getting And Cleaning Data From Google Groups E-mail
#### Squire, M. *Clean Data*, Chapter 5

## Introduction
In this example project we want to find out how long it took developers to get an answer to their API question via an e-mail tech support group. For that, we will download e-mail messages from a Google Groups mailing list, figure out which messages were sent in reply to the others and calculate some basic summary statistics about how long it took each message to get a reply.

## Collect the Google Groups messages

A list of the URLs for all the postings on the BigQuery Google Group is available at https://raw.githubusercontent.com/megansquire/stackpaper2015/master/BigQueryGGurls.txt. We first retrieve this file:

In [1]:
import os
import urllib2

# Create a directory to store downloaded data (if it doesn't exist already)
if not os.path.exists("source_data/"):
    os.makedirs("source_data/")

# Download the file with the list of URLs
fname = "BigQueryGGurls.txt"
url = "https://raw.githubusercontent.com/megansquire/stackpaper2015/master/" + fname

if not os.path.isfile("source_data/"+fname):
    f = urllib2.urlopen(url)
    with open("source_data/"+fname, "wb") as listfile:
        listfile.write(f.read())

Once we have a list of target URLs we download all the e-mails in those URLs:

In [5]:
# The time library is included so we can take a short sleep() method
# in between requests to the Google Groups server
import time

# Create a directory to store downloaded emails
datapath = "source_data/emails/"
if not os.path.exists(datapath):
    os.makedirs(datapath)

with open("source_data/"+fname, "r") as lf:
    urls = []
    for url in lf:
        urls.append(url.strip())

# Only download missing files
# Also: added a check to catch urlopen errors as some of the URLs in the list
# are not working
currentFileNum = 1
for url in urls:
    emailFile = datapath+"msg%d.txt" % currentFileNum
    if not os.path.isfile(emailFile):
        print("Downloading: {0} Number: {1}".format(url, currentFileNum))
        
        try:
            htmlFile = urllib2.urlopen(url)
        except urllib2.HTTPError, err:
            if err.code == 404:
                print "Error: page not found"
            elif err.code == 403:
                print "Error: access denied"
            elif err.code == 500:
                print "Error: internal server error"
            else:
                print "Error code", err.code
        except urllib2.URLError, err:
            print "Error:", err.reason
            
        time.sleep(2)
        urlFile = open(emailFile, 'wb')
        urlFile.write(htmlFile.read())
        urlFile.close()
    currentFileNum = currentFileNum + 1

Downloading: https://groups.google.com/forum/message/raw?msg=bigquery-discuss/cYkx6k2A5ro/MKDDkG2-f2EJ Number: 539
Error: internal server error
Downloading: https://groups.google.com/forum/message/raw?msg=bigquery-discuss/G1psmiunLpQ/BnZS2ghM-UcJ Number: 540
Error: internal server error
Downloading: https://groups.google.com/forum/message/raw?msg=bigquery-discuss/5Aw2fO5moa4/1RPBLS9FXK8J Number: 541
Downloading: https://groups.google.com/forum/message/raw?msg=bigquery-discuss/rMxdSm00LkM/_EMwutWPgjUJ Number: 542
Downloading: https://groups.google.com/forum/message/raw?msg=bigquery-discuss/8zspQalCmwU/cGepkWExtDYJ Number: 543
Downloading: https://groups.google.com/forum/message/raw?msg=bigquery-discuss/8zspQalCmwU/n2Onp1Cvrq0J Number: 544
Downloading: https://groups.google.com/forum/message/raw?msg=bigquery-discuss/8zspQalCmwU/42FM_Vtpz94J Number: 545
Downloading: https://groups.google.com/forum/message/raw?msg=bigquery-discuss/VofC_IRR-3A/N_YJOXdnpmEJ Number: 546
Error: internal server

The above code downloads over 600 e-mail messages and saves each one in a separate text file. Looking at their content, we see lots of headers (e.g. 'Received', 'Date', 'Message-ID', 'Subject') which store information about the e-mail, its **metadata**.

The three headers that store the metadata elements we need are:

- Date
- Message-ID
- In-Reply-To

All messages will have an ID and a date, but 'In-Reply-To' will only appear in a message that is a reply to another - the value in that header must be the ID of another message.

In [None]:
import re
import email.utils
import datetime
import numpy

originals = {}
