<a id='home'></a>

# Project : Email DB 

In this small project, I build a notebook which reads the mailbox data (mbox.txt) counts up the number email messages per organization (i.e. domain name of the email address) using a SQLite DB with the following schema to maintain the counts.

- The DB is manipulated by using sqlite3 library.
- The text file is checked by queries executed directly from bash.

The workflow is as following:

1. First, the database and tables are initialized
2. Second, the mailbox data is sanity checked before updating the DB
3. Third , the txt is parsed and converted to a format which is readeble by the sqlite. Later the data is inserted to the database's appropriate tables
4. Finally, some interesting queries are made in order to have an idea about the data.



## Create Database and Tables

In [1]:
import sqlite3

con = sqlite3.connect('emaildb.db')
cur = con.cursor()

cur.execute('''
DROP TABLE IF EXISTS Counts''')

cur.execute('''
CREATE TABLE Counts (org TEXT, count INTEGER)''')
con.commit()
con.close()

## Check the Raw Email Text

In [2]:
!head -20 mbox.txt
!echo '----'
!tail -20 mbox.txt

From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
Return-Path: <postmaster@collab.sakaiproject.org>
Received: from murder (mail.umich.edu [141.211.14.90])
	 by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;
	 Sat, 05 Jan 2008 09:14:16 -0500
X-Sieve: CMU Sieve 2.3
Received: from murder ([unix socket])
	 by mail.umich.edu (Cyrus v2.2.12) with LMTPA;
	 Sat, 05 Jan 2008 09:14:16 -0500
Received: from holes.mr.itd.umich.edu (holes.mr.itd.umich.edu [141.211.14.79])
	by flawless.mail.umich.edu () with ESMTP id m05EEFR1013674;
	Sat, 5 Jan 2008 09:14:15 -0500
Received: FROM paploo.uhi.ac.uk (app1.prod.collab.uhi.ac.uk [194.35.219.184])
	BY holes.mr.itd.umich.edu ID 477F90B0.2DB2F.12494 ; 
	 5 Jan 2008 09:14:10 -0500
Received: from paploo.uhi.ac.uk (localhost [127.0.0.1])
	by paploo.uhi.ac.uk (Postfix) with ESMTP id 5F919BC2F2;
	Sat,  5 Jan 2008 14:10:05 +0000 (GMT)
Message-ID: <200801051412.m05ECIaH010327@nakamura.uits.iupui.edu>
Mime-Version: 1.0
----
X-DSPAM-Confidence: 0.9836
X

#### Use the grep function to print the lines with a matching pattern

In [3]:
! grep From: mbox.txt | head -20

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
grep: write error: Broken pipe


## Parse the Text and Update the DB

In [4]:
con = sqlite3.connect('emaildb.db')
cur = con.cursor()

fname = 'mbox.txt'

fh = open(fname)


for line in fh:
    if not line.startswith('From: ') : continue
    pieces = line.split()
    email = pieces[1]
    dom = email[email.find('@')+1:]
    

    
    cur.execute('SELECT count FROM Counts WHERE org = ? ', (dom, ))
    row = cur.fetchone()
    
    if row is None:
        cur.execute('''INSERT INTO Counts (org, count) 
                VALUES ( ?, 1 )''', ( dom, ) )
    else : 
        cur.execute('UPDATE Counts SET count = count+1 WHERE org = ?', 
            (dom, ))

con.commit()
con.close()

## Check the Database

In [5]:
import pandas as pd

In [6]:
with sqlite3.connect('emaildb.db') as con:
    cur = con.cursor()


    sqlstr = 'SELECT org, count FROM Counts ORDER BY count DESC'
    cur.execute(sqlstr)
#     print "Counts:"
#     for row in cur.execute(sqlstr) :
#         print '%s %s' % (str(row[0]), row[1])
    df =  pd.DataFrame(cur.fetchall(), columns=['Domain','Count'])

df.head()

Unnamed: 0,Domain,Count
0,iupui.edu,536
1,umich.edu,491
2,indiana.edu,178
3,caret.cam.ac.uk,157
4,vt.edu,110


## Extra :  Calculate the sum of domain names with bash

In [7]:
print "wc -l command counts the lines of the grep function's results"
! grep From: mbox.txt| wc -l
    
print "which is equal to: {}".format(df.Count.sum())

wc -l command counts the lines of the grep function's results
1797
which is equal to: 1797


[Home](#home)