# In which the contents of numerous spam folders gradually erodes my faith in humanity. 

Week 7 of Andrew Ng's ML course on Coursera introduces the Support Vector Machine algorithm and challenges us to use it for classifying email as spam or ham. Here I use the [SpamAssassin public corpus](https://spamassassin.apache.org/publiccorpus/) to build an SVM spam email classifier in order to learn about the relevant python tools. Part I focuses on the preprocessing of individual emails while Part II focuses on the actual classifier.

>## Tools Covered:
- `re` for regular expressions to do Natural Language Processing (NLP)
- `stopwords` text corpus for removing information-poor words in NLP
- `SnowballStemmer` for stemming text in NLP
- `BeautifulSoup` for HTML parsing

In [1]:
# Set up environment
import scipy.io
import matplotlib.pyplot as plt
import matplotlib 
import pandas as pd
import numpy as np
import pickle
import os
import re

from nltk.stem.snowball import SnowballStemmer
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")

import snips as snp  # my snippets
snp.prettyplot(matplotlib)  # my aesthetic preferences for plotting
%matplotlib inline

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sonya\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
cd hw-wk7-spam-preprocessing

C:\Users\Sonya\Box Sync\Projects\course-machine-learning\hw-wk7-spam-preprocessing


# Quick Look at the Data

I'm going to pull a set of spam and "ham" (non-spam) emails from the [SpamAssassin public corpus](https://spamassassin.apache.org/publiccorpus/) data sets. This resource has also kindly ham separated emails into easy and hard ham. Each email is stored a a plain text file with the email header information and the email body including HTML markup if applicable. 

In [3]:
# Setup for accessing all the spam and ham text files
from os import listdir
from os.path import isfile, join

spampath = join(os.getcwd(), "spam")
spamfiles = [join(spampath, fname) for fname in listdir(spampath)]

hampath = join(os.getcwd(), "easy_ham")
hamfiles = [join(hampath, fname) for fname in listdir(hampath)]

## Example Formatted  File
Here is what an email would look like if viewed with proper formatting, like in your browser.

In [4]:
with open(hamfiles[3]) as myfile:
    for line in myfile.readlines():
        print(line)

From irregulars-admin@tb.tf  Thu Aug 22 14:23:39 2002

Return-Path: <irregulars-admin@tb.tf>

Delivered-To: zzzz@localhost.netnoteinc.com

Received: from localhost (localhost [127.0.0.1])

	by phobos.labs.netnoteinc.com (Postfix) with ESMTP id 9DAE147C66

	for <zzzz@localhost>; Thu, 22 Aug 2002 09:23:38 -0400 (EDT)

Received: from phobos [127.0.0.1]

	by localhost with IMAP (fetchmail-5.9.0)

	for zzzz@localhost (single-drop); Thu, 22 Aug 2002 14:23:38 +0100 (IST)

Received: from web.tb.tf (route-64-131-126-36.telocity.com

    [64.131.126.36]) by dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id

    g7MDGOZ07922 for <zzzz-irr@example.com>; Thu, 22 Aug 2002 14:16:24 +0100

Received: from web.tb.tf (localhost.localdomain [127.0.0.1]) by web.tb.tf

    (8.11.6/8.11.6) with ESMTP id g7MDP9I16418; Thu, 22 Aug 2002 09:25:09

    -0400

Received: from red.harvee.home (red [192.168.25.1] (may be forged)) by

    web.tb.tf (8.11.6/8.11.6) with ESMTP id g7MDO4I16408 for

    <irregulars@tb.tf>

## Example Raw File
Now we want to see what the actual strings look like that we will do all our processing on.

In [5]:
with open(hamfiles[3], "r") as myfile:
    lines = myfile.readlines()
print(lines)

['From irregulars-admin@tb.tf  Thu Aug 22 14:23:39 2002\n', 'Return-Path: <irregulars-admin@tb.tf>\n', 'Delivered-To: zzzz@localhost.netnoteinc.com\n', 'Received: from localhost (localhost [127.0.0.1])\n', '\tby phobos.labs.netnoteinc.com (Postfix) with ESMTP id 9DAE147C66\n', '\tfor <zzzz@localhost>; Thu, 22 Aug 2002 09:23:38 -0400 (EDT)\n', 'Received: from phobos [127.0.0.1]\n', '\tby localhost with IMAP (fetchmail-5.9.0)\n', '\tfor zzzz@localhost (single-drop); Thu, 22 Aug 2002 14:23:38 +0100 (IST)\n', 'Received: from web.tb.tf (route-64-131-126-36.telocity.com\n', '    [64.131.126.36]) by dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id\n', '    g7MDGOZ07922 for <zzzz-irr@example.com>; Thu, 22 Aug 2002 14:16:24 +0100\n', 'Received: from web.tb.tf (localhost.localdomain [127.0.0.1]) by web.tb.tf\n', '    (8.11.6/8.11.6) with ESMTP id g7MDP9I16418; Thu, 22 Aug 2002 09:25:09\n', '    -0400\n', 'Received: from red.harvee.home (red [192.168.25.1] (may be forged)) by\n', '    web.tb.tf 

Some preliminary thoughts: The first line of every file is the most basic header info about the originating address and time the email was sent. There follows a section of keyword-value pairs in the form *keyword: value\n*. Finally, **the body of each email is separated from the meta info by two newline characters `\n\n`.** Note that some of the email bodies contain HTML. 

# Easy Mode with Just the Email Body
The first thing I'll try is just doing some NLP on only the email bodies, ignoring all the header info. First write a function that grabs only the body lines of a single email:

In [25]:
def get_body(fpath):
    '''Get email body lines from fpath using first occurence of empty line.'''
    with open(fpath, "r") as myfile:
        try: 
            lines = myfile.readlines()
            idx = lines.index("\n") # only grabs first instance
            return "".join(lines[idx:])
        except: 
            print("Couldn't decode file %s" %(fpath,))

In [26]:
# Test it out 
body= get_body(hamfiles[3])

In [27]:
body  # This is the actual string we are going to be processing

"\nKlez: The Virus That Won't Die\n \nAlready the most prolific virus ever, Klez continues to wreak havoc.\n\nAndrew Brandt\n>>From the September 2002 issue of PC World magazine\nPosted Thursday, August 01, 2002\n\n\nThe Klez worm is approaching its seventh month of wriggling across \nthe Web, making it one of the most persistent viruses ever. And \nexperts warn that it may be a harbinger of new viruses that use a \ncombination of pernicious approaches to go from PC to PC.\n\nAntivirus software makers Symantec and McAfee both report more than \n2000 new infections daily, with no sign of letup at press time. The \nBritish security firm MessageLabs estimates that 1 in every 300 \ne-mail messages holds a variation of the Klez virus, and says that \nKlez has already surpassed last summer's SirCam as the most prolific \nvirus ever.\n\nAnd some newer Klez variants aren't merely nuisances--they can carry \nother viruses in them that corrupt your data.\n\n...\n\nhttp://www.pcworld.com/news/art

In [28]:
print(body)  # This is what it would look like properly displayed


Klez: The Virus That Won't Die
 
Already the most prolific virus ever, Klez continues to wreak havoc.

Andrew Brandt
>>From the September 2002 issue of PC World magazine
Posted Thursday, August 01, 2002


The Klez worm is approaching its seventh month of wriggling across 
the Web, making it one of the most persistent viruses ever. And 
experts warn that it may be a harbinger of new viruses that use a 
combination of pernicious approaches to go from PC to PC.

Antivirus software makers Symantec and McAfee both report more than 
2000 new infections daily, with no sign of letup at press time. The 
British security firm MessageLabs estimates that 1 in every 300 
e-mail messages holds a variation of the Klez virus, and says that 
Klez has already surpassed last summer's SirCam as the most prolific 
virus ever.

And some newer Klez variants aren't merely nuisances--they can carry 
other viruses in them that corrupt your data.

...

http://www.pcworld.com/news/article/0,aid,103259,00.asp
___

# Preprocessing Plan of Attack (order matters)
The order of steps in text processing matters a lot if you are trying to extract other features alongside a simple "Bag of Words" or "Word Salad" model. For instance, if you want to count the number of question marks in the email text then you should probably do it *before* removing all punctuation, but *after* replacing all http addresses (which sometimes contain special characters). Here is a rough outline of all the steps we'll take to get from a messy, marked-up raw text to a delicious word salad:
- Strip any HTML tags and leave only text content (also count HTML tags)
- Strip all email and web addresses (also count them)
- Lowercase everything (also count uppercases)
- Strip all dollar signs and numbers(also count them)
- Strip away all other punctuation (also count exclamation and question marks)
- Standardize all white space to single space (also count newlines and blank lines)
- Count the total number of words in our word salad
- Strip away all useless "Stopwords" (like "a", "the", "at")
- Stem all the words down to their root to simplify


## Parsing HTML
Some of the email bodies contain HTML formatting - the amount of such formatting might be a helpful feature, but the tags themselves we want to strip away. There are also some symbols in HTML documents, like "<", that have a reserved shorthand notation since they are otherwise interpreted as markup by the browser. We could write regexes to do all of this HTML processing, but a lovely little package called `beatiful soup` has already done this and provided us with an HTML parser that returns a parsed object. The `get_text()` method lets us pull out everything *except* the markup from this parsed object. You should check out the [official soup docs](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text). 

In [29]:
# Parse the email body into HTML elements
from bs4 import BeautifulSoup
soup = BeautifulSoup(body, 'html.parser')

In [30]:
# Count the number of HTML elements and specific link elements
nhtml = len(soup.find_all())
nlinks = len(soup.find_all("a"))

# Pull out only the non-markup of the body
body = soup.get_text()

## Finding email and web addresses
We'll find and count the appearances of email and web addresses, and then replace each one with blank space. A very useful tool for all language processing is the **regular expression**, which is housed in the `re` module of the python standard lib. For more info you can refer to my brief but hopefully edifying [overview of regexes in python](http://sdsawtelle.github.io/blog/output/regular-expressions-in-python.html). 

In [31]:
# Replace and count all URLs 
regx = re.compile(r"(http|https)://[^\s]*")
body, nhttps = regx.subn(repl=" ", string=body)

# Replace and count all email addresses
regx = re.compile(r"\b[^\s]+@[^\s]+[.][^\s]+\b")
body, nemails = regx.subn(repl=" ", string=body)

In [32]:
body

"\nKlez: The Virus That Won't Die\n \nAlready the most prolific virus ever, Klez continues to wreak havoc.\n\nAndrew Brandt\n>>From the September 2002 issue of PC World magazine\nPosted Thursday, August 01, 2002\n\n\nThe Klez worm is approaching its seventh month of wriggling across \nthe Web, making it one of the most persistent viruses ever. And \nexperts warn that it may be a harbinger of new viruses that use a \ncombination of pernicious approaches to go from PC to PC.\n\nAntivirus software makers Symantec and McAfee both report more than \n2000 new infections daily, with no sign of letup at press time. The \nBritish security firm MessageLabs estimates that 1 in every 300 \ne-mail messages holds a variation of the Klez virus, and says that \nKlez has already surpassed last summer's SirCam as the most prolific \nvirus ever.\n\nAnd some newer Klez variants aren't merely nuisances--they can carry \nother viruses in them that corrupt your data.\n\n...\n\n \n____________________________

## Lowercasing and Counting Caps
We don't expect whether a word is capitalized or not to reflect some deep difference in tone or meaning, but we *might* expect that an email with a bunch of capitalization reflects a certain tone, so we'll lowercase everything but still count the number of capitalized letters. 

In [33]:
# Count uppercases
nupper = len([charup for charup, char in zip(body, body.lower()) if charup != char])
# Lowercase everything
body = body.lower()

## Finding numbers, dollar signs, and punctuation
We'd like to know the frequency of punctuation which carry certain tones, like exclamation marks, question marks, and dollar signs. Also the frequency of numbers appearing in the email might be a helpful feature. All of these frequencies should be normalized to the number of words in the email to measure the tone or intent of the email rather than its length, but we'll hold off on word-count until we're done with processing. After counting the things we care about, we'll remove all punctuation to get us closer to a pure bag of words.

In [34]:
# Count and replace all numbers (integer and float)
regx = re.compile(r"\b[\d.]+\b")
body, nnum = regx.subn(repl=" ", string=body)

# Count and replace all dollar signs
regx = re.compile(r"[$]")
body, ndollar = regx.subn(repl=" ", string=body)

# Count number of special punctuation
nexclaim, nquest = body.count("!"), body.count("?")

# Remove all other punctuation (dashes replace with space)
regx = re.compile(r"[^\w\s_-]+")  
body = regx.sub(repl="", string=body)
regx = re.compile(r"[_-]+")
body = regx.sub(repl=" ", string=body)

In [35]:
body

'\nklez the virus that wont die\n \nalready the most prolific virus ever klez continues to wreak havoc\n\nandrew brandt\nfrom the september   issue of pc world magazine\nposted thursday august    \n\n\nthe klez worm is approaching its seventh month of wriggling across \nthe web making it one of the most persistent viruses ever and \nexperts warn that it may be a harbinger of new viruses that use a \ncombination of pernicious approaches to go from pc to pc\n\nantivirus software makers symantec and mcafee both report more than \n  new infections daily with no sign of letup at press time the \nbritish security firm messagelabs estimates that   in every   \ne mail messages holds a variation of the klez virus and says that \nklez has already surpassed last summers sircam as the most prolific \nvirus ever\n\nand some newer klez variants arent merely nuisances they can carry \nother viruses in them that corrupt your data\n\n\n\n \n \nirregulars mailing list\n \n \n\n'

## Standardizing White Space and Total Word Count
Standardizing white space is an important step, as it makes tokenizing the email into words straightforward, but make sure to do it as a last step since lots of the substitutions we've done have created extra whitespace. Also the number of carriage returns (`\n` characters) and the number of blank lines (`\n\n`) might be predictive so we'll count those.

In [36]:
# Count carriage returs and blank lines
nblanks, nnewlines = body.count("\n\n"), body.count("\n")

# Make all white space a single space
regx = re.compile(r"\s+")
body = regx.sub(repl=" ", string=body)

# Remove any trailing or leading white space
body = body.strip(" ")

In [37]:
body

'klez the virus that wont die already the most prolific virus ever klez continues to wreak havoc andrew brandt from the september issue of pc world magazine posted thursday august the klez worm is approaching its seventh month of wriggling across the web making it one of the most persistent viruses ever and experts warn that it may be a harbinger of new viruses that use a combination of pernicious approaches to go from pc to pc antivirus software makers symantec and mcafee both report more than new infections daily with no sign of letup at press time the british security firm messagelabs estimates that in every e mail messages holds a variation of the klez virus and says that klez has already surpassed last summers sircam as the most prolific virus ever and some newer klez variants arent merely nuisances they can carry other viruses in them that corrupt your data irregulars mailing list'

This is a true bag of words, so now we can get our word count to use in normalizing counts:

In [38]:
nwords = len(body.split(" "))
nwords

155

## Remove Stop Words with `nltk`
Each email is going to have lots of words which are the "glue" of the english language but don't carry much semantic weight in determining the real topic or tone of an email. These are called [Stop Words](https://en.wikipedia.org/wiki/Stop_words) and we will go ahead and strip them out from the start. 

The Natural Language Tool Kit module (`ntlk`) includes a crap ton of functionality for processing text and also access to some public "corpora" such as for stop words.

In [39]:
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sonya\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [40]:
len(stopwords.words("english"))

153

In [41]:
stopwords.words("english")[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your']

In [42]:
# Remove all useless stopwords
bodywords = body.split(" ")
keepwords = [word for word in bodywords if word not in stopwords.words('english')]
body = " ".join(keepwords)

In [43]:
body

'klez virus wont die already prolific virus ever klez continues wreak havoc andrew brandt september issue pc world magazine posted thursday august klez worm approaching seventh month wriggling across web making one persistent viruses ever experts warn may harbinger new viruses use combination pernicious approaches go pc pc antivirus software makers symantec mcafee report new infections daily sign letup press time british security firm messagelabs estimates every e mail messages holds variation klez virus says klez already surpassed last summers sircam prolific virus ever newer klez variants arent merely nuisances carry viruses corrupt data irregulars mailing list'

## Stemming with `nltk`
This classifier is trying to determine the intent or tone of an email (spam vs. ham) by virtue of the specific words in that email, among other things. We don't expect that a slight variation on the same root word, like "battery" versus "batteries", carries much difference in intent or tone. Thus when we begin to represent our emails in the feature space of word content, we would do better to replace all the variants of each root with the root itself: this reduces the complexity of emails without really reducing the information about tone or intent. This process is called **stemming** and the `nltk` module has several options for out-of-the-box stemmers. 

In [27]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

stemmer.stem("generously")

'generous'

In [28]:
# Stem all words
words = body.split(" ")
stemwords = [stemmer.stem(wd) for wd in words]
body = " ".join(stemwords)

In [29]:
body

'klez virus wont die alreadi prolif virus ever klez continu wreak havoc andrew brandt septemb issu pc world magazin post thursday august klez worm approach seventh month wriggl across web make one persist virus ever expert warn may harbing new virus use combin pernici approach go pc pc antivirus softwar maker symantec mcafe report new infect daili sign letup press time british secur firm messagelab estim everi e mail messag hold variat klez virus say klez alreadi surpass last summer sircam prolif virus ever newer klez variant arent mere nuisanc carri virus corrupt data irregular mail list'

# Encapsulate Preprocessing in a Function
All of the above steps can be combined to a function that spits out the final processed word salad with the other features of interest that we exracted along the way.

In [30]:
def word_salad(body):
    '''Produce a word salad and some useful features from email body.'''

    # Parse HTML extract content only (but count tags)
    soup = BeautifulSoup(body, 'html.parser')
    nhtml = len(soup.find_all())
    nlinks = len(soup.find_all("a"))
    body = soup.get_text()
    
    # Replace and count all URLs 
    regx = re.compile(r"(http|https)://[^\s]*")
    body, nhttps = regx.subn(repl=" ", string=body)

    # Replace and count all email addresses
    regx = re.compile(r"\b[^\s]+@[^\s]+[.][^\s]+\b")
    body, nemails = regx.subn(repl=" ", string=body)
    
    # Count uppercases then lowercase everything
    nupper = len([charup for charup, char in zip(body, body.lower()) if charup != char])
    body = body.lower()
    
    # Count and replace all numbers (integer and float)
    regx = re.compile(r"\b[\d.]+\b")
    body, nnum = regx.subn(repl=" ", string=body)

    # Count and replace all dollar signs
    regx = re.compile(r"[$]")
    body, ndollar = regx.subn(repl=" ", string=body)

    # Count number of special punctuation
    nexclaim, nquest = body.count("!"), body.count("?")

    # Remove all other punctuation (dashes replace with space)
    regx = re.compile(r"[^\w\s_-]+")  
    body = regx.sub(repl="", string=body)
    regx = re.compile(r"[_-]+")
    body = regx.sub(repl=" ", string=body)
    
    # Count carriage returs and blank lines
    nblanks, nnewlines = body.count("\n\n"), body.count("\n")

    # Make all white space a single space
    regx = re.compile(r"\s+")
    body = regx.sub(repl=" ", string=body)

    # Remove any trailing or leading white space
    body = body.strip(" ")
    
    # Get total word count
    nwords = len(body.split(" "))
    freqns = {"email": nemails/nwords, "http":nhttps/nwords,
              "exclaim":nexclaim/nwords, "quest":nquest/nwords, 
              "dollar":ndollar/nwords, 
              "blank":nblanks/nwords, "newline":nnewlines/nwords, 
              "html":nhtml/nwords, "link":nlinks/nwords}
 
    # Remove all useless stopwords
    bodywords = body.split(" ")
    keepwords = [word for word in bodywords if word not in stopwords.words('english')]

    # Stem all words
    stemwords = [stemmer.stem(wd) for wd in keepwords]
    body = " ".join(stemwords)

    return freqns, body

In [31]:
# Try out our functions
body = get_body(spamfiles[179])
freqns, body = word_salad(body)
freqns

{'blank': 0.02877697841726619,
 'dollar': 0.0,
 'email': 0.007194244604316547,
 'exclaim': 0.05755395683453238,
 'html': 0.30935251798561153,
 'http': 0.0,
 'link': 0.014388489208633094,
 'newline': 0.2158273381294964,
 'quest': 0.0}

In [32]:
body

'hello seen nbc cbs cnn even oprah health discoveri actual revers age burn fat without diet exercis proven discoveri even report new england journal medicin forget age diet forev guarante reduc bodi fat build lean muscl without exercis enhac sexual perform remov wrinkl cellulit lower blood pressur improv cholesterol profil improv sleep vision memori restor hair color growth strengthen immun system increas energi cardiac output turn back bodi biolog time clock year month usag free inform get free month suppli hgh click receiv email subscrib opt america mail list remov relat maillist click'

# Building a Corpus of Processed Emails
Whatever algorithm we ultimately use for classification will require numeric feature vectors, so mapping each word salad to such a vector is the next main task. We'll start by building a corpus that is just a list of fully processed emails, and we'll build alongside it a dataframe of the other features of interest.

In [32]:
emails =  ["email"]*len(hamfiles + spamfiles)  # Reserve in memory, faster than append
fnames = [os.path.split(fpath)[1] for fpath in hamfiles + spamfiles]
df = pd.DataFrame(columns = ["email", "http", "blank", "dollar", 
                             "exclaim", "html", "link", "newline", "quest"], index=fnames)
y = [0]*len(hamfiles) + [1]*len(spamfiles)  # Ground truth vector

for idx, fpath in enumerate(hamfiles + spamfiles):
    body = get_body(fpath)  # Extract only the email body text
    freqns, body = word_salad(body)  # All preprocessing
    fname = os.path.split(fpath)[1]
    emails[idx] = body
    df.loc[fname] = freqns

In [33]:
df.head()

Unnamed: 0,email,http,blank,dollar,exclaim,html,link,newline,quest
0001.ea7e79d3153e7469e7a9c3e0af6a357e,0.00980392,0.00490196,0.0784314,0.0147059,0.0,0.00490196,0,0.25,0.0
0002.b3120c4bcbf3101e661161ee7efcb8bf,0.01,0.02,0.06,0.0,0.02,0.0,0,0.27,0.01
0003.acfc5ad94bbd27118a0d8685d18c89dd,0.00411523,0.00823045,0.0288066,0.0,0.00823045,0.0,0,0.160494,0.0
0004.e8d5727378ddde5c3be181df593f1712,0.00645161,0.0129032,0.0451613,0.0,0.0,0.0,0,0.212903,0.0
0005.8c3b9e9c0f3f183ddaf7592a11b99957,0.00483092,0.00483092,0.0483092,0.0,0.00483092,0.0,0,0.198068,0.00966184


In [34]:
emails[0]

'date wed aug chris garrigu messag id cant reproduc error repeat like everi time without fail debug log pick happen pick exec pick inbox list lbrace lbrace subject ftp rbrace rbrace sequenc mercuri exec pick inbox list lbrace lbrace subject ftp rbrace rbrace sequenc mercuri ftoc pickmsg hit mark hit tkerror syntax error express int note run pick command hand delta pick inbox list lbrace lbrace subject ftp rbrace rbrace sequenc mercuri hit that hit come obvious version nmh im use delta pick version pick nmh compil fuchsia cs mu oz au sun mar ict relev part mh profil delta mhparam pick seq sel list sinc pick command work sequenc actual one that explicit command line search popup one come mh profil get creat kre ps still use version code form day ago havent abl reach cvs repositori today local rout issu think exmh worker mail list'

In [35]:
# Pickle these objects for easier access later
with open("easyham_and_spam_corpus_and_df_and_y.pickle", "wb") as myfile:
    pickle.dump([emails, df, y], myfile)

We're now in position to start mapping emails into a numeric vector space. It turns out there are a lot of ways in which to do this and the proper ML approach would be to search over this space using cross-validation to identify the best approach. This is the subject of Spam Part II. We'll explore different vectorization schemes and feed these vectors into a Support Vector Machine to classify each email. 