# Project 2: Spam/Ham Prediction

In this project, we will create a classifier that can distinguish spam emails from ham (non-spam) emails.

## Getting Started

In [1]:
# Run this cell to set up your notebook
import seaborn as sns
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()
sns.set_context("talk")

## Loading in the Data

The dataset consists of email messages and their labels (0 for ham, 1 for spam)

The `emails` DataFrame contains labeled data that we will use to train your model. It contains three columns:

1. `id`: An identifier for the training example.
1. `subject`: The subject of the email
1. `email`: The text of the email.
1. `spam`: 1 if the email was spam, 0 if the email was ham (not spam).

In [2]:
train = pd.read_csv('train.csv')
# We lower case the emails to make them easier to work with
train['email'] = train['email'].str.lower()
train['subject'] = train['subject'].str.lower()
train=train[0:8013]
train.head()

Unnamed: 0,id,subject,email,spam
0,0,subject: a&l daily to be auctioned in bankrupt...,url: http://boingboing.net/#85534171\n date: n...,0.0
1,1,"subject: wired: ""stronger ties between isps an...",url: http://scriptingnews.userland.com/backiss...,0.0
2,2,subject: it's just too small ...,<html>\n <head>\n </head>\n <body>\n <font siz...,1.0
3,3,subject: liberal defnitions\n,depends on how much over spending vs. how much...,0.0
4,4,subject: re: [ilug] newbie seeks advice - suse...,hehe sorry but if you hit caps lock twice the ...,0.0


## Our First Features

We would like to take the text of an email and predict whether the text is ham or spam. This is a *classification* problem, so we will use logistic regression to make a classifier.

data are text, not numbers. To address this, we can create numeric features derived from the email text and use those features for logistic regression.

We create a function called `words_in_text` that takes in a list of words and the text of an email. It outputs a pandas Series containing either a 0 or a 1 for each word in the list. The value of the Series should be 0 if the word doesn't appear in the text and 1 if the word does.

In [3]:
def words_in_text(words, text):
    '''
    Args:
        `words` (list of str): words to find
        `text` (str): string to search in
    
    Returns:
        Series containing either 0 or 1 for each word in words
        (0 if the word is not in text, 1 if the word is).
    '''
    return pd.Series([1 if str(word) in str(text) else 0 for word in words])


assert np.allclose(words_in_text(['hello'], 'hello world'),
                   [1])
assert np.allclose(words_in_text(['hello', 'bye', 'world'], 'hello world hello'),
                   [1, 0, 1])


Now, we create a function called `words_in_texts` that takes in a list of words and a pandas Series of email texts. It should output a 2-dimensional NumPy matrix containing one row for each email text. The row should contain the output of `words_in_text` for each example.

In [4]:
def words_in_texts(words, texts):
    '''
    Args:
        `words` (list of str): words to find
        `texts` (Series of str): strings to search in
    
    Returns:
        NumPy array of 0s and 1s with shape (n, p) where n is the
        number of texts and p is the number of words.
    '''
    return np.matrix([words_in_text(words, text) for text in texts])

assert np.allclose(words_in_texts(['hello', 'bye', 'world'], pd.Series(['hello', 'hello world hello'])),
                   np.array([[1, 0, 0], [1, 0, 1]]))

## Classification

The output of `words_in_texts` is a numeric matrix containing features for each email. This means we can use it directly to train a classifier.

The following 5 words might be useful as features to distinguish spam/ham emails. We use these words as well as the `train` DataFrame to create two NumPy arrays: `X_train` and `y_train`.

`X_train` should be a matrix of 0s and 1s created by using our `words_in_texts` function on all the emails in the training set.

`y_train` should be vector of the correct labels for each email in the training set.

In [5]:
some_words = ['drug','bank','prescription','memo','private']
X_train = words_in_texts(some_words, train["email"])
y_train = train["spam"]

Now we have matrices we can give to scikit-learn! Using the [`LogisticRegression`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) classifier, we train a logistic regression model using `X_train` and `y_train`. Then, we output the accuracy of the model in the cell below. We get the accuracy around 75.0

In [6]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
clf = lr.fit(X_train, y_train)
y_pred = clf.predict(X_train)

acc = np.count_nonzero(y_pred==train["spam"])/train.shape[0]*100
acc

75.63958567328092

That doesn't seem too shabby! But the classifier we made above isn't as great as we might think. 
We calculate the proportion of ham emails and compare it to the accuracy you got.

In [7]:
ham_emails = np.count_nonzero(1-train['spam'])/train.shape[0]
ham_emails

0.7450393111194309

## Moving Forward

With this in mind, it was our assignment to make our classifier more accurate. In particular, everybody in the class should get at least **88%** accuracy on the test set.

Here are some ideas for improving our model:

1. Finding better features based on the email text. For example, simple features that typically work for emails are:
    1. Number of characters in the subject / body
    1. Number of words in the subject / body
    1. Use of punctuation (e.g., how many '!' were there?)
    1. Whether or not the email is a reply to an earlier email or a forwarded email. 
    1. Using bag-of-words or [td-idf](http://www.tfidf.com/).
    
1. Finding better words to use as features. Which words are the best at distinguishing emails? This requires digging into the email text itself. (To help you out, we've given you a set of [English stopwords](https://www.wikiwand.com/en/Stop_words) in `stopwords.csv`)
1. Better data processing. For example, many emails contain HTML as well as text. You can consider extracting out the text from the HTML to help you find better words. Or, you can match HTML tags themselves, or even some combination of the two.

Recall that we should use cross-validation to do feature and model selection properly! Otherwise, we will likely overfit to our training data.

Here we tried to use various research publications found on the internet that contained recommendations for word feature selection.

# Our Work Area

In [8]:
import re
def count(string, text):
    return text.count(string)

In [9]:
stopwords = pd.read_csv('stopwords.csv')
some_words = ['bank','prescription','private',"free","dear","yours",
              "img=","html","</"," </div>","</table>","url=","http://www.","value=","<!--",r'\<.*\>',r"#[a-f0-9]{6}",
              "$","+","-","  ","\\","@","|",'""',"'","$$"]
#some_words += list(stopwords)

x_train = words_in_texts(some_words, train["email"])
#x_train = np.append(x_train1,np.array([]).T, 1)

#print(x_train.shape[1])

y_train = train["spam"]

In [10]:
x_train.shape

(8013, 27)

In [11]:
re_find = words_in_texts(["re: "], pd.Series(train["subject"]))
virus_find = words_in_texts(["virus"], pd.Series(train["subject"]))

x_train = np.concatenate((x_train, np.array(re_find)), axis=1)
x_train = np.concatenate((x_train, np.array(virus_find)), axis=1)

In [12]:
re_html=r'\<.*\>'
arr=train["email"].str.count(re_html)
x_train=np.append(x_train,np.array([arr]).T,1)

In [13]:
re_c = r'\b([a-zA-Z]+[0-9]+[a-zA-Z0-9]*|[0-9]+[a-zA-Z]+[a-zA-Z0-9]*)\b'
arr=train["email"].str.count(re_html)
x_train=np.append(x_train,np.array([arr]).T, 1)

In [14]:
from sklearn.linear_model import LogisticRegression
#Your code here
lr = LogisticRegression()
clf = lr.fit(x_train, y_train)
y_pred=clf.predict(x_train)
hamemat = np.count_nonzero(y_pred==y_train)
hamemat / train.shape[0] * 100
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
accuracy=0
for train1, test in kf.split(x_train):
    cv_x_train=x_train[train1]
    cv_y_train=y_train[train1]
    cv_x_test=x_train[test]
    cv_y_test=y_train[test]
    lr=  LogisticRegression()
    lr.fit(cv_x_train, cv_y_train)
    cv_y_pred=lr.predict(cv_x_test)
    cv_test_error = np.count_nonzero(cv_y_pred==cv_y_test)/len(cv_y_test)
    accuracy=cv_test_error+accuracy
accuracy/5 * 100

89.50444041018596

Yay! We got 89% accuracy!

### EDA

In the four light blue cells below, show four different visualizations that you used to select features for your model. Each cell should output:

1. A plot showing something meaningful about the data that helped you during feature / model selection.
2. 2-3 sentences describing what you plotted and what its implications are for your features.

## Evaluating your work

When everybody is finished, I will give you intructions for evaluating your final model.

In [15]:
test = pd.read_csv("test.csv", names=["subject", "email", "spam", "0", "1", "2", "3"], header=None).iloc[1:,0:3]
test.head()

Unnamed: 0,subject,email,spam
1,Subject: Industry Forum #136,This is an HTML email message. If you see thi...,1.0
2,"Subject: please help, new webcam/ modeling page\n",I just got a webcam and computer for my birthd...,1.0
3,Subject: [ILUG] HELLO\n,OFFICE OF:EGNR. FEMI DANIEL\n FEDERAL MINISTRY...,1.0
4,Subject: Re: [ILUG] Alan Cox doesn't like 2.5 ...,> The impression I get from reading lkml the o...,0.0
5,Subject: Re: Holidays for freshrpms.net :-)\n,"On Tue, 10 Sep 2002 18:39:07 +0200\n Matthias ...",0.0


In [16]:
stopwords = pd.read_csv('stopwords.csv')
some_words = ['bank','prescription','private',"free","dear","yours",
              "img=","html","</"," </div>","</table>","url=","http://www.","value=","<!--",r'\<.*\>',r"#[a-f0-9]{6}",
              "$","+","-","  ","\\","@","|",'""',"'","$$"]
#some_words += list(stopwords)

x_test = words_in_texts(some_words, test["email"])

#print(x_test.shape[1])

y_test = test["spam"]

In [17]:
x_test.shape

(335, 27)

In [18]:
re_find = words_in_texts(["re: "], pd.Series(test["subject"]))
virus_find = words_in_texts(["virus"], pd.Series(test["subject"]))

x_test = np.concatenate((x_test, np.array(re_find)), axis=1)
x_test = np.concatenate((x_test, np.array(virus_find)), axis=1)

#print(x_test.shape[1])

In [19]:
re_html=r'\<.*\>'
arr = test["email"].str.count(re_html)
x_test = np.append(x_test,np.array([arr]).T,1)

#print(x_test.shape[1])

In [20]:
re_c = r'\b([a-zA-Z]+[0-9]+[a-zA-Z0-9]*|[0-9]+[a-zA-Z]+[a-zA-Z0-9]*)\b'
arr = test["email"].str.count(re_html)
x_test = np.append(x_test,np.array([arr]).T, 1)

print(x_test.shape[1])

31


In [21]:
print(x_test.shape, x_train.shape)

(335, 31) (8013, 31)


In [22]:
lr = LogisticRegression()
clf = lr.fit(x_train, y_train)
y_pred_test = clf.predict(x_test)

acc = np.count_nonzero(y_test==y_pred_test)/test.shape[0]*100

acc

85.97014925373134

We got 85.9% accuracy on test data. Good job!