In [14]:
from IPython.display import display, Markdown, Image
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set(style = "whitegrid", 
        color_codes = True,
        font_scale = 1.5)

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.metrics import precision_recall_curve

### Loading in the Data

In email classification, our goal is to classify emails as spam or not spam (referred to as "ham") using features generated from the text in the email. 

The dataset consists of email messages and their labels (0 for ham, 1 for spam). Your labeled training dataset contains 8348 labeled examples, and the test set contains 1000 unlabeled examples.

Run the following cells to load in the data into DataFrames.

The `train` DataFrame contains labeled data that you will use to train your model. It contains four columns:

1. `id`: An identifier for the training example
1. `subject`: The subject of the email
1. `email`: The text of the email
1. `spam`: 1 if the email is spam, 0 if the email is ham (not spam)

The `test` DataFrame contains 1000 unlabeled emails. You will predict labels for these emails.

In [15]:
# Load the data 
original_training_data = pd.read_csv('shared/lec2/train.csv')
test = pd.read_csv('shared/lec2/test.csv')

In [16]:
# Convert the emails to lower case as a first step to processing the text
original_training_data['email'] = original_training_data['email'].str.lower()
test['email'] = test['email'].str.lower()

original_training_data.head()

Unnamed: 0,id,subject,email,spam
0,0,Subject: A&L Daily to be auctioned in bankrupt...,url: http://boingboing.net/#85534171\n date: n...,0
1,1,"Subject: Wired: ""Stronger ties between ISPs an...",url: http://scriptingnews.userland.com/backiss...,0
2,2,Subject: It's just too small ...,<html>\n <head>\n </head>\n <body>\n <font siz...,1
3,3,Subject: liberal defnitions\n,depends on how much over spending vs. how much...,0
4,4,Subject: RE: [ILUG] Newbie seeks advice - Suse...,hehe sorry but if you hit caps lock twice the ...,0


First, let's check if our data contains any missing values. Fill in the cell below to print the number of NaN values in each column. If there are NaN values, replace them with appropriate filler values (i.e., NaN values in the `subject` or `email` columns should be replaced with empty strings). Print the number of NaN values in each column after this modification to verify that there are no NaN values left.

Note that while there are no NaN values in the `spam` column, we should be careful when replacing NaN labels. Doing so without consideration may introduce significant bias into our model when fitting.

In [17]:
print('Before Filling:')
print(original_training_data.isnull().sum())
original_training_data = original_training_data.fillna('')
print('------------')
print('After Filling:')
print(original_training_data.isnull().sum())

Before Filling:
id         0
subject    6
email      0
spam       0
dtype: int64
------------
After Filling:
id         0
subject    0
email      0
spam       0
dtype: int64


In the cell below, we print the text of the first ham and the fourth spam email in the original training set.

In [18]:
first_ham = original_training_data.loc[original_training_data['spam'] == 0, 'email'].iloc[0] 
fourth_spam = original_training_data.loc[original_training_data['spam'] == 1, 'email'].iloc[3] 

print("Ham \n", first_ham)
print("Spam \n", fourth_spam)

Ham 
 url: http://boingboing.net/#85534171
 date: not supplied
 
 arts and letters daily, a wonderful and dense blog, has folded up its tent due 
 to the bankruptcy of its parent company. a&l daily will be auctioned off by the 
 receivers. link[1] discuss[2] (_thanks, misha!_)
 
 [1] http://www.aldaily.com/
 [2] http://www.quicktopic.com/boing/h/zlfterjnd6jf
 
 

Spam 
 dear ricardo1 ,
 
 <html>
 <body>
 <center>
 <b><font color = "red" size = "+2.5">cost effective direct email advertising</font><br>
 <font color = "blue" size = "+2">promote your business for as low as </font><br>
 <font color = "red" size = "+2">$50</font> <font color = "blue" size = "+2">per 
 <font color = "red" size = "+2">1 million</font>
 <font color = "blue" size = "+2"> email addresses</font></font><p>
 <b><font color = "#44c300" size ="+2">maximize your marketing dollars!<p></font></b>
 <font size = "+2">complete and fax this information form to 309-407-7378.<br>
 a consultant will contact you to discuss your 

## Training Validation Split
The training data is available for both training models and **validating** the models that we train.  We therefore need to split the training data into separate training and validation datsets.  You will need this **validation data** to assess the performance of your classifier once you are finished training. Note that we set the seed (random_state) to 42. This will produce a pseudo-random sequence of random numbers that is the same for every student. Do not modify this in the following questions, as our tests depend on this random seed.

In [19]:
train, val = train_test_split(original_training_data, test_size=0.1, random_state=42)

## Feature Selection

We would like to take the text of an email and predict whether the email is ham or spam. Uur data are text, not numbers. To address this, we can create numeric features derived from the email text and use those features for logistic regression.

Each row of $X$ is an email. Each column of $X$ contains one feature for all the emails. We'll guide you through creating a simple feature, and you'll create more interesting ones when you are trying to increase your accuracy.

Create a function called `words_in_texts` that takes in a list of `words` and a pandas Series of email `texts`. It should output a 2-dimensional NumPy array containing one row for each email text. The row should contain either a 0 or a 1 for each word in the list: 0 if the word doesn't appear in the text and 1 if the word does. For example:

```
>>> words_in_texts(['hello', 'bye', 'world'], 
                   pd.Series(['hello', 'hello worldhello']))

array([[1, 0, 0],
       [1, 0, 1]])
```

In [11]:
def words_in_texts(words, texts):
    '''
    Args:
        words (list-like): words to find
        texts (Series): strings to search in
    
    Returns:
        NumPy array of 0s and 1s with shape (n, p) where n is the
        number of texts and p is the number of words.
    '''
    indicator_array = np.array([texts.str.contains(word) for word in words]).T
    return indicator_array.astype('int32')

In [9]:
assert np.allclose(words_in_texts(['hello', 'bye', 'world'], 
                           pd.Series(['hello', 'hello worldhello'])),
            np.array([[1, 0, 0], 
                      [1, 0, 1]])) == True

In [10]:
assert np.allclose(words_in_texts(['a', 'b', 'c', 'd', 'e', 'f', 'g'], 
                           pd.Series(['a b c d ef g', 'a', 'b', 'c', 'd e f g', 'h', 'a h'])),
            np.array([[1,1,1,1,1,1,1], 
                      [1,0,0,0,0,0,0],
                      [0,1,0,0,0,0,0],
                      [0,0,1,0,0,0,0],
                      [0,0,0,1,1,1,1],
                      [0,0,0,0,0,0,0],
                      [1,0,0,0,0,0,0]])) == True