# Homework 5 Part I: Spam Classification in SciKit-Learn

This assignment uses data from https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

Data processing was inspired by https://www.kaggle.com/overflow012/d/uciml/sms-spam-collection-dataset/text-preprocessing-classification

Before getting started, run this to upgrade SciKit-Learn from 0.17 to 0.18.1.  Then go to Kernel | Restart in Jupyter.

In [1]:
import pandas as pd

####
# Helper function:
#  Return the k most frequently appearing keywords in the dataframe
def top_k(data_df, vec, k):
    X = vec.fit_transform(data_df['sms'].values)
    labels = vec.get_feature_names()
    
    return pd.DataFrame(columns = labels, data = X.toarray()).sum().sort_values(ascending = False)[:k]



sms_df = pd.read_csv('spam.csv', encoding='latin-1')
sms_df.columns = ['class', 'sms', 'a', 'b', 'c']


## Step 1.1 Data Wrangling

Clean up sms_df.  Delete 'a', 'b', 'c', lowercase the sms text

In [2]:
## Data wrangling / cleaning
sms_df = pd.DataFrame(sms_df[['class', 'sms']])
sms_df['sms'] = sms_df['sms'].str.lower()

### Step 1.1 Results

In [3]:
sms_df

Unnamed: 0,class,sms
0,ham,"go until jurong point, crazy.. available only ..."
1,ham,ok lar... joking wif u oni...
2,spam,free entry in 2 a wkly comp to win fa cup fina...
3,ham,u dun say so early hor... u c already then say...
4,ham,"nah i don't think he goes to usf, he lives aro..."
5,spam,freemsg hey there darling it's been 3 week's n...
6,ham,even my brother is not like to speak with me. ...
7,ham,as per your request 'melle melle (oru minnamin...
8,spam,winner!! as a valued network customer you have...
9,spam,had your mobile 11 months or more? u r entitle...


In [4]:
sms_df.groupby('class').describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,sms
class,Unnamed: 1_level_1,Unnamed: 2_level_1
ham,count,4825
ham,unique,4515
ham,top,"sorry, i'll call later"
ham,freq,30
spam,count,747
spam,unique,653
spam,top,please call our customer service representativ...
spam,freq,4


## Step 1.2. Vectorizing the Text

In [5]:
## Generate feature vectors
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(decode_error = 'ignore', stop_words = 'english')

## Let's see the most frequent terms in spam

In [6]:
top_spam = top_k(sms_df[sms_df['class'] == 'spam'], vec, 30)

top_spam

free          224
txt           163
ur            144
mobile        127
text          125
stop          121
claim         113
reply         104
www            98
prize          93
just           78
cash           76
won            76
uk             74
150p           71
send           70
new            69
nokia          67
win            64
urgent         63
tone           60
week           60
50             57
contact        56
service        56
msg            54
com            54
18             51
16             51
guaranteed     50
dtype: int64

## Vs ham...

In [7]:
top_ham = top_k(sms_df[sms_df['class'] == 'ham'], vec, 30)

top_ham

gt       318
lt       316
just     293
ok       287
ll       265
ur       241
know     236
good     233
got      232
like     232
come     227
day      209
time     201
love     199
going    169
home     165
want     164
lor      162
need     158
sorry    157
don      151
da       150
today    139
later    135
dont     132
did      129
send     129
think    128
pls      123
hi       122
dtype: int64

## Step 1.2.2 Regularize URLs and Numbers

Import _regularize_ here, and use *regularize_urls* and *regularize_numbers*
on the columns.

In [8]:
# Regularize/tokenize URLs and numbers
from regularize import regularize_urls, regularize_numbers
sms_df['sms'] = regularize_numbers(regularize_urls(sms_df['sms'])) 

## Step 1.2.2 Results

Re-run the CountVectorizer, re-create vector X, and re-compute the top-30 spam terms.  Output the top-30 spam terms.

In [9]:
# Top-30 spam terms
top_spam = top_k(sms_df[sms_df['class'] == 'spam'], vec, 30)

top_spam

_num_         3289
free           228
txt            165
ur             144
_url_          141
mobile         129
stop           126
text           125
claim          113
reply          104
prize           92
just            78
won             76
cash            76
nokia           71
send            70
win             70
new             69
urgent          63
week            60
tone            59
box             57
msg             56
service         56
contact         56
guaranteed      50
ppm             49
customer        49
mins            47
phone           46
dtype: int64

## Step 1.3 Creating Features

Take the top-30 spam + top-30 ham words, and create a new CountVectorizer,
called *relevant_vec*, which _only_ includes those words.
See http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html.

In [10]:
top_spam = top_k(sms_df[sms_df['class'] == 'spam'], vec, 30)
import numpy as np
top_spam = top_k(sms_df[sms_df['class'] == 'spam'], vec, 30)
vocabulary = np.union1d(np.array(top_ham.index),np.array(top_spam.index))

In [11]:
# Vector of 'important' words
relevant_vec = CountVectorizer(decode_error = 'ignore', stop_words = 'english', vocabulary=vocabulary)

In [12]:
import sklearn.model_selection as ms
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np

# X is the feature array, based off relevant words
X = relevant_vec.fit_transform(sms_df['sms'].values).toarray()

# Compute the length of each sms message, normalized
# by max length
Xlen = np.zeros((X.shape[0],1))
inx = 0
for v in sms_df['sms'].values:
        Xlen[inx,0] = len(v)
        inx += 1
Xlen = Xlen / max(Xlen)
# Add the length as another feature
X = np.hstack((X, Xlen))

to_int = np.vectorize(lambda x: 1 if x == 'ham' else 0)
y = to_int(sms_df['class'].values)

# Now we split...
X_train, X_test, y_train, y_test = ms.train_test_split(X, 
                                                    y, test_size=0.2, random_state=42)

X_train

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.09110867],
       [ 3.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.16684962],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.05049396],
       ..., 
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.04939627],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.02854007],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.03841932]])

In [13]:
y

array([1, 1, 0, ..., 1, 1, 1])

In [14]:
import tensorflow as tf
columns = list(vocabulary)
columns.append('length')

In [15]:
def input_fn(x, y):
    feature_cols = {column: tf.constant(x[:,index], shape=[x.shape[0], 1]) 
                    for index, column in enumerate(columns)}
    label = tf.constant(y)
    return feature_cols, label
def test_input_fn():
    return input_fn(X_test, y_test)
def train_input_fn():
    return input_fn(X_train, y_train)

In [16]:
len(test_input_fn()[0])

58

In [24]:
# Create a DNNClassifier with two hidden layers of 5 units each, and run for 1000 steps.  
# Create a Markdown Cell saying “Step 4.3.1 Results.” 
# Run the fit operation over the training data and the evaluate operation over the test data.  
# Sort the results of the evaluate operation by key, and output the keys.  
# For reference, here is an example of using the DNNClassifier. Note the accuracy.

feature_columns = [tf.contrib.layers.real_valued_column(column_name) for column_name in columns] 
dnn_clf = tf.contrib.learn.DNNClassifier(hidden_units=[5, 5], n_classes=2, feature_columns=feature_columns,
                                        model_dir='/tmp/spam_model_dnn')

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_id': 0, '_model_dir': None, '_evaluation_master': '', '_save_checkpoints_steps': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f578aa96c88>, '_num_worker_replicas': 0, '_tf_random_seed': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_task_type': None, '_save_checkpoints_secs': 600, '_save_summary_steps': 100, '_is_chief': True, '_master': '', '_keep_checkpoint_every_n_hours': 10000, '_environment': 'local', '_num_ps_replicas': 0, '_keep_checkpoint_max': 5}


### Step 4.3.1 Results

In [25]:
dnn_clf.fit(input_fn=train_input_fn, steps=1000)
score = dnn_clf.evaluate(input_fn=test_input_fn, steps=1)

Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/spam_model_dnn/model.ckpt.
INFO:tensorflow:loss = 0.728257, step = 1
INFO:tensorflow:global_step/sec: 463.893
INFO:tensorflow:loss = 0.26465, step = 101 (0.214 sec)
INFO:tensorflow:global_step/sec: 468.31
INFO:tensorflow:loss = 0.137327, step = 201 (0.214 sec)
INFO:tensorflow:global_step/sec: 479.208
INFO:tensorflow:loss = 0.10023, step = 301 (0.209 sec)
INFO:tensorflow:global_step/sec: 470.593
INFO:tensorflow:loss = 0.0833407, step = 401 (0.213 sec)
INFO:tensorflow:global_step/sec: 437.645
INFO:tensorflow:loss = 0.0749837, step = 501 (0.228 sec)
INFO:tensorflow:global_

In [26]:
for key in sorted(score):
    print('%s: %s' % (key, score[key]))

accuracy: 0.975785
accuracy/baseline_label_mean: 0.865471
accuracy/threshold_0.500000_mean: 0.975785
auc: 0.980211
global_step: 1000
labels/actual_label_mean: 0.865471
labels/prediction_mean: 0.867361
loss: 0.075252
precision/positive_threshold_0.500000_mean: 0.98152
recall/positive_threshold_0.500000_mean: 0.990674


In [27]:
lin_clf = tf.contrib.learn.LinearClassifier(feature_columns=feature_columns, model_dir='/tmp/spam_model_linear')

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_id': 0, '_model_dir': None, '_evaluation_master': '', '_save_checkpoints_steps': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f57e1f30710>, '_num_worker_replicas': 0, '_tf_random_seed': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_task_type': None, '_save_checkpoints_secs': 600, '_save_summary_steps': 100, '_is_chief': True, '_master': '', '_keep_checkpoint_every_n_hours': 10000, '_environment': 'local', '_num_ps_replicas': 0, '_keep_checkpoint_max': 5}


### Step 4.3.2 Results

In [28]:
lin_clf.fit(input_fn=train_input_fn, steps=1000)
score = lin_clf.evaluate(input_fn=test_input_fn, steps=1)

Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from /tmp/spam_model_linear/model.ckpt-1000
INFO:tensorflow:Saving checkpoints for 1001 into /tmp/spam_model_linear/model.ckpt.
INFO:tensorflow:loss = 0.0949926, step = 1001
INFO:tensorflow:global_step/sec: 541.108
INFO:tensorflow:loss = 0.0927605, step = 1101 (0.184 sec)
INFO:tensorflow:global_step/sec: 819.246
INFO:tensorflow:loss = 0.0908504, step = 1201 (0.122 sec)
INFO:tensorflow:global_step/sec: 798.697
INFO:tensorflow:loss = 0.0891962, step = 1301 (0.125 sec)
INFO:tensorflow:global_step/sec: 769.566
INFO:tensorflow:loss = 0.0877492, step = 1401 (0.130 sec)
INFO:tensorflow:glob

In [29]:
for key in sorted(score):
    print('%s: %s' % (key, score[key]))

accuracy: 0.972197
accuracy/baseline_label_mean: 0.865471
accuracy/threshold_0.500000_mean: 0.972197
auc: 0.977969
global_step: 2000
labels/actual_label_mean: 0.865471
labels/prediction_mean: 0.864231
loss: 0.0925379
precision/positive_threshold_0.500000_mean: 0.974594
recall/positive_threshold_0.500000_mean: 0.993782
