Suppose the spam generator becomes more intelligent and begins producing prose which looks "more legitimate" than before. 

There are numerous ways the prose could become more like legitimate text. For the purpose of this notebook we will simply force the spam data to 'drift' by adding the first few lines of Pride and Prejudice to the start of the spam documents in our testing set. We will then see how the trained model responds.  

In [1]:
import pandas as pd
df = pd.read_parquet("data/training.parquet")

  labels = getattr(columns, 'labels', None) or [
  return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)
  labels, = index.labels


In [2]:
from sklearn import model_selection


df_train, df_test = model_selection.train_test_split(df, random_state=43)
df_test_spam = df_test[df_test.label == 'spam'].copy()

In [3]:
def add_text(doc, adds):
    """
    takes in a string _doc_ and
    appends text _adds_ to the start
    """
    
    return adds + doc

In [4]:
pride_pred = '''It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.'''

In [5]:
df_test_spam["text"] = df_test_spam.text.apply(add_text, adds=pride_pred)

In [6]:
pd.set_option('display.max_colwidth', -1) #ensures that all the text is visible

df_test_spam.sample(3)

Unnamed: 0,index,label,text
31688,11688,spam,"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.This is a must-have for me from a party store. This product used to taste good, which it does. It is definitely not the same as the originals may be but it's not a big fan and consumer of the Lipton Tea you find at the supermarket for a box like that in a french press. My dog really enjoys the Booda bones. I wrote to Betty Crocker to let them know your feelings. I add half the seasoning packet but the other flavors I bought. Now I can again start the day without it. The bags are just the right amount of sweetness in my cereal every morning. Let me start by saying something negative about the Keurig coffee machine and the counter."
37965,17965,spam,"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.Still being oblivious at the time, plus it helped me unwind. Again, NO. You want reformed inmates? I feel that both men and women. I was on a whim I poured a little bit of spice. However, I do find that occassionally it produces too thick a gloppy layer of chocolate adorns the top, with nuts and other good stuff that you should be eating, but mmmmm, they are good! What are these 5 star reviewers have already. Right now, shes eating her food altogether. Then they sell the cat food market, F**** F****. Usually chicken sometimes turkey sometimes beef, sometimes I would like them with Lox, mascarpone cream cheese and serve with fruits in season."
32415,12415,spam,"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.This is deceitful to those who do not or cannot drink caffeine, it is the perfect snack to get me my own Espresso Machine. I would not have anticipated.Both products list high-quality ingredients on the Evo every other morning because it does taste good. I make pitchers at a time this way compared to buying sprouts at the store."


Need to compute the summaries on this. Using the function 

In [7]:
import re
import numpy as np
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

def strip_punct(doc):
    """
    takes in a document _doc_ and
    returns a tuple of the punctuation-free
    _doc_ and the count of punctuation in _doc_
    """
    
    return re.subn(r"""[!.><:;',@#~{}\[\]\-_+=£$%^&()?]""", "", doc, count=0, flags=0)

def caps(word):
    return not word.islower()

def isstopword(word):
    return word in ENGLISH_STOP_WORDS

def standard_summary(row):
    """
    takes in an entry _row_ from the data and 
    computes each of the summaries then returns
    the summaries in a tuple, along with the unique 
    'level_0' id
    """
    
    doc = row["text"]

    no_punct = strip_punct(doc)
    
    words = no_punct[0].split()
    
    number_words = len(words)
    
    word_length = [len(x) for x in words]
    
    mean_wl = sum(word_length)/number_words
    
    max_wl = max(word_length)
    min_wl = min(word_length)

    pc_90_wl = np.percentile(word_length, 90)
    pc_10_wl = np.percentile(word_length, 10)
    
    upper = sum([caps(x) for x in words])
    stop_words = sum([isstopword(x) for x in words])

    return [ no_punct[1], number_words, mean_wl, max_wl, min_wl, pc_10_wl, pc_90_wl, upper, stop_words]

In [8]:
features = df_test_spam.apply(standard_summary, axis=1).apply(pd.Series)
features.columns = ["num_punct", "num_words", "av_wl", "max_wl", "min_wl", "10_quantile", "90_quantile", "upper_case", "stop_words"]

In [9]:
features.sample(4)

Unnamed: 0,num_punct,num_words,av_wl,max_wl,min_wl,10_quantile,90_quantile,upper_case,stop_words
30741,28.0,167.0,4.275449,13.0,1.0,2.0,8.0,23.0,86.0
30264,29.0,172.0,4.360465,14.0,1.0,2.0,8.0,19.0,90.0
28605,30.0,208.0,4.139423,13.0,1.0,2.0,7.0,23.0,106.0
37208,41.0,255.0,4.301961,14.0,1.0,2.0,8.0,42.0,126.0


Now we can load in the models we generated earlier and see how well they classify these feature vectors. 

In [10]:
filename = 'models/lr_model_simplesummaries.sav'

In [11]:
import pickle
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
#result = loaded_model.score(X_test, Y_test)
#print(result)

In [12]:
loaded_model

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=4000, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [13]:
predictions = loaded_model.predict(features.iloc[:,0:features.shape[1]])

In [14]:
predictions

array(['legitimate', 'legitimate', 'legitimate', ..., 'legitimate',
       'legitimate', 'spam'], dtype=object)

In [15]:
type(predictions)

numpy.ndarray

In [16]:
np.array(np.unique(predictions, return_counts = True))

array([['legitimate', 'spam'],
       [3804, 1192]], dtype=object)

Will now load in the other model and see how well it fairs with a change of data. 

In [17]:
from sklearn.feature_extraction.text import HashingVectorizer

hv = HashingVectorizer(norm=None, token_pattern='(?u)\\b[A-Za-z]\\w+\\b', n_features=8192, alternate_sign = False)
hv


HashingVectorizer(alternate_sign=False, analyzer='word', binary=False,
         decode_error='strict', dtype=<class 'numpy.float64'>,
         encoding='utf-8', input='content', lowercase=True,
         n_features=8192, ngram_range=(1, 1), non_negative=False,
         norm=None, preprocessor=None, stop_words=None, strip_accents=None,
         token_pattern='(?u)\\b[A-Za-z]\\w+\\b', tokenizer=None)

In [18]:
hvcounts = hv.fit_transform(df_test_spam["text"])
hvcounts

<4996x8192 sparse matrix of type '<class 'numpy.float64'>'
	with 674604 stored elements in Compressed Sparse Row format>

In [19]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
hvdf_tfidf = tfidf_transformer.fit_transform(hvcounts)

In [20]:
dense_tf_idf = hvdf_tfidf.toarray()
labled_vecs = pd.concat([df_test_spam.reset_index()[["index", "label"]],pd.DataFrame(dense_tf_idf)], axis=1)
labled_vecs.columns = labled_vecs.columns.astype(str)

In [21]:
features = labled_vecs

In [22]:
features

Unnamed: 0,index,label,0,1,2,3,4,5,6,7,...,8182,8183,8184,8185,8186,8187,8188,8189,8190,8191
0,19423,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
1,14677,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
2,7014,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
3,12415,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
4,1393,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
5,15963,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.086029,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
6,13558,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
7,15843,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
8,8428,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
9,7729,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
filename = 'models/lr_model_tfidfsummaries.sav'
import pickle
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
#result = loaded_model.score(X_test, Y_test)
#print(result)

In [24]:
predictions = loaded_model.predict(features.iloc[:,2:features.shape[1]])

In [25]:
predictions

array(['spam', 'legitimate', 'legitimate', ..., 'legitimate',
       'legitimate', 'spam'], dtype=object)

In [26]:
import numpy as np
np.array(np.unique(predictions, return_counts = True))

array([['legitimate', 'spam'],
       [3807, 1189]], dtype=object)