In this notebook we will process the synthetic Austen/food reviews data and convert it into feature vectors. In later notebooks these feature vectors will be the inputs to models which we will train and eventually use to identify spam. 

The feature vectors generated in this notebook are composed of simple summaries of the text data. We begin by loading in the data produced by [the generator notebook.](00-generator.ipynb) 

In [1]:
import pandas as pd
import os.path

df = pd.read_parquet(os.path.join("data", "training.parquet"))

To illustrate the computation of feature vectors, we compute them for a sample of three documents from the data loaded in above.

In [2]:
import numpy as np

np.random.seed(0xc0fee)
df_samp = df.sample(3)

In [3]:
pd.set_option('display.max_colwidth', -1) #ensures that all the text is visible

df_samp

  pd.set_option('display.max_colwidth', -1) #ensures that all the text is visible


Unnamed: 0,index,label,text
9026,9026,legitimate,"As Harriet now lived, the Martins could not get through; and as Miss de Bourgh looked that way. Elinor kept her concern and surprise, began to inquire into Miss Thorpe's connections and fortune. I have three now, the best that ever were backed. No go. Project Gutenberg's The Complete Works of Jane Austen, by Jane Austen I LADY SUSAN VERNON TO MRS. He talked to her repeatedly in the most common diseases in cats and I try to avoid food coloring and thick motor oil. I feed them Sam's Yams and C.E.T."
27417,7417,spam,"Ginger Honey Crystals Pack of 30 today and on the go, to work, on vacation, or even camping. Would definitely buy again. . . . He loved the treat and wanted another."
20031,31,spam,"Every time I get close to running out. No worries here. Weight, energy, general health, etc. all tip-top. Ready to drink. Anyway,there are only two of us for meals, I have made incredible pizza with both. Introduced to Stash double Spice Chai in a gift basket would be nice."


The summmaries we will compute for each document are: 
* number of pieces of punctuation 
* number of words
* average word length
* maximum word length
* minimum word length
* 10th percentile word length
* 90th percentile word length
* number of words containing upper case letters
* number 'stop words'
    
To begin, we count the number of pieces of punctuation in each piece of text. We will remove the punctuation from the text as it is counted. This will make computing the later summaries a little simpler.

In [4]:
import re

def strip_punct(doc):
    """
    takes in a document _doc_ and
    returns a tuple of the punctuation-free
    _doc_ and the count of punctuation in _doc_
    """
    
    return re.subn(r"""[!.><:;',@#~{}\[\]\-_+=£$%^&()?]""", "", doc, count=0, flags=0)

In [5]:
df_samp["text_str"]= df_samp["text"].apply(strip_punct)

In [6]:
df_samp

Unnamed: 0,index,label,text,text_str
9026,9026,legitimate,"As Harriet now lived, the Martins could not get through; and as Miss de Bourgh looked that way. Elinor kept her concern and surprise, began to inquire into Miss Thorpe's connections and fortune. I have three now, the best that ever were backed. No go. Project Gutenberg's The Complete Works of Jane Austen, by Jane Austen I LADY SUSAN VERNON TO MRS. He talked to her repeatedly in the most common diseases in cats and I try to avoid food coloring and thick motor oil. I feed them Sam's Yams and C.E.T.","(As Harriet now lived the Martins could not get through and as Miss de Bourgh looked that way Elinor kept her concern and surprise began to inquire into Miss Thorpes connections and fortune I have three now the best that ever were backed No go Project Gutenbergs The Complete Works of Jane Austen by Jane Austen I LADY SUSAN VERNON TO MRS He talked to her repeatedly in the most common diseases in cats and I try to avoid food coloring and thick motor oil I feed them Sams Yams and CET, 17)"
27417,7417,spam,"Ginger Honey Crystals Pack of 30 today and on the go, to work, on vacation, or even camping. Would definitely buy again. . . . He loved the treat and wanted another.","(Ginger Honey Crystals Pack of 30 today and on the go to work on vacation or even camping Would definitely buy again He loved the treat and wanted another, 9)"
20031,31,spam,"Every time I get close to running out. No worries here. Weight, energy, general health, etc. all tip-top. Ready to drink. Anyway,there are only two of us for meals, I have made incredible pizza with both. Introduced to Stash double Spice Chai in a gift basket would be nice.","(Every time I get close to running out No worries here Weight energy general health etc all tiptop Ready to drink Anywaythere are only two of us for meals I have made incredible pizza with both Introduced to Stash double Spice Chai in a gift basket would be nice, 13)"


We will store the count of punctuation in a new summaries vector: 

In [7]:
df_summaries = pd.DataFrame({'num_punct' :df_samp["text_str"].apply(lambda x: x[1])})
df_summaries

Unnamed: 0,num_punct
9026,17
27417,9
20031,13


In [8]:
df_samp.reset_index(inplace=True) 

#note level and index coincide for the legitimate documents, but not for the spam - 
    #for spam, index = level_0 mod 20,000

In [9]:
df_samp

Unnamed: 0,level_0,index,label,text,text_str
0,9026,9026,legitimate,"As Harriet now lived, the Martins could not get through; and as Miss de Bourgh looked that way. Elinor kept her concern and surprise, began to inquire into Miss Thorpe's connections and fortune. I have three now, the best that ever were backed. No go. Project Gutenberg's The Complete Works of Jane Austen, by Jane Austen I LADY SUSAN VERNON TO MRS. He talked to her repeatedly in the most common diseases in cats and I try to avoid food coloring and thick motor oil. I feed them Sam's Yams and C.E.T.","(As Harriet now lived the Martins could not get through and as Miss de Bourgh looked that way Elinor kept her concern and surprise began to inquire into Miss Thorpes connections and fortune I have three now the best that ever were backed No go Project Gutenbergs The Complete Works of Jane Austen by Jane Austen I LADY SUSAN VERNON TO MRS He talked to her repeatedly in the most common diseases in cats and I try to avoid food coloring and thick motor oil I feed them Sams Yams and CET, 17)"
1,27417,7417,spam,"Ginger Honey Crystals Pack of 30 today and on the go, to work, on vacation, or even camping. Would definitely buy again. . . . He loved the treat and wanted another.","(Ginger Honey Crystals Pack of 30 today and on the go to work on vacation or even camping Would definitely buy again He loved the treat and wanted another, 9)"
2,20031,31,spam,"Every time I get close to running out. No worries here. Weight, energy, general health, etc. all tip-top. Ready to drink. Anyway,there are only two of us for meals, I have made incredible pizza with both. Introduced to Stash double Spice Chai in a gift basket would be nice.","(Every time I get close to running out No worries here Weight energy general health etc all tiptop Ready to drink Anywaythere are only two of us for meals I have made incredible pizza with both Introduced to Stash double Spice Chai in a gift basket would be nice, 13)"


Many of the summaries we will compute require us to consider each word in the text, one by one. To prevent needing to 'split' the text multiple times, we split once, then apply each function to the resultant words. 

To do this, we "explode" the text into words, so that each word occupies a row of the data frame, and retains the associated "level_0", "index" and "label". 

In [10]:
rows = []
_ = df_samp.apply(lambda row: [rows.append([ row['level_0'], row['index'], row['label'], word]) 
                         for word in row.text_str[0].split()], axis=1)
df_samp_explode = pd.DataFrame(rows, columns=df_samp.columns[0:4])

In [11]:
df_samp_explode

Unnamed: 0,level_0,index,label,text
0,9026,9026,legitimate,As
1,9026,9026,legitimate,Harriet
2,9026,9026,legitimate,now
3,9026,9026,legitimate,lived
4,9026,9026,legitimate,the
...,...,...,...,...
165,20031,31,spam,gift
166,20031,31,spam,basket
167,20031,31,spam,would
168,20031,31,spam,be


Column `level_0` contains the index we want to aggregate any calculations over. 

Computing the number of words in each document is now simply calculating the number of rows for each value of `level_0`.

In [12]:
df_summaries["num_words"] = df_samp_explode['level_0'].value_counts()
df_summaries

Unnamed: 0,num_punct,num_words
9026,17,92
27417,9,29
20031,13,49


Many of the remaining summaries require word length to be computed. To save us from recomputing this every time, we will add a column containing this information to our 'exploded' data frame:

In [13]:
df_samp_explode["word_len"] = df_samp_explode["text"].apply(len) 

In [14]:
df_samp_explode.sample(10) 

Unnamed: 0,level_0,index,label,text,word_len
90,9026,9026,legitimate,and,3
86,9026,9026,legitimate,feed,4
152,20031,31,spam,made,4
122,20031,31,spam,time,4
43,9026,9026,legitimate,No,2
137,20031,31,spam,all,3
109,27417,7417,spam,camping,7
161,20031,31,spam,Spice,5
141,20031,31,spam,drink,5
151,20031,31,spam,have,4


In the next cell we compute the average word length as well as the minimum and maximum, for each document. 

In [15]:
df_summaries["av_wl"] = df_samp_explode.groupby('level_0')['word_len'].mean() #average word length
df_summaries["max_wl"] = df_samp_explode.groupby('level_0')['word_len'].max() #max word length
df_summaries["min_wl"] = df_samp_explode.groupby('level_0')['word_len'].min() #min word length

We can also compute quantiles of the word length: 

In [16]:
df_summaries["10_quantile"] = df_samp_explode.groupby('level_0')['word_len'].quantile(0.1) #10th quantile word length
df_summaries["90_quantile"]= df_samp_explode.groupby('level_0')['word_len'].quantile(0.9) #90th quantile word length

In [17]:
df_summaries

Unnamed: 0,num_punct,num_words,av_wl,max_wl,min_wl,10_quantile,90_quantile
9026,17,92,4.271739,11,1,2.0,7.0
27417,9,29,4.310345,10,2,2.0,7.2
20031,13,49,4.346939,11,1,2.0,7.0


As well as the simple summaries relating to word length, we can compute some more involved summaries related to language. For each document we will compute: 

* the number of words which contain at least one capital letter
* the number of stop words



In [18]:
#item.islower returns true if all characters are lowercase, else false.
#nb: isupper only returns true if all characters are upper case. 
def caps(word):
    return not word.islower()
df_samp_explode["upper_case"]=df_samp_explode['text'].apply(caps)
df_summaries["upper_case"] = df_samp_explode.groupby('level_0')['upper_case'].sum() 

In [19]:
df_summaries

Unnamed: 0,num_punct,num_words,av_wl,max_wl,min_wl,10_quantile,90_quantile,upper_case
9026,17,92,4.271739,11,1,2.0,7.0,31
27417,9,29,4.310345,10,2,2.0,7.2,7
20031,13,49,4.346939,11,1,2.0,7.0,11


Stop words are commonly used words which are usually considered to be unrelated to the document topic. Examples include 'in', 'the', 'at' and 'otherwise'.

In [30]:
from sklearn.feature_extraction._stop_words import ENGLISH_STOP_WORDS 

In [31]:
def isstopword(word):
    return word in ENGLISH_STOP_WORDS

df_samp_explode["stop_words"]=df_samp_explode['text'].apply(isstopword)

In [32]:
df_samp_explode.sample(10)

Unnamed: 0,level_0,index,label,text,word_len,upper_case,stop_words
22,9026,9026,legitimate,and,3,False,True
34,9026,9026,legitimate,have,4,False,True
123,20031,31,spam,I,1,True,False
85,9026,9026,legitimate,I,1,True,False
32,9026,9026,legitimate,fortune,7,False,False
46,9026,9026,legitimate,Gutenbergs,10,True,False
124,20031,31,spam,get,3,False,True
28,9026,9026,legitimate,Miss,4,True,False
73,9026,9026,legitimate,cats,4,False,False
64,9026,9026,legitimate,to,2,False,True


In [33]:
df_summaries["stop_words"] = df_samp_explode.groupby('level_0')['stop_words'].sum() 

In [34]:
df_summaries

Unnamed: 0,num_punct,num_words,av_wl,max_wl,min_wl,10_quantile,90_quantile,upper_case,stop_words
9026,17,92,4.271739,11,1,2.0,7.0,31,37
27417,9,29,4.310345,10,2,2.0,7.2,7,13
20031,13,49,4.346939,11,1,2.0,7.0,11,22


Now that we've illustrated how to compute the summaries on a subsample of our data, we will go ahead and compute the summaries for each of the texts in the full dataset. In order to minimise clutter in this notebook we have [introduced a helper function called `features_simple`](mlworkflows/featuressimple.py).

In [35]:
df.reset_index(inplace=True)

In [37]:
from mlworkflows import featuressimple

In [38]:
simple_summary = featuressimple.SimpleSummaries()

summaries = simple_summary.transform(df["text"])

In [39]:
from sklearn.pipeline import Pipeline

feat_pipeline = Pipeline([
    ('features',simple_summary)
])

from mlworkflows import util
util.serialize_to(feat_pipeline, "feature_pipeline.sav")

In [40]:
features = pd.concat([df[["index", "label"]],
                                pd.DataFrame(summaries)], axis=1)

In [41]:
features

Unnamed: 0,index,label,no_punct,number_words,mean_wl,max_wl,min_wl,pc_10_wl,pc_90_wl,upper,stop_words
0,0,legitimate,34,124,4.604839,14,1,2.0,8.0,22,64
1,1,legitimate,16,87,4.896552,16,1,2.0,8.4,10,46
2,2,legitimate,23,139,4.330935,12,1,2.0,8.0,24,74
3,3,legitimate,17,94,4.500000,13,1,2.0,9.0,13,49
4,4,legitimate,12,80,4.375000,9,2,2.0,7.0,11,46
...,...,...,...,...,...,...,...,...,...,...,...
39995,19995,spam,10,52,4.211538,11,1,2.0,7.0,8,25
39996,19996,spam,8,66,4.545455,13,1,2.0,8.5,6,34
39997,19997,spam,11,52,4.384615,12,1,2.0,7.0,10,20
39998,19998,spam,15,95,3.926316,12,1,2.0,6.0,9,54


In [42]:
features.columns = features.columns.astype(str)

# Visualisation

These vectors have too many dimensions for us to easily picture  them as points in space.  [Principal component analysis](https://en.wikipedia.org/wiki/Principal_component_analysis), or PCA, is a statistical technique that is over a century old; it takes observations in a high-dimensional space and maps them to a (potentially much) smaller number of dimensions. We'll see it in action now, using the [implementation from scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA).

(To learn a little more about PCA and an alternative technique, visit [the visualisation notebook](01-vectors-and-visualization.ipynb).)

In [43]:
import sklearn.decomposition

DIMENSIONS = 2

pca = sklearn.decomposition.PCA(DIMENSIONS)

pca_summaries = pca.fit_transform(features.iloc[:,2:features.shape[1]])

In [44]:
from mlworkflows import plot

pca_summaries_plot_data = pd.concat([df, pd.DataFrame(pca_summaries, columns=["x", "y"])], axis=1)

plot.plot_points(pca_summaries_plot_data, x="x", y="y", color="label")

In [47]:
features.to_parquet(os.path.join("data", "features_summaries.parquet"))

Now that we have a feature engineering approach, next step is to train a model.  Again, you have two choices for your next step:  [click here](04-model-logistic-regression.ipynb) for a model based on *logistic regression*, or [click here](04-model-random-forest.ipynb) for a model based on *ensembles of decision trees*.