In this notebook we will process the synthetic Austen/food reviews data and convert it into feature vectors. In later notebooks these feature vectors will be the inputs to models which we will train and eventually use to identify spam. 

Feature vectors are composed of simple summaries of the documents. 

In [1]:
import pandas as pd

df = pd.read_parquet("data/training.parquet")

  labels = getattr(columns, 'labels', None) or [
  return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)
  labels, = index.labels


To illustrate the computation of feature vectors, we compute them for a sample of three documents from the data loaded in above.

In [8]:
import numpy as np

np.random.seed(0xc0ffffee)
df_samp = df.sample(3)

In [9]:
pd.set_option('display.max_colwidth', -1) #ensures that all the text is visible
df_samp

Unnamed: 0,index,label,text
38247,18247,spam,"I make it since I haven't found it yet but I am not a coffee drinker. When I opened the box up, put the label on, and take it from me, don't waste your money on this coffee and every cup is enjoyable. Shipping will double the price. Can't tell it is chili but not hot enough to hold a week of uncontrollable diareau, and Darby was also vomiting. I had no taste in teas.My advice: buy a single box and see how dogs react."
24052,4052,spam,"YUMMY! My wife says this is hot chocolate. Accordingly, the tea had no real intelligence to give, and that she enjoyed so much! I even sweetened my tea with Sugar Twin, so I'm thrilled to find it is the description that closely. I developed this aversion to them because they are my new favorite."
27554,7554,spam,"Sorry folks I'm sure it is healthy - it contains only fruits and nuts, no sugar added. Gets boring. Ugh. If you have problems with muscle cramping are relieved to know. It taste like hot water with some coffee at the price of gas and other expenses these days I can't stop sucking on them. She does get breastmilk but I have to tell him the agreeable news. You can definitely taste the fig but not so much the better. I didn't like the way this manufacturer packages 4 types of sugar are listed, it is clear you are only getting 3-4 hours total per night."


In [10]:
df_samp.reset_index(inplace=True) 

#note level and index coincide for the legitimate documents, but not for the spam - 
    #for spam, index = level_0 mod 20,000

In [11]:
df_samp #storing 

Unnamed: 0,level_0,index,label,text
0,38247,18247,spam,"I make it since I haven't found it yet but I am not a coffee drinker. When I opened the box up, put the label on, and take it from me, don't waste your money on this coffee and every cup is enjoyable. Shipping will double the price. Can't tell it is chili but not hot enough to hold a week of uncontrollable diareau, and Darby was also vomiting. I had no taste in teas.My advice: buy a single box and see how dogs react."
1,24052,4052,spam,"YUMMY! My wife says this is hot chocolate. Accordingly, the tea had no real intelligence to give, and that she enjoyed so much! I even sweetened my tea with Sugar Twin, so I'm thrilled to find it is the description that closely. I developed this aversion to them because they are my new favorite."
2,27554,7554,spam,"Sorry folks I'm sure it is healthy - it contains only fruits and nuts, no sugar added. Gets boring. Ugh. If you have problems with muscle cramping are relieved to know. It taste like hot water with some coffee at the price of gas and other expenses these days I can't stop sucking on them. She does get breastmilk but I have to tell him the agreeable news. You can definitely taste the fig but not so much the better. I didn't like the way this manufacturer packages 4 types of sugar are listed, it is clear you are only getting 3-4 hours total per night."


In [34]:
#To make the text easier to process we start by removing all punctuation from each document. 

import string

s = "string. With. Punctuation?"
table = str.maketrans({key: None for key in string.punctuation})
new_s = s.translate(table)                          # Output: string without punctuation

In [36]:
def no_punct(phrase):
    return phrase.translate(table)
df_samp["no_punct"]= df_samp["text"].apply(no_punct)

In [37]:
df_samp

Unnamed: 0,level_0,index,label,text,no_punct
0,38247,18247,spam,"I make it since I haven't found it yet but I am not a coffee drinker. When I opened the box up, put the label on, and take it from me, don't waste your money on this coffee and every cup is enjoyable. Shipping will double the price. Can't tell it is chili but not hot enough to hold a week of uncontrollable diareau, and Darby was also vomiting. I had no taste in teas.My advice: buy a single box and see how dogs react.",I make it since I havent found it yet but I am not a coffee drinker When I opened the box up put the label on and take it from me dont waste your money on this coffee and every cup is enjoyable Shipping will double the price Cant tell it is chili but not hot enough to hold a week of uncontrollable diareau and Darby was also vomiting I had no taste in teasMy advice buy a single box and see how dogs react
1,24052,4052,spam,"YUMMY! My wife says this is hot chocolate. Accordingly, the tea had no real intelligence to give, and that she enjoyed so much! I even sweetened my tea with Sugar Twin, so I'm thrilled to find it is the description that closely. I developed this aversion to them because they are my new favorite.",YUMMY My wife says this is hot chocolate Accordingly the tea had no real intelligence to give and that she enjoyed so much I even sweetened my tea with Sugar Twin so Im thrilled to find it is the description that closely I developed this aversion to them because they are my new favorite
2,27554,7554,spam,"Sorry folks I'm sure it is healthy - it contains only fruits and nuts, no sugar added. Gets boring. Ugh. If you have problems with muscle cramping are relieved to know. It taste like hot water with some coffee at the price of gas and other expenses these days I can't stop sucking on them. She does get breastmilk but I have to tell him the agreeable news. You can definitely taste the fig but not so much the better. I didn't like the way this manufacturer packages 4 types of sugar are listed, it is clear you are only getting 3-4 hours total per night.",Sorry folks Im sure it is healthy it contains only fruits and nuts no sugar added Gets boring Ugh If you have problems with muscle cramping are relieved to know It taste like hot water with some coffee at the price of gas and other expenses these days I cant stop sucking on them She does get breastmilk but I have to tell him the agreeable news You can definitely taste the fig but not so much the better I didnt like the way this manufacturer packages 4 types of sugar are listed it is clear you are only getting 34 hours total per night


We can now "explode" the text into words, so that each word occupies a row of the data frame, and retains the associated "level_0", "index" and "label". 

In [12]:
rows = []
_ = df_samp.apply(lambda row: [rows.append([ row['level_0'], row['index'], row['label'], word]) 
                         for word in row.text.split()], axis=1)
df_samp_explode = pd.DataFrame(rows, columns=df_samp.columns)

In [13]:
df_samp_explode

Unnamed: 0,level_0,index,label,text
0,38247,18247,spam,I
1,38247,18247,spam,make
2,38247,18247,spam,it
3,38247,18247,spam,since
4,38247,18247,spam,I
5,38247,18247,spam,haven't
6,38247,18247,spam,found
7,38247,18247,spam,it
8,38247,18247,spam,yet
9,38247,18247,spam,but


Column "level_0" contains the index we want to aggregate any calculations over. 

The summmaries we are going to compute for each document are: 
    number of words in each document
    average word length
    maximum word length
    minimum word length
    10th percentile word length
    90th percentile word length
    number of upper case words
    
    
Many of these require the word length to be computed. To save us from recomputing this every time, we will begin add a column containing this information to our 'exploded' data frame. 


In [14]:
df_samp_explode["word_len"] = df_samp_explode["text"].apply(len) 

In [15]:
df_samp_explode.sample(10) 

Unnamed: 0,level_0,index,label,text,word_len
210,27554,7554,spam,taste,5
182,27554,7554,spam,gas,3
49,38247,18247,spam,tell,4
241,27554,7554,spam,hours,5
177,27554,7554,spam,coffee,6
126,24052,4052,spam,closely.,8
220,27554,7554,spam,didn't,6
118,24052,4052,spam,thrilled,8
171,27554,7554,spam,taste,5
66,38247,18247,spam,was,3


Note that punctuation is counted as contributing to word length. (we're fine with this, but you could process this out if you wanted to.)

We will record the summaries for each document in a new data frame called summaries. We start by computing the number of words in each document. 

In [16]:
word_counts = df_samp_explode['level_0'].value_counts()

df_summaries = pd.DataFrame({'word_counts' :word_counts})

In [17]:
df_summaries

Unnamed: 0,word_counts
27554,106
38247,85
24052,54


In the next cell we compute the average word length as well as the minimum and maximum, for each document. 

In [18]:
df_summaries["av_wl"] = df_samp_explode.groupby('level_0')['word_len'].mean() #average word length

In [19]:
df_summaries["max_wl"] = df_samp_explode.groupby('level_0')['word_len'].max() #max word length
df_summaries["min_wl"] = df_samp_explode.groupby('level_0')['word_len'].min() #min word length

In [20]:
df_summaries["10_quantile"] = df_samp_explode.groupby('level_0')['word_len'].quantile(0.1) #10th quantile word length
df_summaries["90 quantile"]= df_samp_explode.groupby('level_0')['word_len'].quantile(0.9) #90th quantile word length

In [21]:
df_summaries

Unnamed: 0,word_counts,av_wl,max_wl,min_wl,10_quantile,90 quantile
27554,106,4.245283,12,1,2.0,7.0
38247,85,3.952941,14,1,2.0,6.6
24052,54,4.5,12,1,2.0,9.0


As well as the simple summaries relating to word length, we can compute some more involved summaries related to language. For each document we will compute: 

The number of words which contain atleast one capital letter. 
The number of stop words. 



In [24]:
#Let's also count the number of 
#number of words containing atleast one capital letter.
#item.islower returns true if all characters are lowercase, else false.
#note: isupper only returns true if all characters are upper case. 
def caps(word):
    return not word.islower()
df_samp_explode["upper_case"]=df_samp_explode['text'].apply(caps)
df_summaries["upper_case"] = df_samp_explode.groupby('level_0')['upper_case'].sum() 

In [26]:
df_summaries

Unnamed: 0,word_counts,av_wl,max_wl,min_wl,10_quantile,90 quantile,upper_case
27554,106,4.245283,12,1,2.0,7.0,14.0
38247,85,3.952941,14,1,2.0,6.6,10.0
24052,54,4.5,12,1,2.0,9.0,8.0


Stop words are commonly used words which are usually considered to be unrelated to the document topic. Examples include 'in', 'the', 'at' and 'otherwise'.

In [28]:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

In [32]:
def isstopword(word):
    return word in ENGLISH_STOP_WORDS

df_samp_explode["stop_words"]=df_samp_explode['text'].apply(isstopword)

In [33]:
df_samp_explode

Unnamed: 0,level_0,index,label,text,word_len,upper_case,stop_words
0,38247,18247,spam,I,1,True,False
1,38247,18247,spam,make,4,False,False
2,38247,18247,spam,it,2,False,True
3,38247,18247,spam,since,5,False,True
4,38247,18247,spam,I,1,True,False
5,38247,18247,spam,haven't,7,False,False
6,38247,18247,spam,found,5,False,True
7,38247,18247,spam,it,2,False,True
8,38247,18247,spam,yet,3,False,True
9,38247,18247,spam,but,3,False,True


In [None]:
df_summaries["stop_words"] = df_explode.groupby('level_0')['stop_words'].sum() 

In [None]:
#Finally, we use regular expressions to count the number of pieces of punctuation in a document
import string
import regex as re

def punct_count(doc):
    return sum(bool(re.match(r"""[!.><:;'@#~{}\[\]-_+=£$%^&()?]""", c)) for c in doc)

In [None]:
df_summaries["punctuation"]=df['text'].apply(punct_count)

In [None]:
df_summaries.sample(10)

In [None]:
import sklearn.decomposition

DIMENSIONS = 2

pca = sklearn.decomposition.PCA(DIMENSIONS)

pca_summaries = pca.fit_transform(df_summaries)

In [None]:
pca_summaries

In [None]:
pca_summaries_plot_data = pd.concat([df, pd.DataFrame(pca_summaries, columns=["x", "y"])], axis=1)

#tsne_plot_data = pd.concat([sdf.reset_index(), pd.DataFrame(tsne_a, columns=["x", "y"])], axis=1)

from mlworkflows import plot

plot.plot_points(pca_summaries_plot_data, x="x", y="y", color="label")

In [None]:
labled_vecs = pd.concat([df[["index", "label"]],df_summaries], axis=1)

In [None]:
labled_vecs

In [None]:
labled_vecs.columns = labled_vecs.columns.astype(str)

In [None]:
labled_vecs.to_parquet("data/simplesummaries_features.parquet")