In this notebook we will process the synthetic Austen/food reviews data and convert it into feature vectors. In later notebooks these feature vectors will be the inputs to models which we will train and eventually use to identify spam. 

Feature vectors are composed of simple summaries of the documents. 

In [1]:
import pandas as pd

df = pd.read_parquet("data/training.parquet")

To illustrate the computation of feature vectors, we compute them for a sample of three documents from the data loaded in above.

In [2]:
import numpy as np

np.random.seed(0xc0ffeeee)
df_samp = df.sample(3)

In [3]:
pd.set_option('display.max_colwidth', -1) #ensures that all the text is visible
df_samp

Unnamed: 0,index,label,text
26675,6675,spam,"Once I received the product and thought it was just like at the dog park or somewhere outside their home. I have tried many K-cup varieties. I had fresh herbs all winter long it was sitting on, apparently she would rather I just feed her the real thing. Please, share them with my co-workers a few weeks to go rancid, then reheated. She would tear the bag open as the treats can dry out, although I think I like a strong coffee, it was not clustered - sounds like your thing, you'll probably love these."
11130,11130,legitimate,"Elinor, while she waited in silence for the appearance of equal solicitude, on topics which had by nature the first claim on her. I do not believe Isabella has any fortune at all: but that will not signify to anyone here what he really is. It is hearty, but not at the Cottage, though that had been brought on by the entrance of a third to cheer a long evening."
33492,13492,spam,"I buy a new tea, thank you. This superior dog biscuit recipe contains only 7 primary ingredients and when it comes to buying this product again. The Babycook is so cute on all of the Happy Baby brand too ... also almost $2 a cup it is worth it! Have one more bag to get rid of my symptoms for all these conditions are relieved. This is one great product. Even brewing this at twice the price. I went home, did more research, saw that the good people can keep it half of it for several years."


In [4]:
df_samp.reset_index(inplace=True) 

#note level and index coincide for the legitimate documents, but not for the spam - 
    #for spam, index = level_0 mod 20,000

In [5]:
df_samp #storing 

Unnamed: 0,level_0,index,label,text
0,26675,6675,spam,"Once I received the product and thought it was just like at the dog park or somewhere outside their home. I have tried many K-cup varieties. I had fresh herbs all winter long it was sitting on, apparently she would rather I just feed her the real thing. Please, share them with my co-workers a few weeks to go rancid, then reheated. She would tear the bag open as the treats can dry out, although I think I like a strong coffee, it was not clustered - sounds like your thing, you'll probably love these."
1,11130,11130,legitimate,"Elinor, while she waited in silence for the appearance of equal solicitude, on topics which had by nature the first claim on her. I do not believe Isabella has any fortune at all: but that will not signify to anyone here what he really is. It is hearty, but not at the Cottage, though that had been brought on by the entrance of a third to cheer a long evening."
2,33492,13492,spam,"I buy a new tea, thank you. This superior dog biscuit recipe contains only 7 primary ingredients and when it comes to buying this product again. The Babycook is so cute on all of the Happy Baby brand too ... also almost $2 a cup it is worth it! Have one more bag to get rid of my symptoms for all these conditions are relieved. This is one great product. Even brewing this at twice the price. I went home, did more research, saw that the good people can keep it half of it for several years."


We can now "explode" the text into words, so that each word occupies a row of the data frame, and retains the associated "level_0", "index" and "label". 

In [7]:
rows = []
_ = df_samp.apply(lambda row: [rows.append([ row['level_0'], row['index'], row['label'], word]) 
                         for word in row.text.split()], axis=1)
df_samp_explode = pd.DataFrame(rows, columns=df_samp.columns)

In [8]:
df_samp_explode

Unnamed: 0,level_0,index,label,text
0,26675,6675,spam,Once
1,26675,6675,spam,I
2,26675,6675,spam,received
3,26675,6675,spam,the
4,26675,6675,spam,product
5,26675,6675,spam,and
6,26675,6675,spam,thought
7,26675,6675,spam,it
8,26675,6675,spam,was
9,26675,6675,spam,just


Column "level_0" contains the index we want to aggregate any calculations over. 

The summmaries we are going to compute for each document are: 
    number of words in each document
    average word length
    maximum word length
    minimum word length
    10th percentile word length
    90th percentile word length
    number of upper case words
    
    
Many of these require the word length to be computed. To save us from recomputing this every time, we will begin add a column containing this information to our 'exploded' data frame. 


In [9]:
df_samp_explode["word_len"] = df_samp_explode["text"].apply(len) 

In [10]:
df_samp_explode.sample(10) 

Unnamed: 0,level_0,index,label,text,word_len
183,33492,13492,spam,when,4
261,33492,13492,spam,years.,6
213,33492,13492,spam,it!,3
122,11130,11130,legitimate,Isabella,8
168,33492,13492,spam,new,3
242,33492,13492,spam,I,1
131,11130,11130,legitimate,not,3
23,26675,6675,spam,many,4
158,11130,11130,legitimate,a,1
173,33492,13492,spam,superior,8


Note that punctuation is counted as contributing to word length. (we're fine with this, but you could process this out if you wanted to.)

We will record the summaries for each document in a new data frame called summaries. We start by computing the number of words in each document. 

In [16]:
word_counts = df_samp_explode['level_0'].value_counts()

df_summaries = pd.DataFrame({'wprd_counts' :word_counts})

In [17]:
df_summaries

Unnamed: 0,wprd_counts
33492,97
26675,95
11130,70


In the next cell we compute the average word length as well as the minimum and maximum, for each document. 

In [None]:
df_summaries["av_wl"] = df_explode.groupby('level_0')['word_len'].mean() #average word length

In [None]:
df_summaries["max_wl"] = df_explode.groupby('level_0')['word_len'].max() #max word length
df_summaries["min_wl"] = df_explode.groupby('level_0')['word_len'].min() #min word length

In [None]:
df_summaries["10_quantile"] = df_explode.groupby('level_0')['word_len'].quantile(0.1) #10th quantile word length
df_summaries["90 quantile"]= df_explode.groupby('level_0')['word_len'].quantile(0.9) #90th quantile word length

In [None]:
#number of words containing atleast one capital letter.
#item.islower returns true if all characters are lowercase, else false.
#note: isupper only returns true if all characters are upper case. 
def caps(word):
    return not word.islower()
df_explode["upper_case"]=df_explode['text'].apply(caps)
df_summaries["upper_case"] = df_explode.groupby('level_0')['upper_case'].sum() 

In [None]:
df_summaries.sample(10)

As well as the simple summaries relating to word length, we can compute some more involved summaries related to language. 

In [None]:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

In [None]:
def isstopword(word):
    return word in ENGLISH_STOP_WORDS

df_explode["stop_words"]=df_explode['text'].apply(isstopword)

In [None]:
df_summaries["stop_words"] = df_explode.groupby('level_0')['stop_words'].sum() 

In [None]:
#Finally, we use regular expressions to count the number of pieces of punctuation in a document
import string
import regex as re

def punct_count(doc):
    return sum(bool(re.match(r"""[!.><:;'@#~{}\[\]-_+=£$%^&()?]""", c)) for c in doc)

In [None]:
df_summaries["punctuation"]=df['text'].apply(punct_count)

In [None]:
df_summaries.sample(10)

In [None]:
import sklearn.decomposition

DIMENSIONS = 2

pca = sklearn.decomposition.PCA(DIMENSIONS)

pca_summaries = pca.fit_transform(df_summaries)

In [None]:
pca_summaries

In [None]:
pca_summaries_plot_data = pd.concat([df, pd.DataFrame(pca_summaries, columns=["x", "y"])], axis=1)

#tsne_plot_data = pd.concat([sdf.reset_index(), pd.DataFrame(tsne_a, columns=["x", "y"])], axis=1)

from mlworkflows import plot

plot.plot_points(pca_summaries_plot_data, x="x", y="y", color="label")

In [None]:
labled_vecs = pd.concat([df[["index", "label"]],df_summaries], axis=1)

In [None]:
labled_vecs

In [None]:
labled_vecs.columns = labled_vecs.columns.astype(str)

In [None]:
labled_vecs.to_parquet("data/simplesummaries_features.parquet")