In this notebook we will process the synthetic Austen/food reviews data and convert it into feature vectors. In later notebooks these feature vectors will be the inputs to models which we will train and eventually use to identify spam. 

In this notebook, feature vectors are composed of simple summaries of the documents. 

In [5]:
import pandas as pd

df = pd.read_parquet("data/training.parquet")

In [6]:
df.sample(10)

Unnamed: 0,index,label,text
6807,6807,legitimate,"Your sister wrote to me again, you know, the c..."
8253,8253,legitimate,Sir James invited himself with great composure...
18433,18433,legitimate,"Anne found a nice seat for her, on a dry sunny..."
7976,7976,legitimate,She is a most serious sacrifice. Thorpe told h...
31887,11887,spam,These plants are very delicate when mixed with...
31224,11224,spam,I also like this brand's cold cereals. I decid...
2097,2097,legitimate,I would encourage you if you are to cruise to ...
17355,17355,legitimate,But when the time draws near. Everybody was su...
1959,1959,legitimate,Oh. She had taken care to have two full course...
18132,18132,legitimate,Does our education prepare us for such atrocit...


In [7]:
#note level and index coincide for the legitimate documents, but not for the spam - 
    #for spam, index = level_0 mod 20,000
df.reset_index(inplace=True) 

In [8]:
df.sample(10)

Unnamed: 0,level_0,index,label,text
12826,12826,12826,legitimate,"So, Lady Russell would suffer in entering the ..."
33672,33672,13672,spam,Research the benefits of feeding freeze dried ...
6098,6098,6098,legitimate,She believed the regard to be mutual; but she ...
5358,5358,5358,legitimate,It is so much deducted from the grand affair o...
7041,7041,7041,legitimate,"As yet, you have seen nothing of Willoughby; a..."
19141,19141,19141,legitimate,Lydia's being settled in the evening found her...
24909,24909,4909,spam,PRODUCT IS NOT AS DESCRIBED DO NOT ORDER these...
36019,36019,16019,spam,As did the Amazon seller. I ordered two of the...
37118,37118,17118,spam,I just ordered two more boxes. I would not cal...
10198,10198,10198,legitimate,You did not enjoy them as I can't do it too fa...


We can now "explode" the text into words, so that each word occupies a row of the data frame, and retains the associated "level_0", "index" and "label". 

In [9]:
rows = []
_ = df.apply(lambda row: [rows.append([row['level_0'], row['index'], row['label'], word]) 
                         for word in row.text.split()], axis=1)

In [11]:
rows[1:10]

[[0, 0, 'legitimate', 'must'],
 [0, 0, 'legitimate', 'write'],
 [0, 0, 'legitimate', 'to'],
 [0, 0, 'legitimate', 'me.'],
 [0, 0, 'legitimate', 'Catherine'],
 [0, 0, 'legitimate', 'sighed.'],
 [0, 0, 'legitimate', 'And'],
 [0, 0, 'legitimate', 'there'],
 [0, 0, 'legitimate', 'are']]

In [12]:
df_explode = pd.DataFrame(rows, columns=df.columns)
df_explode

Unnamed: 0,level_0,index,label,text
0,0,0,legitimate,You
1,0,0,legitimate,must
2,0,0,legitimate,write
3,0,0,legitimate,to
4,0,0,legitimate,me.
5,0,0,legitimate,Catherine
6,0,0,legitimate,sighed.
7,0,0,legitimate,And
8,0,0,legitimate,there
9,0,0,legitimate,are


Column "level_0" contains the index we want to aggregate any calculations over. 

The summmaries we are going to compute for each document (indexed by "level_0") are: 
    number of words in each document
    average word length
    maximum word length
    minimum word length
    10th percentile word length
    90th percentile word length
    number of upper case words
    
    
Many of these require the word length to be computed. To save us from recomputing this every time, we will begin add a column containing this information to our 'exploded' data frame. 


In [13]:
df_explode["word_len"] = df_explode["text"].apply(len) 

In [14]:
df_explode.sample(10) ## looks fine, though punctuation is being counted as a character in this word length calculation. 

Unnamed: 0,level_0,index,label,text,word_len
690728,7710,7710,legitimate,He,2
2368910,27139,7139,spam,loves,5
1383710,15397,15397,legitimate,I,1
295087,3293,3293,legitimate,the,3
307090,3425,3425,legitimate,they,4
2530821,29160,9160,spam,was,3
2747736,31900,11900,spam,$15,3
2308509,26361,6361,spam,old,3
1239895,13770,13770,legitimate,slim,4
3090948,36206,16206,spam,It's,4


In [3]:
def numwords(doc):
    return len(doc.split())

def avwordlength(doc):
    words = doc.split()
    return sum(len(word) for word in words) /len(words)

def maxwordlength(doc):
    words = doc.split()
    return max(len(word) for word in words)

def minwordlength(doc):
    words = doc.split()
    return min(len(word) for word in words)

In [4]:
import numpy as np

def docwlpercentile(p):
    def pthpercentile(doc):
        words = doc.split()
        wl = [len(word) for word in words]
        return np.percentile(wl,p)
    return pthpercentile

In [5]:
df["number_words"], df["av_wl"]=[df["text"].apply(numwords), df["text"].apply(avwordlength)]

In [6]:
df["number_words"], df["average_wl"], df["min_wl"], df["max_wl"], df["10p_wl"], df["90p_wl"] = [df["text"].apply(numwords), df["text"].apply(avwordlength), df["text"].apply(minwordlength), df["text"].apply(maxwordlength), df["text"].apply(docwlpercentile(10)), df["text"].apply(docwlpercentile(90))]

In [7]:
df

Unnamed: 0,label,text,number_words,av_wl,average_wl,min_wl,max_wl,10p_wl,90p_wl
0,legitimate,You must write to me. Catherine sighed. And th...,124,4.879032,4.879032,1,16,2.0,9.0
1,legitimate,Who would have thought Mr. Crawford sure of he...,88,5.022727,5.022727,1,17,2.0,9.0
2,legitimate,He had only himself to please in his choice: h...,139,4.496403,4.496403,1,13,2.0,8.0
3,legitimate,Oh! One accompaniment to her song took her agr...,94,4.680851,4.680851,1,15,2.0,9.0
4,legitimate,"As soon as breakfast was over, she went to her...",80,4.525000,4.525000,2,9,2.0,8.0
5,legitimate,Mrs Clay's selfishness was not so great as to ...,103,4.669903,4.669903,1,12,2.0,8.0
6,legitimate,"But self, though it would intrude, could not e...",48,4.500000,4.500000,1,12,2.0,7.3
7,legitimate,"Elizabeth, though she did not wish to slight. ...",101,4.386139,4.386139,1,13,2.0,7.0
8,legitimate,Edmund had descended from that moral elevation...,89,4.404494,4.404494,1,12,2.0,8.0
9,legitimate,I read up on the morrow the Crawfords were eng...,108,4.814815,4.814815,1,17,2.0,9.0


As well as the simple summaries relating to word length, we can compute some more involved summaries related to language. 

To do this, we use SpaCy. 

In [8]:
#item.islower returns true if all are lowercase, else false.
#so this is capturing words with capital letters anywhere in them, not just at the start. 
#note: isupper only returns true if all characters are upper case. 
def uppercase(doc):
    words = doc.split()
    return len([x for x in words if not x.islower()])
df["upper"] = df["text"].apply(uppercase)

In [9]:
import spacy
english = spacy.load("en")

In [10]:
# computing the number of stop words is not quick. 

def num_stops(doc):
    tokens = english(doc)
    return len([token for token in tokens if token.is_stop])
    
df["num_stops"] = df["text"].apply(num_stops)

In [11]:
df

Unnamed: 0,label,text,number_words,av_wl,average_wl,min_wl,max_wl,10p_wl,90p_wl,upper,num_stops
0,legitimate,You must write to me. Catherine sighed. And th...,124,4.879032,4.879032,1,16,2.0,9.0,22,66
1,legitimate,Who would have thought Mr. Crawford sure of he...,88,5.022727,5.022727,1,17,2.0,9.0,11,47
2,legitimate,He had only himself to please in his choice: h...,139,4.496403,4.496403,1,13,2.0,8.0,24,75
3,legitimate,Oh! One accompaniment to her song took her agr...,94,4.680851,4.680851,1,15,2.0,9.0,13,51
4,legitimate,"As soon as breakfast was over, she went to her...",80,4.525000,4.525000,2,9,2.0,8.0,11,44
5,legitimate,Mrs Clay's selfishness was not so great as to ...,103,4.669903,4.669903,1,12,2.0,8.0,13,55
6,legitimate,"But self, though it would intrude, could not e...",48,4.500000,4.500000,1,12,2.0,7.3,5,27
7,legitimate,"Elizabeth, though she did not wish to slight. ...",101,4.386139,4.386139,1,13,2.0,7.0,15,53
8,legitimate,Edmund had descended from that moral elevation...,89,4.404494,4.404494,1,12,2.0,8.0,11,49
9,legitimate,I read up on the morrow the Crawfords were eng...,108,4.814815,4.814815,1,17,2.0,9.0,18,56
