In this notebook we will process the synthetic Austen/food reviews data and convert it into feature vectors. In later notebooks these feature vectors will be the inputs to models which we will train and eventually use to identify spam. 

In this notebook, feature vectors are composed of simple summaries of the documents. 

In [1]:
import pandas as pd

df = pd.read_parquet("data/training.parquet")

In [2]:
df.sample(10)

Unnamed: 0,index,label,text
39171,19171,spam,They are a perfect size to give as a gift. My ...
6144,6144,legitimate,"And could we carry our selfish point with you,..."
35351,15351,spam,My dogs go crazy when they see me reach for th...
21216,1216,spam,A great start to so many pet food scares. It c...
34295,14295,spam,I first found out he had milk protein allergie...
8051,8051,legitimate,"He looked surprised, displeased, alarmed; but ..."
1466,1466,legitimate,It works just fine.This Rainforest blend has q...
23658,3658,spam,YUM!!!! Even sharing them with friends and fam...
10563,10563,legitimate,"You must stay to be acquainted with, and yet d..."
12470,12470,legitimate,Fletcher and I mean real hard. Don't imagine t...


In [3]:
#note level and index coincide for the legitimate documents, but not for the spam - 
    #for spam, index = level_0 mod 20,000
df.reset_index(inplace=True) 

In [4]:
df.sample(10)

Unnamed: 0,level_0,index,label,text
11827,11827,11827,legitimate,"Obviously, the artificially sweetened one, so ..."
14779,14779,14779,legitimate,I have been drinking this for about two years....
23834,23834,3834,spam,Angel has been on antibiotics. They're not too...
7951,7951,7951,legitimate,I had not been accomplished by her ingenious d...
14654,14654,14654,legitimate,"However that might be, she was unmanageable. I..."
32426,32426,12426,spam,I bought the maker while in NZ and only four W...
32501,32501,12501,spam,They also are a great source of potassium and ...
23971,23971,3971,spam,I would like to order more in the box for $5. ...
31746,31746,11746,spam,A great occasional treat -- yum! I was terribl...
6825,6825,6825,legitimate,It took a little while in doubt. She wanted to...


We can now "explode" the text into words, so that each word occupies a row of the data frame, and retains the associated "level_0", "index" and "label". 

In [5]:
rows = []
_ = df.apply(lambda row: [rows.append([row['level_0'], row['index'], row['label'], word]) 
                         for word in row.text.split()], axis=1)

In [6]:
rows[1:10]

[[0, 0, 'legitimate', 'must'],
 [0, 0, 'legitimate', 'write'],
 [0, 0, 'legitimate', 'to'],
 [0, 0, 'legitimate', 'me.'],
 [0, 0, 'legitimate', 'Catherine'],
 [0, 0, 'legitimate', 'sighed.'],
 [0, 0, 'legitimate', 'And'],
 [0, 0, 'legitimate', 'there'],
 [0, 0, 'legitimate', 'are']]

In [7]:
df_explode = pd.DataFrame(rows, columns=df.columns)
df_explode

Unnamed: 0,level_0,index,label,text
0,0,0,legitimate,You
1,0,0,legitimate,must
2,0,0,legitimate,write
3,0,0,legitimate,to
4,0,0,legitimate,me.
5,0,0,legitimate,Catherine
6,0,0,legitimate,sighed.
7,0,0,legitimate,And
8,0,0,legitimate,there
9,0,0,legitimate,are


Column "level_0" contains the index we want to aggregate any calculations over. 

The summmaries we are going to compute for each document (indexed by "level_0") are: 
    number of words in each document
    average word length
    maximum word length
    minimum word length
    10th percentile word length
    90th percentile word length
    number of upper case words
    
    
Many of these require the word length to be computed. To save us from recomputing this every time, we will begin add a column containing this information to our 'exploded' data frame. 


In [8]:
df_explode["word_len"] = df_explode["text"].apply(len) 

In [9]:
df_explode.sample(10) ## looks fine, though punctuation is being counted as a character in this word length calculation. 

Unnamed: 0,level_0,index,label,text,word_len
636853,7114,7114,legitimate,cannot,6
2726160,31627,11627,spam,coffee,6
611562,6853,6853,legitimate,in,2
14341,154,154,legitimate,been,4
2712269,31453,11453,spam,are,3
3085658,36140,16140,spam,at,2
738124,8233,8233,legitimate,to,2
2087251,23612,3612,spam,has,3
2665888,30866,10866,spam,is,2
778828,8665,8665,legitimate,of,2


We will record the summaries for each document in a new data frame called summaries. We start by computing the number of words in each document. 

In [10]:
counts = df_explode['level_0'].value_counts()

df_summaries = pd.DataFrame({'counts' :counts})

In [11]:
df_summaries.sample(10)

Unnamed: 0,counts
22116,95
3444,119
594,64
21319,23
12374,122
39163,92
11693,82
23205,76
15381,90
3405,70


In [12]:
df_summaries["av_wl"] = df_explode.groupby('level_0')['word_len'].mean() #average word length


In [13]:
df_summaries.sample(10)

Unnamed: 0,counts,av_wl
33945,68,4.544118
15876,36,4.75
18863,79,4.658228
9779,34,4.705882
15352,102,3.911765
24756,46,4.434783
4780,153,4.633987
27125,97,4.525773
15687,75,4.56
22788,64,4.28125


In [14]:
df_summaries["max_wl"] = df_explode.groupby('level_0')['word_len'].max() #max word length
df_summaries["min_wl"] = df_explode.groupby('level_0')['word_len'].min() #min word length

In [16]:
df_summaries["10_quantile"] = df_explode.groupby('level_0')['word_len'].quantile(0.1) #average word length

In [19]:
df_summaries["90 quantile"]= df_explode.groupby('level_0')['word_len'].quantile(0.9)

In [20]:
df_summaries.sample(100)

Unnamed: 0,counts,av_wl,max_wl,min_wl,10_quantile,90 quantile
35506,119,4.428571,11,1,1.8,8.0
28298,73,4.452055,13,1,2.0,7.8
21015,76,4.421053,16,1,2.0,8.0
10602,95,4.315789,13,1,2.0,8.0
3144,107,4.626168,12,1,2.0,8.0
14324,44,4.295455,12,1,2.0,6.7
27375,77,4.805195,13,1,2.0,8.0
19667,104,4.259615,14,1,2.0,7.0
11207,93,4.602151,15,1,2.0,8.0
18428,77,3.974026,11,1,2.0,7.0


As well as the simple summaries relating to word length, we can compute some more involved summaries related to language. 

To do this, we use SpaCy. 

In [None]:
#item.islower returns true if all are lowercase, else false.
#so this is capturing words with capital letters anywhere in them, not just at the start. 
#note: isupper only returns true if all characters are upper case. 
def uppercase(doc):
    words = doc.split()
    return len([x for x in words if not x.islower()])
df["upper"] = df["text"].apply(uppercase)

In [None]:
import spacy
english = spacy.load("en")

In [None]:
# computing the number of stop words is not quick. 

def num_stops(doc):
    tokens = english(doc)
    return len([token for token in tokens if token.is_stop])
    
df["num_stops"] = df["text"].apply(num_stops)

In [None]:
df