In this notebook we will process the synthetic Austen/food reviews data and convert it into feature vectors. In later notebooks these feature vectors will be the inputs to models which we will train and eventually use to identify spam. 

In this notebook, feature vectors are composed of simple summaries of the documents. 

In [1]:
import pandas as pd

df = pd.read_parquet("data/training.parquet")

In [2]:
df.sample(10)

Unnamed: 0,index,label,text
29606,9606,spam,If you're in the market for a super green supp...
12184,12184,legitimate,1 Siami and 1 Main Coon. The hope of meeting a...
29482,9482,spam,"Good price for 6, not so great and then take t..."
30417,10417,spam,The reason for the better half of a scoop. It ...
22255,2255,spam,"This tea is great, for the onset of colds/flu...."
20773,773,spam,I purchased these and slim jims at the same ti...
23012,3012,spam,"Eggs, Chicken, Steak, asian food. Also good in..."
38935,18935,spam,The shipping is fast and convenient. And as al...
4161,4161,legitimate,Fanny was startled at the proposal. She had ne...
32126,12126,spam,i do like there coconut manna. It was selling ...


In [3]:
#note level and index coincide for the legitimate documents, but not for the spam - 
    #for spam, index = level_0 mod 20,000
df.reset_index(inplace=True) 

In [4]:
df.sample(10)

Unnamed: 0,level_0,index,label,text
16476,16476,16476,legitimate,I now go on to say that hasn't been written be...
29548,29548,9548,spam,"Needless to say, it got eaten very quickly. I ..."
22957,22957,2957,spam,These little chocolates were the best. Back ab...
7017,7017,7017,legitimate,It is a good water to drink for the rest of th...
1760,1760,1760,legitimate,"But on Wednesday, I think, Henry, you may expe..."
27330,27330,7330,spam,I love to take this early in the morning with ...
24713,24713,4713,spam,"Buyer beware, you may want to put this down on..."
34488,34488,14488,spam,This multigrain chip was a hit. I have been ve...
4738,4738,4738,legitimate,AND GOOD RICE. The laurels at Maple Grove are ...
35114,35114,15114,spam,Great stuff! Well they discontinued carring th...


We can now "explode" the text into words, so that each word occupies a row of the data frame, and retains the associated "level_0", "index" and "label". 

In [5]:
rows = []
_ = df.apply(lambda row: [rows.append([row['level_0'], row['index'], row['label'], word]) 
                         for word in row.text.split()], axis=1)

In [6]:
rows[1:10]

[[0, 0, 'legitimate', 'must'],
 [0, 0, 'legitimate', 'write'],
 [0, 0, 'legitimate', 'to'],
 [0, 0, 'legitimate', 'me.'],
 [0, 0, 'legitimate', 'Catherine'],
 [0, 0, 'legitimate', 'sighed.'],
 [0, 0, 'legitimate', 'And'],
 [0, 0, 'legitimate', 'there'],
 [0, 0, 'legitimate', 'are']]

In [7]:
df_explode = pd.DataFrame(rows, columns=df.columns)
df_explode

Unnamed: 0,level_0,index,label,text
0,0,0,legitimate,You
1,0,0,legitimate,must
2,0,0,legitimate,write
3,0,0,legitimate,to
4,0,0,legitimate,me.
5,0,0,legitimate,Catherine
6,0,0,legitimate,sighed.
7,0,0,legitimate,And
8,0,0,legitimate,there
9,0,0,legitimate,are


Column "level_0" contains the index we want to aggregate any calculations over. 

The summmaries we are going to compute for each document (indexed by "level_0") are: 
    number of words in each document
    average word length
    maximum word length
    minimum word length
    10th percentile word length
    90th percentile word length
    number of upper case words
    
    
Many of these require the word length to be computed. To save us from recomputing this every time, we will begin add a column containing this information to our 'exploded' data frame. 


In [8]:
df_explode["word_len"] = df_explode["text"].apply(len) 

In [9]:
df_explode.sample(10) ## looks fine, though punctuation is being counted as a character in this word length calculation. 

Unnamed: 0,level_0,index,label,text,word_len
1857083,20723,723,spam,a,1
1337865,14874,14874,legitimate,with,4
468645,5259,5259,legitimate,of,2
1992126,22427,2427,spam,to,2
3362717,39633,19633,spam,This,4
720072,8035,8035,legitimate,Lords,5
136301,1541,1541,legitimate,and,3
1726354,19169,19169,legitimate,do,2
1417034,15762,15762,legitimate,I,1
2563277,29561,9561,spam,does,4


We will record the summaries for each document in a new data frame called summaries. We start by computing the number of words in each document. 

In [10]:
counts = df_explode['level_0'].value_counts()

df_summaries = pd.DataFrame({'counts' :counts})

In [11]:
df_summaries.sample(10)

Unnamed: 0,counts
26316,78
12775,39
21652,50
1295,47
20707,75
5924,75
31114,57
5179,88
12426,134
15739,190


In [12]:
df_summaries["av_wl"] = df_explode.groupby('level_0')['word_len'].mean() #average word length


In [13]:
df_summaries.sample(10)

Unnamed: 0,counts,av_wl
31356,41,4.512195
21293,118,4.40678
32752,127,4.551181
1762,89,4.88764
27319,104,4.221154
30250,57,4.403509
18452,167,4.538922
38171,92,4.358696
26662,132,4.227273
39226,107,4.682243


In [14]:
df_summaries["max_wl"] = df_explode.groupby('level_0')['word_len'].max() #max word length
df_summaries["min_wl"] = df_explode.groupby('level_0')['word_len'].min() #min word length

In [15]:
df_summaries["10_quantile"] = df_explode.groupby('level_0')['word_len'].quantile(0.1) #average word length

In [16]:
df_summaries["90 quantile"]= df_explode.groupby('level_0')['word_len'].quantile(0.9)

In [17]:
df_summaries.sample(100)

Unnamed: 0,counts,av_wl,max_wl,min_wl,10_quantile,90 quantile
17625,148,4.283784,11,1,2.0,8.0
28773,77,4.168831,15,1,2.0,7.4
7132,145,4.544828,13,1,2.0,8.0
27198,133,4.774436,15,1,2.0,8.0
28133,53,4.226415,10,1,2.0,8.0
3514,68,4.794118,13,1,2.0,8.0
4904,109,4.807339,15,1,2.0,9.0
7831,68,4.382353,10,1,2.0,8.0
19761,149,4.255034,18,1,2.0,7.2
32030,139,4.539568,15,1,2.0,7.0


In [18]:
#item.islower returns true if all are lowercase, else false.
#so this is capturing words with capital letters anywhere in them, not just at the start. 
#note: isupper only returns true if all characters are upper case. 
def caps(word):
    return not word.islower()
df_explode["upper_case"]=df_explode['text'].apply(caps)

df_summaries["upper_case"] = df_explode.groupby('level_0')['upper_case'].sum() 

In [19]:
df_summaries.sample(10)

Unnamed: 0,counts,av_wl,max_wl,min_wl,10_quantile,90 quantile,upper_case
24483,105,4.514286,13,1,2.0,7.6,13.0
19633,87,4.931034,12,1,2.0,10.0,12.0
26766,94,3.989362,13,1,2.0,6.0,11.0
21511,59,4.355932,10,1,2.0,7.0,9.0
39001,144,4.416667,13,1,2.0,7.0,20.0
33108,111,4.603604,12,1,2.0,8.0,19.0
3862,99,4.535354,15,1,2.0,9.0,16.0
28528,157,4.286624,13,1,2.0,8.0,27.0
21997,134,3.925373,10,1,2.0,6.0,22.0
36354,208,4.298077,12,1,2.0,7.0,27.0


As well as the simple summaries relating to word length, we can compute some more involved summaries related to language. 

In [20]:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

In [21]:
def isstopword(word):
    return word in ENGLISH_STOP_WORDS

df_explode["stop_words"]=df_explode['text'].apply(isstopword)

In [22]:
df_summaries["stop_words"] = df_explode.groupby('level_0')['stop_words'].sum() 