In this notebook we will process the synthetic Austen/food reviews data and convert it into feature vectors. In later notebooks these feature vectors will be the inputs to models which we will train and eventually use to identify spam. 

In this notebook, feature vectors are composed of simple summaries of the documents. 

In [1]:
import pandas as pd

df = pd.read_parquet("data/training.parquet")

In [2]:
df.sample(10)

Unnamed: 0,index,label,text
15182,15182,legitimate,How strange that she should be the last. Did n...
2623,2623,legitimate,"It was as much, however, as was desired, and m..."
32804,12804,spam,He loved the candy and threw the entire salad ...
25832,5832,spam,This smells delicious when you can be enjoying...
17770,17770,legitimate,Mrs. Ferrars came to inspect the happiness whi...
35537,15537,spam,My mom says almonds help you sleep and for the...
18535,18535,legitimate,They are tubes rather than flat envelopes. In ...
21662,1662,spam,Once the noodles get soft enough to easily bre...
20083,83,spam,This chocolate flavor goat milk formula tastes...
39549,19549,spam,It is everything that Commercial Cacao isn't. ...


In [3]:
#note level and index coincide for the legitimate documents, but not for the spam - 
    #for spam, index = level_0 mod 20,000
df.reset_index(inplace=True) 

In [4]:
df.sample(10)

Unnamed: 0,level_0,index,label,text
29955,29955,9955,spam,Nutiva Coconut Extra-Virgin Oil works just fin...
32483,32483,12483,spam,It's a win-win all around! They are very crunc...
31726,31726,11726,spam,The resealable tab is well designed. Starbucks...
20193,20193,193,spam,Neither my children nor myself could drink it....
31474,31474,11474,spam,I was happy with it's superiority above the ot...
23990,23990,3990,spam,I have been buying Greenies for years - looks ...
23667,23667,3667,spam,I particularly like the blend of flavors that ...
7227,7227,7227,legitimate,"Kellogg's, please go back to supermarket honey..."
33976,33976,13976,spam,I also use the CLEAR scalp oil which may be th...
8373,8373,8373,legitimate,"What I advise is, that your father is dead. No..."


We can now "explode" the text into words, so that each word occupies a row of the data frame, and retains the associated "level_0", "index" and "label". 

In [5]:
rows = []
_ = df.apply(lambda row: [rows.append([row['level_0'], row['index'], row['label'], word]) 
                         for word in row.text.split()], axis=1)

In [6]:
rows[1:10]

[[0, 0, 'legitimate', 'must'],
 [0, 0, 'legitimate', 'write'],
 [0, 0, 'legitimate', 'to'],
 [0, 0, 'legitimate', 'me.'],
 [0, 0, 'legitimate', 'Catherine'],
 [0, 0, 'legitimate', 'sighed.'],
 [0, 0, 'legitimate', 'And'],
 [0, 0, 'legitimate', 'there'],
 [0, 0, 'legitimate', 'are']]

In [7]:
df_explode = pd.DataFrame(rows, columns=df.columns)
df_explode

Unnamed: 0,level_0,index,label,text
0,0,0,legitimate,You
1,0,0,legitimate,must
2,0,0,legitimate,write
3,0,0,legitimate,to
4,0,0,legitimate,me.
5,0,0,legitimate,Catherine
6,0,0,legitimate,sighed.
7,0,0,legitimate,And
8,0,0,legitimate,there
9,0,0,legitimate,are


Column "level_0" contains the index we want to aggregate any calculations over. 

The summmaries we are going to compute for each document (indexed by "level_0") are: 
    number of words in each document
    average word length
    maximum word length
    minimum word length
    10th percentile word length
    90th percentile word length
    number of upper case words
    
    
Many of these require the word length to be computed. To save us from recomputing this every time, we will begin add a column containing this information to our 'exploded' data frame. 


In [8]:
df_explode["word_len"] = df_explode["text"].apply(len) 

In [9]:
df_explode.sample(10) ## looks fine, though punctuation is being counted as a character in this word length calculation. 

Unnamed: 0,level_0,index,label,text,word_len
2040043,23026,3026,spam,a,1
3164064,37143,17143,spam,and,3
1940370,21780,1780,spam,had,3
2273976,25954,5954,spam,is,2
2325718,26581,6581,spam,swirl,5
2030690,22905,2905,spam,had,3
2216539,25210,5210,spam,recommend!,10
1460344,16248,16248,legitimate,was,3
1143270,12698,12698,legitimate,What,4
1531675,17022,17022,legitimate,that,4


We will record the summaries for each document in a new data frame called summaries. We start by computing the number of words in each document. 

In [10]:
counts = df_explode['level_0'].value_counts()

df_summaries = pd.DataFrame({'counts' :counts})

In [11]:
df_summaries.sample(10)

Unnamed: 0,counts
10610,122
31563,74
26945,71
14981,93
28131,119
37429,59
15075,118
8251,68
24498,102
25649,63


In [12]:
df_summaries["av_wl"] = df_explode.groupby('level_0')['word_len'].mean() #average word length


In [13]:
df_summaries.sample(10)

Unnamed: 0,counts,av_wl
15209,31,4.290323
33588,59,4.101695
13368,73,4.643836
38882,44,4.454545
24150,61,3.95082
14828,69,4.028986
22016,55,4.581818
14897,100,4.52
34431,51,4.313725
21277,31,4.774194


In [14]:
df_summaries["max_wl"] = df_explode.groupby('level_0')['word_len'].max() #max word length
df_summaries["min_wl"] = df_explode.groupby('level_0')['word_len'].min() #min word length

In [15]:
df_summaries["10_quantile"] = df_explode.groupby('level_0')['word_len'].quantile(0.1) #average word length

In [16]:
df_summaries["90 quantile"]= df_explode.groupby('level_0')['word_len'].quantile(0.9)

In [18]:
df_summaries.sample(10)

Unnamed: 0,counts,av_wl,max_wl,min_wl,10_quantile,90 quantile
35716,59,4.627119,11,1,2.0,8.0
6658,60,4.5,11,1,2.0,7.0
7846,131,4.427481,12,1,2.0,8.0
1991,60,4.416667,12,1,2.0,9.0
39773,123,4.252033,10,1,2.0,7.0
6964,53,4.226415,10,1,3.0,7.0
39022,87,4.172414,15,1,1.0,8.0
4656,161,4.714286,13,1,2.0,9.0
2137,82,4.512195,13,1,2.0,8.0
13557,63,4.412698,10,1,2.0,8.0


In [19]:
#item.islower returns true if all are lowercase, else false.
#so this is capturing words with capital letters anywhere in them, not just at the start. 
#note: isupper only returns true if all characters are upper case. 
def caps(word):
    return not word.islower()
df_explode["upper_case"]=df_explode['text'].apply(caps)

df_summaries["upper_case"] = df_explode.groupby('level_0')['upper_case'].sum() 

In [20]:
df_summaries.sample(10)

Unnamed: 0,counts,av_wl,max_wl,min_wl,10_quantile,90 quantile,upper_case
7328,41,3.829268,9,1,2.0,7.0,8.0
28244,119,4.042017,12,1,1.8,7.0,15.0
30871,52,4.461538,11,1,2.0,7.9,11.0
39397,90,4.688889,23,1,2.0,8.0,11.0
10076,115,4.234783,11,1,2.0,7.0,12.0
32386,89,4.078652,10,1,2.0,7.0,13.0
35601,109,4.697248,14,1,2.0,8.0,17.0
26739,71,4.056338,13,1,2.0,8.0,9.0
7064,156,4.358974,16,1,2.0,8.0,16.0
26467,94,4.234043,12,1,2.0,7.0,15.0


As well as the simple summaries relating to word length, we can compute some more involved summaries related to language. 

In [21]:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

In [22]:
def isstopword(word):
    return word in ENGLISH_STOP_WORDS

df_explode["stop_words"]=df_explode['text'].apply(isstopword)

In [23]:
df_summaries["stop_words"] = df_explode.groupby('level_0')['stop_words'].sum() 

In [47]:
df_summaries.sample(11)

Unnamed: 0,counts,av_wl,max_wl,min_wl,10_quantile,90 quantile,upper_case,stop_words,punctuation
3634,97,4.525773,13,1,2.0,8.0,9.0,51.0,9
23498,42,4.714286,12,1,2.0,7.9,5.0,15.0,4
30056,53,4.056604,10,1,2.0,6.0,7.0,23.0,6
19585,96,4.270833,13,1,1.5,8.0,11.0,45.0,8
8279,69,4.463768,12,1,2.0,9.0,14.0,34.0,6
6881,148,4.594595,12,1,2.0,9.0,18.0,76.0,17
28874,59,4.644068,11,1,2.0,8.0,6.0,29.0,5
6202,81,4.62963,15,1,2.0,8.0,9.0,39.0,13
15024,154,4.467532,16,1,2.0,8.0,16.0,81.0,14
29786,68,4.867647,12,1,2.0,8.0,17.0,23.0,5


In [40]:
import string
import regex as re
def punct_count(doc):
    print(doc)
    return sum(string.punctuation in doc)     

def punct_count2(doc):
    return sum(bool(re.match(r"""[!.><:;'@#~{}\[\]-_+=£$%^&()?]""", c)) for c in doc)



In [54]:
df_explode["punct"]=df_explode['text'].apply(punct_count2)

df_summaries["punctuation"]=df['text'].apply(punct_count2)
df_summaries["punct"]=df_explode.groupby('level_0')['punct'].sum() 


In [58]:
df_summaries.sample(10)

Unnamed: 0,counts,av_wl,max_wl,min_wl,10_quantile,90 quantile,upper_case,stop_words,punctuation,punct
30175,156,4.416667,12,1,2.0,8.0,20.0,73.0,11,11
11242,120,4.366667,11,1,2.0,8.0,12.0,63.0,10,10
29897,49,4.489796,12,1,2.0,8.2,14.0,17.0,9,9
14429,140,4.457143,12,1,2.0,8.0,13.0,73.0,12,12
28239,49,4.489796,12,1,1.0,8.0,13.0,16.0,7,7
25511,59,4.355932,11,1,2.0,7.0,7.0,28.0,7,7
20160,89,4.483146,16,1,2.0,7.2,12.0,36.0,12,12
22377,56,4.428571,12,1,2.0,8.0,6.0,23.0,6,6
27074,57,3.982456,10,1,2.0,7.0,22.0,17.0,7,7
27611,108,4.12037,10,1,2.0,7.0,14.0,53.0,11,11


In [60]:
df["text"][22377]

'it is so worth purchasing and using! I gave it away to someone. The slight flavour that did come through was bitter. I love salted and smoked almonds. Not for someone with no heart problems, cholesterol, yadda yadda. Although I have a very nice cup of tea and this stuff is worthy of a thousand recipes.'

In [52]:
pd.set_option('display.max_colwidth', -1)
df["text"][3634]

'But one morning--I forget exactly the day--but perhaps it was lucky for her husband, who might not have many years of division and estrangement. This nine Hundred they always kept in a manner so little agreeable to her. Great business if people keep buying it. To be otherwise comforted was out of the house, and he now eats only organic human grade freeze dried artisan dog food. Do not run away the first from Miss Crawford; there was no being attached to me, they would not envy her that distinction _now_; but when she got into it.'