In this notebook we will process the synthetic Austen/food reviews data and convert it into feature vectors. In later notebooks these feature vectors will be the inputs to models which we will train and eventually use to identify spam. 

In this notebook, feature vectors are composed of simple summaries of the documents. 

In [1]:
import pandas as pd

df = pd.read_parquet("data/training.parquet")

In [2]:
df.sample(10)

Unnamed: 0,index,label,text
4850,4850,legitimate,And yet it was Captain Wentworth not in the wa...
29211,9211,spam,I kept switching foods - changed to ceramic bo...
32323,12323,spam,This is NOT organic dog food. A normal half ga...
15921,15921,legitimate,It is a fantastic rice. From such commendation...
17643,17643,legitimate,Kids sports practices can make meals a real ad...
36051,16051,spam,I purchased this as a yearly tradition if not ...
32969,12969,spam,100% worth the price. I wanted to receive them...
11800,11800,legitimate,"It does, however acidify the pH. I have no dou..."
10361,10361,legitimate,But why? I remember her promising to give you ...
6107,6107,legitimate,I said what I could but see her again!--But I ...


In [3]:
#note level and index coincide for the legitimate documents, but not for the spam - 
    #for spam, index = level_0 mod 20,000
df.reset_index(inplace=True) 

In [4]:
df.sample(10)

Unnamed: 0,level_0,index,label,text
29765,29765,9765,spam,Too bad. It was the first time take it when he...
11621,11621,11621,legitimate,The man must be very like a merit to those he ...
38274,38274,18274,spam,This is a great company and well priced and ca...
34186,34186,14186,spam,"One of my cats, who have different medical iss..."
7689,7689,7689,legitimate,It was gratifying to know that it must be a wo...
26620,26620,6620,spam,the flavor is intense. They only K-cup i'm buy...
1513,1513,1513,legitimate,I was obliged to call herself to think of them...
38899,38899,18899,spam,I can't stand mango and just about gave up whe...
22046,22046,2046,spam,Both ways were tasty. They rip when going into...
34177,34177,14177,spam,I have called them SOAP candy because to them ...


We can now "explode" the text into words, so that each word occupies a row of the data frame, and retains the associated "level_0", "index" and "label". 

In [5]:
rows = []
_ = df.apply(lambda row: [rows.append([row['level_0'], row['index'], row['label'], word]) 
                         for word in row.text.split()], axis=1)

In [6]:
rows[1:10]

[[0, 0, 'legitimate', 'must'],
 [0, 0, 'legitimate', 'write'],
 [0, 0, 'legitimate', 'to'],
 [0, 0, 'legitimate', 'me.'],
 [0, 0, 'legitimate', 'Catherine'],
 [0, 0, 'legitimate', 'sighed.'],
 [0, 0, 'legitimate', 'And'],
 [0, 0, 'legitimate', 'there'],
 [0, 0, 'legitimate', 'are']]

In [7]:
df_explode = pd.DataFrame(rows, columns=df.columns)
df_explode

Unnamed: 0,level_0,index,label,text
0,0,0,legitimate,You
1,0,0,legitimate,must
2,0,0,legitimate,write
3,0,0,legitimate,to
4,0,0,legitimate,me.
5,0,0,legitimate,Catherine
6,0,0,legitimate,sighed.
7,0,0,legitimate,And
8,0,0,legitimate,there
9,0,0,legitimate,are


Column "level_0" contains the index we want to aggregate any calculations over. 

The summmaries we are going to compute for each document (indexed by "level_0") are: 
    number of words in each document
    average word length
    maximum word length
    minimum word length
    10th percentile word length
    90th percentile word length
    number of upper case words
    
    
Many of these require the word length to be computed. To save us from recomputing this every time, we will begin add a column containing this information to our 'exploded' data frame. 


In [8]:
df_explode["word_len"] = df_explode["text"].apply(len) 

In [9]:
df_explode.sample(10) ## looks fine, though punctuation is being counted as a character in this word length calculation. 

Unnamed: 0,level_0,index,label,text,word_len
3160085,37090,17090,spam,review,6
278862,3114,3114,legitimate,the,3
1501976,16700,16700,legitimate,is,2
1143234,12697,12697,legitimate,those,5
3245323,38171,18171,spam,they're,7
686407,7666,7666,legitimate,He,2
2616632,30246,10246,spam,could,5
546927,6132,6132,legitimate,to,2
862557,9597,9597,legitimate,looked,6
844801,9394,9394,legitimate,only,4


We will record the summaries for each document in a new data frame called summaries. We start by computing the number of words in each document. 

In [10]:
counts = df_explode['level_0'].value_counts()

df_summaries = pd.DataFrame({'counts' :counts})

In [11]:
df_summaries.sample(10)

Unnamed: 0,counts
10714,87
37783,62
15213,129
37165,150
39009,149
17294,75
4484,154
4808,145
3102,99
4098,67


In [12]:
df_summaries["av_wl"] = df_explode.groupby('level_0')['word_len'].mean() #average word length


In [13]:
df_summaries.sample(10)

Unnamed: 0,counts,av_wl
32592,199,4.150754
16415,91,4.164835
19950,75,4.76
21241,28,5.071429
39975,73,4.657534
1223,125,4.344
37640,81,4.493827
35270,83,4.325301
5531,115,4.104348
31600,37,3.72973


In [None]:
df_summaries["max_wl"] = df_explode.groupby('level_0')['word_len'].max() #max word length
df_summaries["min_wl"] = df_explode.groupby('level_0')['word_len'].min() #average word length


In [None]:

def maxwordlength(doc):
    words = doc.split()
    return max(len(word) for word in words)

def minwordlength(doc):
    words = doc.split()
    return min(len(word) for word in words)

In [None]:
import numpy as np

def docwlpercentile(p):
    def pthpercentile(doc):
        words = doc.split()
        wl = [len(word) for word in words]
        return np.percentile(wl,p)
    return pthpercentile

In [None]:
df["number_words"], df["av_wl"]=[df["text"].apply(numwords), df["text"].apply(avwordlength)]

In [None]:
df["number_words"], df["average_wl"], df["min_wl"], df["max_wl"], df["10p_wl"], df["90p_wl"] = [df["text"].apply(numwords), df["text"].apply(avwordlength), df["text"].apply(minwordlength), df["text"].apply(maxwordlength), df["text"].apply(docwlpercentile(10)), df["text"].apply(docwlpercentile(90))]

In [None]:
df

As well as the simple summaries relating to word length, we can compute some more involved summaries related to language. 

To do this, we use SpaCy. 

In [None]:
#item.islower returns true if all are lowercase, else false.
#so this is capturing words with capital letters anywhere in them, not just at the start. 
#note: isupper only returns true if all characters are upper case. 
def uppercase(doc):
    words = doc.split()
    return len([x for x in words if not x.islower()])
df["upper"] = df["text"].apply(uppercase)

In [None]:
import spacy
english = spacy.load("en")

In [None]:
# computing the number of stop words is not quick. 

def num_stops(doc):
    tokens = english(doc)
    return len([token for token in tokens if token.is_stop])
    
df["num_stops"] = df["text"].apply(num_stops)

In [None]:
df