In this notebook we will process the synthetic Austen/food reviews data and convert it into feature vectors. In later notebooks these feature vectors will be the inputs to models which we will train and eventually use to identify spam. 

In this notebook, feature vectors are composed of simple summaries of the documents. 

In [1]:
import pandas as pd

df = pd.read_parquet("data/training.parquet")

In [2]:
df.sample(10)

Unnamed: 0,label,text
4,spam,"According to what I originally thought, which ..."
132,spam,I've never once had an addiction to chocolate ...
11,legitimate,All the better. Catherine's silent appeal to h...
232,spam,The first time I boiled them way too long to f...
113,spam,Truth in advertising? I guess I'll be looking ...
963,spam,Got it with a little honey. The cookies have a...
614,legitimate,Her own feelings entirely engrossed her; her w...
989,legitimate,Mrs. Bennet was in fact given in the death of ...
128,spam,I actually smell the apple right away and came...
285,legitimate,Is there nothing you could take to give you pr...


In [3]:
def numwords(doc):
    return len(doc.split())

def avwordlength(doc):
    words = doc.split()
    return sum(len(word) for word in words) /len(words)

def maxwordlength(doc):
    words = doc.split()
    return max(len(word) for word in words)

def minwordlength(doc):
    words = doc.split()
    return min(len(word) for word in words)

In [4]:
import numpy as np

def docwlpercentile(p):
    def pthpercentile(doc):
        words = doc.split()
        wl = [len(word) for word in words]
        return np.percentile(wl,p)
    return pthpercentile

In [5]:
df["number_words"], df["av_wl"]=[df["text"].apply(numwords), df["text"].apply(avwordlength)]

In [6]:
df["number_words"], df["average_wl"], df["min_wl"], df["max_wl"], df["10p_wl"], df["90p_wl"] = [df["text"].apply(numwords), df["text"].apply(avwordlength), df["text"].apply(minwordlength), df["text"].apply(maxwordlength), df["text"].apply(docwlpercentile(10)), df["text"].apply(docwlpercentile(90))]

In [7]:
df

Unnamed: 0,label,text,number_words,av_wl,average_wl,min_wl,max_wl,10p_wl,90p_wl
0,legitimate,You must write to me. Catherine sighed. And th...,124,4.879032,4.879032,1,16,2.0,9.0
1,legitimate,Who would have thought Mr. Crawford sure of he...,88,5.022727,5.022727,1,17,2.0,9.0
2,legitimate,He had only himself to please in his choice: h...,139,4.496403,4.496403,1,13,2.0,8.0
3,legitimate,Oh! One accompaniment to her song took her agr...,94,4.680851,4.680851,1,15,2.0,9.0
4,legitimate,"As soon as breakfast was over, she went to her...",80,4.525000,4.525000,2,9,2.0,8.0
5,legitimate,Mrs Clay's selfishness was not so great as to ...,103,4.669903,4.669903,1,12,2.0,8.0
6,legitimate,"But self, though it would intrude, could not e...",48,4.500000,4.500000,1,12,2.0,7.3
7,legitimate,"Elizabeth, though she did not wish to slight. ...",101,4.386139,4.386139,1,13,2.0,7.0
8,legitimate,Edmund had descended from that moral elevation...,89,4.404494,4.404494,1,12,2.0,8.0
9,legitimate,I read up on the morrow the Crawfords were eng...,108,4.814815,4.814815,1,17,2.0,9.0
