In this notebook we will process the synthetic Austen/food reviews data and convert it into feature vectors. In later notebooks these feature vectors will be the inputs to models which we will train and eventually use to identify spam. 

In [1]:
import pandas as pd

df = pd.read_parquet("data/training.parquet")

  labels = getattr(columns, 'labels', None) or [
  return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)
  labels, = index.labels


## Term Frequency - Indverse Document Frequency (TF-IDF)

Term Frequency - Inverse Document Frequency, know as TF-IDF, is a vector of numbers which aim to capture how important different words are within a set of documents. If we consider a standard set of documents (e.g. webpages, news articles, or tweets) one would expect most to contain stop words such as "a", "the" or "in". 

Considering only raw word count, these words will appear frequently however are not of interest - they don't tell us what the document is about.

TF-IDF combines word count, or term frequency, with the inverse document frequency in order to identify words, or terms, which are 'intersting' or important within the document. 


It is a combination of two different metrics: 

#### Term Frequency
The first is Term Frequency.

The term frequency of term $t$ in document $d$, which we denote $Tf(t, d)$, is simply a count of the number of times the term $t$ appears in the document $d$. 

#### Inverse Document Frequency

Inverse Document Frequency, or idf, indicates whether a word is popular or rare, across documents. 

The inverse document frequency of a term $t$, across a set of $N$ documents is the logarithm of the ratio of $N$ divided by the number of documents in which term $t$ appears: 

$idf(t) = log\left(1+ \frac{N}{\text{number of documents containing }t +1}\right)$ +1

The $+ 1$ present in the denominator prevents division by zero, which would occur if a term $t$ was present in none of the documents. 

Multpilying together the TF and the IDF, for a given document $d$ and term $t$ we can compute the TF-IDF:

### TF-idf

For a term t and a document d, the TF-idf is given by: 

$tf-idf(t,d) = tf(t,d) \times idf(t)$



The resultant set of vecs for a given document are then normalised. 

Note: there are variations upon the equations for term frequency and idf, which we do not consider here. The equations given above are those used by the scikit learn library.

### Implementation: 

The scikit-learn module contains a set of feature extraction functions which take in text documents and return feature vectors. Tf-idf is computed using the TfidfVectorizer function. By default, the function splits all documents into words, and discounts any 1-letter words as well as punctuation. From there, the function makes a count matrix showing the frequency of the words across the documents. This is then used to compute the Tf-idf.   

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

#initialising the vectorizer, with default parameters. 
vectorizer = TfidfVectorizer()

In [3]:
#We can take a look at the vectorizer. 
vectorizer

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

### parameters: 
    analyzer: {'word', 'char', 'char_wb'} determines whether features (terms) should be words, n-grams (strings of n characters) or only n-grams within word boundaries.
    binary: if False, term frequencies are computed as raw counts of terms. if True, term frequencies are taken to be 1 if the term is present at least once, and 0 otherwise.
    stop_words: can take in a list of 'stop words' such as 'is', 'was', 'as', and 'the', which will not be included in the terms considered when computing tfidf. note that this can also be controlled by lowering the following parameter: 
    max_df: takes a float from 0.0 to 1.0. if a term appears in more than max_df fraction of documents it will not be included in the list of considered terms. 


In [8]:
#The fit_transform function computes the tf-idf vectors for the simulated data. 

tf_idf = vectorizer.fit_transform(df["text"])
tf_idf

<2000x11121 sparse matrix of type '<class 'numpy.float64'>'
	with 129952 stored elements in Compressed Sparse Row format>

  (0, 11081)	0.037445397287015666
  (0, 6593)	0.06647960706903334
  (0, 11021)	0.11419958815587786
  (0, 10090)	0.07382689335491247
  (0, 6240)	0.10096923990687869
  (0, 1717)	0.09446965819170754
  (0, 8984)	0.1638948273956066
  (0, 658)	0.07150959755522736
  (0, 9972)	0.0554825888993784
  (0, 783)	0.04264051197112491
  (0, 7013)	0.06089781009426667
  (0, 1961)	0.11617484630876207
  (0, 10865)	0.057215452190075455
  (0, 606)	0.11887840310016683
  (0, 6803)	0.0620063376540027
  (0, 8626)	0.0998344424562198
  (0, 9958)	0.06583711161334001
  (0, 6700)	0.06484976565309741
  (0, 1445)	0.13272353859255703
  (0, 5471)	0.10337724428812725
  (0, 9978)	0.039288791636846336
  (0, 10904)	0.05367931948843717
  (0, 8012)	0.09483647351028074
  (0, 10103)	0.09375494859210098
  (0, 4861)	0.15900198273087807
  :	:
  (1998, 8998)	0.20770317666438837
  (1999, 10090)	0.08052063964708878
  (1999, 658)	0.07799323897285004
  (1999, 9958)	0.10770965901554154
  (1999, 5471)	0.08456269920857518
  (1999, 1118)	0.

In [23]:
tf_idf.shape

(2000, 11121)

In [62]:
vectorizer.get_feature_names()[0:10]

['00', '000', '01', '03', '05', '06', '07', '08', '09', '0g']

if we look at the first 10 feature names, which we expect to be words because we used the defaults from the , we see that they are nonsense. 

On one hand, this doesn't really matter - many feature engingeering techniques generate features which are not explainable. On the other hand, we thought we were getting words. 

In [41]:
print(tf_idf[:,0]) #This tells us that '000' appears in 14 of the docs. let's double check 


  (342, 0)	0.15497173855249344
  (935, 0)	0.09357464860043498
  (947, 0)	0.11922929676832385
  (979, 0)	0.13528214068910932
  (1128, 0)	0.1122534746577168
  (1244, 0)	0.1296948418074572
  (1570, 0)	0.20797516195989615
  (1693, 0)	0.09633887341614138
  (1719, 0)	0.14496356963282298
  (1753, 0)	0.15047467442804296
  (1852, 0)	0.12384227128620147
  (1878, 0)	0.1309043357630492
  (1908, 0)	0.24842177813585903
  (1960, 0)	0.12234175060680996


In [63]:
print(df[342:343]["text"])

342    English, Scottish and Irish and even decaf for after 6:00 PM. I was very unwilling to go as well from real humanity and good-nature, as from a dislike of Edward; and it ended, as every feeling declared him now to be? Disappointed that the recommended amt of treats is only 10.. They were sick with horror, while he examined; but he was sure there could not but triumph. Of every accomplishment accustomary to my sex, I was Mistress.
Name: text, dtype: object


In [None]:
# we see the double 0 there.

In [64]:
vectorizer2 = TfidfVectorizer(stop_words='english')


In [65]:
vectorizer2

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [66]:
tf_idf2 = vectorizer2.fit_transform(df["text"])
tf_idf2

<2000x10828 sparse matrix of type '<class 'numpy.float64'>'
	with 69803 stored elements in Compressed Sparse Row format>

In [67]:
vectorizer2.get_feature_names()[0:10]

['00', '000', '01', '03', '05', '06', '07', '08', '09', '0g']

In [83]:
vectorizer3 =  TfidfVectorizer(min_df=20)

In [84]:
tf_idf3 = vectorizer3.fit_transform(df["text"])
tf_idf3

<2000x913 sparse matrix of type '<class 'numpy.float64'>'
	with 98179 stored elements in Compressed Sparse Row format>

In [85]:
vectorizer3.get_feature_names()[0:10]

['10',
 '100',
 '50',
 'able',
 'about',
 'absolutely',
 'account',
 'actually',
 'add',
 'added']