# stop words, tf-idf

Let's investigate one of the most useful feature weightings, and how stop words derive naturally from that. To start, let's load a set of small documents.

In [None]:
import pandas as pd
import numpy as np

In [None]:
# load data
try:
    df = pd.read_csv('data/rt_critics.csv')
except IOError:
    print 'cannot find file'

In [None]:
# It seems silly to call such short blurbs 'documents', but we'll stick with the NLP nomenclature.

documents = list(df['quote'])
documents[:5]

## Document Frequency

Let's start by calculating the document frequency for words in these documents. For this task, let's also remove all the punctuation marks and make everything lower-case.

In [None]:
from nltk.tokenize import wordpunct_tokenize  # for tokenizing our text
import string  # helps with removing punctuation
from collections import Counter  # great dict-like datastructure for counting things

In [None]:
print string.punctuation

In [7]:
# This is a bit of text cleanup
word_bag_list = []
for doc in documents:
    cleaned = doc.lower().replace('-', ' ')  # make lowercase and split hyphenated words in two
    for c in string.punctuation:  # strip punctuation marks.
        cleaned = cleaned.replace(c, '')
    word_bag_list.append(wordpunct_tokenize(cleaned))

# How do things look?
print 'a few tokens:', word_bag_list[:3]

# this flattens the nested lists into one big list for some stats
token_list = []
for tokens in word_bag_list:
    token_list.extend(tokens)
print 'number of tokens:', len(token_list)
print 'number of unique tokens:', len(set(token_list))
print 'number of documents:', len(word_bag_list)

a few tokens: [['so', 'ingenious', 'in', 'concept', 'design', 'and', 'execution', 'that', 'you', 'could', 'watch', 'it', 'on', 'a', 'postage', 'stamp', 'sized', 'screen', 'and', 'still', 'be', 'engulfed', 'by', 'its', 'charm'], ['the', 'years', 'most', 'inventive', 'comedy'], ['a', 'winning', 'animated', 'feature', 'that', 'has', 'something', 'for', 'everyone', 'on', 'the', 'age', 'spectrum']]
number of tokens: 280092
number of unique tokens: 22424
number of documents: 14072


In [None]:
# calculate the document frequency of all the unique tokens in the bags of words.

df = Counter()  # initialize this dict-like thing.

for doc in word_bag_list:
    # FILL IN CODE
    # count up the times words appear in INDIVIDUAL documents (not the total across all documents)
    for something in something_else:  # edit this, obviously
        # add one to the right key in df

for token in df:
    # normalize the counts by the number of documents (are you getting zeros? Think datatypes.)

# this last line prints the 20 highest-scoring words and their scores
df.most_common(20)

## Stop Words

Which words are likely to be stop words? The ones that show up in the most documents! These terms with the largest document frequency are the stopwords! The threshold above which you call something a stopword is up to you.

## tf-idf

More interesting than stop-words is the tf-idf score. This tells us which words are most discriminative between documents. Words that occur a lot in one document but doesn't occur in many documents will tell you something special about the document:

$$
\text{tf-idf} = tf \cdot \log{idf} = tf \cdot \log{1 \over df} = tf \cdot -\log{df}
$$

recall that:

$$
\log{x} = -\log{1 \over x}
$$

What are the most discriminative words in the first few documents?

In [None]:
# calculate the term frequency of all the unique tokens in all of the bags of words.

for doc in word_bag_list[:100]:
    tf = Counter()  # initialize this dict-like thing.
    tfidf = Counter()
    
    # FILL IN CODE

    # calculate term frequencies
    # this is similar to the document frequencies.
    for something in something_else:

    # calculate tf-idf scores
    for token in tf:
        tfidf[token] = # fill this in. you can use np.log().

    # this prints most significant words in the document
    print tfidf.most_common(5)

# Sci-Kit Learn

Scikit-Learn comes with utilities to do these calculations for us. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tfidf_vec = TfidfVectorizer(stop_words='english')
output = tfidf_vec.fit_transform(documents)
print output.toarray()[20:30, :10]

In [None]:
print tfidf_vec.get_stop_words()