# Stopword lists in NLTK
Learning goals:
 - Learning to think about efficient data structures
 - Know how to measure the runtime of functions and interpret the output


In [None]:
import nltk
stopwords_en = nltk.corpus.stopwords.words('english')

What data type is `stopwords_en` actually?

In [None]:
type(stopwords_en)

What's in it? Let's take a look at every 24th stop word...

In [None]:
stopwords_en[::24]

How many stop words for English do we have in total?

In [None]:
len(stopwords_en)

## What does the following function calculate?

In [None]:
def foo(text):
    """Documentation is missing here... """
    bar = [w for w in text if w.lower() not in stopwords_en]
    return len(bar)/len(text)*100

Question: How can this function (a) be better documented and (b) written more efficiently?

In [None]:
stopwords_en_set = set(stopwords_en)

def non_stopwords_percentage(text):
    """Return the percentage of non-stopwords of an English token sequence."""
    non_stopwords = [w for w in text 
                     if w.lower() not in stopwords_en_set]
    return len(non_stopwords)/len(text)*100.

help(non_stopwords_percentage)

What is the percentage of non-stop words in the Brown corpus? Rounded to one decimal place...

In [None]:
round(foo(nltk.corpus.brown.words()),1)

## How do you measure efficiency gains?

In [None]:
import cProfile
brown_words = list(nltk.corpus.brown.words())

Stop words represented as a list

In [None]:
cProfile.run("foo(brown_words)")

Stop words as a set (performance-wise similar to dictionaries)

In [None]:
cProfile.run("non_stopwords_percentage(brown_words)")

## How do I calculate the proportion of content words in all words?
Punctuation tokens are to be dealt with...

In [None]:
import re
def delete_punctuation(s):
    return re.sub(r'''[!'"#%&\x27`()*,-./:;?@[\]_{}\xa1\xab\xb7\xbb\xbf]''','',s)


In [None]:
def content_word_percentage(text):
    """ Return the percentage of content words in a list of English tokens. """
    content_words = [w for w in text 
                     if delete_punctuation(w) != ''
                     and w.lower() not in stopwords_en_set]
    return len(content_words)/len(text)*100.

In [None]:
round(content_word_percentage(brown_words),1)

For a more general solution, consider using the unicode information. The external regex module allows unicode character classes

In [None]:
! pip install regex

In [None]:
def delete_punctuation(s):
    return re.sub(r'\p{P}+','',s)