# Stopword lists in NLTK
Available for different languages...

In [6]:
import nltk
stopwords_en = nltk.corpus.stopwords.words('english')

What data type is `stopwords_en` actually?

In [7]:
type(stopwords_en)

list

What's in it? Let's take a look at every 24th stop word...

In [8]:
stopwords_en[::24]

['i', 'hers', 'were', 'at', 'again', 'own', 'aren', "shan't"]

How many stop words for English do we have in total?

In [9]:
len(stopwords_en)

179

## What does the following function calculate?

In [12]:
def foo(text):
    """Documentation is missing here... """
    bar = [w for w in text if w.lower() not in stopwords_en]
    return len(bar)/len(text)*100

Question: How can this function (a) be better documented and (b) written more efficiently?

In [10]:
stopwords_en_set = set(stopwords_en)

def non_stopwords_percentage(text):
    """Return the percentage of non-stopwords of an English token sequence."""
    non_stopwords = [w for w in text 
                     if w.lower() not in stopwords_en_set]
    return len(non_stopwords)/len(text)*100.

help(non_stopwords_percentage)

Help on function non_stopwords_percentage in module __main__:

non_stopwords_percentage(text)
    Return the percentage of non-stopwords of an English token sequence.



What is the percentage of non-stop words in the Brown corpus? Rounded to one decimal place...

In [13]:
round(foo(nltk.corpus.brown.words()),1)

59.1

## How do you measure efficiency gains?

In [14]:
import cProfile
brown_words = list(nltk.corpus.brown.words())

Stop words represented as a list

In [16]:
cProfile.run("foo(brown_words)")

         1161199 function calls in 1.953 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.947    1.947 <ipython-input-12-3c46754d62ab>:1(foo)
        1    1.853    1.853    1.947    1.947 <ipython-input-12-3c46754d62ab>:3(<listcomp>)
        1    0.006    0.006    1.953    1.953 <string>:1(<module>)
        1    0.000    0.000    1.953    1.953 {built-in method builtins.exec}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  1161192    0.095    0.000    0.095    0.000 {method 'lower' of 'str' objects}




Stop words as a set (performance-wise similar to dictionaries)

In [18]:
cProfile.run("non_stopwords_percentage(brown_words)")

         1161199 function calls in 0.301 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.295    0.295 <ipython-input-10-f482d791b7c1>:3(non_stopwords_percentage)
        1    0.204    0.204    0.295    0.295 <ipython-input-10-f482d791b7c1>:5(<listcomp>)
        1    0.005    0.005    0.301    0.301 <string>:1(<module>)
        1    0.000    0.000    0.301    0.301 {built-in method builtins.exec}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  1161192    0.092    0.000    0.092    0.000 {method 'lower' of 'str' objects}




## How do I calculate the proportion of content words in all words?
Punctuation tokens are to be dealt with...

In [19]:
import re
def delete_punctuation(s):
    return re.sub(r'''[!'"#%&\x27`()*,-./:;?@[\]_{}\xa1\xab\xb7\xbb\xbf]''','',s)


In [20]:
def content_word_percentage(text):
    """ Return the percentage of content words in a list of English tokens. """
    content_words = [w for w in text 
                     if delete_punctuation(w) != ''
                     and w.lower() not in stopwords_en_set]
    return len(content_words)/len(text)*100.

In [21]:
round(content_word_percentage(brown_words),1)

46.4