Summary
=======

In this notebook I want to derive an algorithm for detecting [stop words](https://en.wikipedia.org/wiki/Stop_words). My previous algorithm was to perform a simple frequency count of words in a corpus of documents and find-the-elbow in the plot of sorted counts to detect the stops words. This was effective in detecting a majority (~80%) of words that are considered to be stop words. However,
a second filter might be more effective and detecting stop words.

The idea is this: stop words are *syntactic sugar*, that is, they are necessary for writing grammatically correct sentences, but in of themselves they don't convey that much information. An alternative filter is to consider the [Shannon entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)) of the conditional probability $$Pr(\text{next word }| \text{stop word})$$ The assertion is that for stop words the entropy should be *high* while for words that are not stop words the entropy should be lower.

**Results**: I compared stop words determined by entropy vs. frequency counts and find that frequency counts is a more accurate algorithm, while using entropy as a secondary filter can result in some false negatives for stop words, and lastly, using a normal model is more stable at finding the elbow in a curve than the geometrical method.

## Estimating the probability distribution for a categorical variable $X$ with sparse data.

In this case the $X$ is the variable representing the word that follows a candidate stop word. Since the size of the vocabulary of words is much larger than the set of words that will follow a candidate stop word I will take advantage of my work in this [notebook](https://github.com/rpgomez/data_science_observations/blob/master/Variable%20Order%20Markov%20Model.ipynb) to estimate the conditional probability $$Pr(\text{next word }| \text{stop word})$$ as follows:

For a given candidate stop word,

* take the observed frequency counts of words that follow, $\{C_{j}| j \in \text{vocabulary}\}$
* Compute the mixing weights, $\omega, 1- \omega$ for the model $X = \begin{cases} Y & Pr(X=Y) := \omega \\
Z & Pr(X=Z) = 1 - \omega\end{cases}$ where $$\omega = \frac{1  + \sum_j  C_j}{2 + \sum_j C_j}$$
* Compute the probability distribution for $Pr(Y=j)$:
$$ Pr(Y=j) = \frac{C_j}{\sum_k C_k}$$

Then compute the Shannon Entropy for $X$ as follows:

\begin{align}
Entropy &= -\left(\sum_{X=Y=j} \omega Pr(j)\log(\omega Pr(j))\right) - (1-\omega)\log(1-\omega) \\
&= -\omega\sum_{X=Y=j} Pr(j)\log(\omega Pr(j)) - (1-\omega)\log(1-\omega) \\
&= -\omega\left(\sum_{X=Y=j} Pr(j)\log(\omega) + \log(Pr(j))\right) - (1-\omega)\log(1-\omega) \\
&= -\omega\log(\omega) -\omega\left(\sum_{X=Y=j}Pr(j)\log(Pr(j))\right) - (1-\omega)\log(1-\omega) \\
&= -\omega\left(\sum_{X=Y=j}Pr(j)\log(Pr(j))\right) - \omega\log(\omega) -  (1-\omega)\log(1-\omega) \\
\end{align}

## Example: Moby Dick
I'm using the entire corpus of Moby Dick to see if I can detect the English stop words.

In [None]:
%pylab inline

In [None]:
from collections import Counter
from collections import defaultdict

from tqdm.notebook import tqdm

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
import string

nltk.download('punkt')

try:
    from bs4 import BeautifulSoup
except ModuleNotFoundError:
    # We don't have BeautifulSoup installed. We also need the lxml module
    !pip install bs4 lxml
    from bs4 import BeautifulSoup


In [None]:
import warnings
warnings.filterwarnings('ignore')

## The code to compute conditional probabilities and log likelihoods:

In [None]:
def construct_counts_dictionary(observed_sequence,d=4,debug=False):
    """Constructs the dictionary of observed counts of words following 
    contexts at most d words long. 
    
    observed_sequence should be a sequence of type integer of non-negative integers.
    Each integer corresponds to some word in the vocabulary.
    """
    
    if not debug:
        def use_me(x):
            return x
    else:
        use_me = tqdm
            
    counts = defaultdict(Counter)
    N = len(observed_sequence)
    
    for t in use_me(range(N)):
        obs = observed_sequence[t]
        for l in range(0,d+1):
            if t - l < 0:
                break
            context = observed_sequence[t-l:t]
            counts[context][obs] +=1
    
    return counts
            
def construct_prob_dictionary(counts_dictionary,vocabulary_size,debug=False):
    """Computes the memory efficient dictionary word| context probabilities"""

    V = vocabulary_size
    
    if not debug:
        def use_me(x):
            return x
    else:
        use_me = tqdm
            
    probs_dictionary = dict()
    for context in use_me(counts_dictionary):
        local_list = counts_dictionary[context]
        total_count = sum(list(local_list.values()))
        
        R = len(local_list)
        if R == V:
            omega = 1.0
        else:
            omega = (1+ total_count)/(2 + total_count)
            
        probs_dictionary[context] = defaultdict(float)
        
        local_probs = defaultdict(float)
        for word in local_list:
            c_i = local_list[word]
            pr = c_i/total_count
            local_probs[word] = pr
        
        if R < V:
            local_probs[None] = (1.0 - omega)/(V-R)
        else:
            local_probs[None] = 0.0
        
        probs_dictionary[context] = local_probs
    
    return probs_dictionary

def generate_log_pdf_dict(probs_dictionary,debug=False):
    """Computes the log likelihood dictionary
    log (Pr(symbol|context)) from the probs_dictionary"""
    
    if not debug:
        def use_me(x):
            return x
    else:
        use_me = tqdm
    
    log_pdf_dict = {}
    for context in use_me(probs_dictionary):
        local_pdf_list = probs_dictionary[context]
        local_log_pdf_list = {}
        for asymbol in local_pdf_list:
            local_log_pdf_list[asymbol] = np.log2(local_pdf_list[asymbol])
            
        log_pdf_dict[context] = local_log_pdf_list
    
    return log_pdf_dict

## Finding-the-elbow algorithms:

Here is code for 2 different algorithms for finding the elbow in a concave up/down graph.

1. The first method guesses the split point and assumes the left and right sides can be modeled as
normally distributed populations with different means and variances. It then find the splits that maximizes
the likelihood of the split. That's where the elbow is located.

2. The second method is a geometrical techique: draw a line segment connecting the ends of the graph and find
the point that is furthest in the Euclidean distance sense from the line. That's the elbow.

***Weaknesses:*** the location of an elbow should be insensitive to the addition or removal of intervals at the end of the graph. Neither of the 2 techniques is completely insensitive to this, but they can be relatively insensitive.

In [None]:
def find_elbow_norm(S_n, width=3000,debug=False):
    """Assumes 2 species of sequences, each fitting 
    a model y = ax + b + e_i, e_i ~ Norm(0,sigma_i)
    where x is index n and y_n = log(S_{n})
    
    Then y_{n+1} - y_n = log(S_{n+1}/S_n) = a(n+1 - n) + b - b + e_{n+1} + e_{n}
                                          = a + f_{n+1} ~ Norm(a,sqrt(2) sigma_i)

    This function finds the split by finding the MLE split."""
    def loglikely(ys,mu,sigma):
        """Computes log(Pr(ys)|Norm(mu,sigma))"""
        constant = (np.log(sigma) + 0.5*np.log(np.pi*2))*ys.shape[0]
        logl = -0.5*((ys-mu)**2).sum()/sigma**2 - constant
        return logl
    
    # Takes the last [width] entries of the sequence to find the split location.
    S = S_n[-1:-width:-1]
    logS = np.log(S)
    logSS = logS[1:] - logS[:-1]
    
    N = logSS.shape[0]
    scores = np.zeros(N-1)
    scores[:2] = -np.inf
    for guess in range(2,N-1):
        popA = logSS[:guess]
        popB = logSS[guess:]
        popA_mu = popA.mean()
        popA_std = popA.std()
        popB_mu = popB.mean()
        popB_std = popB.std()
        
        scores[guess] = loglikely(popA,popA_mu, popA_std) + loglikely(popB,popB_mu,popB_std)
    
    scores[np.isnan(scores)] = -np.inf
    split = scores.argmax() + 1
    if debug:
        return scores
    return -split
        

def find_elbow(S_n):
    """Finds index i where the elbow occurs using the algorithm
    the elbow is the point of the graph furthest from the line
    connecting the end points. 
    
    This technique is inspired by Rolle's theorem and the mean 
    value theorem. We take the graph and rotate it until it's 
    starting and end point are connected by a horizontal line
    segment, then we find the extremal location."""
    
    S0 = S_n[0]
    SN = S_n[-1]
    N = len(S_n) - 1
    m = (SN - S0)/N
    x = np.arange(S_n.shape[0])
    
    y = np.abs(S_n  - S0 - m*x)
    i = y.argmax()
    return i

def stability_counts(sorted_list,min_interval=1000,max_interval=3000):
    """For each tail end sequence  of length t= min_interval,...,max_interval
    the code will recover the elbow and return the found elbows for each
    possible interval to determine if the elbow is stable."""
    
    elbows = np.array([find_elbow(sorted_list[:-t:-1]) for t in \
                         range(min_interval, max_interval)])
    return elbows

## Downloading and parsing the Moby Dick content:

In [None]:
# download Mody Dick as html and then extract the content
!wget -c "https://www.gutenberg.org/files/2701/2701-h/2701-h.htm"
htmltxt= open('2701-h.htm','r').read()
soup = BeautifulSoup(htmltxt,'lxml')
content = soup.text.lower()

Now I'm going to remove punctuation and non-word content as I'm only interested in modeling word content.

In [None]:
filter = string.ascii_letters
def filter_nonletters(aletter):
 if aletter in filter:
    return aletter
 else:
    return " "
new_content = "".join([filter_nonletters(x) for x in tqdm(content)])

I'll extract the words from the filtered content:

In [None]:
words = word_tokenize(new_content)
print("Length of word sequence: ",len(words))

vocabulary = set(words)
print("Size of vocabulary: ", len(vocabulary))

Now to construct the conditional probabilities and log probabilities:

In [None]:
words_children = construct_counts_dictionary(tuple(words),d=1,debug=False)

probs_words = construct_prob_dictionary(words_children,len(vocabulary),debug=False)

logprob_words = generate_log_pdf_dict(probs_words,debug=False)

And now to construct the entropies for each word:

In [None]:
entropies = {}
for aword in probs_words:
    w = sum(list(words_children[aword].values()))
    w = (1+w)/(2+w)
    constant_term = -(w *np.log2(w) + (1-w)*np.log2(1-w))
    my_probs = probs_words[aword]
    my_logprobs = logprob_words[aword]
    my_entropy = -sum([my_probs[next_word]*my_logprobs[next_word] for next_word in my_logprobs \
                      if next_word != None])
    entropies[aword] = w*my_entropy + constant_term

In [None]:
flattened_entropies = np.array(list(entropies.values()))

## Entropy Graphs
Here we're going to plot the recovered entropies of the conditional probabilities $Pr(next|word)$.

In [None]:
sorted_entropies = np.sort(flattened_entropies)

In [None]:
figure(figsize=(16,6))
subplot(1,2,1)
title("Distribution of Word Entropies:")
xlabel("Shannon Bit Entropy")
hist(flattened_entropies,bins=100);

subplot(1,2,2)
title("Sorted Word Entropies")
ylabel("Shannon Bit Entropy")
plot(sorted_entropies);

That rising peak on the right side of the right graph is what I claim to be candidate stop words. We're going to take the last 1000 entropies and see how close the 2 find-the-elbow algorithms agree on the location of the elbow:

In [None]:
elbow_norm = find_elbow_norm(sorted_entropies,width=1000)
elbow_euclid = - find_elbow(sorted_entropies[-1:-1000:-1])

print("Location of the elbow according to normal distribution: ", elbow_norm)
print("Location of the elbow according to geometrical method:  ", elbow_euclid)

So the first method believes the elbow occurs 96 entries from the right, while the geometrical method says it's 157 entries from the right. 

Let's see how stable those values are. I'm going to vary the width that I'm willing to consider as an interval of the  last N entries where $N = 500,\ldots, 2000$.

In [None]:
indices_norm = np.array([find_elbow_norm(sorted_entropies,width=N) for N in tqdm(range(500,2000))])
indices_euclidean = -stability_counts(sorted_entropies,min_interval=500,max_interval=2000)

In [None]:
figure(figsize=(16,5))
subplot(1,2,1)
title("Histogram of elbows according to the normal model:")
xlabel('elbow location')
hist(indices_norm,bins=np.arange(indices_norm.min(),indices_norm.max()+1));
subplot(1,2,2)
title("Histogram of elbows according to the Euclidean method:")
xlabel('elbow location')
hist(indices_euclidean,bins=np.arange(indices_euclidean.min(),indices_euclidean.max()+1));

So it looks like the normal model is more stable, but it is not completely insensitive to the interval size. Let's see if the 2 techniques have any elbow locations in common or at least most in common:

In [None]:
their_norm_locs= np.argwhere(bincount(-indices_norm)>0).flatten()
their_norm_counts = bincount(-indices_norm)[their_norm_locs]

for x,y in zip(their_norm_locs,their_norm_counts):
    print("loc: {0:4d} count: {1:4d}".format(x,y))

In [None]:
their_euclidean_locs= np.argwhere(bincount(-indices_euclidean)>0).flatten()
their_euclidean_counts = bincount(-indices_euclidean)[their_euclidean_locs]

for x,y in zip(their_euclidean_locs,their_euclidean_counts):
    print("loc: {0:4d} count: {1:4d}".format(x,y))

So it looks like the 2 methods seem to agree roughly around an elbow location of 95-96 from the end of the right side. However, they seem to agree more strongly around elbow location of 155-157 from the right side. Let's see what kind of vocabulary is labelled as being stop words:

In [None]:
vocab96 = [aword[0] for aword in entropies if entropies[aword]>= sorted_entropies[-96] and aword != ()]
vocab155 = [aword[0] for aword in entropies if entropies[aword]>= sorted_entropies[-155] and aword != ()]

vocab96.sort()
vocab155.sort()

In [None]:
print(" ".join(vocab96))

Words such as *ahab, whale, whales, men, sea, ship* should probably not be considered stop words.

In [None]:
print(" ".join(vocab155))

In addition words like Starbuck, dead, fish, flask, queequeg should probably not be considered stop words either.

## Frequency Count Graphs
Now let's see what happens when we consider frequency counts of words instead.

In [None]:
word_counts = Counter(words)

their_counts = np.array([avalue for avalue in word_counts.values()])
sorted_counts = their_counts.copy()
sorted_counts.sort()

In [None]:
figure(figsize=(16,6))
subplot(1,2,1)
title("Distribution of Frequency Counts:")
xlabel("Frequency Counts")
hist(sorted_counts[-2000:],bins=100);

subplot(1,2,2)
title("Sorted Frequency Counts")
ylabel("Frequency Counts")
plot(sorted_counts[-2000:]);

In [None]:
elbow_norm = find_elbow_norm(sorted_counts,width=1000)
elbow_euclid = - find_elbow(sorted_counts[-1:-1000:-1])

print("Location of the elbow according to normal distribution: ", elbow_norm)
print("Location of the elbow according to geometrical method:  ", elbow_euclid)

In [None]:
indices_norm = np.array([find_elbow_norm(sorted_counts,width=N) for N in tqdm(range(500,2000))])
indices_euclidean = -stability_counts(sorted_counts,min_interval=500,max_interval=2000)

In [None]:
figure(figsize=(16,5))
subplot(1,2,1)
title("Histogram of elbows according to the normal model:")
xlabel('elbow location')
hist(indices_norm,bins=np.arange(indices_norm.min(),indices_norm.max()+1));
subplot(1,2,2)
title("Histogram of elbows according to the Euclidean method:")
xlabel('elbow location')
hist(indices_euclidean,bins=np.arange(indices_euclidean.min(),indices_euclidean.max()+1));

In [None]:
their_norm_locs= np.argwhere(bincount(-indices_norm)>0).flatten()
their_norm_counts = bincount(-indices_norm)[their_norm_locs]

for x,y in zip(their_norm_locs,their_norm_counts):
    print("loc: {0:4d} count: {1:4d}".format(x,y))

In [None]:
their_euclidean_locs= np.argwhere(bincount(-indices_euclidean)>0).flatten()
their_euclidean_counts = bincount(-indices_euclidean)[their_euclidean_locs]

for x,y in zip(their_euclidean_locs,their_euclidean_counts):
    print("loc: {0:4d} count: {1:4d}".format(x,y))

So it looks like the norm model technique is much more insensitive to the width interval. Having said that, the Euclidean method seems to be producing more conservative estimates on the location of the stop words.

In [None]:
vocab40 = [aword for aword in word_counts if word_counts[aword]>= sorted_counts[-40] and aword != ()]
vocab78 = [aword for aword in word_counts if word_counts[aword]>= sorted_counts[-78] and aword != ()]

vocab40.sort()
vocab78.sort()

In [None]:
print(" ".join(vocab40))

In [None]:
print(" ".join(vocab78))

## Intersections
Let's see what words are in common between entropy and frequency counts:

In [None]:
def intersect(list1,list2):
    """Takes the intersection of 2 list of words and returns the words in common."""
    set1, set2 = set(list1), set(list2)
    common = list(set1.intersection(set2))
    common.sort()
    return common

In [None]:
print("entropy 96 and Counts 40:\n", " ".join(intersect(vocab96,vocab40)))
print()
print("entropy 96 and Counts 78:\n", " ".join(intersect(vocab96,vocab78)))

In [None]:
print("entropy 155 and Counts 40:\n", " ".join(intersect(vocab155,vocab40)))
print()
print("entropy 155 and Counts 78:\n", " ".join(intersect(vocab155,vocab78)))

Out of curiosity which words were excluded from the stop words generated by the norm model of the frequency counts:

In [None]:
print(set(vocab78).difference(vocab155))

I would consider those words stop words as well, but I suppose it's better to have false negatives than false positives for stop words.