<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Word-Count" data-toc-modified-id="Word-Count-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Word Count</a></span><ul class="toc-item"><li><span><a href="#Using-Counter" data-toc-modified-id="Using-Counter-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Using <code>Counter</code></a></span></li><li><span><a href="#Adding-Word-Counts-From-Two-Distinct-Datasets-Together" data-toc-modified-id="Adding-Word-Counts-From-Two-Distinct-Datasets-Together-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Adding Word Counts From Two Distinct Datasets Together</a></span></li></ul></li><li><span><a href="#Removing-Stopwords-Using-gensim" data-toc-modified-id="Removing-Stopwords-Using-gensim-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Removing Stopwords Using <code>gensim</code></a></span></li><li><span><a href="#Finding-Similar-Word-Matches-Using-difflib" data-toc-modified-id="Finding-Similar-Word-Matches-Using-difflib-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Finding Similar Word Matches Using <code>difflib</code></a></span></li></ul></div>

# Word Count
## Using `Counter`

A normal dictionary object will return a key error if you do not first initialize the key value:

In [4]:
from typing import Dict
ordinary_dict: Dict = dict()
ordinary_dict["yu"] += 1

KeyError: 'yu'

The `Counter` object in `collections` has a default value of 0 for every key.

In [6]:
from collections import Counter

counter = Counter()
counter["yu"] += 1
counter

Counter({'yu': 1})

Moreover, the you can pass in a list of strings to the `Counter` constructor, as well as calling the
`most_common` method to get the most common words:

In [20]:
from typing import List
words: List[str] = open("tale-of-two-cities.txt").read().split()
dickens_counter = Counter(words)
dickens_counter.most_common(5)

[('the', 7363), ('and', 4727), ('of', 3944), ('to', 3398), ('a', 2792)]

You can also quickly use this counter to find the percentage of words in a corpus that belong to a certain word:

In [21]:
dickens_counter["the"] / sum(dickens_counter.values())

0.054036400998091885

## Adding Word Counts From Two Distinct Datasets Together

We can add two `Counter` objects together to get their combined counts. In this example, we'll load in the `fraudulent_emails.txt` dataset and start a new counter called `email_counter`.

In [23]:
email_counter = Counter(open("fraudulent_emails.txt").read().split())
email_counter.most_common(5)

[('the', 141), ('to', 116), ('I', 115), ('of', 80), ('in', 80)]

In [25]:
combined_counter: Counter = dickens_counter + email_counter
combined_counter.most_common(5)

[('the', 7504), ('and', 4789), ('of', 4024), ('to', 3514), ('a', 2834)]

You can also subtract counts from one dataset:

In [29]:
# get back the original email_counter
(combined_counter - dickens_counter).most_common(5)

[('the', 141), ('to', 116), ('I', 115), ('of', 80), ('in', 80)]

# Removing Stopwords Using `gensim`

Removing stopwords in `nltk` often means you first have to tokenize the document into distinct tokens, then run each token through to check if it is a stopword. Another commonly used NLP library in Python, `gensim`, has a helper function to do this all in one go:

In [41]:
from gensim.parsing.preprocessing import remove_stopwords

text = '''
Rendered in a manner desperate, by her state and by the beckoning of their conductor,
he drew over his neck the arm that shook upon his shoulder, lifted her a little, and hurried 
her into the room. He sat her down just within the door, and held her, clinging to him.
'''
processed_text = remove_stopwords(text)
processed_text

'Rendered manner desperate, state beckoning conductor, drew neck arm shook shoulder, lifted little, hurried room. He sat door, held her, clinging him.'

Note, however, this only works well if you are happy with Gensim's only predefined list of stopwords. To inspect what stopwords are used in Gensim, use
```python
from gensim.parsing.preprocessing import STOPWORDS
print(STOPWORDS)
```

# Finding Similar Word Matches Using `difflib`

Within Python's Standard Library, the `difflib` has a variety of tools for helping identify differences between text and content. It uses an algorithm called the **Ratcliff-Obershelp algorithm**, which is described in brief below:

> The idea is to find the longest contiguous matching subsequence that contains no “junk” elements; these “junk” elements are ones that are uninteresting in some sense, such as blank lines or whitespace. (Handling junk is an extension to the Ratcliff and Obershelp algorithm.) The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that “look right” to people. [Link](https://docs.python.org/3/library/difflib.html)

In [14]:
# this loads in the top 20k most popular words in the English language
words = set(map(lambda word: word.replace("\n", ""), open("20k.txt").readlines()))

In [15]:
import difflib

w = "knaght"
difflib.get_close_matches(w, words)

['knight', 'naughty', 'knights']

You can combine this with a tokenizer to create your own (very basic) spellcheck function:

In [32]:
from nltk.tokenize import word_tokenize

def spellcheck_document(text):
    new_tokens = []
    for token in word_tokenize(text):
        matches = difflib.get_close_matches(token.lower(), words, n=1, cutoff=0.7)
        if len(matches) == 0 or token.lower() in words:
            new_tokens.append(token)
        else:
            new_tokens.append(matches[0])
    return " ".join(new_tokens)
spellcheck_document("He is a craezy perzon")

'He is a crazy person'