# Lecture 30 Notes

## Example: A Spell Checker

Lets write a program that prints the misspelled words in a text file. The
general structure of the program is as follows:

- Read a list of correctly spelled words.
- Read the text file, splitting it into words.
- For each word in the text file, if it's *not* in the list of correctly spelled
  words, append it to a list of misspelled words.
- Print the list of misspelled words.

First, let's get a list of correctly spelled words.

### A List of Correctly Spelled Words

We'll use [enable1.txt](enable1.txt) as our list of correctly spelled words. Any
string *not* in this list will be considered misspelled.

[enable1.txt](enable1.txt) has 172,820 unique English words, one per line:

In [6]:
words = open('enable1.txt').readlines()

print('First 10 words:')
for line in words[:10]:
    print(line.strip())

print()
print('Last 10 words:')
for line in words[-10:]:  # print the last 10 words
    print(line.strip())

First 10 words:
aa
aah
aahed
aahing
aahs
aal
aalii
aaliis
aals
aardvark

Last 10 words:
zymology
zymosan
zymosans
zymoses
zymosis
zymotic
zymurgies
zymurgy
zyzzyva
zyzzyvas


Notice that the word "a" is *not* in [enable1.txt](enable1.txt)! Thus it will
treat a sentence like "This is a test." as having a misspelled word, "a".

You could easily fix this by adding "a" to the file, but we will keep it as-is.
With a bit of searching you can find much better spelling dictionaries online.

### Efficiently Checking for Misspelled Words

To efficiently check if a word is misspelled, we'll read the words of
[enable1.txt](enable1.txt) as keys into a dictionary:

In [7]:
all_words = {}
word_file = open('enable1.txt')
for w in word_file:
    w = w.strip()
    all_words[w] = 1

print(len(all_words), 'words in spelling dictionary')

172820 words in spelling dictionary


Checking if a given word is a key in a dictionary is very fast:

In [10]:
# press enter without typing a word to quit
word = input('--> ').strip()
while word != '':
    if word in all_words:
        print(word, 'is in the dictionary')
    else:
        print(word, 'is NOT in the dictionary')
    word = input('--> ').strip()

pizza is in the dictionary
shoebox is not in the dictionary
candy is in the dictionary
tesla is in the dictionary
monorepo is not in the dictionary


The expression `word in words` returns `True` if `word` is a key in the
dictionary `words`, and `False` otherwise. It runs very quickly, even for large
dictionaries.

### Reading the Words of a Text File

We want to extract all the words from a given text file. Perhaps the simplest
way to do that is as follows:

- Read the entire file as a single string.
- Split the string into words using the `split` method.
- Clean-up the words by:
  - Converting them to lowercase (our spelling dictionary is all lowercase).
  - Removing any non-letter characters.
  - Deleting any empty strings.

In [16]:
text = open('joke.txt').read() # read the entire file into a string
words = text.split()           # split the string into a list of words

# remove any not-alphabetic characters from each word
for i in range(len(words)):
    words[i] = words[i].strip().lower()
    for c in words[i]:
        if c not in 'abcdefghijklmnopqrstuvwxyz':
            words[i] = words[i].replace(c, '')

# remove any empty strings
# Note that this can be done more simple with a list comprehension:
# words = [w for w in words if w != '']
result = []
for w in words:
    if w != '':
        result.append(w)
words = result

# print the words
for w in words:
    print(f'"{w}"')

"whos"
"there"
"a"
"broken"
"pencil"
"a"
"broken"
"pencil"
"who"
"never"
"mind"
"its"
"pointless"


### Putting it All Together

Now we have all the pieces we need to write a basic spell checker:

In [19]:
def spell_check(filename):
    """Returns a list of the misspelled words in filename.
    """
    # Open the spelling dictionary and read it into a dictionary
    all_words = {}
    word_file = open('enable1.txt')
    for w in word_file:
        w = w.strip()
        all_words[w] = 1
    
    # Open the given file and split it into words
    text = open(filename).read()
    words = text.split()
    
    # Remove non-alphabetic characters from each word
    for i in range(len(words)):
        words[i] = words[i].strip().lower()
        for c in words[i]:
            if c not in 'abcdefghijklmnopqrstuvwxyz':
                words[i] = words[i].replace(c, '')

    # Remove any empty strings
    result = []
    for w in words:
        if w != '':
            result.append(w)
    words = result

    # Find any misspelled words
    misspelled = []
    for w in words:
        if w not in all_words:
            misspelled.append(w)
    
    return misspelled

joke_misspelled = spell_check('joke.txt')
print(joke_misspelled)

dracula_misspelled = spell_check('dracula.txt')
print(f'{len(dracula_misspelled)} misspelled words in dracula.txt')
print('Here are the first 5:', dracula_misspelled[:5])

['whos', 'a', 'a']
11976 misspeled words in dracula.txt
 Here are the first 5: ['dracula', 'bram', 'jonathan', 'harkers', 'jonathan']


### Possible Improvements

This is only a basic spell-checker. There are many ways to extend it:

- Include the line number of each misspelled word. This will help the user find
  it in the file.
- Check for common words that are not in a dictionary: names (of people,
  countries, companies, etc.), abbreviations, contractions, numbers, HTML tags,
  etc.
- Suggest corrections for misspelled words. Sometimes a misspelled word is
  simply a typo, and suggesting similar-looking words can be helpful.
- Use a better spelling dictionary. Online you can find word lists specifically
  designed for spell-checking.
- Allow spell-checking in different languages. For instance, to spell-check
  German text then use a German word list.