# CSC 820 Homework 3  
Andrew Dahlstrom  
2/21/2024

In [1]:
import nltk
from nltk.corpus import gutenberg
import pandas as pd
import Levenshtein 

# Load Lewis Carroll's Alice in Wonderland from project Gutenberg library into a list like object
corpus = gutenberg.words('carroll-alice.txt') 
corpus_name = "Alice in Wonderland"
# Use the built-in set method to automatically create a leist of unique words from the corpus
types = set(corpus)

# Total number of words (tokens) in corpus
N = len(corpus)

# Create a dataframe to hold the types from the corpus
corpus_table = pd.DataFrame(types, columns=['word'])
corpus_table.head()

Unnamed: 0,word
0,squeaked
1,already
2,cupboards
3,laughing
4,against


In [2]:
# Create a dataframe to store all the tokens from the corpus in a column
token_table = pd.DataFrame(corpus, columns=['word'])

# Create a series to hold the frequency counts of each word from the token dataframe
word_freq = token_table['word'].value_counts()

# Map the word frequency to the corresponding word in the type table under
# the new frequency column
corpus_table['frequency'] = corpus_table['word'].map(word_freq)

# Create a new column that contains the probability of the word occuring in the corpus
# by dividing each word frequency by the total number of words in the corpus.
corpus_table['probability'] = corpus_table['frequency'] / N

corpus_table.head()

Unnamed: 0,word,frequency,probability
0,squeaked,1,2.9e-05
1,already,2,5.9e-05
2,cupboards,2,5.9e-05
3,laughing,1,2.9e-05
4,against,9,0.000264


In [3]:
# Function declaration for word_search. Input is a word and checks corpus_table
# for word, if found returns the corresponding probability of the word, else
# returns word not found.
def word_search(word):
    # result is a filtered dataframe returning only the row corresponding to word.
    result = corpus_table[corpus_table['word'] == word]
    
    if result.empty:
        closest_types = lev_distance(word, corpus_table)
        #print(f"The word '{word}' is not present in the corpus of '{corpus_name}'.")
        for index, row in closest_types.iterrows():
            print(f"Type: {row['word']}, Probability: {row['probability']}")
    else:
        # Get only the corresponding probability value with iloc[0]
        prob = result['probability'].iloc[0]
        print(f"'{word}' is a complete and correct word as per corpus '{corpus_name}', and its probability is '{prob}'.")

In [4]:
# Function declaration for lev_distance to calculate the Levenshtein distance using pythons built-in
# Levenshtein distance algorithm. Input is a word and dataframe containing the corpus types, frequencies and probabilities.
# Returns a dataframe containing the top 5 that are most similar to the input string as per the Levenshtein distance.
def lev_distance(word, corpus_table):
    # Copy the corpus table into the result table and create a new column for Levenshtein distances filled with placeholders.
    result_table = corpus_table.copy()
    result_table['Lev_Distance'] = 0
    
    # Iterate through the word column of the result table calculating
    # the Levenshtein distances between the input word and corpus words and
    # update result table inplace.
    for index, row in result_table.iterrows():
        distance = Levenshtein.distance(word, row['word'])
        result_table.at[index, 'Lev_Distance'] = distance
    
    # Sort dataframe by shortest distance 
    result_table.sort_values(by='Lev_Distance', ascending=True, inplace=True)
    
    # Return only the top five closest types from the corpus
    result_table = result_table.head(5)
    
    return result_table

In [5]:
# Initialize program in main
if __name__== "__main__":
    while True: 
        user_input = input(f"Begin by entering a word to search for it in '{corpus_name}' or 'exit' to exit the program.")
        if user_input == 'exit':
            break
        word_search(user_input)

Begin by entering a word to search for it in 'Alice in Wonderland' or 'exit' to exit the program. Alice


'Alice' is a complete and correct word as per corpus 'Alice in Wonderland', and its probability is '0.011609498680738786'.


Begin by entering a word to search for it in 'Alice in Wonderland' or 'exit' to exit the program. rabbit


'rabbit' is a complete and correct word as per corpus 'Alice in Wonderland', and its probability is '0.0001465845793022574'.


Begin by entering a word to search for it in 'Alice in Wonderland' or 'exit' to exit the program. cat


'cat' is a complete and correct word as per corpus 'Alice in Wonderland', and its probability is '0.0003224860744649663'.


Begin by entering a word to search for it in 'Alice in Wonderland' or 'exit' to exit the program. time


'time' is a complete and correct word as per corpus 'Alice in Wonderland', and its probability is '0.001993550278510701'.


Begin by entering a word to search for it in 'Alice in Wonderland' or 'exit' to exit the program. mad


'mad' is a complete and correct word as per corpus 'Alice in Wonderland', and its probability is '0.00041043682204632074'.


Begin by entering a word to search for it in 'Alice in Wonderland' or 'exit' to exit the program. homework


Type: somewhere, Probability: 5.8633831720902964e-05
Type: someone, Probability: 2.9316915860451482e-05
Type: home, Probability: 0.0001465845793022574
Type: memory, Probability: 2.9316915860451482e-05
Type: work, Probability: 0.00023453532688361186


Begin by entering a word to search for it in 'Alice in Wonderland' or 'exit' to exit the program. pizza


Type: pine, Probability: 2.9316915860451482e-05
Type: sizes, Probability: 2.9316915860451482e-05
Type: puzzle, Probability: 2.9316915860451482e-05
Type: pigs, Probability: 0.00017590149516270889
Type: size, Probability: 0.00038111990618586926


Begin by entering a word to search for it in 'Alice in Wonderland' or 'exit' to exit the program. Italy


Type: scaly, Probability: 2.9316915860451482e-05
Type: tale, Probability: 8.795074758135444e-05
Type: stalk, Probability: 2.9316915860451482e-05
Type: talk, Probability: 0.00041043682204632074
Type: stay, Probability: 0.0001465845793022574


Begin by entering a word to search for it in 'Alice in Wonderland' or 'exit' to exit the program. imagination


Type: invitation, Probability: 5.8633831720902964e-05
Type: explanation, Probability: 5.8633831720902964e-05
Type: Ambition, Probability: 2.9316915860451482e-05
Type: imagine, Probability: 2.9316915860451482e-05
Type: variations, Probability: 2.9316915860451482e-05


Begin by entering a word to search for it in 'Alice in Wonderland' or 'exit' to exit the program. exit


# Summary

This program is designed to extract the types and tokens from a corpus, in this case Lewis Carroll's "Alice in Wonderland" then builds a dataframe containing the unique words (types) and frequency of their occurrence in the corpus in order to calculate the probability of the word occurring in the corpus. 

The program prompts the user to input a word then the program checks if the word is in the corpus dataframe exactly as it is input. If so then it displays a message indicating it is in the corpus and the associated probability. If the word is not found in the corpus, a similarity measure is made using the Levenshtein distance calculation between the input word and all types from the corpus. The top five most similar (shortest distance) words are returned in a message with their corresponding probabilities. 

Improvements to the program could include using word normalization to fix minor discrepancies such as upper case, punctuation or even more advanced techniques such as stemming and lemmatization. Another improvement could be optimizing the Levenshtein distance calculation, as iterating over the dataframe for each comparison is computationally expensive. 

The most challenging part of the program is the lev_distance function which is where I encountered the most problems. Remembering how to iterate through the rows and columns of a dataframe and update column values inplace required a bit of research. Also understanding the data types returned by the NLTK methods and Levenshtein distance method required a bit of research as well.  The pairwise distance calculation was also confusing and required a bit of trial and error.

