# Motivation

I want to include some word definitions in the Boggle Vision app. In order to do that, I need to have _some_ local store of all of the definitions. I'll create that in this notebook!

I found [this repo that has a JSON verison of the Webster's dictonary](https://github.com/ssvivian/WebstersDictionary) - I'm going to use that in order to fetch the definitions for the words.


# Setup

The cells below will set up the rest of the notebook.

I'll start by configuring my kernel:


In [1]:
# Change the cwd to the parent of the current dir
%cd .. 

# Enable the autoreload extension so that code can change
%load_ext autoreload
%autoreload 2

d:\data\programming\boggle-vision\boggle-vision-prototyping


Next, I'm going to run through some import statements:


In [2]:
# Import statements
import json
import pandas as pd

Finally, I'm going to load in some important files - namely, the list of allowed words, and the Webster's dictionary.


In [3]:
# Load in the list of allowed words
with open("data\scrabble-dictionary.json", "r") as f:
    allowed_words = json.load(f)

# Load in the Websters dictionary
with open("data\websters-dictionary.json", "r") as f:
    websters_dictionary = json.load(f)

# Transforming Data

The main thing that I need to do in this notebook: determine which words from the Scrabble dictionary have definitions.


In [35]:
# Make a DataFrame version of the allowed_words list
allowed_words_df = pd.DataFrame(allowed_words, columns=["word"])

# Make a DataFrame version of the websters_dictionary list
websters_dictionary_df = pd.DataFrame.from_records(websters_dictionary)

# Drop the "synonyms" column from the websters_dictionary_df
websters_dictionary_df.drop(columns=["synonyms"], inplace=True)

# Make the word column a lower case string
websters_dictionary_df["word"] = websters_dictionary_df["word"].str.lower()

# Merge the allowed_words_df with the websters_dictionary_df
merged_websters_dictionary_df = websters_dictionary_df.merge(
    allowed_words_df, on="word", how="outer", indicator="exists_in_scrabble_dict"
)


def check_if_exists_in_scrabble_dict(row):
    """
    Checks if the word exists in the scrabble dictionary.
    """
    if row["exists_in_scrabble_dict"] == "both":
        return True
    else:
        return False


merged_websters_dictionary_df[
    "exists_in_scrabble_dict"
] = merged_websters_dictionary_df.apply(check_if_exists_in_scrabble_dict, axis=1)

# Add a "length" column to the merged_websters_dictionary_df
merged_websters_dictionary_df["length"] = merged_websters_dictionary_df[
    "word"
].str.len()

How many of the Scrabble words don't have dictionary definitions?


In [36]:
merged_websters_dictionary_df["exists_in_scrabble_dict"].value_counts()

exists_in_scrabble_dict
False    178693
True      60353
Name: count, dtype: int64

It seems to be quite a few of them. However: in order to take an aggressive approach to this, I'm just going to drop _every_ word that doesn't have a definition. Also, words under 4 letters long.


In [46]:
# Create a filtered version of the merged_websters_dictionary_df
filtered_merged_websters_dictionary_df = merged_websters_dictionary_df.query(
    "length >= 4 & exists_in_scrabble_dict == True"
).copy()

# Show the first 3 rows of the filtered_merged_websters_dictionary_df
filtered_merged_websters_dictionary_df.head(3)

Unnamed: 0,pos,word,definitions,exists_in_scrabble_dict,length
14,n.,abaca,"[The Manila-hemp plant (Musa textilis); also, ...",True,5
19,adv.,aback,"[Toward the back or rear; backward. ""Therewith...",True,5
20,n.,aback,[An abacus. [Obs.] B. Jonson.],True,5


Next, I'll work on adding some nicer "definition" labels.


In [54]:
# Create a dictionary mapping pos tags to the actual part of speech
pos_tag_to_label_dict = {
    "n.": "noun",
    "adv.": "adverb",
    "prep.": "preposition",
    "v.": "verb",
    "a.": "adjective",
    "p.": "pronoun",
    "interj.": "interjection",
    "conj.": "conjunction",
    "pron.": "pronoun",
}

filtered_merged_websters_dictionary_df[
    "pos_label"
] = filtered_merged_websters_dictionary_df["pos"].apply(
    lambda x: pos_tag_to_label_dict.get(x, "unknown")
)

# Add a column that contains a string of the first definition
filtered_merged_websters_dictionary_df[
    "definition_str"
] = filtered_merged_websters_dictionary_df.apply(lambda row: row.definitions[0], axis=1)

# Saving Data

Now that I've got a trimmed down dictionary, I'm going to save it locally. I'll use this trimmed down file in the production app.


In [59]:
# Create a dictionary that will map words to their definitions and part of speech
word_to_definition_dict = {
    row.word: {
        "pos": row.pos_label,
        "definition": row.definition_str,
    }
    for row in filtered_merged_websters_dictionary_df[
        ["word", "pos_label", "definition_str"]
    ].itertuples()
}

# Now, save this dictionary to a json file
with open("data\word_to_definition.json", "w") as f:
    json.dump(word_to_definition_dict, f)