# Motivation
In Notebook 11, I used [a GitHub repo that contained different words and their definitions](https://github.com/ssvivian/WebstersDictionary). Issue here: a ton of these words were obscure, and their definitions weren't that great. 

So, instead, I'm going to try and put together my own dictionary - mostly based on [Princeton's WordNet](https://wordnet.princeton.edu/). I found [a repo that contains all of the words in a more parsable JSON format](https://github.com/fluhus/wordnet-to-json). In addition, I found [a library called word-forms](https://pypi.org/project/word-forms/) that tries to generate some of the different inflections of that word. 

In this notebook, I'm going to try and create a more robust dictionary. 

# Setup
The cells below will help to set up the rest of my notebook. 

I'll start by configuring my kernel:

In [1]:
# Changing the cwd to the root of the project
%cd ..

# Enabling the autoreload extension
%load_ext autoreload
%autoreload 2

d:\data\programming\boggle-vision\boggle-vision-prototyping


Now, I'm going to load in the necessary libraries:

In [2]:
# Import general libraries
import json
import pandas as pd
from pathlib import Path
from tqdm import tqdm
from word_forms.word_forms import get_word_forms
from Levenshtein import ratio

# Loading in the WordNet Files
I'm going to kick things off by parsing all of the WordNet JSON files.

In [3]:
# Collect a list of smaller DataFrames that we'll merge together eventually
wordnet_df_list = []

# Iterate through each of the JSON files in the data/wordnet-json-files directory
for child_file in tqdm(list(Path("data\wordnet-json-files").iterdir())):
    if child_file.suffix == ".json":
        with open(child_file, "r") as f:
            cur_file_data = json.load(f)

        # Create a DataFrame from the data in the JSON file and store it
        wordnet_df_list.append(
            pd.DataFrame.from_records([val for key, val in cur_file_data.items()])
        )

# Concatenate all of the DataFrames together into a single DataFrame
wordnet_df = pd.concat(wordnet_df_list)

100%|██████████| 26/26 [00:03<00:00,  7.90it/s]


We'll also want a dictionary mapping each word to the first definition associated with each of the possible parts of speech: 

In [19]:
def generate_first_word_for_pos(meanings_list):
    
    # Iterate through each of the meanings, and extract the first one corresponding to each pos tag
    first_meaning_for_pos = {}
    for meaning in meanings_list:
        if meaning["part_of_speech"] not in first_meaning_for_pos:
            first_meaning_for_pos[meaning["part_of_speech"]] = meaning["def"]
            
    # Return the resulting dictionary
    return first_meaning_for_pos

# Create a mapping between word --> possible pos tags
word_to_pos_and_def_mapping = {row.word: generate_first_word_for_pos(row.meanings) for row in wordnet_df.itertuples()}

Next: we're going to clean up this `wordnet_df` a bit by doing the following: 

- Extract the first part of speech & definition from the `meanings` column
- Remove any word that has non-alphabet characters (i.e., compound words, words with apostrophes, etc.)

In [4]:
# Create a copy that we'll modify
cleaned_wordnet_df = wordnet_df.copy()

# Add the "pos" and "def" columns
cleaned_wordnet_df["pos"] = cleaned_wordnet_df["meanings"].apply(lambda x: x[0]['part_of_speech'])
cleaned_wordnet_df["definition"] = cleaned_wordnet_df["meanings"].apply(lambda x: x[0]['def'])

# Drop the "meanings" column
cleaned_wordnet_df = cleaned_wordnet_df.drop(columns=["meanings", "pos_order"])

# Filter out words that aren't entirely made with alphabetic characters
cleaned_wordnet_df = cleaned_wordnet_df[
    cleaned_wordnet_df["word"].apply(
        lambda x: not any([not char.isalpha() for char in x])
    )
]

# Morphological Generation
Now, I'm going to use the `word-forms` library to try and find different forms of all of the words in the `cleaned_wordnet_df`. 

I'll start by trying to "expand" all of the words that're already in the `cleaned_wordnet_df`:

In [21]:
# We're going to store new word additions in a dictionary
new_word_forms = {}

# We'll create a set containing all of the words in the dictionary; this can
# be used for easily checking if a word is already in the dictionary
initial_word_set = set(cleaned_wordnet_df["word"])

# We're going to map the pos tags to their full names
pos_mapping = {
    "n": "noun",
    "v": "verb",
    "a": "adjective",
    "r": "adverb",
}

# Iterate through each of the rows in the DataFrame and determine the extra forms of the word
for row in tqdm(list(cleaned_wordnet_df.itertuples())):
    cur_word_forms = get_word_forms(row.word)

    # Iterate through all of the forms, and check if they ought to be added to the dictionary
    for pos, form_set in cur_word_forms.items():
        for word_form in form_set:
            if word_form not in initial_word_set:
                # We're only going to add it if the pos matches the pos of the original word
                if pos_mapping[pos] in word_to_pos_and_def_mapping[row.word]:
                    
                    # Extract the definition of the original word's pos
                    cur_word_def = word_to_pos_and_def_mapping[row.word][pos_mapping[pos]]
                    
                    # If this word isn't already in the dictionary, we'll add it
                    if word_form not in new_word_forms:
                        new_word_forms[word_form] = {
                            "word": word_form,
                            "pos": pos,
                            "definition": cur_word_def,
                            "linked_word": row.word,
                        }

                    # If it's already in there, we'll check which linked word is closer, and
                    # then update the linked word if the new one is closer
                    else:
                        cur_linked_word = new_word_forms[word_form]["linked_word"]
                        if ratio(row.word, word_form) > ratio(
                            cur_linked_word, word_form
                        ):
                            new_word_forms[word_form]["linked_word"] = row.word
                            new_word_forms[word_form]["definition"] = cur_word_def

# Create an "expanded" wordnet_df that contains all of the new words as well as the old ones
expanded_wordnet_df = pd.concat(
    [
        pd.DataFrame.from_records(list(new_word_forms.values())),
        pd.DataFrame.from_records(
            [
                {
                    "word": row.word,
                    "pos": row.pos,
                    "definition": row.definition,
                    "linked_word": None,
                }
                for row in cleaned_wordnet_df.itertuples()
            ]
        ),
    ]
)

100%|██████████| 78925/78925 [00:12<00:00, 6129.62it/s]


# Saving a Dictionary
Now that we've created the `expanded_wordnet_df`, we can save it: 

In [23]:
with open("data/word_to_definition.json", "w") as json_file:
    json.dump({
        row.word: {
            "pos": row.pos,
            "definition": row.definition,
        }
        for row in expanded_wordnet_df.itertuples()
    }, json_file, indent=2)