# Motivation
In this notebook, I'm going to analyze the results of some of the Boggle simulation runs I produced in **Notebook 10: Simulating Bogglel Games**. My laptop's RAM sort of dies out on 1mil+ game simulation runs, so in order to actually analyze things in large scale, I'll need to combine the results of training runs. 

# Setup
The cells below will set up the rest of the notebook. 

I'll start by configuring my kernel:

In [1]:
# Change the cwd to the root of the project
%cd ..

# Enable the autoreload of modules
%load_ext autoreload
%autoreload 2

d:\data\programming\boggle-vision\boggle-vision-prototyping


Next, I'm going to import some necessary libraries:

In [2]:
# Import general modules
import pandas as pd
import random
from tqdm import tqdm
import json
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path
import datetime
import plotly.express as px

# Importing custom-built modules
from utils.board_solving import (
    parse_board_from_letter_sequence,
    solve_boggle,
    allowed_words_trie,
    score_boggle_word
)

# Loading Data
First, I'm going to load all of the simulation results I've produced up until now. 

I've saved `.json` files containing aggregated simulation word counts. I'll load all of them in below, and parse them as I load them. 

Once each of the games are loaded, I'll add some extra columns. 

In [3]:
# Iterate through each .json file in the data/simulations folder
total_games = 0
word_ct_dict = {}
for path in tqdm(list(Path("data/simulations").glob("*.json"))):
    # Load the data
    with open(path, "r") as f:
        data = json.load(f)

    # Add the number of games to the total
    total_games += data.get("n_games", 0)

    # Iterate through each of the words in the word_ct dict
    for word, ct in data.get("word_ct", []):
        # Add the word to the word_ct_dict
        word_ct_dict[word] = word_ct_dict.get(word, 0) + ct

# Make a DataFrame of the word_ct_dict
simulation_df = pd.DataFrame.from_records(
    [(word, ct) for word, ct in word_ct_dict.items()], columns=["word", "ct"]
)

# Add some additional columns
simulation_df["length"] = simulation_df["word"].apply(len)
simulation_df["points"] = simulation_df["word"].apply(score_boggle_word)

# Determine the percentage of games that each word was found in
simulation_df["pct_games"] = simulation_df["ct"] / total_games

# Calculate the z-score of each word
simulation_df["z_score"] = (
    simulation_df["pct_games"] - simulation_df["pct_games"].mean()
) / simulation_df["pct_games"].std()



100%|██████████| 14/14 [00:03<00:00,  4.18it/s]


Now that I've added all of those stats, I'm going to assign a "rarity" to the word. I've created the thresholds after a little manual inspection of the z-score distribution! 

In [4]:
# This method will determine if a word is "Common", "Uncommon", "Rare", and "Very Rare"
def categorize_word(z_score):
    """
    Categorize a word based on its z-score.

    :param z_score: The z-score of the word
    :return: The category of the word ('Common', 'Uncommon', 'Rare', 'Very Rare')
    """
    if z_score > 0:
        return "Common"
    elif z_score > -0.15:
        return "Uncommon"
    elif z_score > -0.23:
        return "Rare"
    else:
        return "Very Rare"
    
    
# Apply the word_rarity function to the DataFrame
simulation_df["rarity"] = simulation_df["z_score"].apply(categorize_word)

# Print out the summary statistics of the rarity
simulation_df["rarity"].value_counts()

rarity
Very Rare    86285
Rare         41268
Common       23559
Uncommon     14073
Name: count, dtype: int64

# Saving Data
Now that I've calculated all of this information about the words, I'd like to save it. 

In [5]:
# Make a dictionary, keyed by the word, of all of the rarity stats from the simulation_df
rarity_dict = simulation_df.set_index("word").to_dict(orient="index")

# Now, save a dictionary with both the rarity_dict and the total_games
with open("data/word_rarity.json", "w") as f:
    json.dump({"rarity_dict": rarity_dict, "total_games": total_games}, f, indent=2)


with open("../boggle-vision-app/boggle-vision-api/data/word_rarity.json", "w") as f:
    json.dump({"rarity_dict": rarity_dict, "total_games": total_games}, f, indent=2)

# MISC

In [6]:
simulation_df.sort_values("length", ascending=False).head(10)

Unnamed: 0,word,ct,length,points,pct_games,z_score,rarity
156759,subordinatenesses,1,17,34,7.692308e-07,-0.239972,Very Rare
152721,tabernaemontanas,1,16,32,7.692308e-07,-0.239972,Very Rare
154824,indecisivenesses,1,16,32,7.692308e-07,-0.239972,Very Rare
158727,slatternlinesses,1,16,32,7.692308e-07,-0.239972,Very Rare
126168,indelicatenesses,1,16,32,7.692308e-07,-0.239972,Very Rare
165050,lightheartedness,1,16,32,7.692308e-07,-0.239972,Very Rare
142212,destalinisations,1,16,32,7.692308e-07,-0.239972,Very Rare
163342,unsusceptibility,1,16,32,7.692308e-07,-0.239972,Very Rare
150207,internationalist,1,16,32,7.692308e-07,-0.239972,Very Rare
150208,internationalise,1,16,32,7.692308e-07,-0.239972,Very Rare
