# Motivation
In this notebook, I'm going to analyze the results of some of the Boggle simulation runs I produced in **Notebook 10: Simulating Bogglel Games**. My laptop's RAM sort of dies out on 1mil+ game simulation runs, so in order to actually analyze things in large scale, I'll need to combine the results of training runs. 

# Setup
The cells below will set up the rest of the notebook. 

I'll start by configuring my kernel:

In [1]:
# Change the cwd to the root of the project
%cd ..

# Enable the autoreload of modules
%load_ext autoreload
%autoreload 2

d:\data\programming\boggle-vision\boggle-vision-prototyping


Next, I'm going to import some necessary libraries:

In [2]:
# Import general modules
import pandas as pd
import random
from tqdm import tqdm
import json
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path
import datetime

# Importing custom-built modules
from utils.board_solving import (
    parse_board_from_letter_sequence,
    solve_boggle,
    allowed_words_trie,
    score_boggle_word
)

# Loading Data
First, I'm going to load all of the simulation results I've produced up until now. 

Each of the simulations is saved as a DataFrame JSON `(orient="records")`. When I load each of them in, I'll also add back the columns that I stripped - `points` and `length`. 

In [3]:
# Declaring the path to the data folder
simulation_data_path = "data/simulations/"

# Iterate through the simulation files and load them into a DataFrame
simulation_df_list = []
for simulation_file in tqdm(list(Path(simulation_data_path).glob("*.json"))):
    with open(simulation_file, "r") as f:
        # Load the simulation data into a DataFrame
        cur_simulation_df = pd.DataFrame(json.load(f))

        # Add the simulation file name as a column
        cur_simulation_df["simulation_file"] = simulation_file.name

        # Strip the datetime from the file name, which looks like solved_boards_2023-12-05_01-01-26.json
        cur_simulation_df["simulation_date_time_str"] = datetime.datetime.strptime(
            simulation_file.name, "solved_boards_%Y-%m-%d_%H-%M-%S.json"
        ).strftime("%Y-%m-%d %H:%M:%S")

        # Edit the board_id to include the simulation_date_time_str (in string format)
        cur_simulation_df["board_id"] = cur_simulation_df.apply(
            lambda row: f"{row['board_id']}_{row['simulation_date_time_str']}", axis=1
        )

        # Drop the simulation_date_time_str and simulation_file columns
        cur_simulation_df.drop(
            ["simulation_date_time_str", "simulation_file"], axis=1, inplace=True
        )

        # Add the simulation DataFrame to the list
        simulation_df_list.append(cur_simulation_df)

# Concatenate the simulation DataFrames into a single DataFrame
simulation_df = pd.concat(simulation_df_list, ignore_index=True)

# Add the length and points columns back
simulation_df["length"] = simulation_df["word"].apply(len)
simulation_df["points"] = simulation_df["word"].apply(score_boggle_word)

 50%|█████     | 1/2 [00:13<00:13, 13.60s/it]

# Collecting Statistics
Now that I've simulated all of the Boggle boards and solved them, I want to collect some statistics about them. Here are things I'm interested in: 

- Frequency statistics for each word
- Avg. total points available
- Avg. total words available
- Avg. # of 8+ length words

I'll start by calculating the frequency stats for each word: 

In [None]:
# We're going to store the number of times each word was found in a board
word_appearances_dict = {}

# Iterate through each of the solved boards
for board_idx, solved_board_df in tqdm(list(solved_boards_dict.items())):
    # Extract the words from this board
    words = solved_board_df["word"].tolist()

    # Iterate through each of the words and update the dictionary
    for word in words:
        word_appearances_dict[word] = word_appearances_dict.get(word, 0) + 1

# Create a DataFrame from the dictionary
word_appearances_df = pd.DataFrame.from_records(
    [
        {"word": word, "appearances": appearances, "length": len(word)}
        for word, appearances in word_appearances_dict.items()
    ]
)

# Sort the DataFrame by the number of appearances
word_appearances_df = word_appearances_df.sort_values(
    by=["appearances", "length", "word"], ascending=[False, True, True]
)

# Add a column indicating the likelihood of each word appearing
word_appearances_df["prob_of_appearance"] = (
    word_appearances_df["appearances"] / n_boards_to_simulate
)

What are some of the most common words? 

In [None]:
# Show the first 10 words, sorted by the number of appearances
word_appearances_df.head(10)

How about the least common words?

In [None]:
# Show the last 10 words, sorted by the number of appearances
word_appearances_df.tail(10)

Now: I want to determine some stats about the point distributions associated with each of the boards:

In [None]:
# We're going to store each of the boards' stats in a list
board_stats_df_records = []



In [None]:
def extract_board_stats(solved_board_df):
    """
    This is a helper method to extract the stats from a solved board.
    """

    # Calculate some stats about the current board
    total_points = solved_board_df["points"].sum()
    num_words = len(solved_board_df)
    eleven_pointers = len(solved_board_df.query("length >= 8"))

    # Return a dictionary containing the stats
    return {
        "board_idx": board_idx,
        "total_points": total_points,
        "num_words": num_words,
        "eleven_pointers": eleven_pointers,
    }


# We're going to store each of the boards' stats in a list
board_stats_df_records = []

# Parallelize the board stats extraction
futures = {}
with ThreadPoolExecutor(max_workers=32) as executor:
    for board_idx, solved_board_df in tqdm(list(solved_boards_dict.items())):
        futures[board_idx] = executor.submit(extract_board_stats, solved_board_df)

    for board_idx, future in tqdm(list(futures.items())):
        board_stats_df_records.append(future.result())

# Finally, make a DataFrame from the records
board_stats_df = pd.DataFrame.from_records(board_stats_df_records)