# Motivation
After analyzing the statistics of the Boggle games I'd hand-labeled, I realized: if I take down the letter distribution of each of the tiles in the actual game, then I should be able to simulate as many games as I want. 

# Setup
The cells below will set up the rest of the notebook. 

I'll start by configuring the kernel: 

In [1]:
# Change directory to the path above
%cd ..

# Enable the autoreload of modules
%load_ext autoreload
%autoreload 2

d:\data\programming\boggle-vision\boggle-vision-prototyping


Next, I'm going to import some necessary modules:

In [2]:
# Import general modules
import pandas as pd
import random
from tqdm import tqdm
import json
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path
import datetime
import time

# Importing custom-built modules
from utils.board_solving import (
    parse_board_from_letter_sequence,
    solve_boggle,
    allowed_words_trie,
)

Finally, I'm going to load in the necessary data: 

In [3]:
# Read in the data on Boggle letter frequencies
boggle_tile_letters_df = pd.read_excel("data/super-big-boggle-tile-letters.xlsx")

# Setting Up Simulations
Now: I'm going to create a method that'll generate a random Boggle board, given these letter distributions. 

In [4]:
# Create a dictionary mapping each tile_idx to the possible letters
tile_idx_to_possible_letters_dict = {
    row["tile_idx"]: [row[f"side_{side_idx+1}"] for side_idx in range(6)]
    for idx, row in boggle_tile_letters_df.iterrows()
}


def simulate_boggle_board(tile_idx_to_possible_letters_dict):
    """
    This method will simulate a Boggle board by randomly selecting letters
    from each possible letter set for each tile.
    """

    # Select one of the possible letters for each tile
    letter_choices = [
        random.choice(possible_letters).lower()
        for tile_idx, possible_letters in tile_idx_to_possible_letters_dict.items()
    ]

    # Now, randomize the order of the letters
    random.shuffle(letter_choices)

    # Return the letter choices
    return letter_choices

# Running Simulations
Now that I've got a method to generate accurate Boggle boards, I'm going to try and solve a ton of boards.

I'll start by generating a bunch of boards. 

In [5]:
# # Parameterize the simulation
# n_boards_to_simulate = 100000

# # Generate each of the boards
# boards = [
#     parse_board_from_letter_sequence(
#         simulate_boggle_board(tile_idx_to_possible_letters_dict)
#     )
#     for _ in range(n_boards_to_simulate)
# ]

Now that I've got all of these boards, I'm going to try and solve them. 

We're going to do this in parallel, so that we can solve a ton of boards at once. 

In [6]:
# # We're going to store all of the solved boards in a list
# solved_boards_dict = {}

# # We're going to store the futures in a dictionary
# futures_dict = {}
# with ThreadPoolExecutor(max_workers=16) as executor:
    
#     # For each of the boards in the `boards` list, we'll submit a job to the executor
#     print(f"Submitting {len(boards)} jobs to the executor...")
#     for board_idx, board in tqdm(list(enumerate(boards))):
#         futures_dict[board_idx] = executor.submit(solve_boggle, board, allowed_words_trie)
    
#     # Now, we're going to wait for each of the futures to complete
#     print(f"\nSolving {len(boards)} boards...")
#     for board_idx, future in tqdm(list(futures_dict.items())):
#         solved_boards_dict[board_idx] = future.result()

# Saving Results of Simulated Games
To better understand the results of the simulated Boggle games, I'll save them. 

In [7]:
# # Create a master DataFrame that contains all of the solved boards
# solved_boards_df_list = []
# for board_id, solved_board_df in solved_boards_dict.items():
#     solved_board_df["board_id"] = board_id
#     solved_board_df.drop(columns=["length", "points", "path", "word_id"], inplace=True, errors="ignore")
#     solved_boards_df_list.append(solved_board_df)
# solved_boards_df = pd.concat(solved_boards_df_list)

# # Create a directory in the data/ folder that contains results of the simulations
# Path("data/simulations").mkdir(parents=True, exist_ok=True)

# # Get a timestamp for the simulation run
# timestamp = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

# # Save a .json file version of the solved_boards_df
# solved_boards_df.to_json(
#     f"data/simulations/solved_boards_{timestamp}.json", orient="records"
# )

# Collecting Statistics
Now that I've simulated all of the Boggle boards and solved them, I want to collect some statistics about them. Here are things I'm interested in: 

- Frequency statistics for each word
- Avg. total points available
- Avg. total words available
- Avg. # of 8+ length words

I'll start by calculating the frequency stats for each word: 

In [8]:
# # We're going to store the number of times each word was found in a board
# word_appearances_dict = {}

# # Iterate through each of the solved boards
# for board_idx, solved_board_df in tqdm(list(solved_boards_dict.items())):
#     # Extract the words from this board
#     words = solved_board_df["word"].tolist()

#     # Iterate through each of the words and update the dictionary
#     for word in words:
#         word_appearances_dict[word] = word_appearances_dict.get(word, 0) + 1

# # Create a DataFrame from the dictionary
# word_appearances_df = pd.DataFrame.from_records(
#     [
#         {"word": word, "appearances": appearances, "length": len(word)}
#         for word, appearances in word_appearances_dict.items()
#     ]
# )

# # Sort the DataFrame by the number of appearances
# word_appearances_df = word_appearances_df.sort_values(
#     by=["appearances", "length", "word"], ascending=[False, True, True]
# )

# # Add a column indicating the likelihood of each word appearing
# word_appearances_df["prob_of_appearance"] = (
#     word_appearances_df["appearances"] / n_boards_to_simulate
# )

What are some of the most common words? 

In [9]:
# def extract_board_stats(solved_board_df):
#     """
#     This is a helper method to extract the stats from a solved board.
#     """

#     # Calculate some stats about the current board
#     total_points = solved_board_df["points"].sum()
#     num_words = len(solved_board_df)
#     eleven_pointers = len(solved_board_df.query("length >= 8"))

#     # Return a dictionary containing the stats
#     return {
#         "board_idx": board_idx,
#         "total_points": total_points,
#         "num_words": num_words,
#         "eleven_pointers": eleven_pointers,
#     }


# # We're going to store each of the boards' stats in a list
# board_stats_df_records = []

# # Parallelize the board stats extraction
# futures = {}
# with ThreadPoolExecutor(max_workers=32) as executor:
#     for board_idx, solved_board_df in tqdm(list(solved_boards_dict.items())):
#         futures[board_idx] = executor.submit(extract_board_stats, solved_board_df)

#     for board_idx, future in tqdm(list(futures.items())):
#         board_stats_df_records.append(future.result())

# # Finally, make a DataFrame from the records
# board_stats_df = pd.DataFrame.from_records(board_stats_df_records)

# **SIMULATION ITERATION LOOP**
A quick and easy way to run a bunch of simulations without totally crashing my computer. 

In [10]:
# Parameterize the simulation iteration loop
n_boards_to_simulate_per_run = 50000
n_runs = 5
n_concurrent_workers = 24
min_to_sleep_between_runs = 3

# Start the simulation loop
for run_idx in range(n_runs):
    # Print out the current run index
    print(f"\n\nSTARTING RUN {run_idx+1} OF {n_runs}...")

    # Parameterize the simulation
    n_boards_to_simulate = n_boards_to_simulate_per_run

    # Generate each of the boards
    boards = [
        parse_board_from_letter_sequence(
            simulate_boggle_board(tile_idx_to_possible_letters_dict)
        )
        for _ in range(n_boards_to_simulate)
    ]

    # We're going to store all of the solved boards in a list
    solved_boards_dict = {}

    # We're going to store the futures in a dictionary
    futures_dict = {}
    with ThreadPoolExecutor(max_workers=n_concurrent_workers) as executor:
        # For each of the boards in the `boards` list, we'll submit a job to the executor
        print(f"Submitting {len(boards)} jobs to the executor...")
        for board_idx, board in tqdm(list(enumerate(boards))):
            futures_dict[board_idx] = executor.submit(
                solve_boggle, board, allowed_words_trie
            )

        # Now, we're going to wait for each of the futures to complete
        print(f"Solving {len(boards)} boards...")
        for board_idx, future in tqdm(list(futures_dict.items())):
            solved_boards_dict[board_idx] = future.result()

    # Create a master DataFrame that contains all of the solved boards
    solved_boards_df_list = []
    for board_id, solved_board_df in solved_boards_dict.items():
        solved_board_df["board_id"] = board_id
        solved_board_df.drop(
            columns=["length", "points", "path", "word_id"],
            inplace=True,
            errors="ignore",
        )
        solved_boards_df_list.append(solved_board_df)
    solved_boards_df = pd.concat(solved_boards_df_list)

    # Create a DataFrame that aggregates the number of times each word was found
    aggregated_by_word_df = (
        solved_boards_df.groupby("word")
        .agg({"board_id": "count"})
        .reset_index()
        .rename(columns={"board_id": "ct"})
        .sort_values("ct", ascending=False)
    )

    # Make a dictionary that stores the number of boards and the number of words
    dict_to_save = {
        "n_games": n_boards_to_simulate_per_run,
        "word_ct": [(row.word, row.ct) for row in aggregated_by_word_df.itertuples()],
    }

    # Create a directory in the data/ folder that contains results of the simulations
    Path("data/simulations").mkdir(parents=True, exist_ok=True)

    # Get a timestamp for the simulation run
    timestamp = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

    # Save a JSON of the dictionary
    with open(f"data/simulations/word_ct_{timestamp}.json", "w") as f:
        json.dump(dict_to_save, f)

    # Sleep for a couple of minutes to allow the OS to clean up the memory and the CPU to cool down
    time.sleep(min_to_sleep_between_runs * 60)



STARTING RUN 1 OF 5...
Submitting 50000 jobs to the executor...


100%|██████████| 50000/50000 [05:51<00:00, 142.31it/s]  


Solving 50000 boards...


100%|██████████| 50000/50000 [01:47<00:00, 464.09it/s]   




STARTING RUN 2 OF 5...
Submitting 50000 jobs to the executor...


100%|██████████| 50000/50000 [01:25<00:00, 584.23it/s] 


Solving 50000 boards...


100%|██████████| 50000/50000 [07:37<00:00, 109.34it/s]  




STARTING RUN 3 OF 5...
Submitting 50000 jobs to the executor...


 89%|████████▊ | 44262/50000 [01:45<00:01, 3567.40it/s]