# Motivation
After analyzing the statistics of the Boggle games I'd hand-labeled, I realized: if I take down the letter distribution of each of the tiles in the actual game, then I should be able to simulate as many games as I want. 

# Setup
The cells below will set up the rest of the notebook. 

I'll start by configuring the kernel: 

In [1]:
# Change directory to the path above
%cd ..

# Enable the autoreload of modules
%load_ext autoreload
%autoreload 2

d:\data\programming\boggle-vision\boggle-vision-prototyping


Next, I'm going to import some necessary modules:

In [2]:
# Import general modules
import pandas as pd
import random
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor

# Importing custom-built modules
from utils.board_solving import (
    parse_board_from_letter_sequence,
    solve_boggle,
    allowed_words_trie,
)

In [7]:
import json
with open("data/scrabble-dictionary.json", "r") as json_file:
    allowed_words = json.load(json_file)
allowed_words_df = pd.DataFrame([{"word": word, "length": len(word)} for word in allowed_words])
allowed_words_df.query("length==17").sort_values(by="length", ascending=False).head(15)

Unnamed: 0,word,length
845,accommodationists,17
107227,paleogeographical,17
107914,paraformaldehydes,17
107900,paradoxicalnesses,17
107505,pancreatectomized,17
107253,paleopathologists,17
107250,paleopathological,17
107239,paleomagnetically,17
107202,paleoanthropology,17
108077,paraprofessionals,17


Finally, I'm going to load in the necessary data: 

In [None]:
# Read in the data on Boggle letter frequencies
boggle_tile_letters_df = pd.read_excel("data/super-big-boggle-tile-letters.xlsx")

# Setting Up Simulations
Now: I'm going to create a method that'll generate a random Boggle board, given these letter distributions. 

In [None]:
# Create a dictionary mapping each tile_idx to the possible letters
tile_idx_to_possible_letters_dict = {
    row["tile_idx"]: [row[f"side_{side_idx+1}"] for side_idx in range(6)]
    for idx, row in boggle_tile_letters_df.iterrows()
}


def simulate_boggle_board(tile_idx_to_possible_letters_dict):
    """
    This method will simulate a Boggle board by randomly selecting letters
    from each possible letter set for each tile.
    """

    # Select one of the possible letters for each tile
    letter_choices = [
        random.choice(possible_letters).lower()
        for tile_idx, possible_letters in tile_idx_to_possible_letters_dict.items()
    ]

    # Now, randomize the order of the letters
    random.shuffle(letter_choices)

    # Return the letter choices
    return letter_choices

# Running Simulations
Now that I've got a method to generate accurate Boggle boards, I'm going to try and solve a ton of boards.

I'll start by generating a bunch of boards. 

In [None]:
# Parameterize the simulation
n_boards_to_simulate = 1500000

# Generate each of the boards
boards = [
    parse_board_from_letter_sequence(
        simulate_boggle_board(tile_idx_to_possible_letters_dict)
    )
    for _ in range(n_boards_to_simulate)
]

Now that I've got all of these boards, I'm going to try and solve them. 

We're going to do this in parallel, so that we can solve a ton of boards at once. 

In [None]:
# We're going to store all of the solved boards in a list
solved_boards_dict = {}

# We're going to store the futures in a dictionary
futures_dict = {}
with ThreadPoolExecutor(max_workers=64) as executor:
    
    # For each of the boards in the `boards` list, we'll submit a job to the executor
    print(f"Submitting {len(boards)} jobs to the executor...")
    for board_idx, board in tqdm(list(enumerate(boards))):
        futures_dict[board_idx] = executor.submit(solve_boggle, board, allowed_words_trie)
    
    # Now, we're going to wait for each of the futures to complete
    print(f"\nSolving {len(boards)} boards...")
    for board_idx, future in tqdm(list(futures_dict.items())):
        solved_boards_dict[board_idx] = future.result()

# Collecting Statistics
Now that I've simulated all of the Boggle boards and solved them, I want to collect some statistics about them. Here are things I'm interested in: 

- Frequency statistics for each word
- Avg. total points available
- Avg. total words available
- Avg. # of 8+ length words

I'll start by calculating the frequency stats for each word: 

In [None]:
# We're going to store the number of times each word was found in a board
word_appearances_dict = {}

# Iterate through each of the solved boards
for board_idx, solved_board_df in tqdm(list(solved_boards_dict.items())):
    # Extract the words from this board
    words = solved_board_df["word"].tolist()

    # Iterate through each of the words and update the dictionary
    for word in words:
        word_appearances_dict[word] = word_appearances_dict.get(word, 0) + 1

# Create a DataFrame from the dictionary
word_appearances_df = pd.DataFrame.from_records(
    [
        {"word": word, "appearances": appearances, "length": len(word)}
        for word, appearances in word_appearances_dict.items()
    ]
)

# Sort the DataFrame by the number of appearances
word_appearances_df = word_appearances_df.sort_values(
    by=["appearances", "length", "word"], ascending=[False, True, True]
)

# Add a column indicating the likelihood of each word appearing
word_appearances_df["prob_of_appearance"] = (
    word_appearances_df["appearances"] / n_boards_to_simulate
)

What are some of the most common words? 

In [None]:
# Show the first 10 words, sorted by the number of appearances
word_appearances_df.head(10)

How about the least common words?

In [None]:
# Show the last 10 words, sorted by the number of appearances
word_appearances_df.tail(10)

Now: I want to determine some stats about the point distributions associated with each of the boards:

In [None]:
# We're going to store each of the boards' stats in a list
board_stats_df_records = []



In [None]:
def extract_board_stats(solved_board_df):
    """
    This is a helper method to extract the stats from a solved board.
    """

    # Calculate some stats about the current board
    total_points = solved_board_df["points"].sum()
    num_words = len(solved_board_df)
    eleven_pointers = len(solved_board_df.query("length >= 8"))

    # Return a dictionary containing the stats
    return {
        "board_idx": board_idx,
        "total_points": total_points,
        "num_words": num_words,
        "eleven_pointers": eleven_pointers,
    }


# We're going to store each of the boards' stats in a list
board_stats_df_records = []

# Parallelize the board stats extraction
futures = {}
with ThreadPoolExecutor(max_workers=32) as executor:
    for board_idx, solved_board_df in tqdm(list(solved_boards_dict.items())):
        futures[board_idx] = executor.submit(extract_board_stats, solved_board_df)

    for board_idx, future in tqdm(list(futures.items())):
        board_stats_df_records.append(future.result())

# Finally, make a DataFrame from the records
board_stats_df = pd.DataFrame.from_records(board_stats_df_records)

# Saving Data
Now that I've run all of the simulations, I want to actually save some of the data. 