# Trexquant Interview Project (The Hangman Game)

* Copyright Trexquant Investment LP. All Rights Reserved. 
* Redistribution of this question without written consent from Trexquant is prohibited

## Instruction:
For this coding test, your mission is to write an algorithm that plays the game of Hangman through our API server. 

When a user plays Hangman, the server first selects a secret word at random from a list. The server then returns a row of underscores (space separated)—one for each letter in the secret word—and asks the user to guess a letter. If the user guesses a letter that is in the word, the word is redisplayed with all instances of that letter shown in the correct positions, along with any letters correctly guessed on previous turns. If the letter does not appear in the word, the user is charged with an incorrect guess. The user keeps guessing letters until either (1) the user has correctly guessed all the letters in the word
or (2) the user has made six incorrect guesses.

You are required to write a "guess" function that takes current word (with underscores) as input and returns a guess letter. You will use the API codes below to play 1,000 Hangman games. You have the opportunity to practice before you want to start recording your game results.

Your algorithm is permitted to use a training set of approximately 250,000 dictionary words. Your algorithm will be tested on an entirely disjoint set of 250,000 dictionary words. Please note that this means the words that you will ultimately be tested on do NOT appear in the dictionary that you are given. You are not permitted to use any dictionary other than the training dictionary we provided. This requirement will be strictly enforced by code review.

You are provided with a basic, working algorithm. This algorithm will match the provided masked string (e.g. a _ _ l e) to all possible words in the dictionary, tabulate the frequency of letters appearing in these possible words, and then guess the letter with the highest frequency of appearence that has not already been guessed. If there are no remaining words that match then it will default back to the character frequency distribution of the entire dictionary.

This benchmark strategy is successful approximately 18% of the time. Your task is to design an algorithm that significantly outperforms this benchmark.

In [1]:
import json
import requests
import random
import string
import secrets
import time
import re
import collections

try:
    from urllib.parse import parse_qs, urlencode, urlparse
except ImportError:
    from urlparse import parse_qs, urlparse
    from urllib import urlencode

from requests.packages.urllib3.exceptions import InsecureRequestWarning

requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

In [2]:
class HangmanAPI(object):
    def __init__(self, access_token=None, session=None, timeout=None):
        self.hangman_url = self.determine_hangman_url()
        self.access_token = access_token
        self.session = session or requests.Session()
        self.timeout = timeout
        self.guessed_letters = []
        
        full_dictionary_location = "words_250000_train.txt"
        self.full_dictionary = self.build_dictionary(full_dictionary_location)        
        self.full_dictionary_common_letter_sorted = collections.Counter("".join(self.full_dictionary)).most_common()
        
        self.current_dictionary = []
        
    @staticmethod
    def determine_hangman_url():
        links = ['https://trexsim.com', 'https://sg.trexsim.com']

        data = {link: 0 for link in links}

        for link in links:

            requests.get(link)

            for i in range(10):
                s = time.time()
                requests.get(link)
                data[link] = time.time() - s

        link = sorted(data.items(), key=lambda x: x[1])[0][0]
        link += '/trexsim/hangman'
        return link
    import collections
    import re

# class GuessingGame:
#     def __init__(self):
#         # Initialize an empty list for the current dictionary of words
#         self.current_dictionary = []
#         # Initialize an empty set to keep track of guessed letters
#         self.guessed_letters = set()

#     def load_dictionary(self, words):
#         # Load the dictionary of words into the current_dictionary list
#         self.current_dictionary = words

    def guess(self, word): # word input example: "_ p p _ e "
        ###############################################
        # Replace with your own "guess" function here #
        ###############################################

        # clean the word so that we strip away the space characters
        # replace "_" with "." as "." indicates any character in regular expressions
        clean_word = word[::2].replace("_",".")
        
        def contains_non_dot_characters(input_string):
            for char in input_string:
                if char != '.':
                    return True
            return False
        
        bigram_candidates = extract_characters_with_underscore(word)
                   #removed print

        # print('bigram_candidates: ', bigram_candidates)
        if len(bigram_candidates)>0:
            terminal_char_only = False
        else:
                       #removed print

            # print('Setting terminal_char_only to True')
            terminal_char_only = True
            
    
        # terminal_char_only = False
        if contains_non_dot_characters(clean_word) and terminal_char_only == False:
            #.....
            bigram_candidates = extract_characters_with_underscore(word)
                       #removed print

            # print('bigram_candidates: ', bigram_candidates)
            # if len(bigram_candidates)>0:
            #     terminal_char_only = False
                # bigram_candidates.remove('_')
                
    
            guess_data_frame = create_guess_data_frame(list(set(self.guessed_letters)))
            
            masked_bigram_frame = bigram_data_frame*guess_data_frame.to_numpy()
            all_bigram_count = masked_bigram_frame.sum().sum()
            prob_bigram_distr = masked_bigram_frame/all_bigram_count

            probability_dict = dict()
            for cand_ in bigram_candidates:
                idx_max_prob_a = prob_bigram_distr[cand_].argmax()
                p_a = prob_bigram_distr[cand_].iloc[idx_max_prob_a]
                probability_dict[cand_] = p_a

            # Get the key corresponding to the max value
            max_key = max(probability_dict, key=lambda k: probability_dict[k])
                       #removed print

            # print('Initial char of the bigram: ', max_key)  # Output: 'banana'
            idx_max_prob = prob_bigram_distr[max_key].argmax()
            
            guess_letter_a = prob_bigram_distr.columns[idx_max_prob]
                       #removed print

            # print('bigram model predicted guess word ' , guess_letter_a)    
            return guess_letter_a
            # else:
            #     print('Setting terminal_char_only to True')
            #     terminal_char_only = True
        # elif terminal_char_only == True or not contains_non_dot_characters(clean_word):
        else:
            # if terminal_char_only == True:
                           #removed print

                # print('In this loop because terminal char arrived')
            # find length of passed word
            terminal_char_only = False
            len_word = len(clean_word)
    
            
            # grab current dictionary of possible words from self object, initialize new possible words dictionary to empty
            current_dictionary = self.current_dictionary
            new_dictionary = []
            
            # iterate through all of the words in the old plausible dictionary
            for dict_word in current_dictionary:
                # continue if the word is not of the appropriate length
                if len(dict_word) != len_word:
                    continue
                    
                # if dictionary word is a possible match then add it to the current dictionary
                if re.match(clean_word,dict_word):
                    new_dictionary.append(dict_word)
                # new_dictionary.append(dict_word)
                       #removed print

            # print(new_dictionary[:3],len(new_dictionary))
            
            for wrong_guessed_letter in self.wrong_guess:
                new_dictionary = [word for word in new_dictionary if wrong_guessed_letter not in word]
           #removed print
            # print(new_dictionary[:3],len(new_dictionary))
            # overwrite old possible words dictionary with updated version
            # self.current_dictionary = new_dictionary
            
            
            # count occurrence of all characters in possible word matches
            full_dict_string = "".join(new_dictionary)
            
            c = collections.Counter(full_dict_string)
            sorted_letter_count = c.most_common()                   
            
            guess_letter = '!'
            
            # return most frequently occurring letter in all possible words that hasn't been guessed yet
            for letter,instance_count in sorted_letter_count:
                if letter not in self.guessed_letters:
                    guess_letter = letter
                    break
                
            # if no word matches in training dictionary, default back to ordering of full dictionary
            if guess_letter == '!':
                sorted_letter_count = self.full_dictionary_common_letter_sorted
                for letter,instance_count in sorted_letter_count:
                    if letter not in self.guessed_letters:
                        guess_letter = letter
                        break            
            
            return guess_letter

    ##########################################################
    # You'll likely not need to modify any of the code below #
    ##########################################################
    
    def build_dictionary(self, dictionary_file_location):
        text_file = open(dictionary_file_location,"r")
        full_dictionary = text_file.read().splitlines()
        text_file.close()
        return full_dictionary
                
    def start_game(self, practice=True, verbose=True):
        # reset guessed letters to empty set and current plausible dictionary to the full dictionary
        self.guessed_letters = []
        self.current_dictionary = self.full_dictionary
        self.wrong_guess = []                 
        response = self.request("/new_game", {"practice":practice})
        mistake_counter = 0
        max_allowed_mistakes = 6
        if response.get('status')=="approved":
            game_id = response.get('game_id')
            word = response.get('word')
            tries_remains = response.get('tries_remains')

            if verbose:
                print("Successfully start a new game! Game ID: {0}. # of tries remaining: {1}. Word: {2}.".format(game_id, tries_remains, word))
            while tries_remains>0:
                # get guessed letter from user code
                guess_letter = self.guess(word)
                    
                # append guessed letter to guessed letters field in hangman object
                self.guessed_letters.append(guess_letter)
                if verbose:
                    print("Guessing letter: {0}".format(guess_letter))
                    
                try:    
                    res = self.request("/guess_letter", {"request":"guess_letter", "game_id":game_id, "letter":guess_letter})
                except HangmanAPIError:
                    print('HangmanAPIError exception caught on request.')
                    continue
                except Exception as e:
                    print('Other exception caught on request.')
                    raise e
               
                if verbose:
                    print("Sever response: {0}".format(res))
                status = res.get('status')
                tries_remains = res.get('tries_remains')
                mistake_counter = max_allowed_mistakes - tries_remains
                # print(mistake_counter)
                if mistake_counter>0:
                    self.wrong_guess.append(guess_letter)
                    max_allowed_mistakes = max_allowed_mistakes - 1

                if status=="success":
                    if verbose:
                        print("Successfully finished game: {0}".format(game_id))
                    return True
                elif status=="failed":
                    reason = res.get('reason', '# of tries exceeded!')
                    if verbose:
                        print("Failed game: {0}. Because of: {1}".format(game_id, reason))
                    return False
                elif status=="ongoing":
                    word = res.get('word')
        else:
            if verbose:
                print("Failed to start a new game")
        return status=="success"
        
    def my_status(self):
        return self.request("/my_status", {})
    
    def request(
            self, path, args=None, post_args=None, method=None):
        if args is None:
            args = dict()
        if post_args is not None:
            method = "POST"

        # Add `access_token` to post_args or args if it has not already been
        # included.
        if self.access_token:
            # If post_args exists, we assume that args either does not exists
            # or it does not need `access_token`.
            if post_args and "access_token" not in post_args:
                post_args["access_token"] = self.access_token
            elif "access_token" not in args:
                args["access_token"] = self.access_token

        time.sleep(0.2)

        num_retry, time_sleep = 50, 2
        for it in range(num_retry):
            try:
                response = self.session.request(
                    method or "GET",
                    self.hangman_url + path,
                    timeout=self.timeout,
                    params=args,
                    data=post_args,
                    verify=False
                )
                break
            except requests.HTTPError as e:
                response = json.loads(e.read())
                raise HangmanAPIError(response)
            except requests.exceptions.SSLError as e:
                if it + 1 == num_retry:
                    raise
                time.sleep(time_sleep)

        headers = response.headers
        if 'json' in headers['content-type']:
            result = response.json()
        elif "access_token" in parse_qs(response.text):
            query_str = parse_qs(response.text)
            if "access_token" in query_str:
                result = {"access_token": query_str["access_token"][0]}
                if "expires" in query_str:
                    result["expires"] = query_str["expires"][0]
            else:
                raise HangmanAPIError(response.json())
        else:
            raise HangmanAPIError('Maintype was not text, or querystring')

        if result and isinstance(result, dict) and result.get("error"):
            raise HangmanAPIError(result)
        return result
    
class HangmanAPIError(Exception):
    def __init__(self, result):
        self.result = result
        self.code = None
        try:
            self.type = result["error_code"]
        except (KeyError, TypeError):
            self.type = ""

        try:
            self.message = result["error_description"]
        except (KeyError, TypeError):
            try:
                self.message = result["error"]["message"]
                self.code = result["error"].get("code")
                if not self.type:
                    self.type = result["error"].get("type", "")
            except (KeyError, TypeError):
                try:
                    self.message = result["error_msg"]
                except (KeyError, TypeError):
                    self.message = result

        Exception.__init__(self, self.message)

# API Usage Examples

## To start a new game:
1. Make sure you have implemented your own "guess" method.
2. Use the access_token that we sent you to create your HangmanAPI object. 
3. Start a game by calling "start_game" method.
4. If you wish to test your function without being recorded, set "practice" parameter to 1.
5. Note: You have a rate limit of 20 new games per minute. DO NOT start more than 20 new games within one minute.

In [238]:
api = HangmanAPI(access_token="8d066fa9d3c4785c59944618903abd", timeout=2000)

## Playing practice games:
You can use the command below to play up to 100,000 practice games.

In [239]:
api.start_game(practice=1,verbose=True)
[total_practice_runs,total_recorded_runs,total_recorded_successes,total_practice_successes] = api.my_status() # Get my game stats: (# of tries, # of wins)
practice_success_rate = total_practice_successes / total_practice_runs
print('run %d practice games out of an allotted 100,000. practice success rate so far = %.3f' % (total_practice_runs, practice_success_rate))


Successfully start a new game! Game ID: 4b0444c0b395. # of tries remaining: 6. Word: _ _ _ _ _ _ _ _ _ .
Guessing letter: e
Sever response: {'game_id': '4b0444c0b395', 'status': 'ongoing', 'tries_remains': 6, 'word': '_ _ _ _ e _ _ _ e '}
Guessing letter: r
Sever response: {'game_id': '4b0444c0b395', 'status': 'ongoing', 'tries_remains': 5, 'word': '_ _ _ _ e _ _ _ e '}
Guessing letter: s
Sever response: {'game_id': '4b0444c0b395', 'status': 'ongoing', 'tries_remains': 4, 'word': '_ _ _ _ e _ _ _ e '}
Guessing letter: n
Sever response: {'game_id': '4b0444c0b395', 'status': 'ongoing', 'tries_remains': 4, 'word': '_ _ _ _ e _ _ n e '}
Guessing letter: d
Sever response: {'game_id': '4b0444c0b395', 'status': 'ongoing', 'tries_remains': 3, 'word': '_ _ _ _ e _ _ n e '}
Guessing letter: l
Sever response: {'game_id': '4b0444c0b395', 'status': 'ongoing', 'tries_remains': 3, 'word': '_ _ _ l e _ _ n e '}
Guessing letter: t
Sever response: {'game_id': '4b0444c0b395', 'status': 'ongoing', 'tries_

In [242]:
for i in range(5000):
    print('Playing ', i, ' th game')
    # Uncomment the following line to execute your final runs. Do not do this until you are satisfied with your submission
    api.start_game(practice=1,verbose=True)
    [total_practice_runs,total_recorded_runs,total_recorded_successes,total_practice_successes] = api.my_status() # Get my game stats: (# of tries, # of wins)
    practice_success_rate = total_practice_successes / total_practice_runs
    print('run %d practice games out of an allotted 100,000. practice success rate so far = %.3f' % (total_practice_runs, practice_success_rate))

    # DO NOT REMOVE as otherwise the server may lock you out for too high frequency of requests
    time.sleep(0.5)

Playing  0  th game
Successfully start a new game! Game ID: 97ac050821a8. # of tries remaining: 6. Word: _ _ _ _ _ _ _ _ _ _ .
Guessing letter: e
Sever response: {'game_id': '97ac050821a8', 'status': 'ongoing', 'tries_remains': 6, 'word': '_ _ _ _ _ _ _ e _ _ '}
Guessing letter: r
Sever response: {'game_id': '97ac050821a8', 'status': 'ongoing', 'tries_remains': 6, 'word': '_ _ _ _ _ _ _ e r _ '}
Guessing letter: i
Sever response: {'game_id': '97ac050821a8', 'status': 'ongoing', 'tries_remains': 6, 'word': '_ _ _ i _ _ _ e r _ '}
Guessing letter: n
Sever response: {'game_id': '97ac050821a8', 'status': 'ongoing', 'tries_remains': 6, 'word': '_ _ _ i n _ _ e r _ '}
Guessing letter: a
Sever response: {'game_id': '97ac050821a8', 'status': 'ongoing', 'tries_remains': 5, 'word': '_ _ _ i n _ _ e r _ '}
Guessing letter: g
Sever response: {'game_id': '97ac050821a8', 'status': 'ongoing', 'tries_remains': 4, 'word': '_ _ _ i n _ _ e r _ '}
Guessing letter: o
Sever response: {'game_id': '97ac05082

KeyboardInterrupt: 

In [218]:
# extract_characters_with_underscore(word)

In [207]:
word = 'c _ e r n o _ i t _ '
guessed_letters_offline = ['e', 'r', 'i']
bigram_model(word, guessed_letters_offline)

bigram_candidates:  ['t', 'c', 'o']
Initial char of the bigram:  o
bigram model predicted guess word  n


In [143]:
api.guessed_letters

['e', 'a', 'i', 'n', 's', 'c', 'g', 't', 'o', 'r', 'h']

In [147]:
prob_bigram_distr['i']

a    0.007212
b    0.001653
c    0.012421
d    0.005067
e    0.005098
f    0.002155
g    0.002793
h    0.000168
i    0.000259
j    0.000065
k    0.000941
l    0.006529
m    0.003022
n    0.000000
o    0.007397
p    0.002531
q    0.000174
r    0.002979
s    0.012582
t    0.000000
u    0.000810
v    0.002895
w    0.000061
x    0.000315
y    0.000043
z    0.002751
Name: i, dtype: float64

In [145]:
prob_bigram_distr['i'].argmax()

18

In [110]:
ttt = extract_characters_with_underscore('_ _ _ _ _ _ _ _ _ _ _ _ _ e _ _ _ ')

In [111]:
ttt.remove('_')

In [112]:
ttt

['e']

In [None]:
api.wrong_guess

## Playing recorded games:
Please finalize your code prior to running the cell below. Once this code executes once successfully your submission will be finalized. Our system will not allow you to rerun any additional games.

Please note that it is expected that after you successfully run this block of code that subsequent runs will result in the error message "Your account has been deactivated".

Once you've run this section of the code your submission is complete. Please send us your source code via email.

In [None]:
for i in range(1000):
    print('Playing ', i, ' th game')
    # Uncomment the following line to execute your final runs. Do not do this until you are satisfied with your submission
    #api.start_game(practice=0,verbose=False)
    
    # DO NOT REMOVE as otherwise the server may lock you out for too high frequency of requests
    time.sleep(0.5)

## To check your game statistics
1. Simply use "my_status" method.
2. Returns your total number of games, and number of wins.

In [None]:
[total_practice_runs,total_recorded_runs,total_recorded_successes,total_practice_successes] = api.my_status() # Get my game stats: (# of tries, # of wins)
success_rate = total_recorded_successes/total_recorded_runs
print('overall success rate = %.3f' % success_rate)

In [11]:
def build_dictionary(dictionary_file_location):
    text_file = open(dictionary_file_location,"r")
    full_dictionary = text_file.read().splitlines()
    text_file.close()
    return full_dictionary

In [13]:
word_list = build_dictionary('./words_250000_train.txt')

In [14]:
len(word_list)

227300

In [15]:
import string
import pandas as pd

def create_bigram_data_frame(word_list):
    # Step 1: Count bigrams for each character
    bigram_counts = {}
    
    for word in word_list:
        word = word.lower()  # Convert all words to lowercase for case-insensitive comparison
        word_length = len(word)
        
        for i in range(word_length - 1):
            bigram = word[i:i+2]
            
            # Count bigrams
            first_char = bigram[0]
            second_char = bigram[1]
            if first_char not in bigram_counts:
                bigram_counts[first_char] = {}
            if second_char in bigram_counts[first_char]:
                bigram_counts[first_char][second_char] += 1
            else:
                bigram_counts[first_char][second_char] = 1

    # Step 2: Create the data frame with all bigrams having count 0
    all_chars = string.ascii_lowercase
    df_data = {first_char: {second_char: bigram_counts.get(first_char, {}).get(second_char, 0) for second_char in all_chars}
               for first_char in all_chars}
    
    df = pd.DataFrame(df_data)

    return df

# Example usage:
# word_list = ['appetizingly', 'appraisingly', 'backbitingly']
bigram_data_frame = create_bigram_data_frame(word_list)

print(bigram_data_frame.T)


       a     b      c      d      e     f     g      h      i    j  ...     q  \
a    270  5416  13089   5429  10223  2623  4660   9559  11742  992  ...    20   
b   7258  1076     28    369   2171    90   173    269   2691    2  ...     1   
c  10476   221   1423    256   7706   197    66    147  20223   11  ...     4   
d   6110   250     68   1321  23098   159   117    126   8250    8  ...     3   
e   3350  5974   9023  13954   5426  3328  7326  11707   8301  735  ...     5   
f   1383    89     34    321   2340  2055   103    202   3508    3  ...     3   
g   4914    48     32    734   2617    46  1315     67   4548    3  ...     1   
h   1089   126  12023    383   1014    61  2377    109    274    5  ...     1   
i   4512  5052   6429  10976   2838  4404  4788   9035    421  267  ...    11   
j    211   148      1    175    265    20    21     15    106    6  ...     0   
k   2070    24   4537     43    573    20    31     60   1532    9  ...     1   
l  23550  7122   3169   1925

In [21]:
bigram_data_frame['a'].argmax()

13

In [34]:
print(bigram_data_frame['a'])

a      270
b     7258
c    10476
d     6110
e     3350
f     1383
g     4914
h     1089
i     4512
j      211
k     2070
l    23550
m     6969
n    25117
o      256
p     6038
q      204
r    19848
s    10554
t    23611
u     3471
v     2413
w     1345
x      751
y     1884
z      964
Name: a, dtype: int64


In [61]:
masked_bigram_frame = bigram_data_frame*guess_data_frame.to_numpy()

In [62]:
bigram_data_frame['a']

a      270
b     7258
c    10476
d     6110
e     3350
f     1383
g     4914
h     1089
i     4512
j      211
k     2070
l    23550
m     6969
n    25117
o      256
p     6038
q      204
r    19848
s    10554
t    23611
u     3471
v     2413
w     1345
x      751
y     1884
z      964
Name: a, dtype: int64

In [63]:
masked_bigram_frame['a']

a      270
b     7258
c    10476
d     6110
e     3350
f     1383
g     4914
h     1089
i     4512
j      211
k     2070
l    23550
m     6969
n        0
o      256
p     6038
q      204
r    19848
s    10554
t        0
u     3471
v     2413
w     1345
x      751
y     1884
z      964
Name: a, dtype: int64

In [64]:
all_bigram_count = masked_bigram_frame.sum().sum()

In [65]:
# prob_bigram_distr = bigram_data_frame/bigram_data_frame.sum(axis=0)
prob_bigram_distr = masked_bigram_frame/all_bigram_count

In [66]:
idx_max_prob = prob_bigram_distr['a'].argmax()

In [67]:
prob_bigram_distr['a'].iloc[idx_max_prob]

0.01446446471039132

In [68]:
prob_bigram_distr.columns[idx_max_prob]

'l'

In [137]:
import string
import pandas as pd

def create_guess_data_frame(guessed_letters):
    all_chars = string.ascii_lowercase
    data = {char: [0 if char in guessed_letters else 1 if char != guessed_letters else 0 for _ in range(len(all_chars))] for char in all_chars}
    df = pd.DataFrame(data, index=list(all_chars))
    return df.T

# Example usage:
guessed_letters = ['r', 'e']
guess_data_frame = create_guess_data_frame(guessed_letters)

print(guess_data_frame.to_numpy())


[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 1 1 1 

In [203]:
import re

def extract_characters_with_underscore(input_string):
    pattern = r'\b(\w)(?=_*)'
    characters = re.findall(pattern, input_string)
    return list(set(characters))

# Example usage:
test_string1 = '_ _ _ _ _ _ _ _ _ _ e'
result1 = extract_characters_with_underscore(test_string1)
print(result1)  # Output: ['i', 'a']

test_string2 = 'c _ e r n o _ i t _ '

result2 = extract_characters_with_underscore(test_string2)
print(result2)  # Output: ['i', 'a']


['e', '_']
['n', '_', 'e', 'c', 'i', 't', 'r', 'o']


In [206]:
def extract_characters_with_underscore(input_string):
    bigrams = input_string.split()
    char_underscore_bigrams = []

    for i in range(len(bigrams) - 1):
        if bigrams[i + 1] == '_':
            char_underscore_bigrams.append(bigrams[i])
    char_underscore_bigrams = list(set(char_underscore_bigrams))
    if '_' in char_underscore_bigrams:
        char_underscore_bigrams.remove('_')
    return char_underscore_bigrams

# Example usage:
input_string = 'n i _ j _ _ i _'
result = extract_characters_with_underscore(test_string2)
print(result)  # Output: ['i', 'j']


['t', 'c', 'o']


In [94]:
result1.remove('_')

In [95]:
result1

['i', 'a']

In [148]:
guess_data_frame = create_guess_data_frame(list(set(['i'])))


In [149]:
guess_data_frame

Unnamed: 0,a,b,c,d,e,f,g,h,i,j,...,q,r,s,t,u,v,w,x,y,z
a,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
b,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
c,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
d,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
e,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
f,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
g,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
h,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
i,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
j,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


In [150]:
masked_bigram_frame = bigram_data_frame*guess_data_frame.to_numpy()


In [151]:
masked_bigram_frame

Unnamed: 0,a,b,c,d,e,f,g,h,i,j,...,q,r,s,t,u,v,w,x,y,z
a,270,5416,13089,5429,10223,2623,4660,9559,11742,992,...,20,20481,5684,12414,2908,3060,3292,585,1364,1421
b,7258,1076,28,369,2171,90,173,269,2691,2,...,1,1913,523,437,3074,3,191,35,402,23
c,10476,221,1423,256,7706,197,66,147,20223,11,...,4,3388,5533,1246,2667,11,104,350,1320,11
d,6110,250,68,1321,23098,159,117,126,8250,8,...,3,3773,235,171,2235,12,181,15,863,13
e,3350,5974,9023,13954,5426,3328,7326,11707,8301,735,...,5,24800,13813,26219,2259,10593,3228,638,1380,3531
f,1383,89,34,321,2340,2055,103,202,3508,3,...,3,969,333,528,619,4,114,37,267,5
g,4914,48,32,734,2617,46,1315,67,4548,3,...,1,2254,187,213,1661,9,42,9,526,9
h,1089,126,12023,383,1014,61,2377,109,274,5,...,1,1206,7431,10336,74,4,1065,118,185,23
i,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
j,211,148,1,175,265,20,21,15,106,6,...,0,111,47,50,65,1,5,3,15,1


In [152]:
all_bigram_count = masked_bigram_frame.sum().sum()
prob_bigram_distr = masked_bigram_frame/all_bigram_count


In [153]:
prob_bigram_distr

Unnamed: 0,a,b,c,d,e,f,g,h,i,j,...,q,r,s,t,u,v,w,x,y,z
a,0.000157,0.003149,0.007609207,0.003156,0.005943,0.001525,0.002709061,0.005557064,0.006826,0.0005766929,...,1.162687e-05,0.011906,0.003304,0.007217,0.001691,0.001778912,0.001913783,0.000340086,0.000793,0.0008260893
b,0.004219,0.000626,1.627762e-05,0.000215,0.001262,5.2e-05,0.0001005724,0.0001563814,0.001564,1.162687e-06,...,5.813436e-07,0.001112,0.000304,0.000254,0.001787,1.744031e-06,0.0001110366,2.034703e-05,0.000234,1.33709e-05
c,0.00609,0.000128,0.000827252,0.000149,0.00448,0.000115,3.836868e-05,8.545751e-05,0.011757,6.39478e-06,...,2.325375e-06,0.00197,0.003217,0.000724,0.00155,6.39478e-06,6.045974e-05,0.0002034703,0.000767,6.39478e-06
d,0.003552,0.000145,3.953137e-05,0.000768,0.013428,9.2e-05,6.801721e-05,7.32493e-05,0.004796,4.650749e-06,...,1.744031e-06,0.002193,0.000137,9.9e-05,0.001299,6.976124e-06,0.0001052232,8.720155e-06,0.000502,7.557467e-06
e,0.001948,0.003473,0.005245464,0.008112,0.003154,0.001935,0.004258923,0.00680579,0.004826,0.0004272876,...,2.906718e-06,0.014417,0.00803,0.015242,0.001313,0.006158173,0.001876577,0.0003708972,0.000802,0.002052724
f,0.000804,5.2e-05,1.976568e-05,0.000187,0.00136,0.001195,5.987839e-05,0.0001174314,0.002039,1.744031e-06,...,1.744031e-06,0.000563,0.000194,0.000307,0.00036,2.325375e-06,6.627317e-05,2.150971e-05,0.000155,2.906718e-06
g,0.002857,2.8e-05,1.8603e-05,0.000427,0.001521,2.7e-05,0.0007644669,3.895002e-05,0.002644,1.744031e-06,...,5.813436e-07,0.00131,0.000109,0.000124,0.000966,5.232093e-06,2.441643e-05,5.232093e-06,0.000306,5.232093e-06
h,0.000633,7.3e-05,0.006989495,0.000223,0.000589,3.5e-05,0.001381854,6.336646e-05,0.000159,2.906718e-06,...,5.813436e-07,0.000701,0.00432,0.006009,4.3e-05,2.325375e-06,0.000619131,6.859855e-05,0.000108,1.33709e-05
i,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
j,0.000123,8.6e-05,5.813436e-07,0.000102,0.000154,1.2e-05,1.220822e-05,8.720155e-06,6.2e-05,3.488062e-06,...,0.0,6.5e-05,2.7e-05,2.9e-05,3.8e-05,5.813436e-07,2.906718e-06,1.744031e-06,9e-06,5.813436e-07


In [154]:

probability_dict = dict()
for cand_ in ['i']:
    idx_max_prob_a = prob_bigram_distr[cand_].argmax()
    p_a = prob_bigram_distr[cand_].iloc[idx_max_prob_a]
    probability_dict[cand_] = p_a


In [157]:
idx_max_prob_a

13

In [156]:
prob_bigram_distr['i']

a    0.006826
b    0.001564
c    0.011757
d    0.004796
e    0.004826
f    0.002039
g    0.002644
h    0.000159
i    0.000000
j    0.000062
k    0.000891
l    0.006180
m    0.002861
n    0.021953
o    0.007001
p    0.002396
q    0.000165
r    0.002820
s    0.011909
t    0.008283
u    0.000766
v    0.002740
w    0.000058
x    0.000298
y    0.000041
z    0.002604
Name: i, dtype: float64

In [155]:
probability_dict

{'i': 0.0219526983936894}

In [187]:
word = 'n i _ _ _ _ i _ '
guessed_letters_offline = ['i', 'n']
bigram_model(word, guessed_letters_offline)

['i']
i
bigram model predicted guess word  s


In [176]:
bigram_candidates

['i', 'n']

In [191]:
def bigram_model(word, guessed_letters_offline):
    bigram_candidates = extract_characters_with_underscore(word)
    print('bigram_candidates: ', bigram_candidates)
    if len(bigram_candidates)>0:
        terminal_char_only = False
        # bigram_candidates.remove('_')
        

        guess_data_frame = create_guess_data_frame(list(set(guessed_letters_offline)))
        
        masked_bigram_frame = bigram_data_frame*guess_data_frame.to_numpy()
        all_bigram_count = masked_bigram_frame.sum().sum()
        prob_bigram_distr = masked_bigram_frame/all_bigram_count

        probability_dict = dict()
        for cand_ in bigram_candidates:
            idx_max_prob_a = prob_bigram_distr[cand_].argmax()
            p_a = prob_bigram_distr[cand_].iloc[idx_max_prob_a]
            probability_dict[cand_] = p_a

        # Get the key corresponding to the max value
        max_key = max(probability_dict, key=lambda k: probability_dict[k])
        
        print('Initial char of the bigram: ', max_key)  # Output: 'banana'
        idx_max_prob = prob_bigram_distr[max_key].argmax()
        
        guess_letter_a = prob_bigram_distr.columns[idx_max_prob]
        print('bigram model predicted guess word ' , guess_letter_a)    
                
    

['i']
i
bigram model predicted guess word  s


In [None]:


# Get the key corresponding to the max value
max_key = max(probability_dict, key=lambda k: probability_dict[k])

print(max_key)  # Output: 'banana'
idx_max_prob = prob_bigram_distr[max_key].argmax()

guess_letter_a = prob_bigram_distr.columns[idx_max_prob]
print('bigram model predicted guess word ' , guess_letter_a)    
return guess_letter_a