# Language Modelling in Hangman

Student Name: Shireen Hassann

## The Hangman Game

The <a href="https://en.wikipedia.org/wiki/Hangman_(game)">Hangman game</a> is a simple game whereby one person thinks of a word, which they keep secret from their opponent, who tries to guess the word one character at a time. The game ends when the opponent makes more than a fixed number of incorrect guesses, or they figure out the secret word before then (in which case they *win*). 

Here's a simple version of the game, and a method allowing interactive play. 

In [1]:
# allowing better python 2 & python 3 compatibility 
from __future__ import print_function 

def hangman(secret_word, guesser, max_mistakes=8, verbose=True, **guesser_args):
    """
        secret_word: a string of lower-case alphabetic characters, i.e., the answer to the game
        guesser: a function which guesses the next character at each stage in the game
            The function takes a:
                mask: what is known of the word, as a string with _ denoting an unknown character
                guessed: the set of characters which already been guessed in the game
                guesser_args: additional (optional) keyword arguments, i.e., name=value
        max_mistakes: limit on length of game, in terms of allowed mistakes
        verbose: be chatty vs silent
        guesser_args: keyword arguments to pass directly to the guesser function
    """
    secret_word = secret_word.lower()
    mask = ['_'] * len(secret_word)
    guessed = set()
    if verbose:
        print("Starting hangman game. Target is", ' '.join(mask), 'length', len(secret_word))
    
    mistakes = 0
    while mistakes < max_mistakes:
        if verbose:
            print("You have", (max_mistakes-mistakes), "attempts remaining.")
        guess = guesser(mask, guessed, **guesser_args)

        if verbose:
            print('Guess is', guess)
        if guess in guessed:
            if verbose:
                print('Already guessed this before.')
            mistakes += 1
        else:
            guessed.add(guess)
            if guess in secret_word:
                for i, c in enumerate(secret_word):
                    if c == guess:
                        mask[i] = c
                if verbose:
                    print('Good guess:', ' '.join(mask))
            else:
                if verbose:
                    print('Sorry, try again.')
                mistakes += 1
                
        if '_' not in mask:
            if verbose:
                print('Congratulations, you won.')
            return mistakes
        
    if verbose:
        print('Out of guesses. The word was', secret_word)    
    return mistakes

def human(mask, guessed, **kwargs):
    """
    simple function for manual play
    """
    print('Enter your guess:')
    try:
        return raw_input().lower().strip() # python 3
    except NameError:
        return input().lower().strip() # python 2

You can play the game interactively using the following command:

In [2]:
# hangman('whatever', human, 8, True)

<b>Instructions</b>: We will be using the words occurring in the *Brown* corpus for *training* an artificial intelligence guessing algorithm, and for *evaluating* the quality of the method. Note that we are intentionally making the hangman game hard, as the AI will need to cope with test words that it has not seen before, hence it will need to learn generalisable patterns of characters to make reasonable predictions.

Your first task is to compute the unique word types occurring in the *Brown* corpus, using `nltk.corpus.Brown`, selecting only words that are entirely comprised of alphabetic characters, and lowercasing the words. Finally, randomly shuffle (`numpy.random.shuffle`) this collection of word types, and split them into disjoint training and testing sets. The test set should contain 1000 word types, and the rest should be in the training set. Your code should print the sizes of the training and test sets.

Feel free to test your own Hangman performance using `hangman(numpy.random.choice(test_set), human, 8, True)`. It is surprisingly difficult (and addictive)!

(0.5 mark)

In [3]:
import nltk
import numpy as np

brown_corpus = nltk.corpus.brown.words()

word_type = set()
for i in brown_corpus:
    if i.isalpha():
        word_type.add(i.lower())

word_type = list(word_type)
np.random.shuffle(word_type)
test_set = word_type[:1000]
training_set = word_type[1000:]
print('The length of test set is', len(test_set))
print('The length of training set is', len(training_set))

The length of test set is 1000
The length of training set is 39234


In [4]:
# hangman(np.random.choice(test_set), human, 8, True)

<b>Instructions</b>: To set a baseline, your first *AI* attempt will be a trivial random method. For this you should implement a guessing method, similar to the `human` method above, i.e., using the same input arguments and returning a character. Your method should randomly choose a character from the range `'a'...'z'` after excluding the characters that have already been guessed in the current game (all subsequent AI approaches should also exclude previous guesses). You might want to use `numpy.random.choice` for this purpose.

To measure the performance of this (and later) techiques, implement a method that measures the average number of mistakes made by this technique over all the words in the `test_set`. You will want to turn off the printouts for this, using the `verbose=False` option, and increase the cap on the game length to `max_mistakes=26`. Print the average number of mistakes for the random AI, which will become a baseline for the following steps.

(1 mark)

In [5]:
import string

def ai(mask, guessed, **kwargs):
    """
    function for AI play
    """
    characters = list(string.ascii_lowercase)
    guessed = list(guessed)

    for i in guessed:
        characters.remove(i)
    # print('Enter your guess:')
    # return raw_input().lower().strip() # python 3
    return np.random.choice(characters)

mistakes_guess = [hangman(i, ai, 26, False) for i in test_set]
#mistakes_guess = hangman(test_set[1], ai, 26, False)
baseline = np.average(mistakes_guess)
print('The average baseline guessing number is', baseline)

The average baseline guessing number is 16.722


**Instructions:** As your first real AI, you should train a *unigram* model over the training set.  This requires you to find the frequencies of characters over all training words. Using this model, you should write a guess function that returns the character with the highest probability, after aggregating (summing) the probability of each blank character in the secret word. Print the average number of mistakes the unigram method makes over the test set. Remember to exclude already guessed characters, and use the same evaluation method as above, so the results are comparable. (Hint: it should be much lower than for random).

(1 mark)

In [6]:
def charProbability(training_set):
    character_probability = {}
    character_frequency = {}
   
    for i in training_set:
        for j in i:
            character_frequency[j] = character_frequency.get(j, 0) + 1
    total_frequency = np.sum(list(character_frequency.values()))

    for k, v in character_frequency.items():
        character_probability[k] = v / total_frequency
        
    return character_probability
        
def guess(character_list, character_probability):
    character = {}
    for i in character_list:
        character[i] = character_probability.get(i, 0)
    sorted_character = sorted(character.items(), key=lambda x: x[1], reverse=True)
    return sorted_character[0][0]

def unigramAi(mask, guessed, **kwargs):
    """
    function for unigram AI play
    """
    character_probability = kwargs.get('character', 0)
    characters = list(string.ascii_lowercase)
    guessed = list(guessed)
    for i in guessed:
        characters.remove(i)
    char = guess(characters, character_probability)
    return char

character_probability = charProbability(training_set)
kwargs = {'character': character_probability}
mistakes_guess = [hangman(i, unigramAi, 26, False, **kwargs) for i in test_set]
unigram = np.average(mistakes_guess)
print('The average unigram model guessing number is', unigram)

The average unigram model guessing number is 10.478


**Instructions:** The length of the secret word is an important clue that we might exploit. Different length words tend to have different distributions over characters, e.g., short words are less likely to have suffixes or prefixes. Your job now is to incorporate this idea by conditioning the unigram model on the length of the secret word, i.e., having *different* unigram models for each length of word. You will need to be a little careful at test time, to be robust to the (unlikely) situation that you encounter a word length that you didn't see in training. Create another AI guessing function using this new model, and print its test performance.   

(0.5 marks)

In [8]:
# store the character probability based on different word length
# sample pair {word_lenght: {prbability_of_characters}}
corpus_probability = {}
train_length_set = set()

for i in training_set:
    train_length_set.add(len(i))

def dynamicCharProbability(training_set, **kwargs):
    """
    **kwargs can be used for passing word length later
    """
    word_length = kwargs.get('length', 0)
    character_probability = {}
    character_frequency = {}
    if not word_length:
        for i in training_set:
            for j in i:
                character_frequency[j] = character_frequency.get(j, 0) + 1
        total_frequency = np.sum(list(character_frequency.values()))

        for k, v in character_frequency.items():
            character_probability[k] = v / total_frequency
    else:
        if corpus_probability.get(word_length, 0) == 0:
            # get the word that has length = word_length
            corpus = [i for i in training_set if len(i) == word_length]
            for i in corpus:
                for j in i:
                    character_frequency[j] = character_frequency.get(j, 0) + 1
            total_frequency = np.sum(list(character_frequency.values()))

            for k, v in character_frequency.items():
                character_probability[k] = v / total_frequency
            corpus_probability[word_length] = character_probability
        else:
            character_probability = corpus_probability[word_length]
    return character_probability


def conditional_unigramAi(mask, guessed, **kwargs):
    """
    function for unigram AI play
    """
    character_probability = kwargs.get('character', 0)
    characters = list(string.ascii_lowercase)
    guessed = list(guessed)
    for i in guessed:
        characters.remove(i)
    char = guess(characters, character_probability)
    return char

mistakes_guess = []
for i in test_set:
    mistakes = 0
    word_length = len(i)
    if word_length not in train_length_set:
        character_probability = dynamicCharProbability(training_set)
        kwargs = {'character': character_probability}
        mistakes = hangman(i, conditional_unigramAi, 26, False, **kwargs) 
        mistakes_guess.append(mistakes)
    else:
        kwargs = {'length': word_length}
        character_probability = dynamicCharProbability(training_set, **kwargs)
        kwargs = {'character': character_probability}
        mistakes = hangman(i, conditional_unigramAi, 26, False, **kwargs) 
        mistakes_guess.append(mistakes)

condition_unigram = np.average(mistakes_guess)
print('The average wrong guessing number of unigram model incorporating with word length is', condition_unigram)

The average wrong guessing number of unigram model incorporating with word length is 10.378


**Instructions:** Now for the main challenge, using a *ngram* language model over characters. The order of characters is obviously important, yet this wasn't incorporated in any of the above models. Knowing that the word has the sequence `n _ s s` is a pretty strong clue that the missing character might be `e`. Similarly the distribution over characters that start or end a word are highly biased (e.g., toward common prefixes and suffixes, like *un-*, *-ed* and *-ly*).

Your job is to develop a *ngram* language model over characters, train this over the training words (being careful to handle the start of each word properly, e.g., by padding with sentinel symbols.) You should use linear interpolation to smooth between the higher order and lower order models, and you will have to decide how to weight each component. 

Your guessing AI algorithm should apply your language model to each blank position in the secret word by using as much of the left context as is known. E.g., in `_ e c _ e _ _` we know the full left context for the first blank (context=start of word), we have a context of two characters for the second blank (context=ec), one character for the second last blank (context=e), and no known context for the last one. If we were using a *n=3* order model, we would be able to apply it to the first and second blanks, but would only be able to use the bigram or unigram distributions for the subsequent blanks. As with the unigram model, you should sum over the probability distributions for each blank to find the expected count for each character type, then select the  character with the highest expected count.

Implement the ngram method for *n=3,4* and *5* and evaluate each of these three models over the test set. Do you see any improvement over the unigram methods above?

(2 marks)

In [9]:
# training the ngram models
def getNgramSet(training_set, n):
    ngram_dict = {}
    for i in training_set:
        for j in range(len(i)):
            if j < n-1:
                ngram_dict['$'*(n-1-j)+i[0:j+1]] = ngram_dict.get('$'*(n-1-j)+i[0:j+1], 0) + 1
            else:
                ngram_dict[i[j-n+1:j+1]] = ngram_dict.get(i[j-n+1:j+1], 0) + 1
                
    total_frequency = 0
    for k,v in ngram_dict.items():
        total_frequency = total_frequency + v
    for k,v in ngram_dict.items():
        ngram_dict[k] = v / total_frequency
    return ngram_dict

unigram_set = getNgramSet(training_set, 1)
bigram_set = getNgramSet(training_set, 2)
trigram_set = getNgramSet(training_set, 3)
fourgram_set = getNgramSet(training_set, 4)
fivegram_set = getNgramSet(training_set, 5)

ngram_set_dict = {'1': unigram_set, '2': bigram_set, '3': trigram_set, '4': fourgram_set, '5': fivegram_set}

In [10]:
# set up the lambda for each ngram model
def getLambda(ngram_list):
    dict = {}
    sum = 0
    for i in ngram_list:
        sum = len(i) + sum
        
    for i in ngram_list:
        dict[len(i)] = len(i)/sum
    return dict

In [11]:
# main guessing function
def ngramAi(mask, guessed, **kwargs):
    """
    function for ngram AI play
    """
    # get n for gram model
    n = kwargs['ngram']
    # add start symbols
    for i in range(n-1):
        mask.insert(0, '$')
        
    # get possible characters
    characters = list(string.ascii_lowercase)
    guessed = list(guessed)
    for i in guessed:
        characters.remove(i)
    
    ngram_model_list = []  
    # find the ngram pattern in the mask
    for i in range((n-1),len(mask)):
        previous_words = mask[i-(n-1):i+1]
        if previous_words[-1].isalpha():
            continue
        model = ['_']
        flag = 0
        for i, v in enumerate(reversed(previous_words)):
            if v == '_':
                flag += 1
                if flag == 2:
                    ngram_model_list.append(model)
                    break
                continue
            elif v != '_' and i != (len(previous_words)-1):
                model.insert(0, v)
            elif v != '_' and i == (len(previous_words)-1):
                model.insert(0, v)
                ngram_model_list.append(model)
            else:
                ngram_model_list.append(model)
                break
#     print(mask)
#     print(ngram_model_list)
    parameter_set = getLambda(ngram_model_list)
    character_possibility = {}
    
    for i in characters:
        for j in ngram_model_list:
            gram = ''.join(j[:-1]) + i
            ngram_set = ngram_set_dict[str(len(j))]
            parameter_lambda = parameter_set[len(j)]
            character_possibility[i] = character_possibility.get(i, 0) + (parameter_lambda*ngram_set.get(gram, 0))
        
    character_possibility = sorted(character_possibility.items(), key=lambda x:x[1], reverse=True)
    # remove start symbols
    for i in range(n-1):
        mask.remove('$')
    return character_possibility[0][0]


# kwargs = {'ngram': 4}
# fault = hangman('hello', ngramAi, 26, True, **kwargs)
# print(fault)

In [12]:
# trigram modle
kwargs = {'ngram': 3}
mistakes_guess = [hangman(i, ngramAi, 26, False, **kwargs) for i in test_set]
ngram = np.average(mistakes_guess)
print('The average guessing number of tirgram model  is', ngram)

# fourgram modle
kwargs = {'ngram': 4}
mistakes_guess = [hangman(i, ngramAi, 26, False, **kwargs) for i in test_set]
ngram = np.average(mistakes_guess)
print('The average guessing number of fourgram model  is', ngram)

# fivegram modle
kwargs = {'ngram': 5}
mistakes_guess = [hangman(i, ngramAi, 26, False, **kwargs) for i in test_set]
ngram = np.average(mistakes_guess)
print('The average guessing number of fivegram model  is', ngram)

print('The improvement of the guessing number is obvious over the unigram.')

The average guessing number of tirgram model  is 8.323
The average guessing number of fourgram model  is 8.102
The average guessing number of fivegram model  is 8.255
The improvement of the guessing number is obvious over the unigram.
