# Likelihood for inferring key

Here I define some likelihood-based methods to infer a key for a cipher.


## Likelihoods based on single letter frequencies

The frequecy tables generated in the counting notebook are put to use here. First I look at the frequency of each letter. In this example, I encode a text using a caesar cipher (shift cipher) with a key of 14, and then return the most probably answer based on the likelihood of each letter.

In [1]:
from utils import *

In [None]:
bible_letters = make_letters('../data/bible.txt')
bible_letter_count = count_letters(bible_letters)
bible_letter_percent = normalize_counts_no_spaces(bible_letter_count)

The simple likelihood computation multiplies together the letter frequencies.
So for example, the likelihood for "the" will be the frequency of t times frequency of h
times frequency of e. 


In [4]:
print(find_likelihood("the",bible_letter_percent))
print(bible_letter_percent[alpha_list.index("t")] * \
      bible_letter_percent[alpha_list.index("h")] * \
      bible_letter_percent[alpha_list.index("e")])

0.001089168502310533
0.001089168502310533


And we can also compute the log-likelihood as the sum of the log frequencies.
This stops numerical error from the number getting tiny.

In [5]:
#need some examples of computing likelihoods
print(log(find_likelihood("the",bible_letter_percent)))
print(find_log_likelihood("the",bible_letter_percent))

-6.82234071576998
-6.822340715769981


When computing the likelihood, I ignore spaces. This is because spaces are often left unchanged when a cipher is encoded. See how decoding 'the' and 't h e' return the same result.

In [7]:
print(find_log_likelihood('the', bible_letter_percent))
print(find_log_likelihood('t h e', bible_letter_percent))

-6.822340715769981
-6.822340715769981


## Likelihood based on pair frequencies

Of course letters in English are not independent. Here I compute a likelihood based on the transitions between letters (a "Markov chain"). First I compute the transition matrix.

In [18]:
bible_pair_counts = count_letter_pairs(bible_letters)
bible_matrix = compute_transition_matrix(bible_pair_counts, 0.5)

The log-likelihood is based on the sum of the logarithm of the frequency of the first letter, and then the logarithm of the transition probabilities for remaining letter pairs.

In [19]:
print(find_pair_log_likelihood("the",bible_letter_percent,bible_matrix))
print(log(bible_letter_percent[alpha_list.index("t")]) + \
      log(bible_matrix[alpha_list.index("t")][alpha_list.index("h")]) + \
      log(bible_matrix[alpha_list.index("h")][alpha_list.index("e")]))

-3.778488818565315
-3.778488818565315
