# Decoding the Caesar cipher with likelihoods


## Single letter frequencies
When decoding a text, you can return the most likely key for a caesar cipher (shift cipher), as there are only 26 possibilities it is quick to look at all of them. Aristocrat ciphers (random substitution cipher) are harder because there are so many possibilities, but decoding based on single letter frequency is a good starting place.

In [1]:
from utils import *
bible_letters = make_letters('../data/bible.txt')
bible_letter_count = count_letters(bible_letters)
bible_letter_percent = normalize_counts_no_spaces(bible_letter_count, 0.5)
bible_pair_counts = count_letter_pairs(bible_letters)
bible_matrix = compute_transition_matrix(bible_pair_counts, 0.5)

In [2]:
ciphertext = encode_caesar_cipher('hello how are you today', 14)
solutions = (ordered_solutions(ciphertext, bible_letter_percent, \
                                          bible_matrix, find_solution_brute))
for solution in solutions:
    print(solution)

14: hello how are you today: -54.590803576239814
25: wtaad wdl pgt ndj idspn: -62.25557692259439
21: axeeh ahp tkx rhn mhwtr: -64.2887091643381
10: lipps lsa evi csy xshec: -65.58838283212106
07: olssv ovd hyl fvb avkhf: -66.82471295637038
04: rovvy ryg kbo iye dynki: -68.06179296722199
06: pmttw pwe izm gwc bwlig: -70.59607092888055
23: yvccf yfn riv pfl kfurp: -72.00447375957069
20: byffi biq uly sio nixus: -72.26343463063272
01: uryyb ubj ner lbh gbqnl: -72.36354247410443
11: khoor krz duh brx wrgdb: -72.5389979684537
17: ebiil elt xob vlr qlaxv: -74.16870922129499
00: vszzc vck ofs mci hcrom: -74.7165539625702
08: nkrru nuc gxk eua zujge: -75.20262045904478
18: dahhk dks wna ukq pkzwu: -75.81245156852515
16: fcjjm fmu ypc wms rmbyw: -75.84690922930247
09: mjqqt mtb fwj dtz ytifd: -79.99942397726669
24: xubbe xem qhu oek jetqo: -80.09142702803763
13: ifmmp ipx bsf zpv upebz: -80.90718063614096
02: tqxxa tai mdq kag fapmk: -81.36177048612468
03: spwwz szh lcp jzf ezolj: -83.132547510

## Pair letter frequencies

Single letter frequencies are a great starting place, especially for caesar ciphers with only 26 possible solutions, but letters in English are not independant which is partially addressed by looking at the pair frequencies. This method is more accurate than the single letter frequencies. As you can see, the gap between the most likely pair frequency solution and the second most likely is much larger than the gap between the first and second single letter frequency solutions.

In [8]:
ciphertext = encode_caesar_cipher('hello how are you today', 14)
solutions = (ordered_solutions(ciphertext, bible_letter_percent, \
                                          bible_matrix, find_solution_brute_pairs))
for solution in solutions:
    print(solution)

14: hello how are you today: -55.43912137746925
20: byffi biq uly sio nixus: -88.83674757288385
10: lipps lsa evi csy xshec: -93.65149873288952
04: rovvy ryg kbo iye dynki: -100.16344143105046
11: khoor krz duh brx wrgdb: -102.26263427475001
25: wtaad wdl pgt ndj idspn: -103.35759019040786
06: pmttw pwe izm gwc bwlig: -106.83924927521556
00: vszzc vck ofs mci hcrom: -108.74677848432158
21: axeeh ahp tkx rhn mhwtr: -108.91687193549512
24: xubbe xem qhu oek jetqo: -112.45098890769391
13: ifmmp ipx bsf zpv upebz: -113.28627415547322
01: uryyb ubj ner lbh gbqnl: -116.60360370055872
02: tqxxa tai mdq kag fapmk: -117.52689084107709
23: yvccf yfn riv pfl kfurp: -120.15082733016793
16: fcjjm fmu ypc wms rmbyw: -120.19259936419115
08: nkrru nuc gxk eua zujge: -121.67366535594951
07: olssv ovd hyl fvb avkhf: -123.17503984208868
17: ebiil elt xob vlr qlaxv: -125.77922338419904
18: dahhk dks wna ukq pkzwu: -127.33259040411828
03: spwwz szh lcp jzf ezolj: -130.31222100120044
09: mjqqt mtb fwj dtz y

Different from the single letter frequencies, the pair frequency takes spaces into account, including the frequencies of which letter start and end words the most often. To demonstrate, call the function once with spaces in the text and once without. Notice that there is a difference between the returned likelihoods.

In [9]:
ciphertext = encode_caesar_cipher('hello how are you today', 14)
solutions = (ordered_solutions(ciphertext, bible_letter_percent, \
                                          bible_matrix, find_solution_brute_pairs))
print(solutions[0])

ciphertext = encode_caesar_cipher('hellohowareyoutoday', 14)
solutions = (ordered_solutions(ciphertext, bible_letter_percent, \
                                          bible_matrix, find_solution_brute_pairs))
print(solutions[0])

14: hello how are you today: -55.43912137746925
14: hellohowareyoutoday: -50.292602006560514
