# Substitution cipher

In this exercise, we'll explore one of the most basic problems in cryptography - cracking the code of a simple substitution cipher in which each letter of the alphabet is replaced by another. Although modern cryptography uses much more complex techniques, this illustrates some of the key statistical issues such as the use of word and letter frequencies.

In [1]:
from collections import OrderedDict
from collections import Counter
import string
import re
import numpy as np

Define a table of letter usage frequency. We'll use this later

In [2]:
english_letter_freq = OrderedDict({'z': 0.00074, 'q': 0.00095, 'x': 0.00150, 'j': 0.00153, 'k': 0.00772,
 'v': 0.00978, 'b': 0.01492, 'p': 0.01929, 'y': 0.01974, 'g': 0.02015,
 'f': 0.02228, 'w': 0.02360, 'm': 0.02406, 'u': 0.02758, 'c': 0.02782,
 'l': 0.04025, 'd': 0.04253, 'r': 0.05987, 'h': 0.06094, 's': 0.06327,
 'n': 0.06749, 'i': 0.06966, 'o': 0.07507, 'a': 0.08167, 't': 0.09056,
 'e': 0.12702})
english_letter_order = ''.join(english_letter_freq.keys())
print('English letters ordered from least to most common:', english_letter_order)

English letters ordered from least to most common: zqxjkvbpygfwmucldrhsnioate


Read in the full text of Frankenstein. For convenience, we're going to convert all upper case letters to lower case.

In [3]:
with open('Frankenstein.txt', 'rt', encoding='utf8') as file:  
#with open('TheTimeMachine.txt', 'rt') as file:  

    orig_text = file.read()
    
orig_text = orig_text.lower()
print(orig_text)

you will rejoice to hear that no disaster has accompanied the
commencement of an enterprise which you have regarded with such evil
forebodings.  i arrived here yesterday, and my first task is to assure
my dear sister of my welfare and increasing confidence in the success
of my undertaking.

i am already far north of london, and as i walk in the streets of
petersburgh, i feel a cold northern breeze play upon my cheeks, which
braces my nerves and fills me with delight.  do you understand this
feeling?  this breeze, which has travelled from the regions towards
which i am advancing, gives me a foretaste of those icy climes.
inspirited by this wind of promise, my daydreams become more fervent
and vivid.  i try in vain to be persuaded that the pole is the seat of
frost and desolation; it ever presents itself to my imagination as the
region of beauty and delight.  there, margaret, the sun is for ever
visible, its broad disk just skirting the horizon and diffusing a
perpetual splendour.  there

We now create our code, where one set of letters substitutes for another. We'll do this by taking the 26 lower case letters and scrambling them into a random order. Note that we're setting the seed for the random number generator so that we get the same results every time. To generate different codes, either change the seed or comment out the np.random.seed line.

As a simplification, we only apply the substituion cypher to the letters a-z and leave numbers, abbreviations and whitespace unchanged.

In [4]:
intab = string.ascii_lowercase
np.random.seed(12345678)
scrambled = np.random.permutation(list(intab))
outtab = ''.join(scrambled)
trantab_forward = str.maketrans(intab, outtab)
trantab_backward = str.maketrans(outtab, intab)

print('Original letters  :', intab)
print('Encrypted letters :', outtab)

Original letters  : abcdefghijklmnopqrstuvwxyz
Encrypted letters : kafjnxwvdhimsoguyzbplqrect


Next, we use the code to translate the original text into the secret text.

In [5]:
encrypted_text = orig_text.translate(trantab_forward)
print(encrypted_text)

cgl rdmm znhgdfn pg vnkz pvkp og jdbkbpnz vkb kffgsukodnj pvn
fgssnofnsnop gx ko nopnzuzdbn rvdfv cgl vkqn znwkzjnj rdpv blfv nqdm
xgznagjdowb.  d kzzdqnj vnzn cnbpnzjkc, koj sc xdzbp pkbi db pg kbblzn
sc jnkz bdbpnz gx sc rnmxkzn koj dofznkbdow fgoxdjnofn do pvn blffnbb
gx sc lojnzpkidow.

d ks kmznkjc xkz ogzpv gx mgojgo, koj kb d rkmi do pvn bpznnpb gx
unpnzbalzwv, d xnnm k fgmj ogzpvnzo aznntn umkc lugo sc fvnnib, rvdfv
azkfnb sc onzqnb koj xdmmb sn rdpv jnmdwvp.  jg cgl lojnzbpkoj pvdb
xnnmdow?  pvdb aznntn, rvdfv vkb pzkqnmmnj xzgs pvn znwdgob pgrkzjb
rvdfv d ks kjqkofdow, wdqnb sn k xgznpkbpn gx pvgbn dfc fmdsnb.
dobudzdpnj ac pvdb rdoj gx uzgsdbn, sc jkcjznksb anfgsn sgzn xnzqnop
koj qdqdj.  d pzc do qkdo pg an unzblkjnj pvkp pvn ugmn db pvn bnkp gx
xzgbp koj jnbgmkpdgo; dp nqnz uznbnopb dpbnmx pg sc dskwdokpdgo kb pvn
znwdgo gx anklpc koj jnmdwvp.  pvnzn, skzwkznp, pvn blo db xgz nqnz
qdbdamn, dpb azgkj jdbi hlbp bidzpdow pvn vgzdtgo koj jdxxlbdow k
unzunplkm bumnojglz.  pvnzn

If we did everything correctly, we can reverse the process and recover the original text.

In [6]:
print(encrypted_text.translate(trantab_backward))

you will rejoice to hear that no disaster has accompanied the
commencement of an enterprise which you have regarded with such evil
forebodings.  i arrived here yesterday, and my first task is to assure
my dear sister of my welfare and increasing confidence in the success
of my undertaking.

i am already far north of london, and as i walk in the streets of
petersburgh, i feel a cold northern breeze play upon my cheeks, which
braces my nerves and fills me with delight.  do you understand this
feeling?  this breeze, which has travelled from the regions towards
which i am advancing, gives me a foretaste of those icy climes.
inspirited by this wind of promise, my daydreams become more fervent
and vivid.  i try in vain to be persuaded that the pole is the seat of
frost and desolation; it ever presents itself to my imagination as the
region of beauty and delight.  there, margaret, the sun is for ever
visible, its broad disk just skirting the horizon and diffusing a
perpetual splendour.  there

As a first pass, we'll go through the text, get rid of everything that is not one of the letters a-z and count the letter frequency. The first line uses an intermediate Python feature called regular expressions to strip out the characters we don't want.

In [7]:
encrypted_atoz = re.sub(r'[^a-z]', '', encrypted_text)

encrypted_letter_freq = Counter(list(encrypted_atoz))
print('Letter usage in encrypted text')
print(Counter(list(encrypted_atoz)), '\n')
encrypted_letter_order = ''.join(sorted(encrypted_letter_freq, key=encrypted_letter_freq.get))

print('Letters ordered by usage from least to most common')
print('English   :', english_letter_order)
print('Encrypted :', encrypted_letter_order)

Letter usage in encrypted text
Counter({'n': 46067, 'p': 30414, 'k': 26757, 'g': 25233, 'd': 24615, 'o': 24381, 'b': 21188, 'z': 20860, 'v': 19757, 'j': 16871, 'm': 12747, 's': 10609, 'l': 10415, 'f': 9276, 'x': 8736, 'c': 7918, 'r': 7640, 'u': 6146, 'w': 5976, 'a': 5027, 'q': 3834, 'i': 1755, 'e': 677, 'h': 504, 'y': 324, 't': 215}) 

Letters ordered by usage from least to most common
English   : zqxjkvbpygfwmucldrhsnioate
Encrypted : tyheiqawurcxflsmjvzbodgkpn


Assuming that the letter usage in our text matches that in our frequency table, we should be all done. Let's apply our results and see if we recover our original text.

In [8]:
trantab = str.maketrans(encrypted_letter_order, english_letter_order)
print(encrypted_text.translate(trantab))

fou gill hexoime to reah trat no disasteh ras ammocyanied tre
moccenmecent ow an entehyhise grimr fou rave hepahded gitr sumr evil
wohebodinps.  i ahhived rehe festehdaf, and cf wihst task is to assuhe
cf deah sisteh ow cf gelwahe and inmheasinp monwidenme in tre summess
ow cf undehtakinp.

i ac alheadf wah nohtr ow london, and as i galk in tre stheets ow
yetehsbuhpr, i weel a mold nohtrehn bheeze ylaf uyon cf mreeks, grimr
bhames cf nehves and wills ce gitr deliprt.  do fou undehstand tris
weelinp?  tris bheeze, grimr ras thavelled whoc tre hepions togahds
grimr i ac advanminp, pives ce a wohetaste ow trose imf mlices.
insyihited bf tris gind ow yhocise, cf dafdheacs bemoce cohe wehvent
and vivid.  i thf in vain to be yehsuaded trat tre yole is tre seat ow
whost and desolation; it eveh yhesents itselw to cf icapination as tre
hepion ow beautf and deliprt.  trehe, cahpahet, tre sun is woh eveh
visible, its bhoad disk xust skihtinp tre rohizon and diwwusinp a
yehyetual sylendouh.  trehe

Something doesn't look quite right. We've definitely made progress with cracking the substitution code and some words are recognizeable. It appears that we've at least correctly identified the most common letters, but there's still a lot of garbled text.

As a next step, we'll take advantage of another property of the English language - the frequent use of the word 'the'. In fact 'the' occurs twice as often than 'and' the second most common 3-letter word.

A key advantage of identifying the word 'the' is that it helps us with the assignment of the letter h, which can be challenging since h and r appear with nearly identical frequency (h = 6.094%, r = 5.987%).

Before we update our solution to the cipher, we should be confident that we've correctly identified 'the'. Fortunately, the last letter (e) and first letter (t) are the first and second most common letters in English text and our letter frequency analysis will almost certainly identiofy them correctly.

In [9]:
# The letters h and r occur with nearly the same frequency and are frequently reversed. 
# We can identify h, along with t and e, by finding the most common word in the English
# language 'the'

# The letter corresponding to h in the encrypted text should be in position 18 in the
# secret_letter_order string (remember, first position is 0). If it occurs in the
# wrong position, swap with letter in position 18

encrypted_letter_order_v2 = encrypted_letter_order

most_common_word = Counter(encrypted_text.split()).most_common(1)
encrypted_t = most_common_word[0][0][0]
encrypted_h = most_common_word[0][0][1]
encrypted_e = most_common_word[0][0][2]

# Find location of t in the encrypted letter order
pos_encrypted_t = encrypted_letter_order.find(encrypted_t)
print('Location of t in encrypted letter order :', pos_encrypted_t)

# Find location of h in the encrypted letter order
pos_encrypted_h = encrypted_letter_order.find(encrypted_h)
print('Location of h in encrypted letter order :', pos_encrypted_h)

# Find location of e in the encrypted letter order
pos_encrypted_e = encrypted_letter_order.find(encrypted_e)
print('Location of e in encrypted letter order :', pos_encrypted_e)

if pos_encrypted_t != 24 or pos_encrypted_e != 25:
    print("\nMisidentified 'the'\n")
else:
    print("\nIdentified 'the' with high confidence\n")
    if pos_encrypted_h != 18:
        print("h was in wrong location\n")
        templist = list(encrypted_letter_order)
        templist[pos_encrypted_h], templist[18] = templist[18], templist[pos_encrypted_h]
        encrypted_letter_order_v2 = ''.join(templist)
    else:
        print("h was in correct location\n")

print('English     :', english_letter_order)
print('Encrypted   :', encrypted_letter_order)
print('Encrypted v2:', encrypted_letter_order_v2)

Location of t in encrypted letter order : 24
Location of h in encrypted letter order : 17
Location of e in encrypted letter order : 25

Identified 'the' with high confidence

h was in wrong location

English     : zqxjkvbpygfwmucldrhsnioate
Encrypted   : tyheiqawurcxflsmjvzbodgkpn
Encrypted v2: tyheiqawurcxflsmjzvbodgkpn


In [110]:
trantab = str.maketrans(encrypted_letter_order_v2, english_letter_order)
translate_1 = encrypted_text.translate(trantab)
print(translate_1)

fou gill rexoime to hear that no disaster has ammocyanied the
moccenmecent ow an enteryrise ghimh fou have reparded gith sumh evil
worebodinps.  i arrived here festerdaf, and cf wirst task is to assure
cf dear sister ow cf gelware and inmreasinp monwidenme in the summess
ow cf undertakinp.

i ac alreadf war north ow london, and as i galk in the streets ow
yetersburph, i weel a mold northern breeze ylaf uyon cf mheeks, ghimh
brames cf nerves and wills ce gith delipht.  do fou understand this
weelinp?  this breeze, ghimh has travelled wroc the repions togards
ghimh i ac advanminp, pives ce a woretaste ow those imf mlices.
insyirited bf this gind ow yrocise, cf dafdreacs bemoce core wervent
and vivid.  i trf in vain to be yersuaded that the yole is the seat ow
wrost and desolation; it ever yresents itselw to cf icapination as the
repion ow beautf and delipht.  there, carparet, the sun is wor ever
visible, its broad disk xust skirtinp the horizon and diwwusinp a
yeryetual sylendour.  there

Our experience with 'the' drives home the limitations of relying purely on letter frequency. Too many of the letters have frequencies that are nearly indistiguishable. The situation is complicated by the choice of text used to derive the letter frequencies since there may be subtle changes over time.

In addition to letter frequencies and word frequencies, we can make use of the relative frequencies with which letters appear as the first letter in a word. Although this doesn't give us much progress on cracking our code, the letters z, q, x and j present an interesting case. All four of these appear with low frequencies (0.074%, 0.095%, 0.150% and 0.153% respectively) and there's a significant gap between the frequency of j and that of the k (0.772%), the next letter in order of frequency. As a result, these four letters normally occupy the first four spots but with the order often scrambled, especially for z/q and x/j. They can be distinguished though by their frequency as first letter in a word.

## Progress break

At this point, we need to take advantage of other properties of English, such as the frequency of double letters, frequency of first letters or common two-letter words. Here are some links with useful information

https://blogs.sas.com/content/iml/2014/10/03/double-letter-bigrams.html  
https://www3.nd.edu/~busiforc/handouts/cryptography/cryptography%20hints.html
https://en.wikipedia.org/wiki/Letter_frequency

In [105]:
english_double_freq = OrderedDict({'o': 0.00137, 'm': 0.00146, 'l': 0.00171, 'f': 0.00210, 't': 0.00378, 
                                  'e': 0.00405, 's': 0.00577})
# double_letter_freq = OrderedDict({ 't': 0.00378, 'e': 0.00405, 's': 0.00577})
english_double_order = ''.join(english_double_freq.keys())
print('English double letters ordered from least to most common:', double_letter_order)

English double letters ordered from least to most common: omlftes


In [111]:
encrypted_double = re.findall(r'([a-z])\1', translate_1)

encrypted_double_freq = Counter(list(encrypted_double))
print(Counter(list(encrypted_double)), '\n')
encrypted_double_order = ''.join(sorted(encrypted_double_freq, key=encrypted_double_freq.get))
encrypted_double_order = encrypted_double_order[-1*len(english_double_order):]

print('Letters ordered by usage from least to most common')
print('English   :', english_double_order)
print('Encrypted :', encrypted_double_order)

Counter({'l': 1492, 's': 1285, 'e': 1180, 'o': 669, 't': 565, 'y': 516, 'r': 435, 'w': 314, 'm': 263, 'n': 209, 'c': 169, 'd': 109, 'p': 33, 'b': 11, 'g': 6, 'z': 5, 'i': 2, 'a': 1, 'h': 1}) 

Letters ordered by usage from least to most common
English   : omlftes
Encrypted : rytoesl


In [104]:
# encrypted_double_freq = {k: v for k, v in encrypted_double_freq.items() if v > 400}
# encrypted_double_order = ''.join(sorted(encrypted_double_freq, key=encrypted_double_freq.get))
# print(encrypted_double_freq)
# encrypted_double_order

In [109]:
encrypted_letter_order_v3 = encrypted_letter_order_v2
trantab = str.maketrans(encrypted_double_order, english_double_order)
translate_2 = translate_1.translate(trantab)
print(translate_2)

ffu giss otxfimt lf htao lhal nf dieaelto hae ammfcmanitd lht
mfcctnmtctnl fw an tnltomoiet ghimh ffu havt otpaodtd gilh eumh tvis
wfotbfdinpe.  i aooivtd htot fteltodaf, and cf wioel laek ie lf aeeuot
cf dtao eielto fw cf gtswaot and inmotaeinp mfnwidtnmt in lht eummtee
fw cf undtolakinp.

i ac asotadf wao nfolh fw sfndfn, and ae i gask in lht elottle fw
mtltoebuoph, i wtts a mfsd nfolhton bottzt msaf umfn cf mhttke, ghimh
boamte cf ntovte and wisse ct gilh dtsiphl.  df ffu undtoeland lhie
wttsinp?  lhie bottzt, ghimh hae loavtsstd wofc lht otpifne lfgaode
ghimh i ac advanminp, pivte ct a wfotlaelt fw lhfet imf msicte.
inemioiltd bf lhie gind fw mofciet, cf dafdotace btmfct cfot wtovtnl
and vivid.  i lof in vain lf bt mtoeuadtd lhal lht mfst ie lht etal fw
wofel and dtefsalifn; il tvto motetnle iletsw lf cf icapinalifn ae lht
otpifn fw btaulf and dtsiphl.  lhtot, caopaotl, lht eun ie wfo tvto
vieibst, ile bofad diek xuel ekiolinp lht hfoizfn and diwwueinp a
mtomtluas emstndfuo.  lhtot

In [35]:
# Find all digraphs
y = re.findall(r'(?<=\s)[a-z]{2}|^[a-z]{2}', encrypted_text)

['cg',
 'rd',
 'zn',
 'pg',
 'vn',
 'pv',
 'og',
 'jd',
 'vk',
 'kf',
 'pv',
 'fg',
 'gx',
 'ko',
 'no',
 'rv',
 'cg',
 'vk',
 'zn',
 'rd',
 'bl',
 'nq',
 'xg',
 'kz',
 'vn',
 'cn',
 'ko',
 'sc',
 'xd',
 'pk',
 'db',
 'pg',
 'kb',
 'sc',
 'jn',
 'bd',
 'gx',
 'sc',
 'rn',
 'ko',
 'do',
 'fg',
 'do',
 'pv',
 'bl',
 'gx',
 'sc',
 'lo',
 'ks',
 'km',
 'xk',
 'og',
 'gx',
 'mg',
 'ko',
 'kb',
 'rk',
 'do',
 'pv',
 'bp',
 'gx',
 'un',
 'xn',
 'fg',
 'og',
 'az',
 'um',
 'lu',
 'sc',
 'fv',
 'rv',
 'az',
 'sc',
 'on',
 'ko',
 'xd',
 'sn',
 'rd',
 'jn',
 'jg',
 'cg',
 'lo',
 'pv',
 'xn',
 'pv',
 'az',
 'rv',
 'vk',
 'pz',
 'xz',
 'pv',
 'zn',
 'pg',
 'rv',
 'ks',
 'kj',
 'wd',
 'sn',
 'xg',
 'gx',
 'pv',
 'df',
 'fm',
 'do',
 'ac',
 'pv',
 'rd',
 'gx',
 'uz',
 'sc',
 'jk',
 'an',
 'sg',
 'xn',
 'ko',
 'qd',
 'pz',
 'do',
 'qk',
 'pg',
 'an',
 'un',
 'pv',
 'pv',
 'ug',
 'db',
 'pv',
 'bn',
 'gx',
 'xz',
 'ko',
 'jn',
 'dp',
 'nq',
 'uz',
 'dp',
 'pg',
 'sc',
 'ds',
 'kb',
 'pv',
 'zn',
 'gx',

In [86]:
ycount = Counter(list(y))
eorder = ' '.join(sorted(ycount, key=ycount.get))
diorder = 've rt de as ar ou is le io st it to ti ea nt or of es en at ha nd ed in he re an on er th'
eorder = eorder[-1*len(diorder):]

In [87]:
trantab2 = str.maketrans(eorder, diorder)
a_tran = translate_1.translate(trantab2)
print(a_tran)

era oiee ienrime tr hehi thht ar rihhhtei hhh hmmreyhaier the
mreeeameeeat rw ha eateiyiihe ohimh era hhhe iethirer oith hamh ehie
wrieirriath.  i hiiiher heie eehteirhe, har ee wiiht thhe ih tr hhhaie
ee rehi hihtei rw ee oeewhie har iamiehhiat mrawireame ia the hammehh
rw ee aareitheiat.

i he heiehre whi arith rw erarra, har hh i ohee ia the htieeth rw
yeteihiaith, i weee h mrer aritheia iieeee yehe ayra ee mheeeh, ohimh
iihmeh ee aeiheh har wieeh ee oith reeitht.  rr era aareihthar thih
weeeiat?  thih iieeee, ohimh hhh tihheeeer wire the ietirah trohirh
ohimh i he hrhhamiat, tiheh ee h wriethhte rw thrhe ime meieeh.
iahyiiiter ie thih oiar rw yireihe, ee rherieheh iemree erie weiheat
har hihir.  i tie ia hhia tr ie yeihahrer thht the yree ih the heht rw
wirht har rehrehtira; it ehei yieheath itheew tr ee iehtiahtira hh the
ietira rw iehate har reeitht.  theie, ehithiet, the haa ih wri ehei
hihiiee, ith iirhr rihe naht heiitiat the hriiera har riwwahiat h
yeiyetahe hyeearrai.  theie

In [89]:
# Find all trigraphs
z = re.findall(r'(?<=\s)[a-z]{3}|^[a-z]{3}', encrypted_text)

In [97]:
zcount = Counter(list(z))
eorder = ' '.join(sorted(zcount, key=zcount.get))
diorder = 'men sth oft tis edt nce has nde for tio ion ent tha and the'
eorder = eorder[-1*len(diorder):]

In [101]:
trantab3 = str.maketrans(eorder, diorder)
a_tran1 = encrypted_text.translate(trantab3)
print(a_tran1)

ion eimm rehoiee to hear that no ditatter hat aeeosuanied the
eosseneesent of an enterurite ehieh ion haqe rewarded eith tneh eqim
forehodinwt.  i arriqed here ietterdai, and si firtt tati it to attnre
si dear titter of si eemfare and inereatinw eonfidenee in the tneeett
of si nndertaiinw.

i as amreadi far north of mondon, and at i eami in the ttreett of
ueterthnrwh, i feem a eomd northern hreete umai nuon si eheeit, ehieh
hraeet si nerqet and fimmt se eith demiwht.  do ion nnderttand thit
feeminw?  thit hreete, ehieh hat traqemmed fros the rewiont toeardt
ehieh i as adqaneinw, wiqet se a foretatte of thote iei emiset.
intuirited hi thit eind of urosite, si daidreast heeose sore ferqent
and qiqid.  i tri in qain to he uertnaded that the uome it the teat of
frott and detomation; it eqer uretentt ittemf to si isawination at the
rewion of heanti and demiwht.  there, sarwaret, the tnn it for eqer
qitihme, itt hroad diti hntt tiirtinw the horiton and diffntinw a
ueruetnam tumendonr.  there

In [102]:
a_tran2 = a_tran1.translate(str.maketrans(encrypted_letter_order_v2, english_letter_order))
print(a_tran2)

kne jkll gjxnkjj zn xjbg zxbz en ikzbzzjg xbz bjjncybekji zxj
jnccjejjcjez nm be jezjgygkzj jxkjx kne xbvj gjpbgiji jkzx zejx jvkl
mngjxnikepz.  k bggkvji xjgj kjzzjgibk, bei ck mkgzz zbzk kz zn bzzegj
ck ijbg zkzzjg nm ck jjlmbgj bei kejgjbzkep jnemkijejj ke zxj zejjjzz
nm ck eeijgzbkkep.

k bc blgjbik mbg engzx nm lneine, bei bz k jblk ke zxj zzgjjzz nm
yjzjgzxegpx, k mjjl b jnli engzxjge xgjjzj ylbk eyne ck jxjjkz, jxkjx
xgbjjz ck ejgvjz bei mkllz cj jkzx ijlkpxz.  in kne eeijgzzbei zxkz
mjjlkep?  zxkz xgjjzj, jxkjx xbz zgbvjllji mgnc zxj gjpknez znjbgiz
jxkjx k bc bivbejkep, pkvjz cj b mngjzbzzj nm zxnzj kjk jlkcjz.
kezykgkzji xk zxkz jkei nm ygnckzj, ck ibkigjbcz xjjncj cngj mjgvjez
bei vkvki.  k zgk ke vbke zn xj yjgzebiji zxbz zxj ynlj kz zxj zjbz nm
mgnzz bei ijznlbzkne; kz jvjg ygjzjezz kzzjlm zn ck kcbpkebzkne bz zxj
gjpkne nm xjbezk bei ijlkpxz.  zxjgj, cbgpbgjz, zxj zee kz mng jvjg
vkzkxlj, kzz xgnbi ikzk xezz zkkgzkep zxj xngkzne bei ikmmezkep b
yjgyjzebl zyljeineg.  zxjgj

In [17]:
# Frequency of beginning characters
re.findall(r'(?<= )[a-z]', encrypted_text)

['r',
 'z',
 'p',
 'v',
 'p',
 'o',
 'j',
 'v',
 'k',
 'p',
 'g',
 'k',
 'n',
 'r',
 'c',
 'v',
 'z',
 'r',
 'b',
 'n',
 'd',
 'k',
 'v',
 'c',
 'k',
 's',
 'x',
 'p',
 'd',
 'p',
 'k',
 'j',
 'b',
 'g',
 's',
 'r',
 'k',
 'd',
 'f',
 'd',
 'p',
 'b',
 's',
 'l',
 'k',
 'k',
 'x',
 'o',
 'g',
 'm',
 'k',
 'k',
 'd',
 'r',
 'd',
 'p',
 'b',
 'g',
 'd',
 'x',
 'k',
 'f',
 'o',
 'a',
 'u',
 'l',
 's',
 'f',
 'r',
 's',
 'o',
 'k',
 'x',
 's',
 'r',
 'j',
 'j',
 'c',
 'l',
 'p',
 'p',
 'a',
 'r',
 'v',
 'p',
 'x',
 'p',
 'z',
 'p',
 'd',
 'k',
 'k',
 'w',
 's',
 'k',
 'x',
 'g',
 'p',
 'd',
 'f',
 'a',
 'p',
 'r',
 'g',
 'u',
 's',
 'j',
 'a',
 's',
 'x',
 'q',
 'd',
 'p',
 'd',
 'q',
 'p',
 'a',
 'u',
 'p',
 'p',
 'u',
 'd',
 'p',
 'b',
 'g',
 'k',
 'j',
 'd',
 'n',
 'u',
 'd',
 'p',
 's',
 'd',
 'k',
 'p',
 'g',
 'a',
 'k',
 'j',
 'p',
 's',
 'p',
 'b',
 'd',
 'x',
 'n',
 'd',
 'a',
 'j',
 'h',
 'b',
 'p',
 'v',
 'k',
 'j',
 'k',
 'b',
 'p',
 'r',
 'c',
 'm',
 's',
 'b',
 'd',
 'r',
 'u'

In [18]:
# Find all final characters
re.findall(r'[a-z](?=\s|\.)', encrypted_text)

['l',
 'm',
 'n',
 'g',
 'z',
 'p',
 'g',
 'z',
 'b',
 'j',
 'n',
 'p',
 'x',
 'o',
 'n',
 'v',
 'l',
 'n',
 'j',
 'v',
 'v',
 'm',
 'b',
 'd',
 'j',
 'n',
 'j',
 'c',
 'p',
 'i',
 'b',
 'g',
 'n',
 'c',
 'z',
 'z',
 'x',
 'c',
 'n',
 'j',
 'w',
 'n',
 'o',
 'n',
 'b',
 'x',
 'c',
 'w',
 'd',
 's',
 'c',
 'z',
 'v',
 'x',
 'j',
 'b',
 'd',
 'i',
 'o',
 'n',
 'b',
 'x',
 'd',
 'm',
 'k',
 'j',
 'o',
 'n',
 'c',
 'o',
 'c',
 'v',
 'b',
 'c',
 'b',
 'j',
 'b',
 'n',
 'v',
 'p',
 'g',
 'l',
 'j',
 'b',
 'b',
 'v',
 'b',
 'j',
 's',
 'n',
 'b',
 'b',
 'v',
 'd',
 's',
 'b',
 'n',
 'k',
 'n',
 'x',
 'n',
 'c',
 'b',
 'j',
 'c',
 'b',
 'j',
 'x',
 'c',
 'b',
 'n',
 'n',
 'p',
 'j',
 'j',
 'd',
 'c',
 'o',
 'o',
 'g',
 'n',
 'j',
 'p',
 'n',
 'n',
 'b',
 'n',
 'p',
 'x',
 'p',
 'j',
 'p',
 'z',
 'b',
 'x',
 'g',
 'c',
 'o',
 'b',
 'n',
 'o',
 'x',
 'c',
 'j',
 'p',
 'n',
 'o',
 'b',
 'z',
 'z',
 'b',
 'j',
 'i',
 'p',
 'w',
 'n',
 'o',
 'j',
 'w',
 'k',
 'm',
 'z',
 'z',
 'v',
 'z',
 'c',
 'd'

In [19]:
# Match two letter words
re.findall(r'(?<=\s)[a-z][a-z](?=\s|\.)', encrypted_text)

['pg',
 'og',
 'gx',
 'ko',
 'sc',
 'db',
 'pg',
 'sc',
 'gx',
 'sc',
 'do',
 'gx',
 'sc',
 'ks',
 'gx',
 'kb',
 'do',
 'gx',
 'sc',
 'sc',
 'sn',
 'jg',
 'ks',
 'sn',
 'gx',
 'ac',
 'gx',
 'sc',
 'do',
 'pg',
 'an',
 'db',
 'gx',
 'dp',
 'pg',
 'sc',
 'kb',
 'gx',
 'db',
 'sc',
 'do',
 'rn',
 'an',
 'pg',
 'do',
 'do',
 'go',
 'an',
 'kb',
 'gx',
 'do',
 'an',
 'do',
 'gx',
 'pg',
 'sc',
 'gx',
 'gx',
 'ac',
 'gx',
 'sc',
 'pg',
 'gx',
 'gz',
 'pg',
 'sn',
 'pg',
 'vn',
 'do',
 'go',
 'ko',
 'gx',
 'lu',
 'pg',
 'an',
 'go',
 'pg',
 'ac',
 'pg',
 'pg',
 'kp',
 'bg',
 'gz',
 'ac',
 'gx',
 'dx',
 'kp',
 'an',
 'ac',
 'ko',
 'kb',
 'sc',
 'sc',
 'ko',
 'sn',
 'pg',
 'bg',
 'pg',
 'kb',
 'go',
 'gx',
 'sc',
 'gx',
 'do',
 'gx',
 'kp',
 'gx',
 'gx',
 'gx',
 'sc',
 'gx',
 'sc',
 'sc',
 'kb',
 'go',
 'sc',
 'sc',
 'pg',
 'sn',
 'pg',
 'do',
 'sc',
 'dp',
 'pg',
 'do',
 'gx',
 'sc',
 'do',
 'gx',
 'sc',
 'kp',
 'gx',
 'sc',
 'sc',
 'gx',
 'go',
 'sc',
 'pg',
 'ac',
 'sc',
 'pg',
 'go',
 'pg',

In [20]:
# Match three letter words
re.findall(r'(?<=\s)[a-z]{3}(?=\s|\.)', encrypted_text)

['vkb',
 'pvn',
 'cgl',
 'koj',
 'koj',
 'pvn',
 'xkz',
 'koj',
 'pvn',
 'koj',
 'cgl',
 'vkb',
 'pvn',
 'dfc',
 'koj',
 'pzc',
 'pvn',
 'pvn',
 'koj',
 'pvn',
 'koj',
 'pvn',
 'blo',
 'xgz',
 'dpb',
 'pvn',
 'koj',
 'ulp',
 'koj',
 'kzn',
 'skc',
 'koj',
 'pvn',
 'dpb',
 'koj',
 'skc',
 'pvn',
 'pvn',
 'kzn',
 'skc',
 'ogp',
 'skc',
 'pvn',
 'pvn',
 'koj',
 'skc',
 'xgz',
 'pvn',
 'pvn',
 'koj',
 'skc',
 'pvn',
 'sko',
 'kzn',
 'koj',
 'kzn',
 'kmm',
 'koj',
 'pvn',
 'hgc',
 'vdb',
 'vdb',
 'alp',
 'kmm',
 'cgl',
 'pvn',
 'kmm',
 'pvn',
 'pvn',
 'kzn',
 'pvn',
 'pvn',
 'kmm',
 'fko',
 'pvn',
 'koj',
 'xgz',
 'pvn',
 'pvn',
 'skc',
 'xde',
 'dpb',
 'ncn',
 'vkb',
 'pvn',
 'pvn',
 'pvn',
 'pvn',
 'pvn',
 'pvn',
 'pvn',
 'cgl',
 'skc',
 'kmm',
 'pvn',
 'xgz',
 'pvn',
 'glz',
 'rkb',
 'cnp',
 'rkb',
 'jkc',
 'koj',
 'koj',
 'vkj',
 'vkj',
 'xgz',
 'pvn',
 'koj',
 'koj',
 'xgz',
 'gon',
 'gro',
 'pvn',
 'pvn',
 'koj',
 'kzn',
 'cgl',
 'kzn',
 'koj',
 'vgr',
 'pvn',
 'alp',
 'pvn',
 'koj',


In [21]:
# Match four letter words
re.findall(r'(?<=\s)[a-z]{4}(?=\s|\.)', encrypted_text)

['rdmm',
 'vnkz',
 'pvkp',
 'vkqn',
 'rdpv',
 'blfv',
 'nqdm',
 'vnzn',
 'pkbi',
 'jnkz',
 'rkmi',
 'xnnm',
 'fgmj',
 'umkc',
 'lugo',
 'rdpv',
 'pvdb',
 'pvdb',
 'xzgs',
 'pvdb',
 'rdoj',
 'sgzn',
 'qkdo',
 'pvkp',
 'ugmn',
 'bnkp',
 'nqnz',
 'nqnz',
 'jdbi',
 'hlbp',
 'rdpv',
 'cglz',
 'rdmm',
 'bgsn',
 'bogr',
 'gqnz',
 'fkms',
 'mkoj',
 'rvkp',
 'pvkp',
 'gomc',
 'pvdb',
 'nqnz',
 'rdpv',
 'ukzp',
 'mkoj',
 'xggp',
 'pvnc',
 'xnkz',
 'pvdb',
 'rdpv',
 'rvno',
 'rdpv',
 'mkbp',
 'onkz',
 'ugmn',
 'skoc',
 'gomc',
 'blfv',
 'sdon',
 'vkqn',
 'rdpv',
 'xnnm',
 'wmgr',
 'rdpv',
 'slfv',
 'sdoj',
 'bglm',
 'pvdb',
 'anno',
 'vkqn',
 'znkj',
 'rdpv',
 'vkqn',
 'anno',
 'skjn',
 'bnkb',
 'ugmn',
 'pvkp',
 'skjn',
 'wggj',
 'xgoj',
 'rnzn',
 'rdpv',
 'pvns',
 'pvkp',
 'pvkp',
 'mdxn',
 'rvno',
 'bglm',
 'kmbg',
 'ugnp',
 'cnkz',
 'pvkp',
 'kmbg',
 'rnmm',
 'rdpv',
 'agzn',
 'hlbp',
 'pvkp',
 'pdsn',
 'rnzn',
 'dopg',
 'anop',
 'vkqn',
 'nqno',
 'vglz',
 'xzgs',
 'pvdb',
 'agjc',
 'rkop',
 

In [22]:
# Calculate the first letter frequency

encrypted_words = encrypted_text.split()
for i, word in enumerate(encrypted_words):
    encrypted_words[i] = word[0]
    
Counter(encrypted_words)


Counter({'c': 1205,
         'r': 5544,
         'z': 1749,
         'p': 10831,
         'v': 4486,
         'o': 1633,
         'j': 2666,
         'k': 8912,
         'f': 3046,
         'g': 4953,
         'n': 2095,
         'b': 4759,
         'x': 3289,
         'd': 6475,
         's': 5267,
         'l': 861,
         'm': 1625,
         'u': 2316,
         'a': 3375,
         'w': 1011,
         'q': 594,
         'h': 244,
         'y': 126,
         'i': 278,
         '2': 23,
         '_': 38,
         '1': 83,
         '(': 35,
         '“': 476,
         '3': 9,
         '7': 5,
         '4': 8,
         '5': 9,
         '—': 3,
         't': 4,
         '[': 5,
         '‘': 22,
         '6': 4,
         '8': 6,
         '9': 6,
         '*': 8,
         '"': 10,
         '-': 7,
         "'": 1,
         '$': 1,
         '#': 1})

In [23]:
print(encrypted_letter_order_v2.translate(trantab_backward))
print(english_letter_order)

zqjxkvbgpwyfcumldrhsnioate
zqxjkvbpygfwmucldrhsnioate
