# Exercise: Attack on mono alphabetic cipher

In this exercice you will implement an attack to the mono alphabetic cipher using the corpus of the book Nineteen eighty four by George Orwell. Needless to say, a masterpiece worth reading for anyone intereted in privacy.

Alice and Bob want to communicate secretly so they meet in person and choose a key that nobody else knows. They agree on using the mono alphabetic cipher to encrypt and decrypt their messages. An attacker (Charlie) is eavesdroping the communication betweeen Alice and Bob so he's able to see all the ciphertext they send to each other. Charlie only knows that Alice and Bob communicate in english and that they use the mono alphabetic cipher. Our question here is, can Charlie crack the secret key with just this information? We will see how in this exercice

Author: [Sebastià Agramunt Puig](https://github.com/sebastiaagramunt) for [OpenMined](https://www.openmined.org/) Privacy ML Series course.

## Alice and Bob's communication

As mentioned, first alice and Bob meet and agree on a secret key, for simplicity here, we copy the code of the Monoalphabetic cipher we coded in the Ciphers notebook

In [1]:
from random import randrange, seed
from copy import deepcopy
import string


seed(3)

def mono_key_generator()-> str:
    chars = list(deepcopy(string.ascii_lowercase))
    chars_permutation = []
    
    while len(chars)>0:
        letter = chars.pop(randrange(len(chars)))
        chars_permutation.append(letter)
        
    return ''.join(chars_permutation) 
    
def mono_encrypt_decrypt(text: str, secret_key: str, encrypt: bool=True) -> str:
    assert len(secret_key)==len(string.ascii_lowercase), "secret key must be all ascii lowercase, 26 letters"
    
    if encrypt:
        convert_dict = {p:c for p, c in zip(string.ascii_lowercase, secret_key)}
    else:
        convert_dict = {c:p for p, c in zip(string.ascii_lowercase, secret_key)}
    convert_dict[" "] = " "
    
    return ''.join([convert_dict[c] for c in text])

In [2]:
seed(5)
secret_key = mono_key_generator()
print(f"Secret key shared between Alice and Bob: {secret_key}")

Secret key shared between Alice and Bob: tizmxsarjchdlpgqwenbykvofu


In [3]:
message = "this is a top secret message"
encrypted_message = mono_encrypt_decrypt(message, secret_key)
decrypted_ciphertext = mono_encrypt_decrypt(encrypted_message, secret_key, encrypt=False)

print(f"message:\n{message}\n\nciphertext:\n{encrypted_message}\n\ndecrypted_ciphertext:\n{decrypted_ciphertext}")

message:
this is a top secret message

ciphertext:
brjn jn t bgq nxzexb lxnntax

decrypted_ciphertext:
this is a top secret message


To get real words used in english we can download a corpora in this language. For instance we can download a book and use it as the messages Alice and Bob will send to each other. In the following chunk of code we download Nineteen Eighty Four by George Orwell from [Project Gutenberg](http://gutenberg.net.au).

In [4]:
from utils import download_data, process_load_textfile
import string
import os

url = 'http://gutenberg.net.au/ebooks01/0100021.txt'
filename = 'Nineteen-eighty-four_Orwell.txt'
download_path = '/'.join(os.getcwd().split('/')[:-1]) + '/data/'

#download data to specified path
download_data(url, filename, download_path)

#load data and process
data = process_load_textfile(filename, download_path)

Let's see how it looks after some processing

In [5]:
data[10000:11000]

'ook its smooth creamy paper a little yellowed by age was of a kind that had not been manufactured for at least forty years past he could guess however that the book was much older than that he had seen it lying in the window of a frowsy little junkshop in a slummy quarter of the town just what quarter he did not now remember and had been stricken immediately by an overwhelming desire to possess it party members were supposed not to go into ordinary shops dealing on the free market it was called but the rule was not strictly kept because there were various things such as shoelaces and razor blades which it was impossible to get hold of in any other way he had given a quick glance up and down the street and then had slipped inside and bought the book for two dollars fifty at the time he was not conscious of wanting it for any particular purpose he had carried it guiltily home in his briefcase even with nothing written in it it was a compromising possession the thing that he was about to

So Alice wants to send a very long message to Bob from the book Nineteen Eighty Four, this is the same as sending many messages of one word each. Let's code this part

In [6]:
data_len = len(data)

init_letter = data_len//2
final_letter = init_letter + data_len//4

message = data[init_letter:final_letter]
encrypted_message = mono_encrypt_decrypt(message, secret_key)

## Charlie's side

As we mentioned, Charlie only knows that Alice and Bob communciate in english and that they use the Monoalphabetic cipher. He's a smart guy and knows what are the most frequent letters in english. His attack will consist on compare the most frequent letters of the ciphertxt (encrypted data) that Alice sends to Bob with the most frequent letters in english.

First things first, we need to obtain the most frequent words in english, luckily you can find them in [wikipedia](https://en.wikipedia.org/wiki/Letter_frequency).

In [7]:
english_letter_counts = [("a", 0.082),
                         ("b", 0.015),
                         ("c", 0.028),
                         ("d", 0.043),
                         ("e", 0.13),
                         ("f", 0.022),
                         ("g", 0.02),
                         ("h", 0.061),
                         ("i", 0.07),
                         ("j", 0.0015),
                         ("k", 0.0077),
                         ("l", 0.04),
                         ("m", 0.024),
                         ("n", 0.067),
                         ("o", 0.075),
                         ("p", 0.019),
                         ("q", 0.00095),
                         ("r", 0.06),
                         ("s", 0.063),
                         ("t", 0.091),
                         ("u", 0.028),
                         ("v", 0.0098),
                         ("w", 0.024),
                         ("x", 0.0015),
                         ("y", 0.002),
                         ("z", 0.00074)
                        ]

In [8]:
# and sort them according to their frequency
english_letter_counts.sort(key=lambda x: x[1], reverse=True)
english_letter_counts

[('e', 0.13),
 ('t', 0.091),
 ('a', 0.082),
 ('o', 0.075),
 ('i', 0.07),
 ('n', 0.067),
 ('s', 0.063),
 ('h', 0.061),
 ('r', 0.06),
 ('d', 0.043),
 ('l', 0.04),
 ('c', 0.028),
 ('u', 0.028),
 ('m', 0.024),
 ('w', 0.024),
 ('f', 0.022),
 ('g', 0.02),
 ('p', 0.019),
 ('b', 0.015),
 ('v', 0.0098),
 ('k', 0.0077),
 ('y', 0.002),
 ('j', 0.0015),
 ('x', 0.0015),
 ('q', 0.00095),
 ('z', 0.00074)]

In [9]:
from collections import Counter
from typing import List, Tuple

# Write a function that inputs a text and outputs a list of tuples with frequencies of letters,
# hint: use Counter from package collections
def letter_count(text: str) -> List[Tuple[str, int]]:
    # step 1: remove white spaces
    text2 = text.replace(" ", "")
    
    # step 2: create a list of charactrs in the text
    letters = [c for c in text2]
    
    # step 3: count characters and sort 
    return Counter(letters).most_common()

In [10]:
lc = letter_count(data)

assert lc[0][0]=="e", "letter_count not well implemented"
assert lc[1][0]=="t", "letter_count not well implemented"
assert lc[2][0]=="a", "letter_count not well implemented"
assert lc[3][0]=="o", "letter_count not well implemented"
assert lc[4][0]=="n", "letter_count not well implemented"
assert lc[5][0]=="i", "letter_count not well implemented"

lc

[('e', 59619),
 ('t', 43877),
 ('a', 36523),
 ('o', 35051),
 ('n', 31986),
 ('i', 31950),
 ('h', 29164),
 ('s', 28972),
 ('r', 26126),
 ('d', 19022),
 ('l', 18657),
 ('u', 13037),
 ('w', 12243),
 ('c', 11636),
 ('m', 10828),
 ('f', 10188),
 ('y', 9423),
 ('g', 9283),
 ('p', 8614),
 ('b', 7653),
 ('v', 4313),
 ('k', 3609),
 ('x', 792),
 ('j', 463),
 ('q', 409),
 ('z', 306)]

### Exercice 3: Charlie's attack

Now Charlie has the ciphertext that Alice sent to Bob and the frequencies of the english letters, a simple attack can be calculate the frequencies of the ciphertext and compare the two lists. Let's code this

In [11]:
import string

def plaintext_attack(encrypted_message: str, english_letter_counts: List[Tuple[str, int]]) -> str:
    # encrypted message is the message intercepted from Alice to Bob
    # english_letter counts is the list of tuples for the frequencies
    characters = string.ascii_lowercase
    
    # first calculate the frequencies in plaintext and ciphertext
    ciphertext_letter_frequencies = letter_count(encrypted_message)
    
    # a dictionary that holds each letter in plaintext the conversion to ciphertext
    key_dict = {}
    for (english_letter, _), (ctx_letter, _) in zip(english_letter_counts, ciphertext_letter_frequencies):
        if key_dict.get(english_letter) is None:
            key_dict[english_letter] = ctx_letter
            
    inferred_secret_key = [key_dict[letter] if key_dict.get(letter) is not None else "_" for letter in characters]
    
    return ''.join(inferred_secret_key)


In [12]:
inferred_secret_key = plaintext_attack(encrypted_message, english_letter_counts)
print(f"secret_key:\n\t{secret_key}\ninferred_secret_key:\n\t{inferred_secret_key}")

correctly_guessed = 0
for sk, isk in zip(secret_key, inferred_secret_key):
    if sk==isk:
        correctly_guessed+=1
print(f"\nCorrectly guessed {correctly_guessed} out of {len(secret_key)}")

secret_key:
	tizmxsarjchdlpgqwenbykvofu
inferred_secret_key:
	tazdxsfrjokmvpgqcenbyilwhu

Correctly guessed 14 out of 26


Not bad! we've guessed 14 out of 26 characters!, let's see how the decrypted text would look like with our inferred key and compare it to the original Let's see how the text looks like when decrypting with this key

In [13]:
mono_encrypt_decrypt(encrypted_message, inferred_secret_key, encrypt=False)[0:500]

'inuel formarl in the sawe wokewent daginb a frienldg hanl for a wowent on minstons arw so that the tmo of thew mere madyinb sile vg sile he veban speayinb mith the pecudiar brake courtesg that lifferentiatel hiw frow the waqoritg of inner partg wewvers i hal veen hopinb for an opportunitg of tadyinb to gou he sail i mas realinb one of gour nemspeay articdes in the tiwes the other lag gou taye a schodardg interest in nemspeay i vedieke minston hal recokerel part of his sedfpossession harldg schod'

In [14]:
message[0:500]

'inued forward in the same movement laying a friendly hand for a moment on winstons arm so that the two of them were walking side by side he began speaking with the peculiar grave courtesy that differentiated him from the majority of inner party members i had been hoping for an opportunity of talking to you he said i was reading one of your newspeak articles in the times the other day you take a scholarly interest in newspeak i believe winston had recovered part of his selfpossession hardly schol'

# Conslusions

Charlie has been able to correctly guess 14 out of 26 characters from the key with this very simple attack!. The main takeaway from this exercice is that one can take information by simply looking at the ciphertext. Can we construct a perfectly secure cipher so that the ciphertext carries no information about the original message?. This is what we are going to see in the next section.