Let us see if we can break simple ciphers with repeated appearing words and phrase! (SPOILER: cipher 1 is quite easy)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [15, 7.5]
import numpy as np
import pandas as pd
import string
from collections import Counter
import nltk
import os
import re

### Loading Data

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

In [None]:
level_one = train[train['difficulty']==1].copy()

In [None]:
level_one.shape

### Chipher alphabet

In [None]:
alp = pd.Series(Counter(''.join(level_one['ciphertext'])))
alp.head(10)

In [None]:
alp.shape

We now know the number of distinct characters in the cipher text. Coupled this with interneuron's awesome analysis in the other notebook (cipher characters kind of follow a similar distribution to typical letter distribution), we can guess that cipher 1 uses a substitution algorithm.

### Loading Plaintext

In [None]:
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='train')

### Count the Most Common Opening Words

Our strategy is to find common words and phrases at the beginning or end of a text, find common patterns in the corresponding parts of the ciphertext, and try to connect them with each other.

In [None]:
heads = [x[:6] for x in news['data']]

In [None]:
Counter(heads).most_common(10)

We see that `From: ` is the most common beginning, so it should also mean that primative encryption algorithms will map them to the same block of ciphertext at corresponding positions.

### Count the Most Common Opening Ciphertext Characters

In [None]:
level_one['ciphertext'].apply(lambda x: x[:6]).value_counts().reset_index().head(10)

We can confidently infer that
> 'From: ' -> '*#^-G1'

In [None]:
subs = {
    'F': '*',
    'r': '#',
    'o': '^',
    'm': '-',
    ':': 'G',
    ' ': '1'
}
subs = {v:k for k, v in subs.items()}

In [None]:
def decipher(ciphertext):
    return ''.join([subs[c] if c in subs.keys() else '?' for c in ciphertext])

def undeciphered(ciphertext):
    return ''.join(['?' if c in subs.keys() else c for c in ciphertext])

In [None]:
level_one['ciphertext'].head(10).apply(decipher).reset_index()

We can now iterate this process and look for the next matching target: `Subject: `

In [None]:
heads = [x[:9] for x in news['data'] if x[:6] != 'From: ']
Counter(heads).most_common()

In [None]:
heads = level_one['ciphertext'].apply(lambda x: x[:9]).value_counts().head(20).reset_index()
heads['deciphered'] = heads['index'].apply(decipher)
heads

Haha! Row 2 is probably 'Subject: '! We move on to the next target:

In [None]:
subs = {v: k for k, v in zip('From: ', '*#^-G1')}
subs.update({v: k for k, v in zip('Subject', '>cX_t')})

In [None]:
heads = [x[:14] for x in news['data'] if x[:6] != 'From: ' and x[:9] != 'Subject: ']
Counter(heads).most_common()

In [None]:
heads = level_one['ciphertext'].apply(lambda x: x[:14]).value_counts().head(20).reset_index()
heads['deciphered'] = heads['index'].apply(decipher)
heads

Row 2 now should be 'Organization'!

In [None]:
subs = {v: k for k, v in zip('From: ', '*#^-G1')}
subs.update({v: k for k, v in zip('Subject', '>cX_t')})
subs.update({v: k for k, v in zip('Organization', '%#dOahOta^')})

In [None]:
heads = level_one['ciphertext'].apply(lambda x: x[:14]).value_counts().head(10).reset_index()
heads['deciphered'] = heads['index'].apply(decipher)
heads

Looking at `Subject: Re: `

In [None]:
heads = [x[:13] for x in news['data'] if x[:6] != 'From: ']
Counter(heads).most_common(10)

In [None]:
heads = level_one['ciphertext'].apply(lambda x: x[:13]).value_counts().head(10).reset_index()
heads['deciphered'] = heads['index'].apply(decipher)
heads

In [None]:
heads['index'][2]

In [None]:
subs = {v: k for k, v in zip('From: ', '*#^-G1')}
subs.update({v: k for k, v in zip('Subject', '>cX_t')})
subs.update({v: k for k, v in zip('Organization', '%#dOahOta^')})
subs.update({v: k for k, v in zip('R', '\x1e')})

In [None]:
heads = level_one['ciphertext'].apply(lambda x: x[:13]).value_counts().head(10).reset_index()
heads['deciphered'] = heads['index'].apply(decipher)
heads

Onto the next: `?i?tribution:`

In [None]:
heads = level_one['ciphertext'].apply(lambda x: x[:13]).value_counts().head(50).reset_index()
heads['deciphered'] = heads['index'].apply(decipher)
heads[heads['deciphered'].apply(lambda x: x[:4] != 'From')]

In [None]:
heads = [x[:13] for x in news['data'] if x[:6] != 'From: ' and x[:9] != 'Subject: ']
Counter(heads).most_common(10)

We found `Distribution: `!

In [None]:
subs = {v: k for k, v in zip('From: ', '*#^-G1')}
subs.update({v: k for k, v in zip('Subject', '>cX_t')})
subs.update({v: k for k, v in zip('Organization', '%#dOahOta^')})
subs.update({v: k for k, v in zip('R', '\x1e')})
subs.update({v: k for k, v in zip('Ds', 'xv')})

In [None]:
heads = level_one['ciphertext'].apply(lambda x: x[:13]).value_counts().head(50).reset_index()
heads['deciphered'] = heads['index'].apply(decipher)
heads[heads['deciphered'].apply(lambda x: x[:4] != 'From')]

We now have figured out enough letters. We might be able to directly map one text to the source.

In [None]:
level_one['ciphertext'].apply(decipher).iloc[3]

In [None]:
np.where([re.search(r'Samue\w Ross', s) != None for s in news['data']])

In [None]:
news['data'][1646]

Found it! Not exact match but same post!

In [None]:
level_one['ciphertext'].iloc[3]

In [None]:
level_one['ciphertext'].apply(decipher).iloc[3]

In [None]:
level_one['ciphertext'].apply(undeciphered).iloc[3]

In [None]:
news['data'][1646][:300]

In [None]:
subs = {v: k for k, v in zip('From: ', '*#^-G1')}
subs.update({v: k for k, v in zip('Subject', '>cX_t')})
subs.update({v: k for k, v in zip('Organization', '%#dOahOta^')})
subs.update({v: k for k, v in zip('R', '\x1e')})
subs.update({v: k for k, v in zip('Ds', 'xv')})
subs.update({v: k for k, v in zip('s', 'v')})
subs.update({v: k for k, v in zip('6@vl.d()\nBkfhp!uywL28', '5bz8\x08A|ysJf]0\'P@oWFH,')})
subs.update({v: k for k, v in zip('N-pHMEAyTKIGCJW01', '\x7fq9geE/\x10{w:"2}l\\L')})

This step is just to match as many letters as possible by comparing both texts. We now have the results:

In [None]:
level_one['ciphertext'].apply(decipher).iloc[3]

In [None]:
level_one['ciphertext'].apply(undeciphered).iloc[3]

There appears to be some plain text case swap (or it could be in the source text already). But otherwise, this message is cracked!

current cipher alphabet coverage:

In [None]:
len(subs.keys()) / len(alp)

We have not cracked the full alphabet but this should be sufficient to continue working out all difficulty 1 texts.

In [None]:
for i in range(10):
    print(level_one['ciphertext'].apply(decipher).iloc[i])
    print('-' * 30)

In [None]:
for i in range(10):
    print(level_one['ciphertext'].apply(decipher).iloc[-i-1])
    print('-' * 30)