In [None]:
# The essentials
import numpy as np
import pandas as pd

# Plotting
%matplotlib inline
import matplotlib.pyplot as plt

# Std lib
from collections import defaultdict, Counter

# Notebook structure
This kernel explains how to crack the cipher algorithm of difficulty level 2. It also describes the plan of attack for further difficulty levels.
## Plan of attack
Based on the competition description, a "Cipher in the middle" based attack approach was constructed (shown in the figure below).
To find, for example, the cipher 2 encryption algorithm, we will need a cipher 1 texts (generated from a known plain text) and compare it with the corresponding cipher 2.
Therefore, it will be necessary to get both the encryption and decryption algorithms for all algorithms.

<img src="https://i.ibb.co/7kJGb6D/attack-overview.png" border="0"/>

Two subproblems have to be solved before cracking this code:
* We will have to find both the encryption and decryption for the previous difficulty level (resued from [Cracking the code: difficulty 1](https://www.kaggle.com/group16/cracking-the-code-difficulty-1))
* We must search (in each difficulty level) for one text matching the Ciphertext in order to execute a known-plaintext attack

# Reading in our data.
All text files are stored in a pandas dataframe.
We will store the length in a column as well. Which will turn out to be useful.

In [None]:
# load text and ciphertexts in pandas dataframe
train = pd.read_csv('../input/training.csv', index_col='index')
train['length'] = train['text'].apply(lambda x: len(x))
# ceil the length of the plain texts and save locally (for matching plain and cipher texts)
train['length_100'] = (np.ceil(train['length'] / 100) * 100).astype(int)
test = pd.read_csv('../input/test.csv')
test['length'] = test['ciphertext'].apply(lambda x: len(x))

# Difficulty 1: encode/decode

Before we can crack the cipher algorithm of difficulty level 2, we will probably have to transform the plaintexts to difficulty one to find a mapping (plaintext, ciphertext) pair (see code explanation in [Cracking the code: difficulty 1](https://www.kaggle.com/group16/cracking-the-code-difficulty-1))

In [None]:
# alphabets and key
alphabet = """7lx4v!2oQ[O=,yCzV:}dFX#(Wak/bqne*JApK{cmf6 GZDj9gT\'"YSHiE]5)81hMNwI@P?Us%;30uBrLR-.$t"""
key =      """ etaoinsrhldcumfygwpb.v,kI\'T"A-SBMxDHj)CW(ELORN!FGPJz0qK?1VY:U92/3*5;478QZ6X%$}#@={[]"""

decrypt_mapping = {}
encrypt_mapping = {}
for i, j in zip(alphabet, key):
    decrypt_mapping[ord(i)] = ord(j)
    encrypt_mapping[ord(j)] = ord(i)

def encrypt_step1(x):
    return x.translate(encrypt_mapping)

def decrypt_step1(x):
    return x.translate(decrypt_mapping)

Next, we transform all plaintexts in the training dataset with the encryption function described above

In [None]:
# encrypt to difficulty 1
train['cipher1'] = train['text'].apply(encrypt_step1)
train.head()

# Let's crack Difficulty 2

To find a matching plaintext and ciphertext example, we can again try to exploit the lengths of our documents (all steps are in detail described in [Cracking the code: difficulty 1](https://www.kaggle.com/group16/cracking-the-code-difficulty-1), but we will now repeat these steps but for difficulty 2 ciphertexts)

### 1. Length analysis plain vs difficulty 2

In [None]:
# select difficulty 2 ciphertexts
diff2 = test[test['difficulty'] == 2]
# group the ciphertexts by length & sort the values 
lengths = diff2.groupby('length')['ciphertext'].count().sort_values()
# search for those cipher lengths which only once in our ciphertexts set
rare_lengths =  lengths[lengths == 1].index
# match them with the train (plaintext) set and count how many times we found a plaintext matching the length of the ciphertexts
train[train['length_100'].isin(rare_lengths)].groupby('length_100')['text'].count()

We have two lengths with only 1 occurence [7100, 7900], we will use these texts to further investigate the cipher2 algorithm

In [None]:
matches = [7100, 7900]
train[train['length_100'].isin(matches)].sort_values('length_100')

In [None]:
diff2[diff2['length'].isin(matches)].sort_values('length')

** We will explain the nexts steps using plaintext ID_44394ca71 and ciphertext ID_f8d497eb8 **

### 2. Anything forced is not beautiful
We will now look at the ciphertext of difficulty 2 and the corresponding ciphertext of difficulty 1.

In [None]:
print("Cipher1 text: ", train[train.plaintext_id=="ID_44394ca71"].cipher1.values[0][0:35], "(generated from the plaintext)")
print("Cipher2 text: ",test[test.ciphertext_id=="ID_f8d497eb8"].ciphertext.values[0][0:35])

It is rather hard to see something interesting, but of course, some random padding characters were added to the ciphertext. <br> Exactly 55 padding characters were used (7100 - 7045 = 55, or ciphertext length - plaintext length = padding). <br>
Let's try and see if we get some more exciting insights if we remove 27 (55//2) padding characters from the ciphertext.

In [None]:
print("Cipher1 text: ",train[train.plaintext_id=="ID_44394ca71"].cipher1.values[0][0:35])
print("Cipher2 text: ",test[test.ciphertext_id=="ID_f8d497eb8"].ciphertext.values[0][(55//2):(55//2)+35])

**WOW, we see some corresponding characters matching between the ciphertext of difficulty 1 and ciphertext of difficulty 2 !**
<br>But some characters are changed or shifted to another character... let's remove all matching characters between the two cipher texts.

In [None]:
cipher1 = train[train.plaintext_id=="ID_44394ca71"].cipher1.values[0][0:35]
cipher2 = test[test.ciphertext_id=="ID_f8d497eb8"].ciphertext.values[0][(55//2):(55//2)+35]

diff_char1 = ""
diff_char2 = ""
for i in range(len(cipher1)):
    if cipher1[i] != cipher2[i]:
        diff_char1 += cipher1[i]
        diff_char2 += cipher2[i]

print(diff_char1)
print(diff_char2)

We now see that these characters are not one on one mappings. 
<br>The last 'Q' character of the cipher1 text is, for instance, being mapped on both a 'U' and 'X' in the corresponding cipher2 text...

** This is a gives us an indication that a key was used to transform one character to another. 
<br>Which leads us to the field of polyalphabetic substitution ciphers **

Let's try to find that key by subtracting the cipher2 text indices inside an alphabet from the cipher2 text indices in that same alphabet and see if we can find something! (ref: see [Vigenère cipher](https://en.wikipedia.org/wiki/Vigenère_cipher), a common polyalphabetic substitution cipher)

In [None]:
def find_key(cipher2, cipher1, alphabet):
    ciphertext = ''
    for i, c in enumerate(cipher2):
        # check if character is in alphabet
        if c in alphabet:
            # get the index of the cipher2 character in the alphabet
            plain_key = alphabet.index(cipher2[i])
            # do the same for the cipher1 character
            enc_key = alphabet.index(cipher1[i])
            # subtract, but make sure we are still inside the alphabet
            newIndex = (plain_key - enc_key) % len(alphabet)
            # return character from alphabet based on subtracted indices
            ciphertext += alphabet[newIndex]
            #cntr = (cntr + 1) % key_length
        else:
            ciphertext += ""
            
    return ciphertext

find_key(diff_char2, diff_char1, "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")

** XeNOPhON XeNOPhON XeNOPhON XeNOPhON ** <br>
We found our repeated key! By playing a little bit with our alophabet, we get it even better:

In [None]:
find_key(diff_char2, diff_char1, 'aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ')

So "xenophon" will be our key for difficulty 2 <br>
(TIP: did you see the banner picture of this kaggle competition? And did you know Xenophon was a student of Plato? Coincidence?...)

![](https://i.ibb.co/FwqvCQY/header.jpg)

### Decode and encode function:

In [None]:
def encrypt_vigenere(plaintext, key, alphabet):
    key_length = len(key)
    cntr = 0
    ciphertext = ''
    for i, c in enumerate(plaintext):
        if c in alphabet:
            charIndex = alphabet.index(c)
            keyIndex = alphabet.index(key[cntr])
            newIndex = (charIndex + keyIndex) % len(alphabet)
            ciphertext += alphabet[newIndex]
            cntr = (cntr + 1) % key_length
        else:
            ciphertext += c
            
    return ciphertext

def decrypt_vigenere(plaintext, key, alphabet):
    key_length = len(key)
    cntr = 0
    ciphertext = ''
    for i, c in enumerate(plaintext):
        if c in alphabet:
            charIndex = alphabet.index(c)
            keyIndex = alphabet.index(key[cntr])
            newIndex = (charIndex - keyIndex) % len(alphabet)
            ciphertext += alphabet[newIndex]
            cntr = (cntr + 1) % key_length
        else:
            ciphertext += c
            
    return ciphertext

Let's decode the cipher2 text with both our description algorithms for difficulty 2 and difficulty 1 and see if it matches the plaintext.

In [None]:
cipher = test[test.ciphertext_id=="ID_f8d497eb8"].ciphertext.values[0]

step1 = decrypt_vigenere(cipher, 'xenophon', 'aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ')
step2 = decrypt_step1(step1)

# decrypted text
print(step2[0:76])
# plaintext
print("                          ",train[train.plaintext_id=="ID_44394ca71"].text.values[0][0:76-27])

**I think we have a match!**