# A First Crack: Tools, Tips & 3 Cipher Solutions

To get started on this Kaggle competition, we will need some tools for cracking simple ciphers. This kernel explores the data, creates some helpful functionality for all ciphers, and applies them to cracking the first three ciphers.
0. Loading the Source Data (Important!)
1. Utility Functions
2. Characteristics of Plaintext Data
    * Length of the newsgroup documents (before being split into 300 char strings)
    * Distribution of ASCII characters in plaintext
    * Dictionary of most common words in the plaintext source
3. Cracking Cipher#1
    * Matching the observed character frequencies
    * Optimizing the decryption with word frequency similarity (or ngrams)
    * Finding actual plaintext in the source and fine-tuning
4. Cracking Cipher #2
5. Cracking Cipher #3
5. Make Predictions

In working on this, many thanks is due to:
 - [Practical Cryptography](http://practicalcryptography.com/ciphers/classical-era/) for a summary of classic ciphers
 - [The scikit-learn tutorial for text data](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) using the same "20 Newgroups" dataset

** Hope you found this public kernel to be helpful! **

### Preliminary Setup
If not being run on a Kaggle kernel, you need to [create an API Token](https://github.com/Kaggle/kaggle-api#api-credentials), upload it, and make sure it is in the directory (usually ~/.kaggle/kaggle.json)

In [None]:
# if using from the AISE.ai machine image on Google Cloud Platform,
# upload the kaggle.json file to /jet/prs/workspace, and then from SSH:
# cd /home/jet
# mkdir .kaggle
# cd /jet/prs/workspace
# cp kaggle.json /home/jet/.kaggle
#  chmod 600 /home/jet/.kaggle/kaggle.json
newdownloads = False
if newdownloads:
    # install missing packages
    !pip install matplotlib
    !pip install scikit-learn
    !pip install fuzzywuzzy
    !pip install python-Levenshtein
    !pip install tqdm
    !pip install kaggle
    !pip install "dask[complete]"
    !pip install graphviz
    
    # download the kaggle data
    import os
    import zipfile
    DATA_DIR = '/jet/prs/workspace/data'
    os.makedirs(DATA_DIR)
    os.chdir(DATA_DIR)
    !kaggle competitions download -c 20-newsgroups-ciphertext-challenge
    #unzip
    trainzip = zipfile.ZipFile('train.csv.zip')
    trainzip.extractall(path=DATA_DIR)
    testzip = zipfile.ZipFile('test.csv.zip')
    testzip.extractall(path=DATA_DIR)

### Load Python packages
We will import the usual `numpy`/`pandas`/`matplotlib`. Also `CountVectorizer` from `sklearn` for dealing with text analysis, `fuzzywuzzy` and `Levenshtein` for string matching, `itertools` to help with lists, and `dask` for  parallel processing.

In [None]:
import os

import numpy as np
np.random.seed(seed=42) # fix seed for reproduceability
import pandas as pd
import matplotlib.pyplot as plt

from scipy import stats

from sklearn.feature_extraction.text import CountVectorizer
token_pattern = r"[a-zA-Z0-9!@#$&()<>'=\-]+" # CountVectorizer word token

from fuzzywuzzy import fuzz, process
import Levenshtein as leven

from itertools import compress
from tqdm import tqdm

# distributed computing
from dask import delayed, compute
from dask.diagnostics import ProgressBar
ProgressBar().register()

### Load Kaggle data

In [None]:
# Kaggle test and training data
onkaggle = True
if onkaggle:
    train_datafile = '../input/train.csv'
    test_datafile = '../input/test.csv'
else:
    train_datafile = '/jet/prs/workspace/data/train.csv'
    test_datafile = '/jet/prs/workspace/data/test.csv'

In [None]:
train_df = pd.read_csv(train_datafile)
test_df =  pd.read_csv(test_datafile)
test_df['length'] = np.array([len(test_df['ciphertext'].iloc[idx]) for idx in range(len(test_df))],dtype=int)

In [None]:
print("Train data")
train_df.tail(5)

In [None]:
print("Test data")
test_df.tail(5)

# 0. Loading the Source Data (Important!)
It will be helpful to reference the plaintext data from the **20 Newsgroups** data set, which is conveniently one of the `sklearn.datasets`. We will download this using `sklearn`, which brings some conveniences relative to the primary source on Jason Rennie's homepage ( [http://qwone.com/~jason/20Newsgroups/](http://qwone.com/~jason/20Newsgroups/) ).

Note that the `fetch_20newsgroups` function used here returns a scikit-learn "bunch" object (see details [here](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html) ). 

It will be extremely helpful to break this up into 300 character chunks **in exactly the same way as the Kaggle organizer. ** To do this, the key piece of code modifies the line breaks and trailing white space:

`textstring.replace('\r\n','\n').replace('\r','\n').replace('\n','\n ').rstrip(' ')`

In [None]:
%%time
from sklearn.datasets import fetch_20newsgroups
twenty_databunch = fetch_20newsgroups(subset='all', download_if_missing=True)

# CORRECTION TO MATCH KAGGLE DATA
def sourcetransform(textstring):
    return textstring.replace('\r\n','\n').replace('\r','\n').replace('\n','\n ').rstrip(' ')

sourcetext = twenty_databunch.data
for i,textstring in enumerate(sourcetext):
    sourcetext[i] = sourcetransform(textstring)
    
twenty_databunch.data = sourcetext

twenty_datalengths = [len(datastring) for datastring in twenty_databunch.data]

category_names = twenty_databunch.target_names


chunks_plaintext = []
chunks_target = []
chunks_length = []
for i in range(len(twenty_databunch.target)):
    strlength = len(twenty_databunch.data[i])
    if strlength > 300:
        for j in range(strlength // 300):
            chunks_plaintext.append(twenty_databunch.data[i][300*j:300*(j+1)])
            chunks_target.append(twenty_databunch.target[i])
            chunks_length.append(300)
        if strlength%300 > 0:
            chunks_plaintext.append(twenty_databunch.data[i][300*(strlength // 300):(300*(strlength // 300)+strlength%300)])
            chunks_target.append(twenty_databunch.target[i])
            chunks_length.append(strlength%300)
    else:
        chunks_plaintext.append(twenty_databunch.data[i])
        chunks_target.append(twenty_databunch.target[i])
        chunks_length.append(strlength)
        
chunk_df = pd.DataFrame({'plaintext':chunks_plaintext,
                         'length':np.array(chunks_length,dtype=int), 
                         'target':np.array(chunks_target,dtype=int)})
chunk_df['testref'] = np.nan
chunk_df['trainref'] = np.nan


# is is very helpful to have a copy of the data in dictionary form for fuzzy wuzzy lookup
sourcetext_dict = {idx: el for idx, el in enumerate(chunks_plaintext)}
chunk_df.tail()

# 1. Utility Functions

The functions in the code section below may be helpful:
* translating between strings of text and 8-bit ascii representations (`string2ascii` and `ascii2string`)
* calculating vectors where each element is the frequency of character per million characters of text ( `char_per_million`)
* decipher function for a substitution cipher given a mapping between ascii characters (`decipher_subst`)
* trimming a longer text to match a shorter one with the same fuzzy similarity (`trim_matchedsource`)
* matrix of ascii character replacement operations needed for Levenshtein distance (`leven_replace_matrix`)


And for later use:
* calculating vectors where each element is the frequency of word/ngram per million characters of text (`word_per_million` and `ngram_per_million`). We can use these calculate similarity to properties of a source language.text (`similarity_score`)

In [None]:
# translating a string of text to an array of 8-bit integers representing ASCII values
def string2ascii(textstring):
    return np.array([ord(char) for char in textstring], dtype=np.int8)

# translating an array of 8-bit integers representing ASCII values to a string
def ascii2string(nparray):
    return ''.join(chr(npint) for npint in nparray)

# calculate ascii character frequency per million characters
def char_per_million(stringsarray):
    asciicount = np.zeros((128,), dtype=int)
    for k,textdata in enumerate(stringsarray):
        asciicount += np.histogram(string2ascii(textdata),np.arange(129))[0]
    totalchars = np.sum(asciicount)
    return np.multiply(asciicount,np.divide(1000000.0,totalchars))

# substitution decipher using the input asciimap dataframe (index= ciphered ascii integer)
def decipher_subst(textstring,asciimap):
    inarray = string2ascii(textstring)
    outarray = np.zeros(len(inarray),dtype=np.int8)
    for asciival in asciimap.index:
        outarray[inarray==asciival] = asciimap['decipher'].loc[asciival]
    return ascii2string(outarray)

# For fine-tuning a cipher, this identifies letter character replacements needed
# The output is a 128x128 matrix with rows and columns corresponding to ascii characters 
# The element in element [a,b] is the number of times the ascii character in texta had to 
# be switched for the ascii character in textb
def leven_replace_matrix(textA,textB):
    # initialize output matrix
    asciiswitchmat = np.zeros([128,128],dtype=int)
    # calculate edits for Levenshtein distance
    lops = leven.editops(textA,textB)
    # count the character replacements needed
    replacetf = [editop[0] is 'replace' for editop in lops]
    replacerefA =  np.array([editop[1] for editop in lops],dtype=np.int8)
    replacerefB =  np.array([editop[2] for editop in lops],dtype=np.int8)
    goodidx = np.logical_and(replacetf,np.logical_and(replacerefA<128,replacerefB<128))
    replacerefA = replacerefA[replacetf]
    replacerefB = replacerefB[replacetf]
    for i, refA in enumerate(replacerefA):
        asciiswitchmat[ord(textA[refA]),ord(textB[replacerefB[i]])] += 1
    return asciiswitchmat

In [None]:
def char_count(textstring):
    return np.histogram(string2ascii(textstring),np.arange(129))[0]
    
# calculate ascii character frequency per million characters
def char_per_300(stringsarray):
    asciicount = np.zeros((128,), dtype=int)
    for k,textdata in enumerate(stringsarray):
        asciicount += char_count(textdata)
    totalchars = np.sum(asciicount)
    return np.multiply(asciicount,np.divide(300.0,totalchars))

# number of words
def num_recognized_words(stringsarray,worddictionary):
    vectorizer = CountVectorizer(analyzer='word',vocabulary=worddictionary2)
    X = vectorizer.fit_transform(stringsarray)
    return np.sum(X)

# calculate words per million characters using a given dictionary list of ngrams
# the worddictionary is generated on a previous run of CountVectorizer using English text
def word_per_million(stringsarray,worddictionary):
    totalchars = sum([len(cipherstring) for cipherstring in stringsarray])
    vectorizer = CountVectorizer(analyzer='word',token_pattern=token_pattern,vocabulary=worddictionary)
    #vectorizer = CountVectorizer(vocabulary=worddictionary)
    X = vectorizer.fit_transform(stringsarray)
    return np.squeeze(np.asarray(np.sum(X,axis=0)*(1000000.0/totalchars)))

# calculate ngrams per million characters using a given dictionary list of ngrams
# the ngramdictionary is generated on a previous run of CountVectorizer using English text
def ngram_per_million(stringsarray,ngramdictionary):
    ngramlength = len(next(iter(ngramdictionary)))
    totalchars = sum([len(cipherstring) for cipherstring in stringsarray])
    vectorizer = CountVectorizer(analyzer='char', 
                             max_features=1000,
                             lowercase = False,
                             ngram_range = (n_ngram,n_ngram),
                             vocabulary=ngramdictionary)
    X = vectorizer.fit_transform(stringsarray)
    return np.squeeze(np.asarray(np.sum(X,axis=0)*(1000000.0/totalchars)))

def similarity_score(x, y):
    return 1.0 - np.divide(np.linalg.norm(x - y),np.linalg.norm(y))

# goodnes of fit
def goodness_of_fit(cipher_array,asciimap,verbose=True):
    decipher_array = [decipher_subst(ciphertext,asciimap) 
     for ciphertext in cipher_array]
    w_score = num_recognized_words(decipher_array, worddictionary)
    wpm_candidate = word_per_million( decipher_array, worddictionary)
    npm_candidate = ngram_per_million( decipher_array, ngramdictionary)
    wpm_score = similarity_score(wpm_candidate, wpm_plaintext)
    npm_score = similarity_score(npm_candidate, npm_plaintext)
    if verbose:
        print('   Goodness-of-Fit vs Source Text   ')
        print('====================================')
        print("Number of Recognized Words = " + str(w_score))
        print("Word-per-Million Similarity = " + str(wpm_score))
        print("Ngram-per-Million Similarity = " + str(npm_score))
    return w_score,wpm_score, npm_score

# 2. Characteristics of Plaintext Data
First, let's look at the actual, plaintext data to get a feel for what we should expect to see in the postings from the 20 newsgroups. Knowing the characteristics of the text will make it much easier to decipher.

Each of the newsgroup documents looks something like this:

In [None]:
# Display an example data item along with its associated category and filename
dataitem = 0
print(twenty_databunch.data[dataitem])

There are 18,846 data items distributed across the 20 categories.

In [None]:
fig, ax = plt.subplots(figsize=(9, 12))
ax.barh(np.arange(len(category_names)),
        [sum(twenty_databunch.target==k) for k in range(len(category_names))],
        tick_label = category_names)
ax.tick_params(axis='both',labelsize=14)
plt.title('# of Newsgroup Postings',fontsize=16)
plt.show()

## Length of the newsgroup documents (after being split into 300 char strings)
Beyond a general understanding of the dataset, the length of the documents can be helpful in decrypting the ciphers as you try to match plaintext to ciphertext--especially in trying to puzzle through the more complicated 3rd and 4th ciphers.

In [None]:
print(str((100*sum(chunk_df['length']==300))//len(chunk_df))+'% of the data items have a size = 300')

In [None]:
print(str(sum(chunk_df['length']<5))+' of the data items have a size < 5 characters!')

## Distribution of ASCII characters in plaintext
In hacking the cipher, it will be helpful to see what the true distribution of ASCII characters is within the plaintext. The traditional letters will probably be fairly similar to the general distribution of alphabet letters in written English documents, but the newsgroup format of these postings could change the frequency with which we see capitalization and special characters.

In [None]:
%%time
cpm_plaintext = char_per_million(chunks_plaintext)
asciitop = np.argsort(-1*cpm_plaintext)
asciitop_df = pd.DataFrame({'char':[chr(asciinum) for asciinum in asciitop],'char_per_million':cpm_plaintext[asciitop]},index=asciitop)

In [None]:
# create a bar chart to highlight the top 40 most common characters
fig, ax = plt.subplots(figsize=(15, 5))
ax.plot(np.arange(30),asciitop_df['char_per_million'].iloc[0:30],color='red', marker='*', markersize=10)
ax.set_xticklabels(["'"+chr(asciinum)+"'" for asciinum in asciitop_df.index[0:30]])
ax.set_xticks(np.arange(30))
ax.tick_params(axis='both',labelsize=16)
plt.ylabel('Occurrence per Million Characters',fontsize=16)
plt.title('Top 30 ASCII Characters: Plaintext Source',fontsize=18)
plt.show()
print("Of the 128 possible ASCII values, only "+ str(sum(cpm_plaintext>1)) +" are used more than once per million")

## Dictionaries of most common words and ngrams

Note that we are not only using the default definion of a word (at least two alphanumeric characters) but are also considering an alternative dictionary allowing for one-letter words and special characters. This is done when loading the Python packages, defining the alternative CountVector word token: 

`token_pattern = r"[a-zA-Z0-9!@#$&()<>'=\-]+" `

In [None]:
# the 5000 most common traditional words
totalchars = sum([len(cipherstring) for cipherstring in twenty_databunch.data])
vectorizer = CountVectorizer(analyzer='word', max_features=5000)
source_words = vectorizer.fit_transform(twenty_databunch.data)
worddictionary2 = vectorizer.vocabulary_

# the 5000 most common words, including one-letter words and symbols
totalchars = sum([len(cipherstring) for cipherstring in twenty_databunch.data])
vectorizer = CountVectorizer(analyzer='word', max_features=5000, token_pattern=token_pattern)
source_words = vectorizer.fit_transform(twenty_databunch.data)
worddictionary = vectorizer.vocabulary_
# the word-per-million score from plaintext English that we try to match in deciphering
wpm_plaintext = word_per_million(twenty_databunch.data,worddictionary)

# the 1000 most common English ngrams
n_ngram = 3
totalchars = sum([len(cipherstring) for cipherstring in twenty_databunch.data])
vectorizer = CountVectorizer(analyzer='char', 
                             max_features=1000,
                             lowercase = False,
                             ngram_range = (n_ngram,n_ngram) )
source_words = vectorizer.fit_transform(twenty_databunch.data)
ngramdictionary = vectorizer.vocabulary_
# the word-per-million score from plaintext English that we try to match in deciphering
npm_plaintext = ngram_per_million(twenty_databunch.data,ngramdictionary)

In [None]:
# this dictionary dataframe will show the common words
worddict_df = pd.DataFrame.from_dict(worddictionary, orient='index')
worddict_df.reset_index(inplace=True)
worddict_df.rename(index=str, columns={"index": "word", 0: "dict_index"},inplace=True)
worddict_df.set_index('dict_index',inplace=True)
worddict_df.sort_index(inplace=True)
worddict_df['words_per_million'] = wpm_plaintext
worddict_df.sort_values('words_per_million',ascending=False,inplace=True)
print("Top 10 Most Common Words/Symbols")
print(worddict_df.head(10))

# 3. Cracking Cipher#1

Let's get started on cracking the cipher for `difficulty=1`.

## Observed character frequencies

Many classic ciphers can be grouped into one of two categories; (1) substitution, where each character is mapped to another, and (2) transposition., where the ordering of the characters is switched.

We can use the observed frequency of characters to identify them. In a transposition ciphertext, the characters show the same frequency of occurrence as normal English text. In contrast, a substitution cipher will produce ciphertext where each character has a different frequency than it would in English. But importantly, the probability distribution should look about the same--just with different characters.

If the shape of the probability distribution changes, the cipher is more complicated. This could likely fall in the category of a "polyalphabetic substitution cipher", where the substitutions mapping changes from one letter to the next. 

So let's look at the distribution of observed characters per million.

In [None]:
# we combine test and training data for greatest precision
cpm_train = char_per_million(train_df[train_df['difficulty']==1]['ciphertext'])
cpm_test =  char_per_million(test_df[test_df['difficulty']==1]['ciphertext'])
cpm_cipher = 0.67*cpm_test + 0.33*cpm_train

c1asciitop = np.argsort(-1*cpm_cipher)
c1asciitop_df = pd.DataFrame({'char':[chr(asciinum) for asciinum in c1asciitop],
                            'char_per_million':cpm_cipher[c1asciitop]},index=c1asciitop)

In [None]:
# create a bar chart to highlight the top 40 most common characters
fig, ax = plt.subplots(figsize=(15, 5))
ax.bar(np.arange(30),
       c1asciitop_df['char_per_million'].iloc[0:30],
       tick_label = ["'"+chr(asciinum)+"'" for asciinum in c1asciitop_df.index[0:30]] )
ax.plot(np.arange(30),asciitop_df['char_per_million'].iloc[0:30],color='red', marker='*', markersize=10)
ax.tick_params(axis='both',labelsize=16)
plt.ylabel('Occurrence per Million Characters',fontsize=16)
plt.title('Top 30 ASCII Characters: Ciphertext Difficulty=1',fontsize=18)
plt.legend(['Distribution for Top Characters in Source Data','Distribution for Difficulty=1'],fontsize=16)
plt.show()
print("Of the 128 possible ASCII values, only "+ str(sum(cpm_cipher>1)) +" are used more than once per million")

This is what we expect to see from a substitution cipher. Note the red line showing the shape of the distribution for the top characters in the source data: different letters but almost exactly the same distribution!
 
Let's start deciphering by trying out a mapping between the characters of the plaintext and the ciphertext with the most similar frequency of occurrence. To do this, we will create an `asciimap` dataframe that connects each ciphertext characer in the index to a plaintext character in the `decipher` column. Then, if we just line up the frequency of letters ocurring in the plaintext English with the ciphertext, we can generate a first guess on a potential substitution mapping. Let's see how legible this is for the sample training data at `iloc[0]`:

In [None]:
asciimap = pd.DataFrame(asciitop, index=c1asciitop,columns=['decipher'])

In [None]:
textstring = train_df[train_df['difficulty']==1]['ciphertext'].iloc[0]
print('SAMPLE DECIPHER ATTEMPT:')
print(decipher_subst(textstring,asciimap))

Not bad! At this point, the text is already legible enough to understand. We could finish deciphering manually by making a few corrections to the asciimap (i.e. we could turn the  >'s sideways to makes v's!). 

... but manual deciphering is no fun!  And automating and optimizing may create tools that we will need for deciphering the later ciphers.

## Finding actual plaintext in the source and fine-tuning
The current `asciimap` substitution deciphering--while not perfect--is close enough to easily find the plaintext source and perfect the final substitutions in the cipher. Note that it takes about 25 seconds for `fuzzy wuzy` to find a match on a typical CPU.

In [None]:
%%time
# Go through the plain text from the category and find the matching entry
textstring = decipher_subst(train_df[train_df['difficulty']==1]['ciphertext'].iloc[0],asciimap)
textcategory = train_df[train_df['difficulty']==1]['target'].iloc[0]
sourcematch = process.extractOne(textstring, sourcetext_dict, scorer = fuzz.ratio)

In [None]:
print('Found a match in source data with "fuzz.ratio" score of ' + str(fuzz.ratio(textstring,sourcematch[0])))
print('Source data found in category = ' + category_names[chunk_df['target'].iloc[sourcematch[2]]])
print('Matching text below:')
print('----------------')
print(sourcematch[0])
print(' ')

 While this level of accuracy is probably fine for identifying most of the `difficulty=1` ciphers, we may want to perfect the decryption. Getting the first cipher completely correct could help with the others.
 
 To fine-tune the mapping, let's take the identified matches and figure out which characters are not mapped correctly. We need a fairly large subset to find the rare characters, like "K", but going through so many example ciphers to find the rare characters eats significant time! For this step, it is helpful to use parallel processing (done here via `dask`).

In [None]:
# subsample of ciphertext
subset_size = 2000
tune_df = train_df[np.logical_and(train_df['difficulty']==1,
                                    np.array([len(train_df['ciphertext'].iloc[idx]) for idx in range(len(train_df))])>100)]
tuneindices = tune_df.index[:subset_size]
        
# function to match to source data for same category and length >100 chars
def finetunematch(idx):
    textstring = decipher_subst(tune_df['ciphertext'].loc[idx],asciimap)
    selectindices = np.logical_and(chunk_df['target']==tune_df['target'].loc[idx],chunk_df['length']>250)
    sourcematch = process.extractOne(textstring, 
                                     list(compress(chunks_plaintext, selectindices)), 
                                     scorer = fuzz.ratio)
    return sourcematch[0], textstring, sourcematch[1]

# parallel evaluation using Dask (big benefits for more CPUs)
par_compute = [delayed(finetunematch)(idx) for idx in tuneindices]
output_arrays = compute(*par_compute, scheduler='processes')

To help ensure the matches are high quality, let's only use matches where the text similarity, as measured by the `fuzz.ratio`, is higher than 80.

In [None]:
# select those with relatively high fuzzy.ratio scores
minfuzzyscore = 80
sourcematch_array = [arrays[0] for arrays in output_arrays]
decipher_array = [arrays[1] for arrays in output_arrays]
fuzzyscores_array = np.array([arrays[2] for arrays in output_arrays])
sourcematch_array = list(compress(sourcematch_array, (fuzzyscores_array>=minfuzzyscore) ))
decipher_array = list(compress(decipher_array, (fuzzyscores_array>=minfuzzyscore) ))
cipher_array =  tune_df[:subset_size]['ciphertext']
cipher_array = list(compress(cipher_array, (fuzzyscores_array>=minfuzzyscore) ))

In [None]:
print('In subsample, ' + str(np.sum(fuzzyscores_array>=minfuzzyscore)) + ' of the ' + str(subset_size) + ' have fuzzy.ratios > ' 
       + str(minfuzzyscore) + ' and will be used for fine-tuning')
print(' ')
print('Before fine-tuning the sub-sample:')
gof = goodness_of_fit(cipher_array,asciimap=asciimap)
print('Average fuzzy.ratio = ' + str(np.mean(fuzzyscores_array[fuzzyscores_array>=minfuzzyscore])))

Using those matches, some of the most common character replacements needed to match the proposed decipher with the source text (using Levenshtein distance edits) are:

In [None]:
# initialize asciiswitchmat, a 128x128 matrix with rows numbers corresponding to ascii characters
# that should be replaced by the ascii character of the column number
asciiswitchmat = np.zeros([128,128],dtype=int)
asciicount = np.zeros((128,), dtype=int)

for i, textstring in enumerate(decipher_array):
    asciiswitchmat += leven_replace_matrix(textstring,sourcematch_array[i])
    asciicount += np.histogram(string2ascii(textstring),np.arange(129))[0]

deciphererror_dict = {'ascii_replacement': [np.argmax(asciiswitchmat[i,...]) for i in range(128)],
                      'char_decipher': [chr(i) for i in range(128)],
                      'char_replacement': [chr(np.argmax(asciiswitchmat[i,...])) for i in range(128)],
                      'numError': np.sum(asciiswitchmat,axis=1),
                      'numReplace': [asciiswitchmat[i,np.argmax(asciiswitchmat[i,...])] for i in range(128)],
                      'totalObs': asciicount}

deciphererror_df = pd.DataFrame.from_dict(deciphererror_dict)
deciphererror_df['pctError'] = deciphererror_df['numError'] / deciphererror_df['totalObs']
deciphererror_df['pctBestReplace'] = deciphererror_df['numReplace'] / deciphererror_df['numError']  
deciphererror_df.sort_values(by=['pctError'],ascending=False,inplace=True)
deciphererror_df.head(10)

In [None]:
# correct the identified errors in character substitution
errorprone = np.logical_and(deciphererror_df['totalObs']>=1,
                            np.logical_and(deciphererror_df['pctError']>=0.01*(100-minfuzzyscore),
                                           deciphererror_df['pctBestReplace']>=0.5)),
decipherchars = deciphererror_df.index[errorprone]
improvementchars = deciphererror_df['ascii_replacement'].loc[errorprone].values
tmpasciimap = asciimap.copy()
for i,decipherchar in enumerate(decipherchars):
    asciimap['decipher'].loc[tmpasciimap['decipher'].values==decipherchar] = improvementchars[i]

Let's run through some more to try to correctly map characters that are less frequently observed.

In [None]:
# Repeating the previous code
# subsample of ciphertext
subset2_size = 2000
tuneindices = tune_df.index[subset_size:(subset2_size+subset_size)]

# parallel evaluation using Dask (big benefits for more CPUs)
par_compute2 = [delayed(finetunematch)(idx) for idx in tuneindices]
output_arrays = compute(*par_compute2, scheduler='processes')
sourcematch_array = [arrays[0] for arrays in output_arrays]
decipher_array = [arrays[1] for arrays in output_arrays]
fuzzyscores_array = np.array([arrays[2] for arrays in output_arrays])
sourcematch_array = list(compress(sourcematch_array, (fuzzyscores_array>=minfuzzyscore) ))
decipher_array = list(compress(decipher_array, (fuzzyscores_array>=minfuzzyscore) ))
cipher_array =  tune_df[subset_size:(subset_size+subset2_size)]['ciphertext']
cipher_array = list(compress(cipher_array, (fuzzyscores_array>=minfuzzyscore) ))

print('In 2nd subsample, ' + str(np.sum(fuzzyscores_array>=minfuzzyscore)) + ' of the ' + str(subset2_size) + ' have fuzzy.ratios > ' 
       + str(minfuzzyscore) + ' and will be used for fine-tuning')

# initialize asciiswitchmat, a 128x128 matrix with rows numbers corresponding to ascii characters
# that should be replaced by the ascii character of the column number
asciiswitchmat = np.zeros([128,128],dtype=int)
asciicount = np.zeros((128,), dtype=int)

for i, textstring in enumerate(decipher_array):
    asciiswitchmat += leven_replace_matrix(textstring,sourcematch_array[i])
    asciicount += np.histogram(string2ascii(textstring),np.arange(129))[0]

deciphererror_dict = {'ascii_replacement': [np.argmax(asciiswitchmat[i,...]) for i in range(128)],
                      'char_decipher': [chr(i) for i in range(128)],
                      'char_replacement': [chr(np.argmax(asciiswitchmat[i,...])) for i in range(128)],
                      'numError': np.sum(asciiswitchmat,axis=1),
                      'numReplace': [asciiswitchmat[i,np.argmax(asciiswitchmat[i,...])] for i in range(128)],
                      'totalObs': asciicount}

deciphererror_df = pd.DataFrame.from_dict(deciphererror_dict)
deciphererror_df['pctError'] = deciphererror_df['numError'] / deciphererror_df['totalObs']
deciphererror_df['pctBestReplace'] = deciphererror_df['numReplace'] / deciphererror_df['numError']

# correct the identified errors in character substitution
errorprone = np.logical_and(deciphererror_df['totalObs']>=1,
                            np.logical_and(deciphererror_df['pctError']>=0.1,
                                           deciphererror_df['pctBestReplace']>=0.8)),
decipherchars = deciphererror_df.index[errorprone]
improvementchars = deciphererror_df['ascii_replacement'].loc[errorprone].values
tmpasciimap = asciimap.copy()
for i,decipherchar in enumerate(decipherchars):
    asciimap['decipher'].loc[tmpasciimap['decipher'].values==decipherchar] = improvementchars[i]

Now we should have a fairly accurate solution. Let's check goodness-of-fit statistics as well as our decryption of the sample cipher at `iloc[0]`

In [None]:
print('After fine-tuning the sub-sample:')
gof = goodness_of_fit(cipher_array,asciimap=asciimap)

In [None]:
textstring = train_df[train_df['difficulty']==1]['ciphertext'].iloc[0]
print('FINAL DECIPHER:')
print(decipher_subst(textstring,asciimap))

This looks great and exactly matches the source. Let's now decipher all the test data for difficulty=1.

In [None]:
%%time
test_df['plaintext'] = ['']*len(test_df)
# fill in plaintext for test data
c1indices = test_df[test_df['difficulty']==1].index
test_df.loc[c1indices,'plaintext'] = [decipher_subst(test_df['ciphertext'].loc[idx],asciimap) for idx in c1indices]

# 4. Cracking Cipher #2
We can follow the exact same process in examining cipher #2. First let's look at the distribution of the ascii characters in the ciphertext.

In [None]:
%%time
cpm_plaintext = char_per_million(chunks_plaintext)
asciitop = np.argsort(-1*cpm_plaintext)
asciitop_df = pd.DataFrame({'char':[chr(asciinum) for asciinum in asciitop],'char_per_million':cpm_plaintext[asciitop]},index=asciitop)

# we combine test and training data for greatest precision
cpm_train = char_per_million(train_df[train_df['difficulty']==2]['ciphertext'])
cpm_test =  char_per_million(test_df[test_df['difficulty']==2]['ciphertext'])
cpm_cipher = 0.67*cpm_test + 0.33*cpm_train

c2asciitop = np.argsort(-1*cpm_cipher)
c2asciitop_df = pd.DataFrame({'char':[chr(asciinum) for asciinum in c1asciitop],
                            'char_per_million':cpm_cipher[c2asciitop]},index=c2asciitop)

# create a bar chart to highlight the top 40 most common characters
fig, ax = plt.subplots(figsize=(15, 5))
ax.bar(np.arange(30),
       c2asciitop_df['char_per_million'].iloc[0:30],
       tick_label = ["'"+chr(asciinum)+"'" for asciinum in c2asciitop_df.index[0:30]] )
ax.plot(np.arange(30),asciitop_df['char_per_million'].iloc[0:30],color='red', marker='*', markersize=10)
ax.tick_params(axis='both',labelsize=16)
plt.ylabel('Occurrence per Million Characters',fontsize=16)
plt.title('Top 30 ASCII Characters: Ciphertext Difficulty=2',fontsize=18)
plt.show()
print("Of the 128 possible ASCII values, only "+ str(sum(cpm_cipher>1)) +" are used more than once per million")

Looks like another substititution cipher. And even though this Kaggle competition creates the ciphertext by encoding with cipher #1 and then with cipher #2 to create the difficulty=2 ciphers, we can simply solve for the composition mapping between the plaintext in the source and the ciphertext. 

(However, you might find some interesting insight into the exact deciphering by looking at the patterns in going from cipher 1 to cipher 2. This would be helpful in correctly deciphering rare characters).

In [None]:
%%time
asciimap2 = pd.DataFrame(asciitop, index=c2asciitop,columns=['decipher'])
# Go through the plain text from the category and find the matching entry
textstring = decipher_subst(train_df[train_df['difficulty']==2]['ciphertext'].iloc[0],asciimap2)
textcategory = train_df[train_df['difficulty']==2]['target'].iloc[0]
sourcematch = process.extractOne(textstring, sourcetext_dict, scorer = fuzz.ratio)

print('Found a match in source data with "fuzz.ratio" score of ' + str(fuzz.ratio(textstring,sourcematch[0])))
print('Source data found in category = ' + category_names[chunk_df['target'].iloc[sourcematch[2]]])
print('Original deciphering using character frequency:')
print('----------------')
print(textstring)
print(' ')
print('Matching text below:')
print('----------------')
print(sourcematch[0])
print(' ')

Again, just using the character frequency is not a bad place to start. We can then fine-tune by matching to the source text.

In [None]:
# subsample of ciphertext
subset_size = 2000
tune_df = train_df[np.logical_and(train_df['difficulty']==2,
                                    np.array([len(train_df['ciphertext'].iloc[idx]) for idx in range(len(train_df))])>100)]
tuneindices = tune_df.index[:subset_size]
        
# function to match to source data for same category and length greater than 100 characters
def finetunematch2(idx):
    textstring = decipher_subst(tune_df['ciphertext'].loc[idx],asciimap2)
    selectindices = np.logical_and(chunk_df['target']==tune_df['target'].loc[idx],chunk_df['length']==len(textstring))
    sourcematch = process.extractOne(textstring, 
                                     list(compress(chunks_plaintext, selectindices)), 
                                     scorer = fuzz.ratio)
    return sourcematch[0], textstring, sourcematch[1]

# parallel evaluation using Dask (big benefits for more CPUs)
par_compute = [delayed(finetunematch2)(idx) for idx in tuneindices]
output_arrays = compute(*par_compute, scheduler='processes')

In [None]:
# select those with relatively high fuzzy.ratio scores
minfuzzyscore = 80
sourcematch_array = [arrays[0] for arrays in output_arrays]
decipher_array = [arrays[1] for arrays in output_arrays]
fuzzyscores_array = np.array([arrays[2] for arrays in output_arrays])
sourcematch_array = list(compress(sourcematch_array, (fuzzyscores_array>=minfuzzyscore) ))
decipher_array = list(compress(decipher_array, (fuzzyscores_array>=minfuzzyscore) ))
cipher_array =  tune_df[:subset_size]['ciphertext']
cipher_array = list(compress(cipher_array, (fuzzyscores_array>=minfuzzyscore) ))
print('In subsample, ' + str(np.sum(fuzzyscores_array>=minfuzzyscore)) + ' of the ' + str(subset_size) + ' have fuzzy.ratios > ' 
       + str(minfuzzyscore) + ' and will be used for fine-tuning')
print(' ')
print('Before fine-tuning the sub-sample:')
gof = goodness_of_fit(cipher_array,asciimap=asciimap2)
print('Average fuzzy.ratio = ' + str(np.mean(fuzzyscores_array[fuzzyscores_array>=minfuzzyscore])))

In [None]:
# initialize asciiswitchmat, a 128x128 matrix with rows numbers corresponding to ascii characters
# that should be replaced by the ascii character of the column number
asciiswitchmat = np.zeros([128,128],dtype=int)
asciicount = np.zeros((128,), dtype=int)

for i, textstring in enumerate(decipher_array):
    asciiswitchmat += leven_replace_matrix(textstring,sourcematch_array[i])
    asciicount += np.histogram(string2ascii(textstring),np.arange(129))[0]

deciphererror_dict2 = {'ascii_replacement': [np.argmax(asciiswitchmat[i,...]) for i in range(128)],
                      'char_decipher': [chr(i) for i in range(128)],
                      'char_replacement': [chr(np.argmax(asciiswitchmat[i,...])) for i in range(128)],
                      'numError': np.sum(asciiswitchmat,axis=1),
                      'numReplace': [asciiswitchmat[i,np.argmax(asciiswitchmat[i,...])] for i in range(128)],
                      'totalObs': asciicount}

deciphererror_df2 = pd.DataFrame.from_dict(deciphererror_dict2)
deciphererror_df2['pctError'] = deciphererror_df2['numError'] / deciphererror_df2['totalObs']
deciphererror_df2['pctBestReplace'] = deciphererror_df2['numReplace'] / deciphererror_df2['numError']  
deciphererror_df2.sort_values(by=['numError'],ascending=False,inplace=True)

In [None]:
# correct the identified errors in character substitution
errorprone = np.logical_and(deciphererror_df2['totalObs']>=1,
                            np.logical_and(deciphererror_df2['pctError']>=0.1,
                                           deciphererror_df2['pctBestReplace']>=0.5)),
decipherchars = deciphererror_df2.index[errorprone]
improvementchars = deciphererror_df2['ascii_replacement'].loc[errorprone].values
tmpasciimap = asciimap2.copy()
for i,decipherchar in enumerate(decipherchars):
    asciimap2['decipher'].loc[tmpasciimap['decipher'].values==decipherchar] = improvementchars[i]

In [None]:
subset2_size = 2000
tuneindices = tune_df.index[subset_size:(subset2_size+subset_size)]

# parallel evaluation using Dask (big benefits for more CPUs)
par_compute2 = [delayed(finetunematch2)(idx) for idx in tuneindices]
output_arrays = compute(*par_compute2, scheduler='processes')


# select those with relatively high fuzzy.ratio scores
minfuzzyscore = 80
sourcematch_array = [arrays[0] for arrays in output_arrays]
decipher_array = [arrays[1] for arrays in output_arrays]
fuzzyscores_array = np.array([arrays[2] for arrays in output_arrays])
sourcematch_array = list(compress(sourcematch_array, (fuzzyscores_array>=minfuzzyscore) ))
decipher_array = list(compress(decipher_array, (fuzzyscores_array>=minfuzzyscore) ))
cipher_array =  tune_df[subset_size:(subset_size+subset2_size)]['ciphertext']
cipher_array = list(compress(cipher_array, (fuzzyscores_array>=minfuzzyscore) ))


print('In 2nd subsample, ' + str(np.sum(fuzzyscores_array>=minfuzzyscore)) + ' of the ' + str(subset2_size) + ' have fuzzy.ratios > ' 
       + str(minfuzzyscore) + ' and will be used for fine-tuning')

# initialize asciiswitchmat, a 128x128 matrix with rows numbers corresponding to ascii characters
# that should be replaced by the ascii character of the column number
asciiswitchmat = np.zeros([128,128],dtype=int)
asciicount = np.zeros((128,), dtype=int)

for i, textstring in enumerate(decipher_array):
    asciiswitchmat += leven_replace_matrix(textstring,sourcematch_array[i])
    asciicount += np.histogram(string2ascii(textstring),np.arange(129))[0]

deciphererror_dict = {'ascii_replacement': [np.argmax(asciiswitchmat[i,...]) for i in range(128)],
                      'char_decipher': [chr(i) for i in range(128)],
                      'char_replacement': [chr(np.argmax(asciiswitchmat[i,...])) for i in range(128)],
                      'numError': np.sum(asciiswitchmat,axis=1),
                      'numReplace': [asciiswitchmat[i,np.argmax(asciiswitchmat[i,...])] for i in range(128)],
                      'totalObs': asciicount}

deciphererror_df = pd.DataFrame.from_dict(deciphererror_dict)
deciphererror_df['pctError'] = deciphererror_df['numError'] / deciphererror_df['totalObs']
deciphererror_df['pctBestReplace'] = deciphererror_df['numReplace'] / deciphererror_df['numError']

# correct the identified errors in character substitution
errorprone = np.logical_and(deciphererror_df['totalObs']>=2,
                            np.logical_and(deciphererror_df['pctError']>=0.2,
                                           deciphererror_df['pctBestReplace']>=0.66)),
decipherchars = deciphererror_df.index[errorprone]
improvementchars = deciphererror_df['ascii_replacement'].loc[errorprone].values
tmpasciimap = asciimap2.copy()
for i,decipherchar in enumerate(decipherchars):
    asciimap2['decipher'].loc[tmpasciimap['decipher'].values==decipherchar] = improvementchars[i]

In [None]:
print('After fine-tuning the sub-sample:')
gof = goodness_of_fit(cipher_array,asciimap=asciimap2)

In [None]:
textstring = train_df[train_df['difficulty']==2]['ciphertext'].iloc[0]
print('FINAL DECIPHER:')
print(decipher_subst(textstring,asciimap2))

And we can now decipher difficulty=2. 

In [None]:
%%time
c2indices = test_df[test_df['difficulty']==2].index
test_df.loc[c2indices,'plaintext'] = [decipher_subst(test_df['ciphertext'].loc[idx],asciimap2) for idx in c2indices]

# 5. Cracking Cipher #3
As before let's start by looking at the distribution of characters in ciphertext for difficulty =3. 

In [None]:
# we combine test and training data for greatest precision
cpm_train = char_per_million(train_df[train_df['difficulty']==3]['ciphertext'])
cpm_test =  char_per_million(test_df[test_df['difficulty']==3]['ciphertext'])
cpm_cipher = 0.67*cpm_test + 0.33*cpm_train

c3asciitop = np.argsort(-1*cpm_cipher)
c3asciitop_df = pd.DataFrame({'char':[chr(asciinum) for asciinum in c3asciitop],
                            'char_per_million':cpm_cipher[c3asciitop]},index=c3asciitop)

# create a bar chart to highlight the top 40 most common characters
fig, ax = plt.subplots(figsize=(15, 5))
ax.bar(np.arange(30),
       c3asciitop_df['char_per_million'].iloc[0:30],
       tick_label = ["'"+chr(asciinum)+"'" for asciinum in c3asciitop_df.index[0:30]] )
ax.plot(np.arange(30),asciitop_df['char_per_million'].iloc[0:30],color='red', marker='*', markersize=10)
ax.tick_params(axis='both',labelsize=16)
plt.ylabel('Occurrence per Million Characters',fontsize=16)
plt.title('Top 30 ASCII Characters: Ciphertext Difficulty=3',fontsize=18)
plt.legend(['Distribution for Top Characters in Source Data','Distribution for Difficulty=3'],fontsize=16)
plt.show()

Of course it would be too easy if cipher number three were another substitution cipher. Looking at the distribution of characters, we can see that the probability distribution of characters does not match the frequencies we would observe in typical plaintext for most characters...though it seems that the space character matches. 

Interesting. It must be the case that plaintext characters are not always mapped to the same ciphertext characters. 

Let's see what it looks like if we try to decipher ciphertext from difficulty #3 using our mapping from cipher #2.

In [None]:
# Go through the plain text from the category and find the matching entry
textstring = decipher_subst(train_df[train_df['difficulty']==3]['ciphertext'].iloc[0],asciimap2)
textcategory = train_df[train_df['difficulty']==3]['target'].iloc[0]
sourcematch = process.extractOne(textstring, sourcetext_dict, scorer = fuzz.ratio)

print('Found a match in source data with "fuzz.ratio" score of ' + str(fuzz.ratio(textstring,sourcematch[0])))
print('Source data found in category = ' + category_names[chunk_df['target'].iloc[sourcematch[2]]])
print('Original deciphering of difficulty #3 ciphertext of using mapping from cipher #2:')
print('----------------')
print(textstring)
print(' ')
print('Matching text below:')
print('----------------')
print(sourcematch[0])
print(' ')

Strangely, we (and fuzzywuzzy) are able to just barely figure out the matching plaintext when using the cipher#2 mapping. But the unmatched plaintext characters don't consistently correspond to the same ciphertext characters. 

Let's reverse engineer which ciphertext characters are the ones that we can't seem to consistently match.

In [None]:
# subsample of ciphertext
subset_size = 400

reversemap2 = asciimap2.copy()
reversemap2.drop_duplicates('decipher',inplace=True)
reversemap2.reset_index(inplace=True)
reversemap2.rename(index=str,columns={"decipher":"index","index":"decipher"},inplace=True)
reversemap2.set_index('index',inplace=True)
reversemap2.head()

transform_ct = np.zeros((128,), dtype=int)
transform_out = np.zeros((128,), dtype=int)
stable_ct = np.zeros((128,), dtype=int)
for idx in tqdm(range(subset_size)):
    textstring = decipher_subst(train_df[train_df['difficulty']==3]['ciphertext'].iloc[idx],asciimap2)
    textcategory = train_df[train_df['difficulty']==3]['target'].iloc[idx]
    sourcematch = process.extractOne(textstring, 
                                     list(compress(chunks_plaintext,chunks_target==textcategory)), 
                                     scorer = fuzz.ratio)
    s1 = string2ascii(decipher_subst(sourcematch[0],reversemap2))
    s2 = string2ascii(train_df[train_df['difficulty']==3]['ciphertext'].iloc[idx])
    if len(s1)==len(s2):
        transform_ct += np.histogram(np.array(s1[(s1-s2)>0],dtype=int),np.arange(129))[0]
        transform_out += np.histogram(np.array(s2[(s1-s2)>0],dtype=int),np.arange(129))[0]
        stable_ct += np.histogram(np.array(s1[(s1-s2)==0],dtype=int),np.arange(129))[0]

transform_pct = np.divide(transform_ct, (transform_ct + stable_ct+1) )

enoughobs = np.arange(128)[np.logical_and((transform_ct+stable_ct)>5,np.arange(128)>40)]
fig, ax = plt.subplots(figsize=(20, 4))
ax.bar(np.arange(len(enoughobs)),
        transform_pct[enoughobs],
              tick_label = ["'"+chr(asciinum)+"'" for asciinum in enoughobs] )
plt.title('Characters Transformed by Cipher Number 3',fontsize=16)
plt.show()

Ah ha! These seem to all be alphabetic characters (upper and lowercase) that are being modified. We'll define this set of alphabetic characters as `c3mod`.

In [None]:
c3mod = set()
modascii = np.union1d(np.arange(65,91),np.arange(97,123))
for i in modascii:
    c3mod.add(i)

 It seems that cipher #3 is doing a transformation that only applies to these aphabetic characters, shifting them by varying amounts. Let's now try to find some pattern in how they are being shifted.

In [None]:
%%time
train_lengths = np.array([len(cipherstring) for cipherstring in train_df['ciphertext']])
c3indices = train_df.index[np.logical_and(train_df['difficulty']==3,train_lengths>290)]

numsamples = 50
c3modmat = np.nan*np.zeros((numsamples,300))
s2startmat = np.nan*np.zeros((numsamples,300)) 
for i,idx in enumerate(c3indices[0:numsamples]):
    textstring = decipher_subst(train_df['ciphertext'].loc[idx],asciimap2)
    textcategory = train_df['target'].loc[idx]
    selectindices = np.logical_and(chunk_df['target']==textcategory,chunk_df['length']>290)
    sourcematch = process.extractOne(textstring,
                                     list(compress(chunks_plaintext, selectindices)),
                                     scorer = fuzz.ratio)
    # Look at the differences for the modified characters
    s1 = string2ascii(decipher_subst(sourcematch[0],reversemap2))
    s2 = string2ascii(train_df['ciphertext'].loc[idx])
    s2mods = [(snum in c3mod) for snum in s2]
    ds = (s1[s2mods]-s2[s2mods])
    # append to matrix
    c3modmat[i,0:len(ds)] = ds
    s2startmat[i,0:len(ds)] = s2[s2mods]
    
    
# with only 26 letters, we cycle back to the beginning
c3modmat[c3modmat<0] = c3modmat[c3modmat<0]+26

fig, ax = plt.subplots(figsize=(15, 5))
#ax.plot(np.arange(300),c3modmat.transpose())
ax.matshow(c3modmat[0:40,0:80])
plt.xlabel('Character Changes',fontsize=16)
plt.ylabel('Ciphertext Samples',fontsize=16)
plt.title("EUREKA! A CONSISTENT PATTERN!",fontsize=22)
plt.show()

If assign the numerical shifts to alphabetic characters, the pattern reveals a secret message that repeats every 19 characters! A fitting tribute to [Helen Fouché Gaines](https://en.wikipedia.org/wiki/Helen_Fouch%C3%A9_Gaines)

In [None]:
modeadjust,_ = stats.mode(c3modmat,axis=0,nan_policy='omit')
modeadjust = np.squeeze(modeadjust.data)
modeadjust = np.trim_zeros(modeadjust)
startadj = 91
print('Modal adjustment to alphabetic characters')
print(ascii2string(startadj-np.array(modeadjust,dtype=int)))
print('Chars 0-19')
print(ascii2string(startadj-np.array(modeadjust[0:19],dtype=int)))
print('Chars 19-38')
print(ascii2string(startadj-np.array(modeadjust[19:37],dtype=int)))
print('Key = ')
modeadjust0 = [ 19.,  22.,  15.,  22.,  13.,   1.,  21.,  12.,   6.,  24.,  19.,
        22.,   1.,  20.,   26.,  18.,  13.,  22.,   8.]
print(ascii2string(startadj-np.array(modeadjust0,dtype=int)))

We have a polyalphabetic substitution cipher with a rolling key (i.e. a Vigenère cipher). Let's now decipher our example text and then all of difficulty = 3.

In [None]:
def decipher_rolling(textstring,c3mod,c3adjust):
    inarray = string2ascii(textstring)
    outarray = np.zeros(len(inarray),dtype=np.int8)
    outarray[:] = inarray
    # check to see if any substitutions are needed
    modchars = [(snum in c3mod) for snum in inarray]
    nmods = sum(modchars)
    if nmods>0:
        modchars = np.squeeze(np.argwhere(modchars))
        lowermods = outarray[modchars] > 95
        charadj = outarray[modchars] - 65 - 32*lowermods
        charadj = (charadj + c3adjust[:nmods])%26
        outarray[modchars] = 65+32*lowermods+charadj
    return ascii2string(outarray)

c3adjust = np.tile(modeadjust0,16)
c3adjust = c3adjust[0:300]

In [None]:
textstring = train_df[train_df['difficulty']==3]['ciphertext'].iloc[0]
print('FINAL DECIPHER:')
print(decipher_subst(decipher_rolling(textstring,c3mod,c3adjust),asciimap2))

And we can now decipher difficulty=3 (and we will also try to use this to attempt difficulty=4).

In [None]:
%%time
c3indices = test_df[test_df['difficulty']>2].index
test_df.loc[c3indices,'plaintext'] = [decipher_subst(decipher_rolling(test_df['ciphertext'].loc[idx],c3mod,c3adjust),asciimap2) for idx in c3indices]

# 6. Make Predictions
After having cracked some/all of the ciphers, a big challenge is to efficiently process this all. Obvious methods, like fuzzy wuzzy lookups, take 10-30 seconds for each data item--which eats up a lot of CPU time. 

**Important tip: find fast/efficient ways to do this. My method uses a "psuedo hash" that I made up which is called `lookupscore`. ** I match deciphered text to the source for cases where they have the same lookup score, and then uses slower comparisons for the cases where there are no exact matches. The more accuraate you have your deciphering, the fewer "slow comparisons" you need to make. 

In [None]:
def minleven(textstring,selectdataframe):
    arrayofstrings = selectdataframe['plaintext'].values
    ldistances = [leven.distance(textstring,compstring) for compstring in arrayofstrings]
    bestfitref = np.argmin(ldistances)
    bestfittext = selectdataframe['plaintext'].iloc[bestfitref]
    bestfitindex = selectdataframe.index[bestfitref]
    return bestfittext, bestfitindex

def char_count(textstring):
    return np.histogram(string2ascii(textstring),np.arange(129))[0]
    
# calculate ascii character frequency per million characters
def char_per_300(stringsarray):
    asciicount = np.zeros((128,), dtype=int)
    for k,textdata in enumerate(stringsarray):
        asciicount += char_count(textdata)
    totalchars = np.sum(asciicount)
    return np.multiply(asciicount,np.divide(300.0,totalchars))

def similarity_score(x, y):
    #ss = 1.0 - np.divide(np.linalg.norm(x - y),np.linalg.norm(y))
    ss = np.sum(np.multiply(x,y))
    return ss

In [None]:
%%time
# major speedup with a pseudo-hash
cp300 = char_per_300(chunks_plaintext)
test_df['Predicted'] = 1
test_df['Levenshtein'] = np.nan
test_df['lookupscore'] = [similarity_score(char_count(textstring),cp300) for textstring in test_df['plaintext'].values ]
chunk_df['lookupscore'] = [similarity_score(char_count(textstring),cp300) for textstring in chunk_df['plaintext'].values ]

To make this kernel run for less time, we will only look for exact matches.

In [None]:
def exacttestmatch(idx):
    textstring = test_df['plaintext'].loc[idx]
    selectindices = np.argwhere(chunk_df['lookupscore']==test_df['lookupscore'].loc[idx])
    if len(selectindices)==1:
        matchtext = chunk_df['plaintext'].loc[selectindices[0]].values[0]
        matchtarget = chunk_df['target'].loc[selectindices[0]].values[0]
        matchdist = leven.distance(textstring,matchtext)
    else:
        matchtarget = np.nan
        matchdist = 300
    return matchtarget, matchdist


for idx in tqdm(test_df.index):
    matchtarget, matchdist = exacttestmatch(idx)
    if matchdist == 0:
        test_df.loc[idx,'Predicted'] = matchtarget
        test_df.loc[idx,'Levenshtein'] = matchdist
        
print(str(np.sum(test_df['Levenshtein']==0))+"/"+str(len(test_df))+" exact matches")

And we can submit this for scoring on the Leaderboard...and only cipher #4 remains to be solved!

In [None]:
submission_df = test_df.copy()
submission_df.set_index('Id',inplace=True)
submission_df.drop(['difficulty','plaintext','ciphertext','length','Levenshtein','lookupscore'],
                   axis=1,inplace=True)
submission_df['Predicted'] = pd.to_numeric(submission_df['Predicted'],downcast='integer')
submission_df.to_csv('submission.csv')