## 1 Baseline (Nuhn, 2013)

Basically the baseline is a Breadth First Search (BFS) algorithm with pruning.

### 1.1 Notation:

We use the same notation by Nuhn et al in their 2013 paper<sup>[1]</sup>.

The ciphertext is: $f_1^N=f_1,…,_fi,…,f_N,\ f_i ∈ V_f$

The plaintext is: $e_1^N=e_1,…, e_i,…,e_N,\ e_i ∈ V_e$

Decipher function: $\phi :V_f → V_e$

We need to find the $\phi​$ such that it maximize $Pr(\phi(f_1),…,\phi(f_N))​$.

### 1.2 Algorithm: 

Since the baseline algorithm is well explained in homework requirement so we do not repeat it in the notebook. 

But we will explain how we choose the `score` function and `ext_order` in detail.

![Figure 1](http://anoopsarkar.github.io/nlp-class/assets/img/decipherbeam.png)

### 1.3 Extension order we use in baseline:

We sort the symbols in ciphertext based on their frequency in descending order in a python list. Then we will do beam search for cipher symbols in this order.

We have improved this extension order algorithm. See the "improvements" section in our nodebook.

### 1.4 Score function we use:

Since we will get potential plaintext sequences in the middle of deciphering which are discrete, we need to score the discrete sequences.

 We divided the sequences into three categories.

1. The sequence contains start symbol (contains the first letter in the whole text)
2. The sequence contains end symbol (contains the last letter in the whole text)
3. The sequences not contain either start or end symbols.

Suppose we have a plaintext sequence $e_ie_{i+1}....e_j$, the start & end symbol are $\$$ .

In first case, we compute the score as $\Pi_k Pr(e_k|\$e_ie_{i+1}...e_{k-1})$

In second case, we compute the score as $(\Pi_k Pr(e_k|e_ie_{i+1}...e_{k-1}))*Pr(\$|e_i...e_j)$

In third case, we compute the score as $(\Pi_k Pr(e_k|e_ie_{i+1}...e_{k-1}))$, ignoring start or end symbols.

(We actually compute the log likelihood in our program.)

We tried some improvements of this score function, please see the "improvements" sections. 

### 1.5 Running Time analysis:

Suppose the number of symbols of ciphertext is $N_f$. The number of symbols of plaintext is $N_e$. We don't consider `ext_limits` in this analysis.

- Without pruning:
  - We need to search every possible mapping from cipher to plaintext.
  - For each letter in $V_f$ it could map to any letter in $V_e$
  - So the possible number of mappings is $N_e^{N_f}$
- With pruning with $beamsize = B$:
  - In each iteration we only keep at most $B$ mappings.
  - So the search tree has at most $B$ leaf nodes.
  - So the complexity is $N_f*B*N_e$, which is much smaller than the one without pruning.

## 2 Improvements

### 2.1 Ext order (Nuhn, 2014)

The `EXT_ORDER` (Extension Order) decides the order in which the symbols are searched and scored. Generating the `EXT_ORDER` means deciding the order of symbols to be searched. Our method performs beam search over the possible extension order sets that similar to the deciphering process. The beam search starts from empty symbol set, and expands the beam every time a new symbol goes in. To the beam search, the generated orders are scored by the function
``` python
def contiguous_score(cipher, order):
    order = set(order)
    count = 0
    ngrams = defaultdict(int)
    for c in cipher:
        if c in order:
            count += 1
            ngrams[min(lm_order, count)] += 1
        else:
            count = 0

    score = 0
    for k, v in ngrams.items():
        score += contiguous_score_weights[k] * v
    return score
```
This function gives higher scores to orders that put more evenly distributed characters at first. In other words, it scores more to more informative orders. Actually, we think that the performance can be further improved by using better function to score the orders.


### 2.2 Multiprocessing

In order to improve efficency, we implemented multiprocessing. When scoring the partially deciphered string, instead of scoring it as a whole using `score_bitstring`, we split it into non contiguous substrings and use multiple process to score each of them. We use different scoring function for begin, end, and middle substrings to include to probability of start and end.

### 2.3 Weighted score

When scoring a sequence, we use the 6 gram lm to get a score and then multiply it with the length of the sequence. We do this because we want longer sequences have more weight than short sequences. Because longer sequences usually make more sense and we should make them more important.

### 2.4 Dynamic ext_limits

Based on the true mappings of 408 zodiac cipher, we give the english letters limits of 4 except the limit of ‘e’ is 7. We do it because we don’t want it to be too specific about the zodiac cipher but also somewhat be helpful, since 7 is a pretty large limit for e and 4 for the rest is large usually enough. We hoped this could somehow be helpful but not sure if it really makes a big difference.

## 3 Attempts

### 3.1 Predict unknown letters and adopt frequency matching (Kambhatla, 2018)

This attempt is based on the beam search method with the ext order improvement. Say we are extending mapping $Φ$ to mapping $Φ'$ with $(f, e)$ (ciphertext symbol $f$ is mapped to plaintext symbol $e$), and we are scoring the new mapping $Φ'$.

Before this attempt, $SCORE_{Φ'} = lm.score\_bitstring(S_{Φ'})$. The score functions of lm return a negative number. The closer it is to 0, i.e. the bigger it is, the better $S$/$Φ'$ is.

#### 3.1.1 Implementation

##### 3.1.1.1 Predict unknown letters with neural language models

We use neural language models to replace unknown letters (ciphertext letters not yet in $Φ'$). Say $S_{Φ'}$ = "ab\_ba\_\_a\_" , we want to replace '\_' with the most possible plaintext letters, $e_i$. With the neural language model given to us, $e_i = next\_chars(S_{Φ'}[:i-1], ...)$. However, the next_chars() function does not always return the most likely successors. We will turn to ngram for help in such cases. We choose the letter $e_i$ that maximizes $lm.score(S_{Φ'}[:i])$.

With a complete sentence predicted, for each $f_i$ (f in the i-th position in the ciphertext), $SCORE_{Φ'} += nlm.score\_next(S_{Φ'}[:i-1], e_i)$. (The initial value of $SCORE_{Φ'}$ is $SCORE_Φ.$)

Note that the score functions of nlm return a positive number. The smaller it is, the better $S$/$Φ'$ is.

##### 3.1.1.2 Update score using frequency matching

Let $freq(f)$ denote the frequency of letter f in ciphertext and $freq(e)$ denote the frequency of letter e in plaintext. Suppose $Φ' = Φ ∪ (f,e)$. $FREQ\_MATCHING\_SCORE_{Φ'} = |log(freq(f)/freq(e))|$. The smaller the score it is, the closer $freq(f)$ and $freq(e)$ is, and the better $(f,e)$ is. Therefore, $SCORE_{Φ'} += FREQ\_MATCHING\_SCORE_{Φ'}$

#### 3.1.2 Result

We finished the implementation, but the outcome is not that satisfying. We tested this with a simple cipher text, the result is hardly even English. When running with the Zodiac-408 ciphertext, it was so slow that we failed to see the final result.

#### 3.1.3 Reflection

The frequency matching score should help a lot for 1:1 substitution ciphers and when the cipher text is long enough. However, we doubt its effect on substitution substitution ciphers.

Also, we think that the prediction should happen when a certain percentage of letters have been deciphered. Say $S_{Φ'}$ = "a\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_a", it doesn't make much sense and it is a waste of time to predict all those letters between the a's. In those cases, lm.score\_bitstring() can be applied.

## 4 Improvements with fairly large beamsize 

In this section we describe the improvements that actually helped us to obtain a good result, which is presented in the cell down below. This result is only possible with a fairly large beamsize and the following improvements.

Note that with a small beamsize, for example 5000, the following improving method will not be appliable. So we will provide a general version of our algorithm in the last cell.


### 4.1 Ignoring unigrams and bigrams

At first we score every substring using the 6-gram language model. We examined in each iteration whether the correct mappings is pruned or not by printing out the worst score in the beams and the score for correct mappings(only a subset of ciphers that is already searched). If the correct score is better than the worst score, it means the correct mappings is still in the beams. And we found that in the early stage where few ciphers are deciphered(where the partially deciphered text is full of unigrams and bigrams), the score for the correct mappings is very bad, which makes it be pruned very early. And later searching leads to very bad result because it starts off based on very wrong mappings. 

So we tried to ignore uingrams and bigrams when scoring. We tried to ignore unigrams only and the correct result are less easier to be pruned but still not good enough. Then we found that ignoring both unigrams and bigrams with a fairly large but doable beamsize works pretty well. We need to make the beamsize sufficiently large at the start because the scores are all 0 for the first few ciphers. When there is no score for any partial decipherment, we need to keep all the extentions of the search tree. To address this and better utilize computing resource and improve efficiency, we implemented dynamic beamsize.


### 4.2 Dynamic beamsize 


``` python
def dynamic_beamsize(cipher, beamsize):
    num_symbols = len(set(cipher))
    beamsizes = [beamsize] * (num_symbols)
    for i in range(4):
        beamsizes[i] = 1000000
    beamsizes[10] = 300000
    beamsizes[20] = 300000
    for i in range(num_symbols // 2, num_symbols):
        beamsizes[i] = int(beamsize * (0.85 ** (i - num_symbols//2)))
    return beamsizes
```

For the first 4 ciphers, the beamsizes need to be sufficiently large. Thus we assign 1 Million to them. In experiments we found that cipher no. 10 and 20 are likely to be pruned, so we make them 300K. Other ciphers has a base beamsize. For the second half, the beamsize are decayed exponentially by a factor of 0.85. Because at later stage, more ciphers are searched and the correct answer has a better score. So we can decay the beamsize to improve efficiency.

For the result down below, we used base beamsize of 100K. We kept the debuging information because we don't have time to run it again. Just scroll down to the bottom to see the result.

## Reference

[1] Malte Nuhn, Julian Schamper, and Hermann Ney. 2013. Beam search for solving substitution ciphers. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1568–1576.

[2] Malte Nuhn, Julian Schamper, and Hermann Ney. 2014. Improved decipherment of homophonic ciphers. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1764–1768.

[3] Nishant Kambhatla, Anahita Mansouri Bigvand, and Anoop Sarkar. 2018. Decipherment of substitution ciphers with neural language models.


In [1]:
from collections import defaultdict, Counter
import collections
import pprint
import math
import bz2
import string
import argparse
from ngram import LM
from nlm_scorer import NlmScorer
import nlm
import evaluator
from copy import deepcopy
from datetime import datetime
from multiprocessing import Pool

pp = pprint.PrettyPrinter(width=45, compact=True)



In [2]:
lm_order = 6
contiguous_score_weights = [0,0,1,1,1,2,3]
ext_limits = {letter: 4 if letter is not 'e' else 7 for letter in string.ascii_lowercase}
nlm = None
lm = LM("data/6-gram-wiki-char.lm.bz2", n=lm_order, verbose=False)
mem = {}
mem_start = {}

Reading language model from data/6-gram-wiki-char.lm.bz2...
Done.


In [3]:
def read_file(filename):
    if filename[-4:] == ".bz2":
        with bz2.open(filename, 'rt') as f:
            content = f.read()
            f.close()
    else:
        with open(filename, 'r') as f:
            content = f.read()
            f.close()
    return content

def check_limits(mappings, ext_limits, letter_to_check=0):
    if letter_to_check is None:
        targets = mappings.values()
        counts = Counter(targets).values()
        if any([count > ext_limits for count in counts]):
            return False
        else:
            return True
    else:
        plaintext_letters = list(mappings.values())
        return plaintext_letters.count(letter_to_check) <= ext_limits[letter_to_check]

def score_single_seq(t):
    i, seq = t
    return len(seq) * ( lm.score_partial_seq(seq) if i != 0 else lm.score_seq(seq) )

pool = Pool(12)

def score(mappings, cipher_text, lm, nlm):
    deciphered = [mappings[cipher_letter] if cipher_letter in mappings else ' ' for cipher_letter in cipher_text]
    deciphered = ''.join(deciphered)
    # bit_string = [ 'o' if c in mappings else '.' for c in cipher_text]
    # bit_string = ''.join(bit_string)
    seqs = deciphered.split()
    seqs = list(filter(lambda seq: len(seq) > 2, seqs))

    res = sum(pool.map(score_single_seq, zip(range(len(seqs)),seqs)))

    # return lm.score_bitstring(deciphered, bit_string)
    return res

def prune(beams, beamsize):
    sorted_beams = sorted(beams, key=lambda b: b[1], reverse=True)

    return sorted_beams[:beamsize]

def get_true_mappings(cipher):
    with open('data/ref.txt') as f:
        ref = f.read()
    true_mappings = {}
    num_symbols = len(set(cipher))
    for i in range(len(cipher)):
        if cipher[i] not in true_mappings:
            true_mappings[cipher[i]] = ref[i]
            if len(true_mappings) == num_symbols:
                return true_mappings

def decipher(cipher, mappings):
    deciphered = [mappings[cipher_letter] if cipher_letter in mappings else '.' for cipher_letter in cipher]
    deciphered = ''.join(deciphered)
    return deciphered

def beam_search(cipher_text, lm, nlm, ext_order, ext_limits, beamsizes):
    Hs = []
    Ht = []
    cardinality = 0
    Hs.append(({}, 0))
    Ve = string.ascii_lowercase

    true_mappings = get_true_mappings(cipher)

    while cardinality < len(ext_order):
        # if args.no_decay:
        #     beamsize = init_beamsize
        # else:
        #     beamsize = max(100, int(init_beamsize*(0.95**cardinality)))
        beamsize = beamsizes[cardinality]

        print("Searching for {}/{} letter".format(cardinality, len(ext_order)))
        print("\tCurrent size of searching tree: {:,}".format(len(Hs)))
        # print("\tGoing to be expended to: {:,}".format(len(Hs) * len(Ve)))
        cipher_letter = ext_order[cardinality]
        for mappings, sc in Hs:
            for plain_letter in Ve:
                ext_mappings = deepcopy(mappings)
                ext_mappings[cipher_letter] = plain_letter
                if check_limits(ext_mappings, ext_limits, plain_letter):  # only check new added one
                    Ht.append((ext_mappings, score(ext_mappings, cipher_text, lm, nlm)))
        Hs = prune(Ht, beamsize)
        max_acc, acc_deciphered = check_gold(Hs, cipher_text)
        print("Check gold: the best accuracy is: {}\nDeciphered text: \n{}".format(max_acc, acc_deciphered))
        # print("\tMost likely plaintext: \n{}".format(decipher(cipher_text, Hs[0][0])))
        cardinality += 1
        Ht = []
        best_mappings = Hs[0][0]
        best_sc = Hs[0][1]
        best_deciphered = decipher(cipher, best_mappings)

        worst_mappings = Hs[-1][0]
        worst_sc = Hs[-1][1]
        worst_deciphered = decipher(cipher, worst_mappings)

        true_deciphered = [true_mappings[cipher_letter] if cipher_letter in best_mappings else '.' for cipher_letter in cipher]
        true_deciphered = ''.join(true_deciphered)
        seqs = true_deciphered.replace('.', ' ') .split()
        seqs = list(filter(lambda seq: len(seq) > 2, seqs))
        true_score = sum(pool.map(score_single_seq, zip(range(len(seqs)), seqs)))

        print('Best deciphered text: \n{} score: {} \nTrue text: \n{} score: {}\nWorst deciphered text: \n{} score: {}\n'
              .format(best_deciphered, best_sc, true_deciphered, true_score, worst_deciphered, worst_sc))
    Hs.sort(key=lambda b: b[1], reverse=True)
    # pp.pprint(Hs)
    return Hs[0]

def contiguous_score(cipher, order):
    order = set(order)
    count = 0
    ngrams = defaultdict(int)
    for c in cipher:
        if c in order:
            count += 1
            ngrams[min(lm_order, count)] += 1
        else:
            count = 0

    score = 0
    for k, v in ngrams.items():
        score += contiguous_score_weights[k] * v
    return score

def prune_orders(orders, beamsize):
    sorted_order = sorted(orders, reverse=True)

    return sorted_order[: beamsize]

# def search_ext_order(cipher, beamsize):
#     symbols = set(cipher)
#     order = []
#     for c in cipher:
#         if c not in order:
#             order.append(c)
#             if len(order) == len(symbols):
#                 return order

def search_ext_order(cipher, beamsize):
    symbols = set(cipher)
    # Start with the most common character
    freq = Counter(cipher)
    start = freq.most_common(1)[0][0]
    # start = cipher[0]
    orders = [([0], [start])]
    orders_tmp = []
    symbols.remove(start)
    for i in range(len(symbols)):
        for scores, order in orders:
            for symbol in symbols:
                if symbol not in order:
                    new_order = deepcopy(order)
                    new_order.append(symbol)
                    new_scores = deepcopy(scores)
                    new_scores.insert(0, contiguous_score(cipher, new_order))
                    orders_tmp.append((new_scores, new_order))
        orders = prune_orders(orders_tmp, beamsize)
        orders_tmp = []
        # pp.pprint(orders)
    orders.sort(reverse=True)
    return orders[0][1]

#
# def decipher(mappings, cipher_text):
#     deciphered = [mappings[c] if c in mappings else '_' for c in cipher_text]
#     deciphered = ''.join(deciphered)
#     return deciphered


def check_gold(Hs, cipher_text):
    """
    Each iteration, check whether current best solutions. (In order to check in which step the the solution is pruned)
    :param Hs:
    :param cipher_text:
    :return: max acc
    """
    max_acc = 0
    deciphered_text = None
    for mappings, sc in Hs:
        deciphered = decipher(cipher_text, mappings)
        if max_acc < evaluator.evaluate(deciphered):
            max_acc = evaluator.evaluate(deciphered)
            deciphered_text = deciphered
    return max_acc, deciphered_text

def dynamic_beamsize(cipher, beamsize):
    num_symbols = len(set(cipher))
    beamsizes = [beamsize] * (num_symbols)
    for i in range(4):
        beamsizes[i] = 1000000
    beamsizes[10] = 300000
    beamsizes[20] = 300000
    for i in range(num_symbols // 2, num_symbols):
        beamsizes[i] = int(beamsize * (0.85 ** (i - num_symbols//2)))
    return beamsizes

In [4]:
file = 'data/cipher.txt'
cipher = read_file(file)
cipher = [x for x in cipher if not x.isspace()]
cipher = ''.join(cipher)
# freq = Counter(cipher)
# ext_order = [ kv[0] for kv in sorted(freq.items(), key=lambda kv: kv[1], reverse=True)]
ext_order = search_ext_order(cipher, 100)
print(ext_order)
beamsizes = dynamic_beamsize(cipher, 100000)
print('Start deciphering...')
search_start = datetime.now()
mappings, sc = beam_search(cipher, lm, nlm, ext_order, ext_limits, beamsizes)

search_end = datetime.now()
print('Deciphering completed after {}'.format(search_end - search_start))
print(mappings)
deciphered = decipher(cipher, mappings)
print(deciphered, sc)
print(evaluator.evaluate(deciphered, log=True))

['—', '∑', 'B', '∫', 'º', 'P', 'A', 'O', 'R', 'u', '∆', 'Z', '/', 'π', 'X', 'À', 'E', 'Ã', 'V', '–', '•', 'W', '≈', '√', '+', 'G', 'æ', 'y', 'F', 'H', '∞', 'K', 'I', '“', 'Q', 'D', '∏', 'T', 'µ', '£', 'M', 'J', 'Ç', '^', 'L', '‘', 'N', 'S', 'ƒ', '\\', '¢', '§', 'Ω', 'j']
Start deciphering...
Searching for 0/54 letter
	Current size of searching tree: 1
Check gold: the best accuracy is: 100
Deciphered text: 
...............................m..........m.......................m....................m.......m...............m...............m...........m....m...................................................................................................................................................m.m......................m...m..................................m......................m...................m........
Best deciphered text: 
...............................a..........a.......................a....................a.......a...............a...............a...........a....a.............

Check gold: the best accuracy is: 98
Deciphered text: 
ili....ll.......l........i.i...m..........m...........illi....l...m....................m..i....m...............m.l...ll...ill..m...........m....m......illi.................................i...............i......l...............i.........i.....ill........i.............ll.........ill...ill....m.m..l....i.ill..........m...m............ill......l.i..........m...ll...i.....l.......m......li.......i...m........
Best deciphered text: 
its....nt.......n........i.s...a..........a...........ions....t...a....................a..i....a...............a.n...on...son..a...........a....a......ions.................................i...............s......t...............i.........s.....ion........s.............tt.........itt...son....a.a..o....i.int..........a...a............ito......n.i..........a...oo...s.....t.......a......ns.......i...a........ score: -90.47365208 
True text: 
ili....ll.......l........i.i...m..........m...........illi....l...m....

Check gold: the best accuracy is: 99
Deciphered text: 
ili...ill.ng....l........i.i...m......i...m...........illingwildg.m....................m..i....m......g......n.m.l...ll...ill..m...i.gg....m....m......illi.g.........i......n..........g...i.g............wi...gi.l...............i.....w...i.i..will.......ni.....di......ll.........ill.dwill....m.m..l....iwilln..gi.....m...m...........will......l.i..w.......m...ll...ing...l.......m......li.......i...m.....i..
Best deciphered text: 
ett...int.er....n........e.t...a......i...a...........einterritor.a....................a..e....a......r......e.a.n...in...tin..a...i.rr....a....a......eint.r.........i......e..........r...e.r............rt...ri.t...............e.....r...t.i..rein.......et.....oi......tt.........ett.ortin....a.a..i....erente..ri.....a...a...........reti......n.e..r.......a...ii...ter...t.......a......nt.......e...a.....i.. score: -224.68717353100004 
True text: 
ili...ill.ng....l........i.i...m......i...m...........illingwild

Check gold: the best accuracy is: 99
Deciphered text: 
ilikekilling..o.l........i.is..m......i.i.m..........killingwildg.m.i...e............s.m..i....mo.....g......n.m.l...ll..killsom...i.ggi.e.m....m......illi.g.........i.i.e..n..........g...i.g.......kso..wi...gi.l.............i.i.....w.e.i.i.iwill.......ni.....di......ll...i....killedwill...om.m..l...siwilln..gi.....m...m..e........will......loi.ow.......m...ll...ingo..l..es...m......li.....o.i...m.....i.i
Best deciphered text: 
ilikekilling..d.l........i.ie..a......i.i.a..........killingwildg.a.i...e............e.a..i....ad.....g......n.a.l...ll..killeda...i.ggi.e.a....a......illi.g.........i.i.e..n..........g...i.g.......ked..wi...gi.l.............i.i.....w.e.i.i.iwill.......ni.....di......ll...i....killedwill...da.a..l...eiwilln..gi.....a...a..e........will......ldi.dw.......a...ll...ingd..l..ee...a......li.....d.i...a.....i.i score: -532.2234357809999 
True text: 
ilikekilling..o.l........i.is..m......i.i.m..........killingwildg

Check gold: the best accuracy is: 98
Deciphered text: 
ilikekillingpeopl.b......i.is..m......i.i.m..........killingwildgamei...e....e..b....s.m..i....mo.....g....ean.m.l...ll..killsome..i.ggi.e.me...m......illi.g..p.....ei.i.e.enb.......a.g...i.g.......kso..wi...gi.l...be..p.....i.i...a.w.e.i.ieiwillb...b..ni.p..adi......ll...i...ekilledwillbe.om.m..l...siwilln..gi.....m...m.be....e...will......loi.ow......pm...ll...ingo..l..es...m......li...beo.i...me...pi.i
Best deciphered text: 
ilikekillingandal.e......i.ie..m......i.i.m..........killingwildgamei...e....n..e....e.m..i....md.....g....ean.m.l...ll..killedme..i.ggi.e.mn...m......illi.g..a.....ei.i.e.nne.......a.g...i.g.......ked..wi...gi.l...ee..a.....i.i...a.w.e.i.iniwille...e..ni.a..adi......ll...i...ekilledwillen.dm.m..l...eiwilln..gi.....m...m.ee....e...will......ldi.dw......am...ll...ingd..l..ee...m......li...eed.i...me...ai.i score: -901.4897519010001 
True text: 
ilikekillingpeopl.b......i.is..m......i.i.m..........killingwildg

Check gold: the best accuracy is: 99
Deciphered text: 
ilikekillingpeoplebeca...i.is..m.c....i.i.m..........killingwildgamei...e....e..bec..sema.i....mo.....g....ean.mal...ll..killsome..i.ggive.me..em......illi.ge.p....cei.i.evenbe..e...a.g...i.g......ckso..wi...gi.l...be..pa....i.i...a.w.e.i.ieiwillbe.eb..ni.p..adic....all...i..vekilledwillbecomem..l.vesiwilln..give...m...mebeca..e...will......loi.ow......pm.c.ll.c.ingo..laves...m....e.li.e.beo.ie..me...pi.i
Best deciphered text: 
ilikekillingpeoplebeca...i.is..m.c....i.i.m..........killingwildgamei...e....e..bec..sema.i....mo.....g....ean.mal...ll..killsome..i.ggive.me..em......illi.ge.p....cei.i.evenbe..e...a.g...i.g......ckso..wi...gi.l...be..pa....i.i...a.w.e.i.ieiwillbe.eb..ni.p..adic....all...i..vekilledwillbecomem..l.vesiwilln..give...m...mebeca..e...will......loi.ow......pm.c.ll.c.ingo..laves...m....e.li.e.beo.ie..me...pi.i score: -1452.2298433599601 
True text: 
ilikekillingpeoplebeca...i.is..m.c....i.i.m..........killingwild

Check gold: the best accuracy is: 97
Deciphered text: 
ilikekillingpeoplebecauseitis..muc..u.i.i.m....u.....killingwildgamei...e....estbecausema.i...emoa....g...ueanamal...ll..killsomet.i.ggive.me..em.a....illi.ge.pe...ceitisevenbe..e...a.get.i.g..u...ckso..wi..agi.l...be.tpa....i.ia..a.w.e.i.ieiwillbe.eb..ni.p..adice...all...i..vekilledwillbecomem.slavesiwilln..give..um...mebecau.e..uwill...t.sloi.ow......pm.c.llec.ingo.slaves...m.a..e.li.eebeo.ie.emet..pi.i
Best deciphered text: 
ilikekillingpeoplebecauseitis..muc..u.i.i.m....u.....killingwildgamei...e....estbecausema.i...emoa....g...ueanamal...ll..killsomet.i.ggive.me..em.a....illi.ge.pe...ceitisevenbe..e...a.get.i.g..u...ckso..wi..agi.l...be.tpa....i.ia..a.w.e.i.ieiwillbe.eb..ni.p..adice...all...i..vekilledwillbecomem.slavesiwilln..give..um...mebecau.e..uwill...t.sloi.ow......pm.c.llec.ingo.slaves...m.a..e.li.eebeo.ie.emet..pi.i score: -2058.1351450812203 
True text: 
ilikekillingpeoplebecauseitis..muc..u.i.i.m....u.....killingwild

Check gold: the best accuracy is: 97
Deciphered text: 
ilikekillingpeoplebecauseitiss.muc..u.iti.m...fun....killingwildgamei...e..r.estbecausemanist.emoa....g..tueanamal.f.ll..killsomet.i.ggivesmet.em.a...rillinge.pe...ceitisevenbet.e...a.getting..ur..ckso.fwi..agi.l...bestpart..i.ia..a.w.e.i.ieiwillbe.eb..ni.p.radice.n.allt..i..vekilledwillbecomem.slavesiwilln..give..um...mebecause..uwill...t.sloi.own.....pm.c.llec.ingofslaves...m.a.terlifeebeorietemet..piti
Best deciphered text: 
ilikekillingpeoplebecauseitiss.muc..u.iti.m...fun....killingwildgamei...e..r.estbecausemanist.emoa....g..tueanamal.f.ll..killsomet.i.ggivesmet.em.a...rillinge.pe...ceitisevenbet.e...a.getting..ur..ckso.fwi..agi.l...bestpart..i.ia..a.w.e.i.ieiwillbe.eb..ni.p.radice.n.allt..i..vekilledwillbecomem.slavesiwilln..give..um...mebecause..uwill...t.sloi.own.....pm.c.llec.ingofslaves...m.a.terlifeebeorietemet..piti score: -2993.2068009804007 
True text: 
ilikekillingpeoplebecauseitiss.muc..u.iti.m...fun....killingwild

Check gold: the best accuracy is: 97
Deciphered text: 
ilikekillingpeoplebecauseitiss.much.u.iti.mo..funth..killingwildgamei..he..r.estbecausemanisthemoat...g..tueanamalof.ll..killsomethi.ggivesmethem.at.hrillinge.pe...ceitisevenbette..ha.gettingyour..ckso.fwithagi.l.h.bestpart..itia.ha.whe.i.ieiwillbe.ebo.ni.p.radice.n.allth.ih.vekilledwillbecomemyslavesiwillnotgivey.umy..mebecauseyouwill..yt.sloi.own.....pmyc.llectingofslaves.o.mya.terlifeebeorietemethhpiti
Best deciphered text: 
ilikekillingpeoplebecauseitiss.much.u.iti.mo..funth..killingwildgamei..he..r.estbecausemanisthemoat...g..tueanamalof.ll..killsomethi.ggivesmethem.at.hrillinge.pe...ceitisevenbette..ha.gettingyour..ckso.fwithagi.l.h.bestpart..itia.ha.whe.i.ieiwillbe.ebo.ni.p.radice.n.allth.ih.vekilledwillbecomemyslavesiwillnotgivey.umy..mebecauseyouwill..yt.sloi.own.....pmyc.llectingofslaves.o.mya.terlifeebeorietemethhpiti score: -4966.995396972399 
True text: 
ilikekillingpeoplebecauseitiss.much.u.iti.mo..funth..killingwildg

Check gold: the best accuracy is: 98
Deciphered text: 
ilikekillingpeoplebecauseitissomuchfuniti.mor.funth..killingwildgameinthef.r.estbecausemanisthemoat...g.rtueanamalof.llt.killsomethinggivesmethemoatthrillinge.pe...ceitisevenbetterthangettingyour..cksoffwithagirlth.bestpartofitiatha.whe.i.ieiwillbe.eborninp.radice.n.allth.ih.vekilledwillbecomemyslavesiwillnotgivey.umyn.mebecauseyouwilltrytosloi.own...topmyc.llectingofslavesformyafterlifeebeorietemethhpiti
Best deciphered text: 
ilikekillingpeoplebecauseitissomuchfuniti.mor.funth..killingwildgameinthef.r.estbecausemanisthemoat...g.rtueanamalof.llt.killsomethinggivesmethemoatthrillinge.pe...ceitisevenbetterthangettingyour..cksoffwithagirlth.bestpartofitiatha.whe.i.ieiwillbe.eborninp.radice.n.allth.ih.vekilledwillbecomemyslavesiwillnotgivey.umyn.mebecauseyouwilltrytosloi.own...topmyc.llectingofslavesformyafterlifeebeorietemethhpiti score: -9199.8456923336 
True text: 
ilikekillingpeoplebecauseitissomuchfuniti.mor.funth..killingwildgam

Check gold: the best accuracy is: 98
Deciphered text: 
ilikekillingpeoplebecauseitissomuchfuniti.morefunthankillingwildgameintheforrestbecausemanisthemoat.angertueanamalofalltokillsomethinggivesmethemoatthrillinge.perenceitisevenbetterthangettingyourrocksoffwithagirlthebestpartofitiathaewheni.ieiwillbereborninparadice.n.alltheihavekilledwillbecomemyslavesiwillnotgiveyoumynamebecauseyouwilltrytosloi.ownor.topmycollectingofslavesformyafterlifeebeorietemethhpiti
Best deciphered text: 
ilikekillingpeoplebecauseitissomuchfuniti.morefunthankillingwildgameintheforrestbecausemanisthemoat.angertueanamalofalltokillsomethinggivesmethemoatthrillinge.perenceitisevenbetterthangettingyourrocksoffwithagirlthebestpartofitiathaewheni.ieiwillbereborninparadice.n.alltheihavekilledwillbecomemyslavesiwillnotgiveyoumynamebecauseyouwilltrytosloi.ownor.topmycollectingofslavesformyafterlifeebeorietemethhpiti score: -21065.009178219483 
True text: 
ilikekillingpeoplebecauseitissomuchfuniti.morefunthankillingwild

Again, the above result is obtained by applying the improvement method describe in the 4th section, which is not appliable for testing other ciphers with beamsize 5000. We will provide a general version of our algorithm in the last cell. 

In [None]:
from collections import defaultdict, Counter
import collections
import pprint
import math
import bz2
import string
import argparse
from ngram import LM
from nlm_scorer import NlmScorer
import nlm
import evaluator
from copy import deepcopy
from datetime import datetime
from multiprocessing import Pool

lm_order = 6
contiguous_score_weights = [0,0,1,1,1,2,3]
ext_limits = {letter: 4 if letter is not 'e' else 7 for letter in string.ascii_lowercase}
nlm = None
lm = LM("data/6-gram-wiki-char.lm.bz2", n=lm_order, verbose=False)

def read_file(filename):
    if filename[-4:] == ".bz2":
        with bz2.open(filename, 'rt') as f:
            content = f.read()
            f.close()
    else:
        with open(filename, 'r') as f:
            content = f.read()
            f.close()
    return content

def check_limits(mappings, ext_limits, letter_to_check=0):
    if letter_to_check is None:
        targets = mappings.values()
        counts = Counter(targets).values()
        if any([count > ext_limits for count in counts]):
            return False
        else:
            return True
    else:
        plaintext_letters = list(mappings.values())
        return plaintext_letters.count(letter_to_check) <= ext_limits[letter_to_check]

def score_single_seq(t):
    i, seq = t
    return len(seq) * ( lm.score_partial_seq(seq) if i != 0 else lm.score_seq(seq) )

pool = Pool(12)

def score(mappings, cipher_text, lm, nlm):
    deciphered = [mappings[cipher_letter] if cipher_letter in mappings else ' ' for cipher_letter in cipher_text]
    deciphered = ''.join(deciphered)
    seqs = deciphered.split()
#     seqs = list(filter(lambda seq: len(seq) > 2, seqs))

    res = sum(pool.map(score_single_seq, zip(range(len(seqs)),seqs)))

    return res

def prune(beams, beamsize):
    sorted_beams = sorted(beams, key=lambda b: b[1], reverse=True)

    return sorted_beams[:beamsize]

def get_true_mappings(cipher):
    with open('data/ref.txt') as f:
        ref = f.read()
    true_mappings = {}
    num_symbols = len(set(cipher))
    for i in range(len(cipher)):
        if cipher[i] not in true_mappings:
            true_mappings[cipher[i]] = ref[i]
            if len(true_mappings) == num_symbols:
                return true_mappings

def decipher(cipher, mappings):
    deciphered = [mappings[cipher_letter] if cipher_letter in mappings else '.' for cipher_letter in cipher]
    deciphered = ''.join(deciphered)
    return deciphered

def beam_search(cipher_text, lm, nlm, ext_order, ext_limits, beamsize):
    Hs = []
    Ht = []
    cardinality = 0
    Hs.append(({}, 0))
    Ve = string.ascii_lowercase

    true_mappings = get_true_mappings(cipher)

    while cardinality < len(ext_order):
        cipher_letter = ext_order[cardinality]
        for mappings, sc in Hs:
            for plain_letter in Ve:
                ext_mappings = deepcopy(mappings)
                ext_mappings[cipher_letter] = plain_letter
                if check_limits(ext_mappings, ext_limits, plain_letter):  # only check new added one
                    Ht.append((ext_mappings, score(ext_mappings, cipher_text, lm, nlm)))
        Hs = prune(Ht, beamsize)
        cardinality += 1
        Ht = []

    Hs.sort(key=lambda b: b[1], reverse=True)
    return Hs[0]

def contiguous_score(cipher, order):
    order = set(order)
    count = 0
    ngrams = defaultdict(int)
    for c in cipher:
        if c in order:
            count += 1
            ngrams[min(lm_order, count)] += 1
        else:
            count = 0

    score = 0
    for k, v in ngrams.items():
        score += contiguous_score_weights[k] * v
    return score

def prune_orders(orders, beamsize):
    sorted_order = sorted(orders, reverse=True)

    return sorted_order[: beamsize]

def search_ext_order(cipher, beamsize):
    symbols = set(cipher)
    # Start with the most common character
    freq = Counter(cipher)
    start = freq.most_common(1)[0][0]
    # start = cipher[0]
    orders = [([0], [start])]
    orders_tmp = []
    symbols.remove(start)
    for i in range(len(symbols)):
        for scores, order in orders:
            for symbol in symbols:
                if symbol not in order:
                    new_order = deepcopy(order)
                    new_order.append(symbol)
                    new_scores = deepcopy(scores)
                    new_scores.insert(0, contiguous_score(cipher, new_order))
                    orders_tmp.append((new_scores, new_order))
        orders = prune_orders(orders_tmp, beamsize)
        orders_tmp = []
        # pp.pprint(orders)
    orders.sort(reverse=True)
    return orders[0][1]

file = 'data/cipher.txt'
cipher = read_file(file)
cipher = [x for x in cipher if not x.isspace()]
cipher = ''.join(cipher)
ext_order = search_ext_order(cipher, 100)
print(ext_order)
beamsize = 1000
print('Start deciphering...')
search_start = datetime.now()
mappings, sc = beam_search(cipher, lm, nlm, ext_order, ext_limits, beamsize)

search_end = datetime.now()
print('Deciphering completed after {}'.format(search_end - search_start))
print(mappings)
deciphered = decipher(cipher, mappings)
print(deciphered, sc)
print(evaluator.evaluate(deciphered, log=True))