# Set 1: *Basics*

## Task 1: *convert hex to base64*

All the cryptography we'll do will be on raw bytes, but messages get sent around usually as hex- or base 64-encoded strings. Thus, converting between these is important, and we'll use the `base64` library to do so:

In [1]:
input_hex = "49276d206b696c6c696e6720796f757220627261696e206c696b65206120706f69736f6e6f7573206d757368726f6f6d"
correct_b64 = "SSdtIGtpbGxpbmcgeW91ciBicmFpbiBsaWtlIGEgcG9pc29ub3VzIG11c2hyb29t"

In [2]:
import base64
encoded_str = base64.b16decode(input_hex, casefold=True)
decoded_b64 = base64.b64encode(encoded_str).decode()

*N.B. Cryptopals provide lower case hex encodings, hence the need for the* `casefold` *argument.*

In [3]:
assert decoded_b64 == correct_b64

Throughout these challenges the authors show a healthy obsession with 90s hip-hop music: decoding the above string to plaintext gives a hint of what is to come:

In [4]:
print(encoded_str.decode())

I'm killing your brain like a poisonous mushroom


## Task 2: *Fixed XOR*

In [5]:
input1 = "1c0111001f010100061a024b53535009181c"
input2 = "686974207468652062756c6c277320657965"
correct_output = "746865206b696420646f6e277420706c6179"

Python has a built-in type `bytes` which will prove super useful for almost all of the cryptography contained in these challenges. Annoyingly, however, `bytes` don't natively support the built-in `^` (bitwise XOR) which is also fairly fundamental. As such, we have to complicate things slightly when we want to use an XOR operation on `bytes` objects:

In [6]:
enc1 = base64.b16decode(input1, casefold=True)
enc2 = base64.b16decode(input2, casefold=True)
xor_product = bytes(a ^ b for a, b in zip(enc1, enc2))
xor_hex = base64.b16encode(xor_product).decode().lower()  # Cryptopals' lowercase hex

This works because doing list operations on `bytes` treats them as lists of integers in the range (0, 255), hopefully for obvious reasons.

In [7]:
assert xor_hex == correct_output

## Task 3: *Single-byte XOR cipher*

Frequency analysis relies on having reliable information about the character frequencies of the plaintext language, so that we can compare the frequencies of trial decodings with the known values. Here, character frequencies obtained from scraping *The Lord Of The Rings* are used:

In [24]:
english_chars = "abcdefghijklmnopqrstuvwxyz"
frequency_dict = {c:0 for c in english_chars}
with open("lotr.txt", "r") as lotr_file:
    total = 0  # counts the number of characters added to the whole dictionary
    for line in lotr_file:
        for c in line:
            if c.lower() in english_chars:
                frequency_dict[c.lower()] += 1
                total += 1

frequency_dict = {c:frequency_dict[c]/total for c in english_chars}  # divide for normalisation
english_freqs = list(frequency_dict.values())

The decryption then proceeds as follows:
1. Decrypt the ciphertext against (XOR the whole text with) a single byte.
2. Calculate the character frequencies for this decrpytion.
3. Compare these character frequencies to the known character frequencies for the English language, using `scipy.chisquare`.
4. Take the best match as the correct decryption.

In [25]:
ciphertext = base64.b16decode("1b37373331363f78151b7f2b783431333d78397828372d363c78373e783a393b3736",
                             casefold=True)

In [92]:
from itertools import chain
keys = [bytes(c, 'utf-8') for c in chain(english_chars, english_chars.upper())]  # key is a single character

In [93]:
def single_byte_XOR(ptext: bytes, key: bytes) -> bytes:
    return bytes(char ^ key[0] for char in ptext)

In [94]:
def char_freqs(text: bytes) -> dict[str: float]:
    english_chars = "abcdefghijklmnopqrstuvwxyz"
    raw_counts = {char: text.count(bytes(char, 'utf-8')) for char in english_chars}
    total = sum(raw_counts.values())
    return list({char: raw_counts[char] / (total if total != 0 else 1) for char in english_chars}.values())

In [95]:
from scipy import inf
from scipy.stats import chisquare
lowest_chi_sq, correct_key, ptext = inf, None, None

In [97]:
for key in keys:
    decrypt = single_byte_XOR(ciphertext, key)
    obs_freqs = char_freqs(decrypt)
    if all([freq == 0 for freq in obs_freqs]):  # if we recover no English characters, just skip the key
        continue
    ch_sq = chisquare(obs_freqs, english_freqs)[0]
    if ch_sq < lowest_chi_sq:
        lowest_chi_sq, correct_key, ptext = ch_sq, key, decrypt

In [100]:
print(correct_key, ptext.decode())

b'X' Cooking MC's like a pound of bacon


Luckily, that worked first time. Sometimes, frequency analysis doesn't work so well, especially with shorter messages that contain few vowels or particularly few occurences of the letter 'e' (see above). In fact, if we extend our key space to all possible single byte keys, the above process will decide that `b'_'` is the correct key, with a Chi Square score of about 1.2 versus the 1.3 that we recover with `b'X'`, the correct key used above. In these cases, some more nuance is required; for natural language texts sometimes it's possible to also include the space character, ' ', as an allowed character, which always occurs with high frequency, although it's very easy to remove and retain meaning for a party that's in-t