In [0]:
!pip install -U -q kaggle
!mkdir -p ~/.kaggle

In [0]:
from google.colab import files
from IPython.display import clear_output
files.upload()
clear_output()

In [0]:
!chmod 600 kaggle.json

In [0]:
!cp kaggle.json ~/.kaggle/

In [5]:
!kaggle competitions download -c 20-newsgroups-ciphertext-challenge

Downloading sample_submission.csv.zip to /content
  0% 0.00/556k [00:00<?, ?B/s]
100% 556k/556k [00:00<00:00, 75.6MB/s]
Downloading test.csv.zip to /content
 56% 9.00M/16.2M [00:00<00:00, 27.2MB/s]
100% 16.2M/16.2M [00:00<00:00, 41.0MB/s]
Downloading train.csv.zip to /content
  0% 0.00/6.98M [00:00<?, ?B/s]
100% 6.98M/6.98M [00:00<00:00, 63.5MB/s]


In [6]:
!unzip train.csv.zip
!unzip test.csv.zip

Archive:  train.csv.zip
  inflating: train.csv               
Archive:  test.csv.zip
  inflating: test.csv                


# Objective
Multiclass classification on a ciphered texts.

# Data
Source: https://www.kaggle.com/c/20-newsgroups-ciphertext-challenge/ </br>
and http://qwone.com/~jason/20Newsgroups/ </br>
The competition dataset is a well-known dataset often used as a sample set for multiclass classification and NLP. The original dataset contains $\approx$ 20 000 documents, partitioned (nearly) evenly across 20 different newsgroups. The newsgroups (partitioned according to subject matter) are: </br>

*   comp.graphics, comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, comp.sys.mac.hardware, comp.windows.x
*   rec.autos, rec.motorcycles, rec.sport.baseball, rec.sport.hockey
*   sci.crypt, sci.electronics, sci.med, sci.space
*   talk.politics.misc, talk.politics.guns, talk.politics.mideast
*   talk.religion.misc, alt.atheism, soc.religion.christian
*   misc.forsale

For the competition the dataset was encrypted with up to 4 classis ciphers. Every document in the dataset was broken into sequential 300-character chunks, and all chunks for the document were then encrypted based on its difficulty level. A difficulty of 1 means that only cipher #1 was used. A difficulty of 2 means cipher #1 was applied, followed by cipher #2, and so on. The difficulty level denotes exactly which ciphers were applied, and in what order.

In [0]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, f1_score

In [8]:
df = pd.read_csv("train.csv")
print("Number of rows in the dataset : ", df.shape[0])

Number of rows in the dataset :  39052


In [9]:
df.head()

Unnamed: 0,Id,difficulty,ciphertext,target
0,ID_88b9bbd73,4,"ob|IK?zzhX*L{83B3Z,FuL*Pusm$83L\t@r$$*38,8s...",10
1,ID_f489bd59f,1,c1|FaAO120O'8ovfoy1W#atvGs1[1s1[1/1]O-a8o1-...,13
2,ID_f90fee9c7,2,1*e4N8$f$0ccOuihkHek$k*V*hoeV$Hj8VhH8...,19
3,ID_8303ced65,1,O8v^10O#to1'#^'^tv1^]s111t01Otaq>-ata_1...,17
4,ID_72abc2cb7,2,eV}H}khfe4b8'S.Vc}{A .#VikV.fV?{$f7$Hjb8...,0


In [0]:
df['plaintext'] = '' #adding a column for the decrypted text

Now here is the trick: looking up kernels for this dataset on Kaggle, I found out that a user named Flal already cracked cipher #1 and cipher #2 (which were simple [substitution ciphers](https://en.wikipedia.org/wiki/Substitution_cipher), decipherable by manually mapping symbols frequences). Cipher #3 (and therefore cipher #4) wasn't decipherable this way.</br>
Here is Flal's notebook with cipher #1 and cipher #2 full solutions: </br>
https://www.kaggle.com/leflal/cipher-1-cipher-2-full-solutions </br>
I decided to use this solution to ease the problem.

In [11]:
!wget https://www.dropbox.com/s/mtmfzdkgt0oavrm/cipher1_map.csv?dl=0 -O cipher1_map.csv

--2019-06-14 02:29:28--  https://www.dropbox.com/s/mtmfzdkgt0oavrm/cipher1_map.csv?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.9.1, 2620:100:601f:1::a27d:901
Connecting to www.dropbox.com (www.dropbox.com)|162.125.9.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/mtmfzdkgt0oavrm/cipher1_map.csv [following]
--2019-06-14 02:29:28--  https://www.dropbox.com/s/raw/mtmfzdkgt0oavrm/cipher1_map.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uccdaafa85f3e0b422e142fbfe00.dl.dropboxusercontent.com/cd/0/inline/Aixd5uqysDLqkQEBQ0Ud99zu6AqXPmhIcXpeXz_QfgRSClKr0rNEx06MnUvK_xZH_QJYfe8fWcOa0KFkDxr55jlady5Djy5wlGsu6ieNgW7kEg/file# [following]
--2019-06-14 02:29:28--  https://uccdaafa85f3e0b422e142fbfe00.dl.dropboxusercontent.com/cd/0/inline/Aixd5uqysDLqkQEBQ0Ud99zu6AqXPmhIcXpeXz_QfgRSClKr0rNEx06MnUvK_xZH_QJYfe8fWcOa0KFkDxr55jlady5Djy5wlGsu6ieNgW7kEg/file
Resol

In [12]:
!wget https://www.dropbox.com/s/o83vocsvtve9q0g/cipher2_map.csv?dl=0 -O cipher2_map.csv

--2019-06-14 02:29:31--  https://www.dropbox.com/s/o83vocsvtve9q0g/cipher2_map.csv?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.9.1, 2620:100:601f:1::a27d:901
Connecting to www.dropbox.com (www.dropbox.com)|162.125.9.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/o83vocsvtve9q0g/cipher2_map.csv [following]
--2019-06-14 02:29:31--  https://www.dropbox.com/s/raw/o83vocsvtve9q0g/cipher2_map.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc568eed8395957b1c32adbda84d.dl.dropboxusercontent.com/cd/0/inline/AizL0OdIkMPMz7mYERo1vU-sn7AMgkwJlmGOfHkmF2C8khwvoLV7Q79pwyiDKgaU13qDfVpgf_PUGW7C9C0JdCZsEgPET34jwD_Qg5MXYnOl9g/file# [following]
--2019-06-14 02:29:31--  https://uc568eed8395957b1c32adbda84d.dl.dropboxusercontent.com/cd/0/inline/AizL0OdIkMPMz7mYERo1vU-sn7AMgkwJlmGOfHkmF2C8khwvoLV7Q79pwyiDKgaU13qDfVpgf_PUGW7C9C0JdCZsEgPET34jwD_Qg5MXYnOl9g/file
Resol

In [0]:
cipher_1_map = pd.read_csv('cipher1_map.csv')
cipher_2_map = pd.read_csv('cipher2_map.csv')
#Note: the second mapping maps plain text to text encrypted with 2 difficulty level, skipping cipher #1 stage.

cipher = {1: {}, 2: {}}

for index, row in cipher_1_map.iterrows():
    cipher[1][row['cipher']] = row['plain']

for index, row in cipher_2_map.iterrows():
    cipher[2][row['cipher']] = row['plain']

In [14]:
cipher[1]

{'\x02': 'n',
 '\x03': 'b',
 '\x06': '7',
 '\x08': '.',
 '\t': '5',
 '\x0c': '%',
 '\x10': 'Y',
 '\x18': 'V',
 '\x1a': 'q',
 '\x1b': 'e',
 '\x1c': '*',
 '\x1e': 'R',
 ' ': '<',
 '!': 'X',
 '"': 'G',
 '#': 'r',
 '$': '\x02',
 '%': 'O',
 '&': '?',
 "'": 'p',
 '(': '^',
 ')': '_',
 '*': 'F',
 '+': 'x',
 ',': '4',
 '-': 'm',
 '.': '~',
 '/': 'A',
 '0': 'h',
 '1': ' ',
 '2': 'C',
 '3': ';',
 '4': '|',
 '5': '6',
 '6': '"',
 '8': 'l',
 '9': 'P',
 ':': 'I',
 ';': "'",
 '<': '9',
 '=': '\x10',
 '>': 'S',
 '?': '/',
 '@': 'U',
 'A': 'd',
 'B': '$',
 'D': '\x0c',
 'E': 'E',
 'F': 'L',
 'G': ':',
 'H': '2',
 'I': '}',
 'J': 'B',
 'K': 'Z',
 'L': '1',
 'O': 'a',
 'P': '!',
 'Q': '\t',
 'S': '\x08',
 'T': ',',
 'U': '[',
 'V': 'Q',
 'W': 'w',
 'X': 'j',
 'Y': '`',
 'Z': '=',
 '[': '>',
 '\\': '0',
 ']': 'f',
 '^': 'o',
 '_': 'c',
 '`': '#',
 'a': 'i',
 'b': '@',
 'c': 'u',
 'd': 'g',
 'e': 'M',
 'f': 'k',
 'g': 'H',
 'h': 'z',
 'i': ']',
 'k': '{',
 'l': 'W',
 'm': '\\',
 'n': '8',
 'o': 'y',
 'p':

In [15]:
cipher[2]

{'\x02': ')',
 '\x03': 'v',
 '\x06': 'T',
 '\x08': '(',
 '\t': 'J',
 '\n': '+',
 '\x0c': 'N',
 '\x10': 'n',
 '\x18': 'b',
 '\x19': '7',
 '\x1a': '.',
 '\x1b': '5',
 '\x1c': '\x19',
 '\x1e': '%',
 ' ': 'Y',
 '!': 'V',
 '"': '\x1b',
 '#': 'q',
 '$': 'e',
 '%': '*',
 '&': 'R',
 "'": '<',
 '(': 'X',
 ')': 'G',
 '*': 'r',
 '+': '\x02',
 ',': 'O',
 '-': '?',
 '.': 'p',
 '/': '^',
 '0': '_',
 '1': 'F',
 '2': 'x',
 '3': '4',
 '4': 'm',
 '5': '~',
 '6': 'A',
 '7': 'h',
 '8': ' ',
 '9': 'C',
 ':': ';',
 ';': '|',
 '<': '6',
 '=': '"',
 '>': '\x03',
 '?': 'l',
 '@': 'P',
 'A': 'I',
 'B': "'",
 'C': '9',
 'E': 'S',
 'F': '/',
 'G': 'U',
 'H': 'd',
 'I': '$',
 'J': '\x1a',
 'K': '\x0c',
 'L': 'E',
 'M': 'L',
 'N': ':',
 'O': '2',
 'P': '}',
 'Q': 'B',
 'R': 'Z',
 'S': '1',
 'T': '\x7f',
 'V': 'a',
 'W': '!',
 'X': '\t',
 'Y': '\x06',
 'Z': '\x08',
 '[': ',',
 '\\': '[',
 ']': 'Q',
 '^': 'w',
 '_': 'j',
 '`': '`',
 'a': '=',
 'b': '>',
 'c': '0',
 'd': 'f',
 'e': 'o',
 'f': 'c',
 'g': '#',
 'h': 'i'

In [0]:
#decrypting texts with difficulty levels 1 and 2

def decipher(text, cipher):
    result = ''
    for i in range(len(text)):
        result += cipher[text[i]]
    return result

for i in range(len(df)):
    if df['difficulty'][i] == 1 or df['difficulty'][i] == 2:
        df['plaintext'][i] = decipher(df['ciphertext'][i], cipher[df['difficulty'][i]])
        df['ciphertext'][i] = df['plaintext'][i]
    print(f"Decrypted {i + 1} texts.")
    clear_output()

In [17]:
df.head() #it works!

Unnamed: 0,Id,difficulty,ciphertext,target,plaintext
0,ID_88b9bbd73,4,"ob|IK?zzhX*L{83B3Z,FuL*Pusm$83L\t@r$$*38,8s...",10,
1,ID_f489bd59f,1,u (Lida Chaplynsky) writes:\n > \n > A family ...,13,u (Lida Chaplynsky) writes:\n > \n > A family ...
2,ID_f90fee9c7,2,From: ece_0028@bigdog.engr.arizona.edu (David ...,19,From: ece_0028@bigdog.engr.arizona.edu (David ...
3,ID_8303ced65,1,also hearty proponents of\n the anti-Semitic...,17,also hearty proponents of\n the anti-Semitic...
4,ID_72abc2cb7,2,o.asd.sgi.com> <1pa0stINNpqa@gap.caltech.edu> ...,0,o.asd.sgi.com> <1pa0stINNpqa@gap.caltech.edu> ...


In [19]:
df['ciphertext'][1]

'u (Lida Chaplynsky) writes:\n > \n > A family member of mine is suffering from a severe depression brought on\n > by menopause as well as a mental break down.  She is being treated with\n > Halydol with some success but the treatments being provided through her\n > psychiatrist are not satisfactory.  Som'

In [20]:
df['ciphertext'][2]

'From: ece_0028@bigdog.engr.arizona.edu (David Anderson)\n Subject: Re: Christian Owned Organization list\n Organization: University of Arizona\n Lines: 19\n \n In article <?a$@byu.edu> $stephan@sasb.byu.edu (Stephan Fassmann) writes:\n >In article <1993Apr13.025426.22532@mnemosyne.cs.du.edu> kcochran@nyx.'

In [0]:
X_train, X_test, y_train, y_test = train_test_split(df['ciphertext'], df['target'])

In [0]:
#I tried to use CountVectorizer, but for some reason it was REALLY SLOW.
vectorizer = TfidfVectorizer(analyzer='char_wb', lowercase=False, ngram_range=(1, 6))
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

In [0]:
clf = LinearSVC()
clf.fit(X_train_vec, y_train)

In [0]:
y_pred = clf.predict(X_test_vec)

In [25]:
accuracy_score(y_pred, y_test)

0.5270920823517361

In [26]:
f1_score(y_pred, y_test, average='macro')

0.5077063332358638