## Gibberish Text Classification

    This was a hackathon Competition. Actual labels of test data was not provided for obvious reasons. The test accuracy was checked in their platform. So here you can check only the training accuarcy.(No validation set)
    
    Learning the context based on the characters (using the CNN model) turned out to be the suitable approach here(worked better than using n-gram)
    
    Takes time. Used GPU to run it.

### Import Necessary libraries

In [1]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

from keras.layers import Input, Embedding, Activation, Flatten, Dense
from keras.layers import Conv1D, MaxPooling1D, Dropout
from keras.models import Model
from sklearn.model_selection import train_test_split

Using TensorFlow backend.


### Read the data

In [2]:
data_source = r"C:\Users\....\trainData.csv"

df = pd.read_csv(data_source, header=None)
df = df.iloc[1:]
df = df.reset_index(drop=True)

In [3]:
df.head()

Unnamed: 0,0,1
0,sk vienuh tw lep mamqvuh mvlekrp m qvqvmvqjuht...,5
1,g zmva mqvbhtwi wenamdfu hezletwgzqvuhtwamuluh...,7
2,lrvimv u cypmvu luhtwa mguuhraqv twmkiwtwq vez...,11
3,xepmuhul m vezezkrqvuhtwamuluhtwmkpmtvenuhtwyp...,6
4,qgqv lrvimvleulqvuhtwamuluhsatv uhlrvikramuhul...,3


### Unique Labels

In [6]:
df[1].unique()

array(['5', '7', '11', '6', '3', '10', '1', '8', '9', '4', '2', '0'],
      dtype=object)

### Preprocessing

In [7]:
train_texts = df[0].values

# Tokenizer
tk = Tokenizer(num_words=None, char_level=True, oov_token='UNK')
tk.fit_on_texts(train_texts)


alphabet = "abcdefghijklmnopqrstuvwxyz0123456789,;.!?:'\"/\\|_@#$%^&*~`+-=<>()[]{}"
char_dict = {}
for i, char in enumerate(alphabet):
    char_dict[char] = i + 1

# Use char_dict to replace the tk.word_index
tk.word_index = char_dict.copy()
tk.word_index[tk.oov_token] = max(char_dict.values()) + 1

# Convert string to index
train_sequences = tk.texts_to_sequences(train_texts)

# Padding
train_data = pad_sequences(train_sequences, maxlen=500, padding='post')

train_data = np.array(train_data, dtype='float32')

train_classes = df[1].values
train_class_list = [x for x in train_classes]


from keras.utils import to_categorical

train_classes = to_categorical(train_class_list)

In [8]:
#train_data
len(train_sequences)

30000

### Size of Vocabulary

In [9]:
vocab_size = len(tk.word_index)
vocab_size

69

### CNN Model

In [10]:
input_size = 500
vocab_size = len(tk.word_index)
embedding_size = 69
conv_layers = [[256, 7, 3],
               [256, 3, -1],
               [256, 3, -1],
               [256, 3, 3]]

fully_connected_layers = [1024, 1024]
num_of_classes = 12
dropout_p = 0.5
optimizer = 'adam'
loss = 'categorical_crossentropy'

# Embedding weights
embedding_weights = []  
embedding_weights.append(np.zeros(vocab_size))  

for char, i in tk.word_index.items(): 
    onehot = np.zeros(vocab_size)
    onehot[i - 1] = 1
    embedding_weights.append(onehot)

embedding_weights = np.array(embedding_weights)
print('Load')


# Embedding layer Initialization
embedding_layer = Embedding(vocab_size + 1,
                            embedding_size,
                            input_length=input_size,
                            weights=[embedding_weights])


# Model Construction
# Input
inputs = Input(shape=(input_size,), name='input', dtype='int64')  

# Embedding
x = embedding_layer(inputs)

# Conv
for filter_num, filter_size, pooling_size in conv_layers:
    x = Conv1D(filter_num, filter_size)(x)
    x = Activation('relu')(x)
    if pooling_size != -1:
        x = MaxPooling1D(pool_size=pooling_size)(x) 
x = Flatten()(x) 

# Fully connected layers
for dense_size in fully_connected_layers:
    x = Dense(dense_size, activation='relu')(x)  
    x = Dropout(dropout_p)(x)

# Output Layer
predictions = Dense(num_of_classes, activation='softmax')(x)

# Build model
model = Model(inputs=inputs, outputs=predictions)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])  
model.summary()

# Shuffle
indices = np.arange(train_data.shape[0])
np.random.shuffle(indices)

x_train = train_data[indices]
y_train = train_classes[indices]


Load






Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           (None, 500)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 500, 69)           4830      
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 494, 256)          123904    
_________________________________________________________________
activation_1 (Activation)    (None, 494, 256)          0         
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 164, 256)          0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 162, 256)          196864   

### Training

In [11]:
# Training
model.fit(x_train, y_train,
          batch_size=256,
          epochs=50,
          verbose=2)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Epoch 1/50
 - 16s - loss: 2.3395 - acc: 0.1568
Epoch 2/50
 - 12s - loss: 2.0774 - acc: 0.2516
Epoch 3/50
 - 12s - loss: 1.7736 - acc: 0.3760
Epoch 4/50
 - 12s - loss: 1.4074 - acc: 0.5012
Epoch 5/50
 - 12s - loss: 1.1644 - acc: 0.5865
Epoch 6/50
 - 12s - loss: 1.0211 - acc: 0.6386
Epoch 7/50
 - 12s - loss: 0.8794 - acc: 0.6905
Epoch 8/50
 - 12s - loss: 0.7496 - acc: 0.7415
Epoch 9/50
 - 12s - loss: 0.6213 - acc: 0.7838
Epoch 10/50
 - 12s - loss: 0.5133 - acc: 0.8237
Epoch 11/50
 - 12s - loss: 0.3935 - acc: 0.8678
Epoch 12/50
 - 12s - loss: 0.3146 - acc: 0.8942
Epoch 13/50
 - 12s - loss: 0.2488 - acc: 0.9167
Epoch 14/50
 - 12s - loss: 0.1786 - acc: 0.9409
Epoch 15/50
 - 12s - loss: 0.1530 - acc: 0.9484
Epoch 16/50
 - 12s - loss: 0.1262 - acc: 0.9596
Epoch 17/50
 - 12s - loss: 0.1166 - acc: 0.9641
Epoch 18/50
 - 12s - loss: 0.1114 - acc: 0.9639
Epoch 19/50
 - 12s - loss: 0.1214 - acc: 0.9614
Epo

<keras.callbacks.History at 0x21a7aa272b0>

### Prediction

### Read Test data

In [12]:
source=r"C:\Users\....\testdata\testData.csv"

test_df = pd.read_csv(source, header=None)
test_df = test_df.iloc[1:]
test_df = test_df.reset_index(drop=True)

In [13]:
test_texts = test_df[0].values

In [14]:
test_texts = tk.texts_to_sequences(test_texts)


In [15]:
test_data = pad_sequences(test_texts, maxlen=500, padding='post')

In [16]:
x_test = np.array(test_data, dtype='float32')

### Predict

In [18]:
out=model.predict(x_test)

### Save the output

In [19]:
f = open(r"C:\Users\....\rohank_output.txt", "w") 
for i in range(len(out)):
        z=np.argmax(out[i])
        f.write("%d" % z)
        f.write("\n")
f.close()