## 0. Importing libraries

In [285]:
import numpy as np
import pandas as pd
import pickle
import warnings
import codecs
import keras

warnings.filterwarnings("ignore")

In [95]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical

In [264]:
from keras.layers import Embedding
from keras.layers import Dense, Input
from keras.layers import Embedding, Dropout, LSTM, Bidirectional
from keras.models import Model, load_model

from keras import backend as K
from keras.engine.topology import Layer, InputSpec

## 1. Preparing the data

Let's read in the data into a **Pandas dataframe**. We have the exact string text and the corresponding label associated with it. The objective is to create a model which is able to classify a new piece of text in the correct category

In [269]:
df = pd.read_csv("LabelledData (1).txt", delimiter=",,, ", header=None, names=["text", "label"])

In [132]:
df.head(5)

Unnamed: 0,text,label
0,how did serfdom develop in and then leave russ...,unknown
1,what films featured the character popeye doyle ?,what
2,how can i find a list of celebrities ' real na...,unknown
3,what fowl grabs the spotlight after the chines...,what
4,what is the full form of .com ?,what


Shuffling the data

In [133]:
df = df.sample(frac=1)

In [270]:
df.label.value_counts()

what           607
who            401
unknown        272
affirmation    104
when            96
 what            2
 who             1
Name: label, dtype: int64

### 1.1 Basic data cleaning

There are some categories which are equivalent but are stored differently due to an extra space. Let's correct that.

In [134]:
df.label[df.label ==" what"] = "what"
df.label[df.label ==" who"] = "who"

In [135]:
df.label.value_counts()

what           609
who            402
unknown        272
affirmation    104
when            96
Name: label, dtype: int64

In [229]:
question_to_number_map = {"what":0, "who":1, "unknown":2, "affirmation":3, "when":4}
number_to_question_map = {v: k for k, v in question_to_number_map.items()}

In [137]:
df.label = df.label.map(question_to_number_map)

In [138]:
df.head(5)

Unnamed: 0,text,label
1421,has anyone used this on a brick driveway ?,3
636,what do flatfish eat ?,0
617,what did shostakovich write for rostropovich ?,0
461,where does your hair grow the fastest ?,2
485,what type of bridge is the golden gate bridge ?,0


In [139]:
x, y = df.text.values, df.label.values

Now the data is ready to be processed 

In [140]:
x[:5]

array(['has anyone used this on a brick driveway ? ',
       'what do flatfish eat ? ',
       'what did shostakovich write for rostropovich ? ',
       'where does your hair grow the fastest ? ',
       'what type of bridge is the golden gate bridge ? '], dtype=object)

### 1.2 GloVe vectors

We'll use pre-trained Global Vectors (**GloVe**) model to convert the individual words of our input questions into representational vectors for which every dimension (50 such dimensions here) represents an abstract language feature regarding the word.

These embeddings are freely available online at https://www.kaggle.com/devjyotichandra/glove6b50dtxt

In [141]:
GLOVE_FILE_PATH  = "glove.6B.50d.txt"
embeddings_index = {}
f = codecs.open(GLOVE_FILE_PATH,'r','utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Total %s word vectors.' % len(embeddings_index))

Total 400000 word vectors.


In [291]:
max_features = 4000
maxlen = 20

### 1.3 Tokenizer

The tokenizer converts the sentences into individual words, also taking care of some basic pre-processing steps such as converting words into lowercase, etc. 

The words are also converted into numbers, based on a fixed mapping which the tokenizer stores as a dict

In [275]:
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(x)
sequences = tokenizer.texts_to_sequences(x)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 3675 unique tokens.


In [277]:
data = pad_sequences(sequences, maxlen=maxlen, padding='post')
labels = to_categorical(np.asarray(y))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

Shape of data tensor: (1483, 20)
Shape of label tensor: (1483, 5)


In [286]:
with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

The tokenizer is also pickled so that it doesn't need to be created again later

### 1.4 Creating Train and validation data for training

Since we have a very small dataset, we try and use as much as possible for training (90%) and the remaining 10% for validation. Once satisfied, we'll randomly test on some questions from http://cogcomp.org/Data/QA/QC/train_1000.label

In [278]:
train_split = 0.9
nb_validation_samples = int(train_split * data.shape[0])
x_train = data[:nb_validation_samples]
y_train = labels[:nb_validation_samples]
x_val = data[nb_validation_samples:]
y_val = labels[nb_validation_samples:]

print('Traing and validation set number for different types of questions')
print(list(question_to_number_map.keys()))
print(y_train.sum(axis=0))
print(y_val.sum(axis=0))

Traing and validation set number for different types of questions
['what', 'who', 'unknown', 'affirmation', 'when']
[553. 357. 253.  95.  76.]
[56. 45. 19.  9. 20.]


Approx ratio of all classes is consistent in the train and val set

## 2. Training the model


### 2.1 Initializing the Embedding layer
We initialize the embedding layer with the GloVe embeddings

In [279]:
embedding_dim = 50

In [184]:
embedding_matrix = np.random.random((len(word_index) + 1, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
        
embedding_layer = Embedding(len(word_index) + 1,
                            embedding_dim,
                            weights=[embedding_matrix],
                            input_length=maxlen,
                            trainable=True)

print('Word Embedding Layer Initialized!')

Word Embedding Layer Initialized!


### 2.2 Defining the network architecture

We're using a Bi-directional LSTM architecture. The input first passes through the Embedding layer which converts them into meaningful language representations. The Embedding layer is followed by a Bidirectional LSTM layer, which is followed by a Dropout, Dense, Dropout layer sequentially. The final Dense layer outputs 5 values, corresponding to the 5 types of questions we want to differentiate between

In [194]:
print('Build model...')
sequence_input = Input(shape=(maxlen,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
l_lstm = Bidirectional(LSTM(100))(embedded_sequences)
dropout = Dropout(0.5)(l_lstm)
dense2 = Dense(100)(dropout)
dropout2 = Dropout(0.5)(dense2)
preds = Dense(5, activation='softmax')(dropout2)
model = Model(sequence_input, preds)
print("model fitting - Bidirectional LSTM")
model.summary()

Build model...
model fitting - Bidirectional LSTM
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_8 (InputLayer)         (None, 20)                0         
_________________________________________________________________
embedding_3 (Embedding)      (None, 20, 50)            183800    
_________________________________________________________________
bidirectional_8 (Bidirection (None, 200)               120800    
_________________________________________________________________
dropout_8 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_11 (Dense)             (None, 100)               20100     
_________________________________________________________________
dropout_9 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_12 (Dense)          

### 2.3 Training the model

The loss function and the metric to measure results on is defined. The **Adam** optimizer is used, with learning rate set to the default value of 0.001 for updating the gradients. The model is trained for 10 epochs with a batch size of 32.

In [195]:
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['acc'])

In [193]:
batch_size = 32
epoch_num = 10

In [196]:
print('Train...')
model.fit(x_train, y_train, validation_data=(x_val, y_val),
          epochs=epoch_num, batch_size=batch_size)
score, acc = model.evaluate(x_val, y_val,
                            batch_size=batch_size)

Train...
Train on 1334 samples, validate on 149 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


The model achieves a validation accuracy of **95.97%**, which is pretty good. The model weights along with the architecture are saved in a file so that it can directly be used later, without requiring re-training.

In [251]:
model.save("identity_predict.h5")

## 3. Using the trained model
The saved file contains  
    - the model architecture
    - The model weights
and can be loaded easily to use for the purpose of prediction

In [265]:
model = load_model("identity_predict.h5")

Loading the tokenizer which was pickled during the processing stage

In [287]:
with open('tokenizer.pickle', 'rb') as handle:
    tokenizer = pickle.load(handle)

In [283]:
def evaluate_model(question):
    x=[]
    x.append(question)
    x_seq = pad_sequences(tokenizer.texts_to_sequences(x), maxlen=maxlen, padding='post')
    pred = np.argmax(model.predict(x_seq))
    
    return number_to_question_map[pred]

In [288]:
evaluate_model("What is your name?")

'what'

In [289]:
evaluate_model("Is there a cab available for airport?")

'affirmation'

In [290]:
evaluate_model("Where do I get good Lebanese food?")

'unknown'

In [284]:
evaluate_model("What is the busiest air travel season ?")

'what'

In [262]:
evaluate_model("What time does the train leave?")

'when'

The model is able to learn the above question belongs to the when category even though it starts with "what"