# Intent Classification

### Data Preprocessing

In [1]:
import numpy as np
import json
import os

The raw data is in json and contains extra info abput the segmentation of the commands. The raw files will be processed and put in the processed data folder.

In [2]:
rdata_path = './raw_data'
data_path = './processed_data'

As you can see we have 7 different classes of intentions.

In [74]:
classes = os.listdir(rdata_path)
classes

['AddToPlaylist',
 'BookRestaurant',
 'GetWeather',
 'PlayMusic',
 'RateBook',
 'SearchCreativeWork',
 'SearchScreeningEvent']

In [4]:
def process_data(jsfile, clas):
    """
    This function gets an opened json file and returns 
    a string containing the commands on separate lines
    for a specified class of intent
    
    """
    d = json.load(jsfile)
    text = ''
    for item in[command['data'] for command in d[clas]]:
        for dic in item:
            text += dic.get('text')
        text +='\n'
    return text

In [5]:
# There's some latin-1 encoding in the PlayMusic file so we take care of it
for clas in classes:
    if clas == 'PlayMusic':
        enc = 'latin-1'
    else:
        enc = 'utf-8'
    # Opening the train and validate files and writing the processed data
    with open(rdata_path+'/'+clas+'/train_'+clas+'_full.json', encoding=enc) as jsfile:
        text = process_data(jsfile, clas)
    with open(data_path+'/train_'+clas+'.txt', 'w', encoding=enc) as txtfile:
        txtfile.write(text)
    with open(rdata_path+'/'+clas+'/validate_'+clas+'.json', encoding=enc) as jsfile:
        text = process_data(jsfile, clas)
    with open(data_path+'/validate_'+clas+'.txt', 'w', encoding=enc) as txtfile:
        txtfile.write(text)

Next we load up the processed data

In [72]:
train_txt = []
train_label = []
test_txt = []
test_labels = []
for i, clas in enumerate(classes):
    label=[0]*len(classes)
    label[i]=1
    if clas=='PlayMusic':
        enc='latin-1'
    else:
        enc='utf-8'
    with open(data_path+'/train_'+clas+'.txt', encoding=enc) as txtfile:
        for line in txtfile:
            train_txt.append(line.replace('\n','')\
                             .replace("'ve", " 've")\
                             .replace("'s", " 's")\
                             .replace("n't", " n't")\
                             .replace("'s", " 's"))
            train_label.append(label)
    with open(data_path+'/validate_'+clas+'.txt', encoding=enc) as txtfile:
        for line in txtfile:
            test_txt.append(line.replace('\n','')\
                             .replace("'ve", " 've")\
                             .replace("'s", " 's")\
                             .replace("n't", " n't")\
                             .replace("'s", " 's"))
            test_labels.append(label)

Here we decide about our sequence length. It is based on 98 percentile of all the command length in the train data.

In [75]:
ls=[]
for c in train_txt:
    ls.append(len(c.split()))
maxLen=int(np.percentile(ls, 98))
maxLen

17

Opening the GloVe word embeddings. Not included in the repo due to volume. Please download it from [here](https://nlp.stanford.edu/projects/glove/) and place it in the processed data path.

In [8]:
embeddings_index={}
with open(data_path+'/glove50.txt', encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

Tokenizing the sequences

In [None]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
max_num_words = 40000
embedding_dim=len(embeddings_index['the'])
tokenizer = Tokenizer(num_words=max_num_words)
tokenizer.fit_on_texts(train_txt)
train_sequences = tokenizer.texts_to_sequences(train_txt)
train_sequences = pad_sequences(train_sequences, maxlen=maxLen, padding='post')
test_sequences = tokenizer.texts_to_sequences(test_txt)
test_sequences = pad_sequences(test_sequences, maxlen=maxLen, padding='post')
word_index = tokenizer.word_index

Some stats needed for the initialization of the embedding matrix

In [10]:
all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
emb_mean,emb_std

(0.020940498, 0.6441043)

Constructing the embedding matrix

In [11]:
num_words = min(MAX_NB_WORDS, len(word_index) )+1
embedding_matrix = np.random.normal(emb_mean, emb_std, (num_words, embedding_dim))
for word, i in word_index.items():
    if i >= max_num_words:
        break
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be random.
        embedding_matrix[i] = embedding_vector

### RNN Model Creation

In [14]:
from keras.models import Model, Sequential
from keras.layers import Dense, Input, Dropout, LSTM, Activation, Bidirectional
from keras.layers.embeddings import Embedding

Building the model using a 2-layer-LSTM + dense architecture.

In [22]:
model = Sequential()
model.add(Embedding(num_words, embedding_dim, trainable=True, weights=[embedding_matrix]))
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(128, return_sequences=True, recurrent_dropout=0.1, dropout=0.1), 'concat'))
model.add(Dropout(0.3))
model.add(LSTM(128, return_sequences=False, recurrent_dropout=0.1, dropout=0.1))
model.add(Dropout(0.3))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(len(classes), activation='softmax'))

In [23]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, None, 50)          573150    
_________________________________________________________________
dropout_5 (Dropout)          (None, None, 50)          0         
_________________________________________________________________
bidirectional_2 (Bidirection (None, None, 256)         183296    
_________________________________________________________________
dropout_6 (Dropout)          (None, None, 256)         0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 128)               197120    
_________________________________________________________________
dropout_7 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 64)                8256      
__________

In [24]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

### Training the Model

In [25]:
model.fit(train_sequences, train_label, epochs = 16,
          batch_size = 64, shuffle=True,
          validation_data=[test_sequences, test_labels])

Train on 13931 samples, validate on 701 samples
Epoch 1/16
Epoch 2/16
Epoch 3/16
Epoch 4/16
Epoch 5/16
Epoch 6/16
Epoch 7/16
Epoch 8/16
Epoch 9/16
Epoch 10/16
Epoch 11/16
Epoch 12/16
Epoch 13/16
Epoch 14/16
Epoch 15/16
Epoch 16/16


<keras.callbacks.History at 0x233b19d0eb8>

Training a bit further using sgd to see if we can make any improvements

In [26]:
model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['acc'])

In [None]:
model.fit(train_sequences, train_label, epochs = 16,
          batch_size = 64, shuffle=True,
          validation_data=[test_sequences, test_labels])

### Inspection

Assessing where the model went wrong

In [39]:
test_preds = model.predict(test_sequences)

In [54]:
false_preds=np.nonzero(~np.equal(np.argmax(test_preds,1),np.argmax(test_labels,1)))[0]

As you can see these false predictions are truly justified as they are very close to the other classes format and wording.

In [73]:
for ind in false_preds:
    print('The command is: {}, The label is:{}, The prediction is:{}\n'\
          .format(test_txt[ind],
                  classes[np.argmax(test_labels[ind])],
                  classes[np.argmax(test_preds[ind])]))

The command is: When is sunrise for AR, The label is:GetWeather, The prediction is:SearchScreeningEvent

The command is: Where is Belgium located, The label is:GetWeather, The prediction is:BookRestaurant

The command is: Live In L.aJoseph Meyer please, The label is:PlayMusic, The prediction is:SearchCreativeWork

The command is: Where can I see The Prime Ministers: The Pioneers, The label is:SearchScreeningEvent, The prediction is:SearchCreativeWork

The command is: I want to see Medal for the General, The label is:SearchScreeningEvent, The prediction is:SearchCreativeWork

The command is: I want to see Shattered Image., The label is:SearchScreeningEvent, The prediction is:SearchCreativeWork

The command is: I want to see Outcast., The label is:SearchScreeningEvent, The prediction is:SearchCreativeWork



### Creating Convolutional Model

This model was inspired by this [repo](https://github.com/ajinkyaT/CNN_Intent_Classification)

In [83]:
from keras.layers import Flatten, Input
from keras.models import Model
from keras.layers import Reshape, Dropout, Concatenate
from keras.layers import Conv2D, MaxPool2D, AvgPool2D

In [105]:
filter_sizes= [2,3,5]
num_filters = 400

inp = Input(shape=(maxLen,))
x = Embedding(num_words, embedding_dim, trainable=True, weights=[embedding_matrix])(inp)
x = Dropout(0.1)(x)
xreshape = Reshape((maxLen, embedding_dim, 1))(x)
conv_2 = Conv2D(num_filters, kernel_size=(filter_sizes[0], embedding_dim), padding='valid', kernel_initializer='normal', activation='relu')(xreshape)
conv_3 = Conv2D(num_filters, kernel_size=(filter_sizes[1], embedding_dim), padding='valid', kernel_initializer='normal', activation='relu')(xreshape)
conv_5 = Conv2D(num_filters, kernel_size=(filter_sizes[2], embedding_dim), padding='valid', kernel_initializer='normal', activation='relu')(xreshape)

maxpool_2 = MaxPool2D(pool_size=(maxLen - filter_sizes[0] + 1, 1), strides=(1,1), padding='valid')(conv_2)
maxpool_3 = MaxPool2D(pool_size=(maxLen - filter_sizes[1] + 1, 1), strides=(1,1), padding='valid')(conv_3)
maxpool_5 = MaxPool2D(pool_size=(maxLen - filter_sizes[2] + 1, 1), strides=(1,1), padding='valid')(conv_5)

avgpool_2 = AvgPool2D(pool_size=(maxLen - filter_sizes[0] + 1, 1), strides=(1,1), padding='valid')(conv_2)
avgpool_3 = AvgPool2D(pool_size=(maxLen - filter_sizes[1] + 1, 1), strides=(1,1), padding='valid')(conv_3)
avgpool_5 = AvgPool2D(pool_size=(maxLen - filter_sizes[2] + 1, 1), strides=(1,1), padding='valid')(conv_5)

x = Concatenate(axis=1)([maxpool_2, maxpool_3, maxpool_5, avgpool_2, avgpool_3, avgpool_5])
x = Flatten()(x)
x = Dropout(0.2)(x)
x = Dense(64, activation='relu')(x)
x = Dropout(0.2)(x)
out = Dense(len(classes), activation='softmax')(x)
model2 = Model(inp, out)

In [106]:
model2.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_8 (InputLayer)            (None, 17)           0                                            
__________________________________________________________________________________________________
embedding_10 (Embedding)        (None, 17, 50)       573150      input_8[0][0]                    
__________________________________________________________________________________________________
dropout_25 (Dropout)            (None, 17, 50)       0           embedding_10[0][0]               
__________________________________________________________________________________________________
reshape_7 (Reshape)             (None, 17, 50, 1)    0           dropout_25[0][0]                 
__________________________________________________________________________________________________
conv2d_19 

In [107]:
model2.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['acc'])

### Training the Model

In [108]:
model2.fit(train_sequences, train_label, epochs = 16,
          batch_size = 64, shuffle=True,
          validation_data=[test_sequences, test_labels])

Train on 13931 samples, validate on 701 samples
Epoch 1/16
Epoch 2/16
Epoch 3/16
Epoch 4/16
Epoch 5/16
Epoch 6/16
Epoch 7/16
Epoch 8/16
Epoch 9/16
Epoch 10/16
Epoch 11/16
Epoch 12/16
Epoch 13/16
Epoch 14/16
Epoch 15/16
Epoch 16/16


<keras.callbacks.History at 0x233e1b03e80>

### Inspection

Assessing where the model went wrong

In [109]:
test_preds = model2.predict(test_sequences)

In [110]:
false_preds=np.nonzero(~np.equal(np.argmax(test_preds,1),np.argmax(test_labels,1)))[0]

Similar confusions as before:

In [111]:
for ind in false_preds:
    print('The command is: {}, The label is:{}, The prediction is:{}\n'\
          .format(test_txt[ind],
                  classes[np.argmax(test_labels[ind])],
                  classes[np.argmax(test_preds[ind])]))

The command is:  playlist called Hands Up, The label is:AddToPlaylist, The prediction is:SearchCreativeWork

The command is: When is sunrise for AR, The label is:GetWeather, The prediction is:BookRestaurant

The command is: Where is Belgium located, The label is:GetWeather, The prediction is:BookRestaurant

The command is: Live In L.aJoseph Meyer please, The label is:PlayMusic, The prediction is:SearchCreativeWork

The command is: Put What Color Is Your Sky by Alana Davis on the stereo., The label is:PlayMusic, The prediction is:AddToPlaylist

The command is: Where can I see The Prime Ministers: The Pioneers, The label is:SearchScreeningEvent, The prediction is:SearchCreativeWork

The command is: Can I see Ellis Island Revisited in 1 minute, The label is:SearchScreeningEvent, The prediction is:GetWeather

The command is: I want to see Shattered Image., The label is:SearchScreeningEvent, The prediction is:SearchCreativeWork

The command is: I want to see Outcast., The label is:SearchScr