# Multi Label Text Classifier using LSTM
    NLP uses deep learning to get great results. One sort of NLP is text classification.
    Text classification is used to determine type of text that machine is handling. 
    
    Keras is a framework to implement deep learning models in a easy and fast manner. 
    Tensor flow, Theano, CNTK can be used as keras's backend. 
    
* Traditionally in NLP, words are encoded to ID and given to data. 
* LSTM RNN solve exploding and vanishing gradients. 

I use the above starred points to solve "multi label text classifier"

Steps:
1. Preprocess the data
2. Prepare data to desired format. So, that it can be sent to model
3. Create a simpler LSTM model (many to one) and train the data
4. Check the predictions

## Step 1: Preprocessing the data
* Before preprocessing the data, the engineer has to look at the data in its raw format and decide how to preprocess it. 
* Me, after looking at the data. 
* I decided to remove the word "BOS", "EOS", numbers & special characters " ' "," . " (Best Practise is to  remove numericals and special characters )
* Each line contains request and its label, they are seperated by a tab space. 
* I split the data into request and label
* I removed new line characters from labels

In [1]:
request = []
labels = []
import re
f = open('final_raw.txt','r')
for line in iter(f):
    line = line.replace("BOS ", "")
    line = line.replace(" EOS", "")
    line = re.sub('[0-9]+', '', line)
    line = line.replace("'", "")
    line = line.replace(".", "")
    line = line.split('\t')
    request.append(line[0])
    line[1] = line[1].replace("\n","")
    labels.append(line[1])
    

Lets have some look at the data 
* request
* labels

In [2]:
( request[0], labels[0] )

('i want to fly from baltimore to dallas round trip', 'atis_flight')

# Step 2: Data Transformation

Lets Encode the labels. so that it can be given to model as integer.

In [3]:
from sklearn.preprocessing import LabelEncoder
labelencoder_labels = LabelEncoder()
labels = labelencoder_labels.fit_transform(labels)

In [4]:
(labels.shape, labels[0].shape, labels[89].shape, labels[489].shape)

((4524,), (), (), ())

This means that labels have just 4524 rows and 1 column

Lets see total no of labels

In [5]:
max(labels)

20

**Transforming requests**
* Tokenising each data. Is simple words, each word is converted to integer. 
* If word "the" is the highest repeated word in the whole dataset. It will be tokenised as 1. 
  If word "if" is the second highest repeated word in the whole dataset. It will be tokenised as 2. 
* The code is written below. ( 'num_words' is the num of the unique words it will deal. rest are thrown away )

In [6]:
import keras.preprocessing.text
from keras.preprocessing import sequence

tk = keras.preprocessing.text.Tokenizer(num_words=2000,
                                   filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                                   lower=True,
                                   split=" ",
                                   char_level=False)
tk.fit_on_texts(request)

Using TensorFlow backend.


In [7]:
req_vec = tk.texts_to_sequences(request)

Lets have a look at tokenised requests.

In [8]:
req_vec[100]

[10, 6, 4, 62, 2, 20, 1, 11, 14]

In [9]:
import numpy as np
req_vec = np.asarray(req_vec)
( req_vec.shape )

(4524,)

having a detailed look

In [10]:
( len(req_vec[0]), len(req_vec[50]), len(req_vec[100]), len(req_vec[1000]) )

(10, 14, 9, 4)

" request " contains all the queries 
Lets do some simple statistics on " request"

In [11]:
import numpy as np
req_lens = []
for i in range(0,len(request)):
    req_lens.append( len(request[i]))
    
aa = np.asarray(req_lens)
(aa.max(), aa.min(),aa.mean())

(255, 3, 63.291777188328915)

There are:
* 255 words at max
* 3 words at min
* 63 words at average 

The significance of this result will be explained later

* Based on the average no of words in each sentence, I decide input matrix. I usually do the twice of average words and keep that as the column size
* Then I pad sentence with less words with zeros. So, I can have a matrix of a finite shape.

In [12]:
max_len = 120
req_vec = sequence.pad_sequences(req_vec,maxlen=max_len, padding = 'post')

In [13]:
( len(req_vec[0]), len(req_vec[50]), len(req_vec[100]), len(req_vec[1000]) )

(120, 120, 120, 120)

Looking at the data after transformation

In [14]:
( req_vec.shape, request[100], req_vec[100] )

((4524, 120),
 'show me the fares from dallas to san francisco',
 array([10,  6,  4, 62,  2, 20,  1, 11, 14,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0]))

# Step 3: Builiding the model

In [15]:
import keras

In [16]:
from keras.models import Sequential 
from keras.layers import Dense,Dense, Activation, LSTM
from keras.layers.embeddings import Embedding

In [17]:
vocab_size = 800
cols = 120
batch  =1
model = Sequential([
            Embedding(vocab_size, batch, input_length=cols, dropout=0.2),
            LSTM(500),
            Dense(256),
            Dense(128),
            Dense(20,activation='softmax')
                  ])
#labels are encoded and that is the reason i use sparse categorical cross entropy. Nothing special
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam',    metrics=["accuracy"])

  """


Fitting the model with 3 epochs to have a quick look.

With keras you can split the data

In [18]:
model.fit(req_vec, labels, 
          batch_size=32, 
          epochs=3,
          validation_split = 0.2
           )

Train on 3619 samples, validate on 905 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x265abef3e48>

# -
The reason is accuracy, val_loss, val_acc are not improving for each epoch.

Some ways to solve
1. Use suitable optimizer.
2. Use suitable loss function. 
3. Use regularizers, dropouts
4. Try different learning rates. 

Will try all these soon.

Lets predict
* Copy one of the tokenised request and lets have a quick prediction

In [21]:
testing = [[10,  6,  4, 62,  2, 20,  1, 11, 14,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0] ]
testing = np.asarray(testing)
print(testing.shape)
testing.reshape(1,120)
print(type(testing) )
testing.shape

(1, 120)
<class 'numpy.ndarray'>


(1, 120)

In [22]:
preds = model.predict(testing)

In [23]:
(preds)

array([[ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
         nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]], dtype=float32)

Lets predict

In [26]:
sentence = ' any flights from dallas to salt city'
# Tokenizing
sen_vec = tk.texts_to_sequences(sentence)
# Padding
sen_vec = sequence.pad_sequences(req_vec,maxlen=max_len)

In [27]:
preds = model.predict(testing)

In [28]:
(preds)

array([[ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
         nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]], dtype=float32)

Sorry for the useless model.