# Multi Label Text Classifier using word embeddings and LSTM
    NLP uses deep learning to get great results. One sort of NLP is text classification.
    Text classification is used to determine type of text that machine is handling. 
    
    Keras is a framework to implement deep learning models in a easy and fast manner. 
    Tensor flow, Theano, CNTK can be used as keras's backend. 
    
* Traditionally in NLP, words are encoded to ID and given to data. But these dont preserve contextual information. So, word embeddings are used. Every word will have a vector and its present in vector space of whole corpus. Simply, imagine word 'The' is represented as a vector and not as ID.
* LSTM RNN solve exploding and vanishing gradients. 

I use the above starred points to solve "multi label text classifier"

Steps:
1. Preprocess the data and transformation of data. 
2. Create word embedding model.
3. Create a simpler LSTM model (many to one) and train the data
4. Check the predictions

In [1]:
import numpy as np

## Step 1: Preprocessing the data
* Before preprocessing the data, the engineer has to look at the data in its raw format and decide how to preprocess it. 
* Me, after looking at the data. 
* I decided to remove the word "BOS", "EOS", numbers & special characters " ' "," . " (Best Practise is to  remove numericals and special characters )
* Each line contains request and its label, they are seperated by a tab space. 
* I split the data into request and label
* I removed new line characters from labels

In [2]:
import re
request = []
labels = []

f = open('final_raw.txt','r')
for line in iter(f):
    line = line.replace("BOS ", "")
    line = line.replace(" EOS", "")
    line = re.sub('[0-9]+', '', line)
    line = line.replace("'", "")
    line = line.replace(".", "")
    line = line.split('\t')
    temp = line[0].split(' ')
    request.append(temp)
    line[1] = line[1].replace("\n","")
    labels.append(line[1])
f.close()

Lets have some look at the data 
* request
* labels

In [3]:
( request[0], labels[0] )

(['i',
  'want',
  'to',
  'fly',
  'from',
  'baltimore',
  'to',
  'dallas',
  'round',
  'trip'],
 'atis_flight')

#  Data Transformation
Lets Encode the labels. so that it can be given to model as integer.

In [4]:
from sklearn.preprocessing import LabelEncoder
labelencoder_labels = LabelEncoder()
labels_encoded = labelencoder_labels.fit_transform(labels)

# Step 2: Create word embeddings.
    
You can word embeddings in 2 ways:

1. Create your own word embedding model  given text dataset.
2. Use Google's 'word2vec' embedding model or use 'glove' word embedding model 

When using 'word2vec' and 'glove' embedding model, you need to remove words from text dataset, which are not present in these embedding models. 

I prefer to train my model on text corpus, believing that it captures more context.

**Gensim is a python library that creates word embeddings for a given text corpus**

In [5]:
from gensim.models import Word2Vec

Using TensorFlow backend.


In [6]:
request = np.asarray(request)
# train model
model_vec = Word2Vec(request, min_count=1)

In [7]:
# summarize the loaded model
print(model_vec)

Word2Vec(vocab=712, size=100, alpha=0.025)


In [11]:
# summarize vocabulary
words = list(model_vec.wv.vocab)
print(words)
print(len(words))

['', 'charlotte', 'plane', 'montreal', 'lunch', 'kansas', 'that', 'reaches', 'jose', 'philadelphia', 'mornings', 'rental', 'st', 'florida', 'connection', 'they', 'friday', 'whatre', 'advertises', 'hi', 'october', 'thursday', 'francisco', 'may', 'airports', 'anywhere', 'these', 'airport', 'amount', 'toward', 'least', 'sundays', 'near', 'approximately', 'new', 'airplane', 'ill', 'hello', 'eastern', 'boeing', 'cars', 'january', 'wednesday', 'express', 'sixteenth', 'explain', 'trans', 's', 'gets', 'the', 'looking', 'fn', 'rate', 'eye', 'service', 'capacities', 'arrival', 'describe', 'april', 'jfk', 'flight', 'august', 'people', 'meals', 'direct', 'supper', 'los', 'houston', 'twenty', 'eighth', 'planes', 'their', 'reservation', 'petersburg', 'snack', 'offer', 'repeat', 'know', 'york', 'week', 'thats', 'cincinnati', 'most', 'spend', 'no', 'lets', 'charges', 'next', 'world', 'put', 'another', 'departure', 'day', 'missouri', 'coach', 'airlines', 'tuesdays', 'shortest', 'departing', 'co', 'some

looking at the vector of the word 'trip'

In [12]:
# access vector for one word
print(model_vec['trip'])

[-0.65366089 -0.03637737 -0.05059821 -0.0171887   0.0455776   0.38865346
 -0.16616102  0.04226137 -0.06991297  0.0603858   0.12815918  0.01084718
  0.09571978  0.36944807 -0.25016913 -0.24622206  0.00678377  0.41599265
 -0.23819438 -0.06295195  0.14006706  0.08455258  0.00498152  0.15439513
  0.21382695 -0.00066623 -0.07322553 -0.17985065 -0.41274625  0.00866815
  0.24029277 -0.35575405  0.49114737 -0.00744834 -0.01922256 -0.07308241
  0.41536006  0.21369734 -0.46222466  0.23555312  0.06116557 -0.08631554
  0.09496447  0.26096189  0.08722508 -0.1143515  -0.25571516 -0.44265786
 -0.09551213  0.44629547  0.1956607  -0.32173955  0.0765483  -0.48595366
 -0.11774267 -0.38030055  0.28741893 -0.30913919 -0.09047089 -0.1554341
  0.62935293  0.13833532  0.29244468  0.07281834 -0.17708285  0.25091544
 -0.06994165  0.47397357 -0.34605703  0.13467559 -0.0412305  -0.12756269
 -0.24033183  0.09237041 -0.27368605 -0.19603567 -0.24889509  0.20308998
 -0.0433677   0.0097833  -0.40524617 -0.54449654 -0.

Looking at the statistics of the corpus we have. 

In [13]:
req_lens = []
for i in range(0,len(request)):
    req_lens.append( len(request[i]))
    
aa = np.asarray(req_lens)
(aa.max(), aa.min(),aa.mean())

(46, 1, 11.280946065428823)

The statistics here and in the previous notebook doesnt match. IDK why. Lets move on. I will have a look at the statistics later, as it is of less priority.

I got to know that average no of words is 11. 

I will have twice the average as row matrix. That means the data is trimmed to 20 words per each sentence. 

In [14]:
for i in range(0,len(request)):
    request[i] = request[i][:20]

I am not going to distrub the original corpus . I create a zero numpy array of shape (4524,20,100) i.e 4524 sentences, 20 words, 100 vectors of each words

This eliminates the step of padding each sentence. Quite wonderful, isnt it. 

In [15]:
hh  = np.zeros(shape=(4524,20,100))
for i in range(len(request)):
    for j in range(len(request[i])):
        hh[i][j] = model_vec[request[i][j] ]

Looking at the shape, word and its vector representation

In [17]:
(hh.shape, request[0][1], hh[0][1] )

((4524, 20, 100),
 'want',
 array([-0.7389788 , -0.04055923,  0.08641576,  0.12363553,  0.05098152,
         0.17420503, -0.25286803,  0.02319203, -0.08540826, -0.03309162,
         0.11532205, -0.03572461, -0.01597139,  0.32943872, -0.06826755,
        -0.11031719, -0.02668942,  0.30229938, -0.08880879,  0.01215467,
         0.05174172,  0.1629716 , -0.14065567,  0.2304832 ,  0.03587618,
         0.06667246,  0.05996433, -0.29377291, -0.38158241, -0.09248393,
         0.05427686, -0.23276503,  0.4110254 , -0.08192627, -0.13393882,
        -0.28103629,  0.27582783,  0.28076112, -0.30359653,  0.05260054,
        -0.01397794, -0.00971313,  0.1106336 ,  0.21021052,  0.2282887 ,
        -0.09043355, -0.31616017, -0.40376154, -0.05114204,  0.18393269,
         0.28197241, -0.22838278, -0.0699352 , -0.34867066, -0.07239054,
        -0.26409933,  0.03571077, -0.26893169,  0.02000365, -0.18976863,
         0.35511482, -0.04331189,  0.22344229,  0.04168072, -0.09777664,
         0.10513742, -0.

# Step 3: Creating LSTM model

In [18]:
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout

In [19]:
lstm = Sequential([
    LSTM(20,input_shape=(20,100) ),
    Dense(100, activation='softmax'),
    Dense(20, activation='softmax')
])

In [20]:
lstm.compile(loss='sparse_categorical_crossentropy', optimizer='adam',    metrics=["accuracy"])

In [22]:
lstm.fit(hh, labels_encoded, 
          batch_size=32, 
          epochs=3,
          validation_split = 0.2
           )

Train on 3619 samples, validate on 905 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x1e148300cf8>

accuracy, val_loss, val_acc are not improving for each epoch.

Some ways to solve
1. Use suitable optimizer.
2. Use suitable loss function. 
3. Use regularizers, dropouts
4. Try different learning rates. 

Will try all these soon.

Another Interesting thing is acc, val_loss, val_acc are same for both 'lstm with word embedding model' * 'lstm without word embedding model ' though the architectures are completely different. How, fascinationg. 

Predicting. 

Here is the sentence that i want to predict

In [23]:
test_sentence = 'ground transportation in salt like city'

Preprocess the sentence :
1. split the sentence into tokens (words)
2. create a numpy zero test vector 
3. go through each word, get the vector of each word, and store it in the array

In [24]:
#splitting using space as seperator
test_sentence = test_sentence.split(' ')
#creating a numpy zero matrix
test_vec = np.zeros(shape=(1,20,100))  
# getting vector of the sentence
for kk in range(len(test_sentence)):
    test_vec[0][kk] = model_vec[test_sentence[kk]]

**prediction**

In [25]:
preds = lstm.predict(test_vec)

Looking at the predictions

In [26]:
( preds )

array([[ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
         nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan]], dtype=float32)

Sorry for the useless model. 

# Additonal 
1. You can save the embedding model and re use it in other projects by loading it back. 
   * If you wanted to use the other popular embedding models you can load it. (like 'word2vec ' and 'glove'
2. Its wise to dump the data at multiple stages using 'pickle'. You can load the preprocessed data using pickle and try out differnt models on it. I have not used here, as its a small project. 
3. YOu can use bcolz to save large arrays. Bcolz saves the array to local disk and this solves a whole lot of time . 
3. You can save the LSTM model and load the model. 

In [None]:
# save embeddign model in binary format
model_vec.save('model_vec.bin')

In [None]:
# loading embeddingmodel 
# apparently this gives error now and i dont know why. 
from gensim.models.keyedvectors import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('model_vec.bin', binary=True)  # C binary format