<a href="https://colab.research.google.com/github/victor-roris/mediumseries/blob/master/NLP/NLPModel_MultiClass_Keras_CM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Multi-label Model with Keras Custom Model

In this notebook, we are going to build a Custom Model using Keras to classify text in different categories. In particular, this model allows multicategory. More than one category can be predicted for one text.

This notebook is adapted from :
 - https://stackoverflow.blog/2019/05/06/predicting-stack-overflow-tags-with-googles-cloud-ai/
 - https://www.youtube.com/watch?v=OHIEZ-Scek8

In [4]:
import pandas as pd
import numpy as np
from sklearn.utils import shuffle
from sklearn.preprocessing import MultiLabelBinarizer
import tensorflow as tf
from tensorflow.keras.preprocessing import text
import keras.backend.tensorflow_backend as K
K.set_session

<function keras.backend.tensorflow_backend.set_session>

* **Fetch data**

We are going to use the Stack Overflow questions tags classification data set

In [5]:
df = pd.read_csv('https://storage.googleapis.com/tensorflow-workshop-examples/stack-overflow-data.csv')
df.head()

Unnamed: 0,post,tags
0,what is causing this behavior in our c# datet...,c#
1,have dynamic html load as if it was in an ifra...,asp.net
2,how to convert a float value in to min:sec i ...,objective-c
3,.net framework 4 redistributable just wonderi...,.net
4,trying to calculate and print the mean and its...,python


* **Clean data**

In [6]:
# Remove entries with null tags 
df = df[pd.notnull(df['tags'])]

# Get a fraction of the entries
df = df.sample(frac=0.5, random_state=99).reset_index(drop=True)

# Randomize the values
df = shuffle(df, random_state=22)
df = df.reset_index(drop=True)

# Create a label column from the tags column
df['class_label'] = df['tags'].factorize()[0]

print(f'Number of labelled examples : {len(df)}')
df.head()

Number of labelled examples : 20000


Unnamed: 0,post,tags,class_label
0,how do i move something in rails i m a progr...,ruby-on-rails,0
1,c# how to output specific array searches t...,c#,1
2,integer.parseint and string format with decima...,java,2
3,compilation problem while upgrading a website ...,.net,3
4,query to list out the records by comparing max...,sql,4


* **Prepare data**

  - *Encoding Tags As Multi-Hot Arrays*

> Encoding labels is pretty simple using Scikit-learn’s MultiLabelBinarizer. Since a single question can have multiple tags, we’ll want our model to output multi-hot arrays.

Binarize the labels

In [7]:
tags_split = [tags.split(',') for tags in df['tags'].values]
tag_encoder = MultiLabelBinarizer()
tags_encoded = tag_encoder.fit_transform(tags_split)

print(f'Labels by entry in a binary matrix of = ({tags_encoded.shape[0]} examples) X ({tags_encoded.shape[1]} labels)')
tags_encoded

Labels by entry in a binary matrix of = (20000 examples) X (20 labels)


array([[0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [8]:
print(f'Classe labels: {tag_encoder.classes_}')

Classe labels: ['.net' 'android' 'angularjs' 'asp.net' 'c' 'c#' 'c++' 'css' 'html' 'ios'
 'iphone' 'java' 'javascript' 'jquery' 'mysql' 'objective-c' 'php'
 'python' 'ruby-on-rails' 'sql']


Get the binary labels matrix for the training and evaluation datasets

In [9]:
num_tags = len(tags_encoded[0])
train_size = int(len(df)*0.8)

print(f'Number of labels : {num_tags}')
print(f'Number of examples in the training dataset : {train_size}')
print(f'Number of examples in the evaluation dataset : {len(df)-train_size}')

Number of labels : 20
Number of examples in the training dataset : 16000
Number of examples in the evaluation dataset : 4000


In [10]:
y_train = tags_encoded[: train_size]
y_test = tags_encoded[train_size:]

print(f'Binarize Label for each training example - binary matrix of = ({y_train.shape[0]} training examples) X ({y_train.shape[1]} labels)')
print(f'Binarize Label for each evaluation example - binary matrix of = ({y_test.shape[0]} evaluation examples) X ({y_test.shape[1]} labels)')

Binarize Label for each training example - binary matrix of = (16000 training examples) X (20 labels)
Binarize Label for each evaluation example - binary matrix of = (4000 training examples) X (20 labels)


 - *Text*

In [11]:
train_post = df['post'].values[:train_size]
test_post = df['post'].values[train_size:]

print(f'Number of training texts : {len(train_post)}')
print(f'Number of evaluation texts : {len(test_post)}')

Number of training texts : 16000
Number of evaluation texts : 4000


Tokenize the texts

 > Imagine each input to your model as a bag of Scrabble tiles, where each tile is a word from your input sentence instead of a letter. Since it’s a “bag” of words, this approach cannot understand the order of words in a sentence, but it can detect the presence or absence of certain words. To make this work, you need to choose a vocabulary that takes the top N most frequently used words from your entire text corpus. This vocabulary will be the only words your model can understand.
 
 > Now we’re ready to create our Keras Tokenizer object. When we instantiate it we’ll need to choose a vocabulary size. Remember that this is the top N most frequent words our model will extract from our text data. This number is a hyperparameter, so you should experiment with different values based on the number of unique words in your text corpus. If you pick something too low, your model will only recognize words that are common across all text inputs (like ‘the’, ‘in’, etc.). A vocab size that’s too large will recognize too many words from each question such that input matrices become mostly 1s. 

In [0]:
from tensorflow.keras.preprocessing import text
class TextPreprocessor(object):
    # Class to contain text processor functionalities
    def __init__(self, vocab_size):
        self._vocab_size = vocab_size
        self._tokenizer = None
    def create_tokenizer(self, text_list):
        tokenizer = text.Tokenizer(num_words = self._vocab_size)
        tokenizer.fit_on_texts(text_list)
        self._tokenizer = tokenizer
    def transform_text(self, text_list):
        text_matrix = self._tokenizer.texts_to_matrix(text_list)
        return text_matrix

In [0]:
# Instiate the Text Processor
VOCAB_SIZE = 500
processor = TextPreprocessor(VOCAB_SIZE)
processor.create_tokenizer(train_post)

In [0]:
# Get the tokenized version of the training and evaluation texts 
X_train = processor.transform_text(train_post)
X_test = processor.transform_text(test_post)

In [15]:
print(f'Training data expresed in a matrix of tokens of: ({X_train.shape[0]} examples) x ({X_train.shape[1]} tokens) ')
X_train

Training data expresed in a matrix of tokens of: (16000 examples) x (500 tokens) 


array([[0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       ...,
       [0., 1., 1., ..., 0., 0., 1.],
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.]])

In [16]:
print(f'Training data expresed in a matrix of tokens of: ({X_test.shape[0]} examples) x ({X_test.shape[1]} tokens) ')
X_test

Training data expresed in a matrix of tokens of: (4000 examples) x (500 tokens) 


array([[0., 1., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.]])

* **Train model**

> We’ve got our model inputs and outputs formatted, so now it’s time to actually build the model. The Keras Sequential Model API is my favorite way to do this since the code makes it easy to visualize each layer of your model. We can define our model in 5 lines of code. 

> This is a deep model because it has 2 hidden layers in between the input and output layer. We don’t really care about the output of these hidden layers, but our model will use them to represent more complex relationships in our data. The first layer takes our 500-element vocabulary vector as input and transforms it into a 50-neuron layer. Then it takes this 50-neuron layer and transforms it into a 25-neuron layer. 50 and 25 here (layer size) are hyperparameters, you should experiment with what works best for your own dataset. What does that activation='relu' part mean? The activation function is how the model computes the output of each layer. We don’t need to know exactly how this is implemented (thanks Keras!) so I won’t get into the details of ReLU here, but you can read more about it if you’d like. The size of our last layer will be equivalent to the number of tags in our dataset (in this case 5). We do care about the output of this layer, so let’s understand why we used the sigmoid activation function. Sigmoid will convert each of our 5 outputs to a value between 0 and 1 indicating the probability that a specific label corresponds with that input.

> Notice that because a question can have multiple tags in this model, the sigmoid output does not add up to 1. If a question could only have exactly one tag, we’d use the Softmax activation function instead and the 5-element output array would add up to 1

In [33]:
def create_model(vocab_size, num_tags):
    # Create a keras sequential model

    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Dense(50, input_shape = (vocab_size,), activation='relu'))
    model.add(tf.keras.layers.Dense(25, activation='relu'))
    model.add(tf.keras.layers.Dense(num_tags, activation='sigmoid'))
    model.compile(loss = 'binary_crossentropy', optimizer='adam', metrics = ['accuracy'])
    return model
    
# Create model
model = create_model(VOCAB_SIZE, num_tags)
model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_12 (Dense)             (None, 100)               50100     
_________________________________________________________________
dense_13 (Dense)             (None, 50)                5050      
_________________________________________________________________
dense_14 (Dense)             (None, 20)                1020      
Total params: 56,170
Trainable params: 56,170
Non-trainable params: 0
_________________________________________________________________


In [34]:
# Train model
model.fit(X_train, y_train, epochs = 20, batch_size=128, validation_split=0.1)
print('Eval loss/accuracy:{}'.format(model.evaluate(X_test, y_test, batch_size = 128)))

Train on 14400 samples, validate on 1600 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Eval loss/accuracy:[0.07750917875766754, 0.97651243]


* **Test our model locally**

In [0]:
class CustomModelPrediction(object):

  def __init__(self, model, processor):
    self._model = model
    self._processor = processor

  def predict(self, instances, **kwargs):
    preprocessed_data = self._processor.transform_text(instances)
    predictions = self._model.predict(preprocessed_data)
    return predictions.tolist()


In [0]:
test_requests = [
  "Get the Row(s) which have the max value in groups using groupby. How do I find all rows in a pandas dataframe which have the max value for count column, after grouping by ['Sp','Mt'] columns?",
  "I have a basic question below to help try get my head around functions in python (following the LPTHW tutorials in prep for uni). Could someone explain the syntax below, and whether I am correct with my assumptions? I understand that the print_two_again is the name of the function, but what is the purpose of having the arg1, arg2 in the parenthesis next to it? Is it to call the `steve` `testing` into the print command below? or do those strings go directing into the print command?"
]

In [47]:
classifier = CustomModelPrediction(model, processor)
results = classifier.predict(test_requests)
print(results)

[[0.07212814688682556, 0.0014442205429077148, 9.882450103759766e-05, 0.37868261337280273, 0.00010716915130615234, 0.14114779233932495, 0.0005007386207580566, 0.0008513331413269043, 0.0007471442222595215, 1.1414289474487305e-05, 0.0007198154926300049, 0.012458831071853638, 3.3229589462280273e-05, 0.007745265960693359, 0.6430821418762207, 0.00032147765159606934, 7.2479248046875e-05, 0.008889257907867432, 0.010358572006225586, 0.02623969316482544], [2.473592758178711e-06, 2.294778823852539e-06, 2.086162567138672e-07, 1.767277717590332e-05, 0.0015439391136169434, 8.046627044677734e-07, 0.000331878662109375, 0.0, 0.00019782781600952148, 2.950429916381836e-06, 0.0009021461009979248, 6.705522537231445e-05, 3.606081008911133e-06, 2.5331974029541016e-06, 2.2649765014648438e-06, 0.0001264810562133789, 3.129243850708008e-05, 0.9987879991531372, 0.0008686482906341553, 2.682209014892578e-05]]


In [50]:
for i in range( len(results) ):
  for idx, predprob in enumerate(results[i]): 
    if predprob > 0.5:
      print(f'Example {i} - Predicted label : {tag_encoder.classes_[idx]} - prob {results[i][idx]}')
      print()

Example 0 - Predicted label : mysql - prob 0.6430821418762207

Example 1 - Predicted label : python - prob 0.9987879991531372

