<a href="https://colab.research.google.com/github/victor-roris/mediumseries/blob/master/NLP/NLPModel_MultiClass_Keras_CM2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Model with Keras

In this notebook we are going to train a custom Keras Model to predict categories of text. To vectorize the text we are going to use the multi-hot arrays from the `keras.preprocessing.text.Tokenizer`.

Keras Tokenizer: https://keras.io/preprocessing/text/

Notebook adapted from: https://github.com/makcedward/nlp/blob/master/sample/nlp-model_interpretation.ipynb

In [0]:
import warnings
warnings.filterwarnings('ignore')

## Fetch data

We use the sklearn direct dataset 20 news groups.

In [0]:
import pandas as pd
import numpy as np 

In [2]:
from sklearn.datasets import fetch_20newsgroups
train_raw_df = fetch_20newsgroups(subset='train')
test_raw_df = fetch_20newsgroups(subset='test')

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [3]:
print(f'Number of raw training examples: {len(train_raw_df.data)}')
print(f'Number of raw test examples: {len(test_raw_df.data)}')

Number of raw training examples: 11314
Number of raw test examples: 7532


In [4]:
category_names = np.unique(np.array(train_raw_df.target_names))
print(f'Number of different categories : {len(category_names)}')
print(f'Category list: {category_names}')

Number of different categories : 20
Category list: ['alt.atheism' 'comp.graphics' 'comp.os.ms-windows.misc'
 'comp.sys.ibm.pc.hardware' 'comp.sys.mac.hardware' 'comp.windows.x'
 'misc.forsale' 'rec.autos' 'rec.motorcycles' 'rec.sport.baseball'
 'rec.sport.hockey' 'sci.crypt' 'sci.electronics' 'sci.med' 'sci.space'
 'soc.religion.christian' 'talk.politics.guns' 'talk.politics.mideast'
 'talk.politics.misc' 'talk.religion.misc']


In [5]:
print('Example of entry:')
print(f'\t - LABEL : {train_raw_df.target[0]} - {train_raw_df.target_names[0]}')
print(f'\t - {train_raw_df.data[0]}')

Example of entry:
	 - LABEL : 7 - alt.atheism
	 - From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







## Prepare data to the model

In [0]:
x_train = train_raw_df.data
y_train = train_raw_df.target

x_test = test_raw_df.data
y_test = test_raw_df.target

## Model training

We are going to use a model that doesn't use raw text as input. It use a vectorization of the text. For this, we are going to create a pipeline that allows combine the vectorizer and the model in the same structure.

The vectorization is via TFIDF codification.

In [7]:
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.base import BaseEstimator, TransformerMixin

from keras.models import Model, Input
from keras.layers import Dense, LSTM, Dropout, Embedding, SpatialDropout1D, Bidirectional, concatenate
from keras.layers import GlobalAveragePooling1D, GlobalMaxPooling1D
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

class KerasTextClassifier:
    __author__ = "Edward Ma"
    __copyright__ = "Copyright 2018, Edward Ma"
    __credits__ = ["Edward Ma"]
    __license__ = "Apache"
    __version__ = "2.0"
    __maintainer__ = "Edward Ma"
    __email__ = "makcedward@gmail.com"
    
    OOV_TOKEN = "UnknownUnknown"
    
    def __init__(self, 
                 max_word_input, word_cnt, word_embedding_dimension, labels, 
                 batch_size, epoch, validation_split,
                 verbose=0):
        self.verbose = verbose
        self.max_word_input = max_word_input
        self.word_cnt = word_cnt
        self.word_embedding_dimension = word_embedding_dimension
        self.labels = labels
        self.batch_size = batch_size
        self.epoch = epoch
        self.validation_split = validation_split
        
        self.label_encoder = None
        self.classes_ = None
        self.tokenizer = None
        
        self.model = self._init_model()
        self._init_label_encoder(y=labels)
        self._init_tokenizer()
        
    def _init_model(self):
        input_layer = Input((self.max_word_input,))
        text_embedding = Embedding(
            input_dim=self.word_cnt+2, output_dim=self.word_embedding_dimension,
            input_length=self.max_word_input, mask_zero=False)(input_layer)
        
        text_embedding = SpatialDropout1D(0.5)(text_embedding)
        
        bilstm = Bidirectional(LSTM(units=256, return_sequences=True, recurrent_dropout=0.5))(text_embedding)
        x = concatenate([GlobalAveragePooling1D()(bilstm), GlobalMaxPooling1D()(bilstm)])
        x = Dropout(0.5)(x)
        x = Dense(128, activation="relu")(x)
        x = Dropout(0.5)(x)
        
        output_layer = Dense(units=len(self.labels), activation="softmax")(x)
        model = Model(input_layer, output_layer)
        model.compile(
            optimizer="adam",
            loss="sparse_categorical_crossentropy",
            metrics=["accuracy"])
        return model
    
    def _init_tokenizer(self):
        self.tokenizer = Tokenizer(
            num_words=self.word_cnt+1, split=' ', oov_token=self.OOV_TOKEN)
    
    def _init_label_encoder(self, y):
        self.label_encoder = LabelEncoder()
        self.label_encoder.fit(y)
        self.classes_ = self.label_encoder.classes_
        
    def _encode_label(self, y):
        return self.label_encoder.transform(y)
        
    def _decode_label(self, y):
        return self.label_encoder.inverse_transform(y)
    
    def _get_sequences(self, texts):
        seqs = self.tokenizer.texts_to_sequences(texts)
        return pad_sequences(seqs, maxlen=self.max_word_input, value=0)
    
    def _preprocess(self, texts):
        # Placeholder only.
        return [text for text in texts]
        
    def _encode_feature(self, x):
        self.tokenizer.fit_on_texts(self._preprocess(x))
        self.tokenizer.word_index = {e: i for e,i in self.tokenizer.word_index.items() if i <= self.word_cnt}
        self.tokenizer.word_index[self.tokenizer.oov_token] = self.word_cnt + 1
        return self._get_sequences(self._preprocess(x))
        
    def fit(self, X, y):
        """
            Train the model by providing x as feature, y as label
        
            :params x: List of sentence
            :params y: List of label
        """
        
        encoded_x = self._encode_feature(X)
        encoded_y = self._encode_label(y)
        
        self.model.fit(encoded_x, encoded_y, 
                       batch_size=self.batch_size, epochs=self.epoch, 
                       validation_split=self.validation_split)
        
    def predict_proba(self, X, y=None):
        encoded_x = self._get_sequences(self._preprocess(X))
        return self.model.predict(encoded_x)
    
    def predict(self, X, y=None):
        y_pred = np.argmax(self.predict_proba(X), axis=1)
        return self._decode_label(y_pred)

Using TensorFlow backend.


In [12]:
classifier = KerasTextClassifier(max_word_input=100, word_cnt=30000, word_embedding_dimension=100, 
                    labels=list(set(y_train.tolist())), batch_size=128, epoch=1, validation_split=0.1)
classifier.model.summary()

Model: "model_5"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_5 (InputLayer)            (None, 100)          0                                            
__________________________________________________________________________________________________
embedding_5 (Embedding)         (None, 100, 100)     3000200     input_5[0][0]                    
__________________________________________________________________________________________________
spatial_dropout1d_5 (SpatialDro (None, 100, 100)     0           embedding_5[0][0]                
__________________________________________________________________________________________________
bidirectional_5 (Bidirectional) (None, 100, 512)     731136      spatial_dropout1d_5[0][0]        
____________________________________________________________________________________________

In [16]:
classifier.fit(x_train, y_train)

Train on 10182 samples, validate on 1132 samples
Epoch 1/1


## Model evaluation

In [0]:
from sklearn.metrics import classification_report

predictions = model.predict(x_test)

print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.76      0.62      0.68       319
           1       0.65      0.70      0.67       389
           2       0.70      0.71      0.71       394
           3       0.61      0.69      0.65       392
           4       0.76      0.77      0.76       385
           5       0.78      0.64      0.71       395
           6       0.75      0.86      0.80       390
           7       0.85      0.74      0.79       396
           8       0.88      0.85      0.87       398
           9       0.82      0.85      0.83       397
          10       0.89      0.86      0.88       399
          11       0.88      0.80      0.84       396
          12       0.42      0.60      0.49       393
          13       0.84      0.72      0.78       396
          14       0.83      0.84      0.83       394
          15       0.81      0.89      0.85       398
          16       0.63      0.77      0.69       364
          17       0.96    