> This kernel is based on the work of http://hunterheidenreich.com/blog/elmo-word-vectors-in-keras/

1. # 1. Kernel Overview

## 1.1 Defination :

In today world** Text Classification/Segmentation/Categorization** (for example ticket categorization in a call centre, email classification, logs category detection etc.) is a common task. With humongous data out there, its nearly impossible to do this manually. Let's try to solve this problem automatically using machine learning and natural language processing tools.

## 1.2 Problem Statement

BBC articles dataset(2126 records) consist of two features text and the assiciated categories namely 
1. Sport 
2. Business 
3. Politics 
4. Tech 
5. Others

**Our task is to train a multiclass classification model on the mentioned dataset.**

## 1.3 Metrics

**Accuracy** - Classification accuracy is the number of correct predictions made as a
ratio of all predictions made

**Precision** - precision (also called positive predictive value) is the fraction of
relevant instances among the retrieved instances

**F1_score** - considers both the precision and the recall of the test to compute the
score

**Recall** – recall (also known as sensitivity) is the fraction of relevant instances that
have been retrieved over the total amount of relevant instances

**Why these metrics?** - We took Accuracy, Precision, F1 Score and Recall as metrics
for evaluating our model because accuracy would give an estimate of correct prediction. Precision would give us an estimate about the positive category predicted value i.e. how much our model is giving relevant result. F1 Score gives a clubbed estimate of precision and recall.Recall would provide us the relevant positive category prediction to the false negative and true positive category recognition results.

## 1.4 Machine Learning Model Considered:

We will be using **ELMO embeddings with KERAS** for this use case. 

ELMO and KERAS is not in the scope of this kernal. Kindly refer other external sources.

# 2. Data Exploration

### Step 2.1 Load Dataset

In [10]:
import pandas as pd

data=pd.read_csv(r"../input/sst5-dataset/SST5_master_train.csv")

In [11]:
# data = data.rename(columns={"Processed_Reviews":"text"})
data.head()

Unnamed: 0.1,Unnamed: 0,label,review,type,Processed_Reviews
0,0,4,The Rock is destined to be the 21st Century 's...,train,the rock is destined to be the 21st century ne...
1,1,5,The gorgeously elaborate continuation of `` Th...,train,the gorgeously elaborate continuation of the l...
2,2,4,Singer/composer Bryan Adams contributes a slew...,train,singer composer bryan adam contributes slew of...
3,3,3,You 'd think by now America would have had eno...,train,you think by now america would have had enough...
4,4,4,Yet the act is still charming here .,train,yet the act is still charming here


# 3. Implementation

### Step 2.2 Map Textual labels to numeric using Label Encoder

In [27]:
from sklearn.preprocessing import LabelEncoder
df2 = pd.DataFrame()
df2["text"] = data["Processed_Reviews"]
df2["label"] = data['label']
df2['text'] = df2['text'].astype(str)
df2.dtypes

text     object
label     int64
dtype: object

In [28]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
stop = stopwords.words('english')
df2['text'] = df2['text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df2['text'].head()

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


0    rock destined 21st century new conan going mak...
1    gorgeously elaborate continuation lord ring tr...
2    singer composer bryan adam contributes slew so...
3    think america would enough plucky british ecce...
4                               yet act still charming
Name: text, dtype: object

In [29]:
df2.head()

Unnamed: 0,text,label
0,rock destined 21st century new conan going mak...,4
1,gorgeously elaborate continuation lord ring tr...,5
2,singer composer bryan adam contributes slew so...,4
3,think america would enough plucky british ecce...,3
4,yet act still charming,4


In [30]:
freq = pd.Series(' '.join(df2['text']).split()).value_counts()[-10:]
df2['text'] = df2['text'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
df2['text'].head()

0    rock destined 21st century new conan going mak...
1    gorgeously elaborate continuation lord ring tr...
2    singer composer bryan adam contributes slew so...
3    think america would enough plucky british ecce...
4                               yet act still charming
Name: text, dtype: object

### Step 2.3 Import the Libraries

In [31]:
import pandas as pd
import numpy as np
import spacy
from tqdm import tqdm
import re
import time
import pickle
pd.set_option('display.max_colwidth', 200)

In [32]:
import tensorflow_hub as hub
import tensorflow as tf

embed = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)

W1215 23:08:07.414207 139775319913856 __init__.py:56] Some hub symbols are not available because TensorFlow version is less than 1.14
W1215 23:08:33.904659 139775319913856 deprecation.py:323] From /opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py:3632: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.


### Step 2.4 Convert Sentence to Elmo Vectors

In [33]:
import tensorflow as tf
import tensorflow_hub as hub
import pandas as pd
from sklearn import preprocessing
import keras
import numpy as np


y = list(df2['label'])
x = list(df2['text'])

le = preprocessing.LabelEncoder()
le.fit(y)

def encode(le, labels):
    enc = le.transform(labels)
    return keras.utils.to_categorical(enc)

def decode(le, one_hot):
    dec = np.argmax(one_hot, axis=1)
    return le.inverse_transform(dec)


x_enc = x
y_enc = encode(le, y)

Using TensorFlow backend.


### Step 2.5 Divide dataset to test and train dataset

In [34]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(np.asarray(x_enc), np.asarray(y_enc), test_size=0.2, random_state=42)

In [35]:
x_train.shape

(7716,)

### Step 2.5 Train Keras neural model with ELMO Embeddings

In [39]:
from keras.layers import Input, Lambda, Dense
from keras.models import Model
import keras.backend as K

def ELMoEmbedding(x):
    return embed(tf.squeeze(tf.cast(x, tf.string)), signature="default", as_dict=True)["default"]

input_text = Input(shape=(1,), dtype=tf.string)
embedding = Lambda(ELMoEmbedding, output_shape=(1024, ))(input_text)

dense = Dense(256, activation='relu')(embedding)

pred = Dense(5, activation='softmax')(dense)

model = Model(inputs=[input_text], outputs=pred)

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

with tf.Session() as session:
    K.set_session(session)
    session.run(tf.global_variables_initializer())  
    session.run(tf.tables_initializer())
    history = model.fit(x_train, y_train, epochs=5, batch_size=128)
    model.save_weights('./elmo-model.h5')

with tf.Session() as session:
    K.set_session(session)
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    model.load_weights('./elmo-model.h5')  
    predicts = model.predict(x_test, batch_size=128)

y_test = decode(le, y_test)
y_preds = decode(le, predicts)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


AxisError: axis 1 is out of bounds for array of dimension 1

# 4. Results

In [37]:
from sklearn import metrics

print(metrics.confusion_matrix(y_test, y_preds))

print(metrics.classification_report(y_test, y_preds))

from sklearn.metrics import accuracy_score

print("Accuracy of ELMO is:",accuracy_score(y_test,y_preds))

[[125  52   2  44   8]
 [171 142  13 154  25]
 [ 61  66  12 189  45]
 [ 26  50  11 292 151]
 [ 12   7   1 106 164]]
              precision    recall  f1-score   support

           1       0.32      0.54      0.40       231
           2       0.45      0.28      0.35       505
           3       0.31      0.03      0.06       373
           4       0.37      0.55      0.44       530
           5       0.42      0.57      0.48       290

   micro avg       0.38      0.38      0.38      1929
   macro avg       0.37      0.39      0.35      1929
weighted avg       0.38      0.38      0.34      1929

Accuracy of ELMO is: 0.3810264385692068


In [38]:
print(tf.__version__) 
print(keras.__version__)
print(keras.__version__)

1.13.1
2.2.4
2.2.4


>** Past Work mentioned on this dataset at max achieved 95.22 accuracies. keras with ELMO embeddings achieved 95.5. But still BERT base model without any preprocessing and achieved 97.75 accuracies.

[bert model]https://www.kaggle.com/sarthak221995/textclassification-97-77-accuracy-bert
**

# 5. Future Improvements on this kernel:

* Explore preprocessing steps on data.
* Explore other models as baseline.
* Make this notebook more informative and illustrative.
* Explaination on ELMO Embeddings Model.
* More time on data exploration
and many more...

# 6. References

> This kernel is based on the work of http://hunterheidenreich.com/blog/elmo-word-vectors-in-keras/