<a href="https://colab.research.google.com/github/rickskyy/nlp-course/blob/master/BBC_articles_classification_elmo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Kernel Overview

## 1.1 Defination :

In today world** Text Classification/Segmentation/Categorization** (for example ticket categorization in a call centre, email classification, logs category detection etc.) is a common task. With humongous data out there, its nearly impossible to do this manually. Let's try to solve this problem automatically using machine learning and natural language processing tools.

## 1.2 Problem Statement

BBC articles dataset (2126 records) consist of two features text and the assiciated categories namely 
1. Sport 
2. Business 
3. Politics 
4. Tech 
5. Others

**Our task is to train a multiclass classification model on the mentioned dataset.**

## 1.3 Metrics

**Accuracy** - Classification accuracy is the number of correct predictions made as a
ratio of all predictions made

**Precision** - precision (also called positive predictive value) is the fraction of
relevant instances among the retrieved instances

**F1_score** - considers both the precision and the recall of the test to compute the
score

**Recall** – recall (also known as sensitivity) is the fraction of relevant instances that
have been retrieved over the total amount of relevant instances

**Why these metrics?** - We took Accuracy, Precision, F1 Score and Recall as metrics
for evaluating our model because accuracy would give an estimate of correct prediction. Precision would give us an estimate about the positive category predicted value i.e. how much our model is giving relevant result. F1 Score gives a clubbed estimate of precision and recall.Recall would provide us the relevant positive category prediction to the false negative and true positive category recognition results.

## 1.4 Machine Learning Model Considered:

We will be using **ELMO embeddings with KERAS** for this use case. 

ELMO and KERAS is not in the scope of this kernal. Kindly refer other external sources.

# 2. Data Exploration

### Step 2.1 Load Dataset

In [0]:
from google.colab import drive
drive.mount('/content/drive')

In [0]:
import pandas as pd

In [0]:
# change path to your copy of dataset
data = pd.read_csv(r"/content/drive/My Drive/bbc_articles/bbc-text.csv", usecols=['category','text'])
data.head(10)

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...
5,politics,howard hits back at mongrel jibe michael howar...
6,politics,blair prepares to name poll date tony blair is...
7,sport,henman hopes ended in dubai third seed tim hen...
8,sport,wilkinson fit to face edinburgh england captai...
9,entertainment,last star wars not for children the sixth an...


# 3. Implementation

In [0]:
from sklearn import metrics,preprocessing,model_selection
from sklearn import metrics
from sklearn.metrics import accuracy_score
import keras
from keras.layers import Input, Lambda, Dense
from keras.models import Model
import keras.backend as K
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import string
import pandas as pd
import re
import spacy
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from spacy.lang.en import English

spacy.load('en')
parser = English()

Using TensorFlow backend.




In [0]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### Step 3.1 Label encoding

In [0]:
def encode(le_enc, labels):
    enc = le_enc.transform(labels)
    return keras.utils.to_categorical(enc)

def decode(le_enc, one_hot):
    dec = np.argmax(one_hot, axis=1)
    return le_enc.inverse_transform(dec)

### Step 3.2 Data cleaning

In [0]:
# Stop words and special characters 
STOPLIST = set(stopwords.words('english') + list(ENGLISH_STOP_WORDS)) 
SYMBOLS = " ".join(string.punctuation).split(" ") + ["-", "...", "”", "”","''"]

In [0]:
# Data Cleaner and tokenizer
def tokenize_text(text):
    
    text = text.strip().replace("\n", " ").replace("\r", " ")
    text = text.lower()
    
    tokens = parser(text)
    
    # lemmatization
    tokens = [tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_ for tok in tokens]
    
    # remove stop words and special charaters
    tokens = filter(lambda tok: tok.lower() not in STOPLIST, tokens)
    tokens = filter(lambda tok: tok not in SYMBOLS, tokens)
    
    # remove small words
    tokens = filter(lambda tok: len(tok) >= 3, tokens)
    
    # remove remaining tokens that are not alphabetic
    tokens = filter(lambda tok: tok.isalpha(), tokens)
    
    return ' '.join(list(set(tokens)))

In [0]:
df = pd.DataFrame()
df['text'] = data['text']
df['text'] = df["text"].apply(lambda x: tokenize_text(x))
df['text'][0]

'end advertise tim network suggest trend plasma enhance crystal service essentially big anybody liquid replay content possible forward abide theatre function display president good personalise grow instant keynote portable provider forget time different important tivotogo reflect live available ultimately guide worry issue deliver firm europe old launch senior talk hour tell today communication dvr radically push expert company brand directtv starcom record build business video recorder concern particularly hanlon technology engine box impact suit partnership challenge pcs market year future say connection gadget lose play microsoft las broadcaster digital telecom new uptake gather pause help example accord personal stacey announce showcase window website scheduler like gate producer bbc hard speech allow young tvs advert want tivo generation kind rewind recognise room add happen google book channel term press cable mediavest lack home vega high hume instead capability humax multimedia

In [0]:
# add text classes
df['class'] = data['category']

In [0]:
# consider removing rare words
# TODO use TF-IDF
freq = pd.Series(' '.join(df['text']).split()).value_counts()[:4000]
print(sum(freq))

df['text'] = df['text'].apply(lambda x: " ".join(x for x in x.split() if x in freq))
df['text'] 

237206


0       end advertise tim network suggest trend enhanc...
1       david financial reply prosecutor bernie phone ...
2       andy unknown way progress rugby head list engl...
3       end boston charlton crystal saturday knock rep...
4       andy major julia big hit knock december releas...
                              ...                        
2220    number healthy big december decent jump streng...
2221    door worker immigration specific metropolitan ...
2222    promise fantastic big summer customer air act ...
2223    end escape solve need apparently suggest schoo...
2224    victory end ball progress score want record ob...
Name: text, Length: 2225, dtype: object

### Step 3.3 Data preparation

In [0]:
# Data preparation
X = df['text'].tolist()
y = df['class'].tolist()

# Lebel encoding
le_enc = preprocessing.LabelEncoder()
le_enc.fit(y)

y_en = encode(le_enc, y)
y_en.shape

(2225, 5)

In [0]:
# split the dataset into training and testing datasets
x_train, x_test, y_train, y_test = model_selection.train_test_split(np.asarray(X), np.asarray(y_en), test_size=0.2, random_state=42)

In [0]:
x_train.shape
y_train.shape

(1780, 5)

### Step 3.4 Build model

In [0]:
# get elmo from tensorflow hub
import tensorflow_hub as hub
import tensorflow as tf

embed = hub.Module("https://tfhub.dev/google/elmo/3", trainable=True)

In [0]:
# ELMo Embedding
def ELMoEmbedding(x):
    return embed(tf.squeeze(tf.cast(x, tf.string)), signature="default", as_dict=True)["default"]

In [0]:
input_text = Input(shape=(1,), dtype=tf.string)
embedding = Lambda(ELMoEmbedding, output_shape=(1024, ))(input_text)
dense = Dense(256, activation='relu')(embedding)
pred = Dense(5, activation='softmax')(dense)
model = Model(inputs=[input_text], outputs=pred)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])













INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore




















In [0]:
with tf.Session() as session:
    K.set_session(session)
    session.run(tf.global_variables_initializer())  
    session.run(tf.tables_initializer())
    history = model.fit(x_train, y_train, epochs=1, batch_size=16)
    model.save_weights('./elmo-model-v2.h5')

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where














Epoch 1/1


























In [0]:
with tf.Session() as session:
    K.set_session(session)
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    model.load_weights('./elmo-model-v2.h5')  
    predicts = model.predict(x_test, batch_size=16)

# 4. Results

In [0]:
# decode test labels
y_test = decode(le_enc, y_test)
# decode predicted labels
y_preds = decode(le_enc, predicts)

In [0]:
from sklearn import metrics

print(metrics.confusion_matrix(y_test, y_preds))

print(metrics.classification_report(y_test, y_preds))

print("Accuracy of ELMO is:",accuracy_score(y_test,y_preds))

[[91  1  7  0  2]
 [ 3 75  1  0  2]
 [ 0  0 83  0  0]
 [ 0  0  1 97  0]
 [ 4  5  0  1 72]]
               precision    recall  f1-score   support

     business       0.93      0.90      0.91       101
entertainment       0.93      0.93      0.93        81
     politics       0.90      1.00      0.95        83
        sport       0.99      0.99      0.99        98
         tech       0.95      0.88      0.91        82

     accuracy                           0.94       445
    macro avg       0.94      0.94      0.94       445
 weighted avg       0.94      0.94      0.94       445

Accuracy of ELMO is: 0.9393258426966292


My result for 1 epoch is 93.93%. Better preprocessing is core to achieve higher accuracy.

Best results found for the dataset is 97.77%.

https://www.kaggle.com/sarthak221995/textclassification-97-77-accuracy-bert

# 5. References

> This kernel is based on the work of:

> https://www.kaggle.com/sarthak221995/textclassification-95-5-accuracy-elmo/notebook
> https://www.kaggle.com/saikumar587/text-classification-elmo/notebook