# Multi-Class Classification using TensorFlow Hub

Data Set - [AG News Subset Dataset](https://www.tensorflow.org/datasets/catalog/ag_news_subset)

The notebook highlights a working example to use TF Hub Embedding Layer for a multiclass classification problem on News articles which are labelled [0,1,2,3].

The main steps are - 

* Acquire Data
* EDA
* Modelling
* Evaluation

In [51]:
import os, re
import numpy as np

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

import shutil, string

from tensorflow.keras import layers
from tensorflow.keras import losses

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.list_physical_devices("GPU") else "NOT AVAILABLE")

Version:  2.8.0
Eager mode:  True
Hub version:  0.12.0
GPU is NOT AVAILABLE


### Download dataset from TFDS

In [2]:
(train_data, validation_data, test_data), ds_info  = tfds.load('ag_news_subset', 
                      split=('train[:60%]', 'train[60%:]', 'test'),
                      as_supervised=True,
                      with_info=True)

[1mDownloading and preparing dataset ag_news_subset/1.0.0 (download: 11.24 MiB, generated: 35.79 MiB, total: 47.03 MiB) to /root/tensorflow_datasets/ag_news_subset/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]






0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/ag_news_subset/1.0.0.incompleteVNIAF5/ag_news_subset-train.tfrecord


  0%|          | 0/120000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/ag_news_subset/1.0.0.incompleteVNIAF5/ag_news_subset-test.tfrecord


  0%|          | 0/7600 [00:00<?, ? examples/s]

[1mDataset ag_news_subset downloaded and prepared to /root/tensorflow_datasets/ag_news_subset/1.0.0. Subsequent calls will reuse this data.[0m


## EDA

In [3]:
# Firstly, let's print dataset info by tensorflow
# This is returned by tfds.load constructor. That's pretty cool, btw!

In [4]:
ds_info

tfds.core.DatasetInfo(
    name='ag_news_subset',
    version=1.0.0,
    description='AG is a collection of more than 1 million news articles.
News articles have been gathered from more than 2000  news sources by ComeToMyHead in more than 1 year of activity.
ComeToMyHead is an academic news search engine which has been running since July, 2004.
The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc),
information retrieval (ranking, search, etc), xml, data compression, data streaming,
and any other non-commercial activity.
For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .

The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above.
It is used as a text classification benchmark in the following paper:
Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advanc

In [5]:
# We have 120000 records for training, and 7600 for testing
# Total unique classes are 4 - 0,1,2,3
# Let's print first few examples

In [6]:
train_examples_batch, train_labels_batch = next(iter(train_data.batch(4)))
train_examples_batch, train_labels_batch

(<tf.Tensor: shape=(4,), dtype=string, numpy=
 array([b'AMD #39;s new dual-core Opteron chip is designed mainly for corporate computing applications, including databases, Web services, and financial transactions.',
        b'Reuters - Major League Baseball\\Monday announced a decision on the appeal filed by Chicago Cubs\\pitcher Kerry Wood regarding a suspension stemming from an\\incident earlier this season.',
        b'President Bush #39;s  quot;revenue-neutral quot; tax reform needs losers to balance its winners, and people claiming the federal deduction for state and local taxes may be in administration planners #39; sights, news reports say.',
        b'Britain will run out of leading scientists unless science education is improved, says Professor Colin Pillinger.'],
       dtype=object)>,
 <tf.Tensor: shape=(4,), dtype=int64, numpy=array([3, 1, 2, 3])>)

In [7]:
# Let's check total examples for train, validation and test

In [8]:
print(f'Total size of the Training dataset : {tf.data.experimental.cardinality(train_data)}')
print(f'Total size of the Training dataset : {tf.data.experimental.cardinality(validation_data)}')
print(f'Total size of the Training dataset : {tf.data.experimental.cardinality(test_data)}')

Total size of the Training dataset : 72000
Total size of the Training dataset : 48000
Total size of the Training dataset : 7600


In [9]:
# Next, unique labels. Although this is available in the desc, but still good to know the method!
# Since TF datasets are lazily evaluated, the next code block might be slow

In [10]:
text, labels = tuple(zip(*train_data))

np_text = np.array(text)
np_labels = np.array(labels)

print('Unique Labels for training : ', list(set(np_labels)))

Unique Labels for training :  [0, 1, 2, 3]


In [11]:
# Next, few useful TF functions 

In [19]:
# Get number of unique classes
print(f"No of unique classes : {ds_info.features['label'].num_classes}")


# Get num of examples by the split
print(f"Total training examples (Training and Validation) : {ds_info.splits['train'].num_examples}")
print(f"Total testing examples  : {ds_info.splits['test'].num_examples}")

No of unique classes : 4
Total training examples (Training and Validation) : 120000
Total testing examples  : 7600


## Modelling

### Embedding Layer 

We use an embedding layer for Text Classification. One way to represent the text is to convert sentences into embeddings vectors. 
Use a pre-trained text embedding as the first layer, which will have three advantages:

* You don't have to worry about text preprocessing,
* Benefit from transfer learning,
* The embedding has a fixed size, so it's simpler to process.

For this example, we use pre-trained text embedding model from TensorFlow Hub called [google/nnlm-en-dim50/2](https://tfhub.dev/google/nnlm-en-dim50/2), which is trained on Google News corpus. This converts the text to 50-dim embedding vector

In [36]:
embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:3])

<tf.Tensor: shape=(3, 50), dtype=float32, numpy=
array([[ 0.13279007,  0.06140124,  0.1747397 , -0.01384087, -0.00910476,
        -0.03726622,  0.07974008,  0.08505542, -0.15469442, -0.07710762,
        -0.5860853 ,  0.38640746, -0.17650622, -0.12226384,  0.22213276,
         0.37066036,  0.01225427,  0.11542831,  0.20145267,  0.16040364,
         0.03724775, -0.1422108 ,  0.04388927,  0.00514791, -0.22648726,
        -0.10230689,  0.06203717,  0.09426294,  0.04055819,  0.18911201,
         0.2816111 , -0.09024968,  0.04349989, -0.30649066,  0.20486301,
        -0.39136994,  0.25492623, -0.06430516,  0.16803294, -0.0635931 ,
         0.09554254, -0.05217019, -0.10079663,  0.259143  , -0.16179433,
        -0.18240969,  0.05787944,  0.00377896,  0.1353013 ,  0.35294548],
       [ 0.2955813 ,  0.134404  ,  0.09672645,  0.1042643 , -0.14633738,
         0.21999091, -0.2732706 ,  0.056431  ,  0.3784275 , -0.14614943,
         0.0726692 ,  0.12335153,  0.07059986, -0.2501442 ,  0.3267967 ,
 

### Define the model

In [61]:
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='softmax'))
model.add(tf.keras.layers.Dense(4))

model.summary()

Model: "sequential_12"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 keras_layer_3 (KerasLayer)  (None, 50)                48190600  
                                                                 
 dense_19 (Dense)            (None, 16)                816       
                                                                 
 dense_20 (Dense)            (None, 4)                 68        
                                                                 
Total params: 48,191,484
Trainable params: 48,191,484
Non-trainable params: 0
_________________________________________________________________


### Compile the model

In [62]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

### Training

In [66]:
tf.config.run_functions_eagerly(True)

history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=1,
                    validation_data=validation_data.batch(512),
                    verbose=1)



## Evaluation

In [68]:
results = model.evaluate(test_data.batch(512), verbose=2)

for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))

15/15 - 1s - loss: 1.0451 - accuracy: 0.6834 - 1s/epoch - 96ms/step
loss: 1.045
accuracy: 0.683


68% Validation Accuracy with 1 Epoch and no text processing.
This can be improved with increasing epochs, using more complex DNNs