<div style="background-color:#3c7852; display:block; padding:10px;"><h1 style="color:#fff">Multiclass Text Classification using Tf.Hub</h1></div>
<div style="padding:3px;">&nbsp;</div>

## What is Transfer Learning?

Transfer learning is a process of using pre-trained model on similar type of data(text, images). TensorFlow has a framework to leverage the pre-trained model network and components in a new model to get trained and receive more knowledge about the data. 

## Dataset

In this kernel we are going to explore a problem on multiclass text classification with Deep Learning model. If the target or response variable contains more than one class label then the data is considered as multinomial or multiclass dataset. 


The dataset is a collection of various consumer complaints about finance products and services sent to companies for response. 

## Key Variables

In this dataset,  `Issue` is a textual description field which conveys the complaints about the finance product and service. The `product` is a target variable which will be classified based on the consumer issue description. 


<div style="background-color:#e0d52f; display:block; padding:10px;margin-botton:4px;"><h2 style="color:#000">Table of content</h2></div>
<div style="padding:3px;">&nbsp;</div>

* [Load and extract dataset](#load_data)
* [Split Train/Holdout and Dev Set](#split_data)
* [Handling Imbalanced Data](#compute_weights)
* [Data to Tensors](#data_to_tensor)
* [Target Encoding](#target_encoding)
* [Transfer Learning](#transfer_learning)
* [Train Model](#train_model)
* [Predict Data](#predict_data)


## Import Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Tensorflow packages
import tensorflow as tf
from tensorflow import keras
import tensorflow_hub as hub

# SKlearn packages
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle, class_weight

import warnings
warnings.filterwarnings('ignore')

# setting max width option
pd.set_option('display.max_colwidth', -1)


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os

def dir_watch(dirname):
    for dirname, _, filenames in os.walk(dirname):
        for filename in filenames:
            print(os.path.join(dirname, filename))

# input dir
dir_watch('/kaggle/input/')
        

<a id="load_data"></a>

## Load and Extract Data

In [None]:
# Load the dataset from csv file
cfpb_data = pd.read_csv('/kaggle/input/us-consumer-finance-complaints/consumer_complaints.csv')

In [None]:
cfpb_data.isnull().sum()

## Extract Data

We will extract notnull values of consumer complaint narrative records for the training. 

In [None]:
non_na_complaints = np.where(~cfpb_data['consumer_complaint_narrative'].isna())

In [None]:
len(non_na_complaints[0])

In [None]:
cfpb_extract = cfpb_data.loc[non_na_complaints]

# Reset the index
cfpb_extract.reset_index(inplace=True)

In [None]:
cfpb_extract.info()

In [None]:
# Interested fields
key_cols = ['product', 'consumer_complaint_narrative']

cfpb_extract[key_cols][:3]

In [None]:
cfpb_extract['product'].value_counts()

In [None]:
# Plot the target variable
cfpb_extract['product'].value_counts().plot(kind='bar')

<a id="split_data"></a>


## Split the dataset

### Train, Holdout and Dev Split
The dataset will be splited into 3 portions as 60/20/20 ratio. One for train the model, one for validation(holdout) and one for test(dev) the model.

In [None]:
# Train and test data will be taken as 80/20 ratio
X_train_full, X_test_full = train_test_split(cfpb_extract[key_cols], test_size=0.2, random_state=111)

# Split the train data into further as 60/20 ratio
X_train, X_valid = train_test_split(X_train_full, test_size=0.2, random_state=111)

In [None]:
print(f"Shape of X_train: {X_train.shape}, X_valid: {X_valid.shape}" )

<a id="compute_weights"></a>
## Handling imbalanced class data

One of the key techniques to handle imbalanced class data is, **computing the class weights**. We can compute the class weights. The weightage of the class is given based on the number of samples available in the dataset. We will use the `sklearn.utils.class_weight` modules `compute_class_weight` method to calculate the weights of the class.

The higher sample classes will have lesser weight and lower sampled classes will have higher weights. 

In [None]:
class_weights = list(class_weight.compute_class_weight('balanced',
                                                      np.unique(cfpb_extract['product']),
                                                      cfpb_extract['product']))


class_weights

In [None]:
# Converting list to dictionary object
weights = {}

for inx, weight in enumerate(class_weights):
    weights[inx] = weight

In [None]:
X_train['consumer_complaint_narrative'][:2]

<a id="data_to_tensor"></a>
## Convert Dataset into Tensors

In this step, we are converting the data into a tensors. Tensor datastructure is required for training the neural network model.

`tf.data.Dataset.from_tensor_slices(tuple)` [Click here](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) for more on `tf.data`

Our dependent variable is `product` and the independent variable is `consumer_complaint_narrative`.

In [None]:
train_tensor = tf.data.Dataset.from_tensor_slices((X_train['consumer_complaint_narrative'].values, X_train['product'].values))
test_tensor = tf.data.Dataset.from_tensor_slices((X_test_full['consumer_complaint_narrative'].values, X_test_full['product'].values))
valid_tensor = tf.data.Dataset.from_tensor_slices((X_valid['consumer_complaint_narrative'].values, X_valid['product'].values))

In [None]:
for corpus, target in train_tensor.take(5):
    print("\nTarget: {} \nData: {}".format(target, corpus))


<a id="target_encoding"></a>
## Target Encoding

We will create a [**StaticHashTable**](https://gist.github.com/venkat-krish/a21808db141c58bea87bc309fccaa042) for our target variables. A sample code for creation of static hash table can be found [here](https://gist.github.com/venkat-krish/a21808db141c58bea87bc309fccaa042)

In [None]:
products = np.unique(cfpb_extract['product'])

products

In [None]:

# Method to define target static hash
def target_encoding(unique_targets):
    
    key_tensor = tf.constant(unique_targets) # class names in text format
    value_tensor = tf.constant(np.arange(0, len(unique_targets))) # index values from 0 to length of the classes
    
    hash_table = tf.lookup.StaticHashTable(
                    tf.lookup.KeyValueTensorInitializer(
                        keys = key_tensor, 
                        values = value_tensor), -1
                )
    
    return hash_table

# Target encoded table
target_encoded = target_encoding(products)

# TF function will get build in the TensorFlow graph
@tf.function
def target_enc(t):
    return target_encoded.lookup(t)


def display_batchwise(dataset, bsize=5):
    for data, label in dataset.take(bsize):
        print("Data:{}\nTarget:{}\n".format(data.numpy(), label.numpy()))
        
def one_hot_labelencoding(text, label):
    return text, tf.one_hot(target_enc(label), 11)

In [None]:
next(iter(train_tensor))

In [None]:
# Transform the labels into binary variables
train_data_f = train_tensor.map(one_hot_labelencoding)
valid_data_f = valid_tensor.map(one_hot_labelencoding)
test_data_f = test_tensor.map(one_hot_labelencoding)

In [None]:
train_data, train_labels = next(iter(train_data_f.batch(5)))

In [None]:
train_data, train_labels

<a id="transfer_learning"></a>
## Transfer Learning using TF.Hub


Tensorflow Hub is a way to share pre-trained model components. In this notebook we will use the **NNLM English 128 dim** ([source](https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1)) model for embedding our text corpus data.

In [None]:
pretrained_url = 'https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1'

# Hub layer for embedding the text corpus
hub_layer = hub.KerasLayer(pretrained_url, output_shape=[128], 
                          input_shape=[], 
                          dtype=tf.string, 
                          trainable=True)

# Look at the hub layer
hub_layer(train_data[:1])

In [None]:
def build_model(embed_layer, output_shape):
    model = tf.keras.Sequential()
    
    model.add(embed_layer)
    
    for unit in [128, 128, 64, 32]:
        model.add(tf.keras.layers.Dense(unit, activation='relu'))
        model.add(tf.keras.layers.Dropout(0.3))
    
    model.add(tf.keras.layers.Dense(output_shape, activation='softmax'))
    
    return model


In [None]:
output_shape = len(products)

# NN model
model = build_model(hub_layer, output_shape)

model.summary()

<a id="train_model"></a>

## Train model

In [None]:
# Train the model with train and validation set

model.compile(loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
             optimizer='adam',
             metrics=['accuracy'])

In [None]:
# Shuffle the train data
# shuffle_buffer_size = 50000
train_data_f = train_data_f.shuffle(60000).batch(512) 
valid_data_f = valid_data_f.shuffle(20000).batch(512)
test_data_f = test_data_f.batch(512)

In [None]:
# fit the data on the model
history = model.fit(train_data_f,
                    epochs=10,
                    validation_data=valid_data_f,
                    class_weight=weights,
                   verbose=1)

In [None]:
results = model.evaluate(test_data_f)

<a id="predict_data"></a>
## Predict the test data

In [None]:
test_data, test_labels = next(iter(test_data_f))

In [None]:
y_preds = model.predict(test_data)

In [None]:
y_preds.argmax(axis=1)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(test_labels.numpy().argmax(axis=1), y_preds.argmax(axis=1)))