# Prueba técnica W & J parte 1
- por David Ricardo Vivas Ordóñez

For this first challenge we will build a news category classifier using Tensorflow and the [Transformers library](https://github.com/huggingface/transformers) by the [Huggingface team](https://github.com/huggingface). This library provides a relatively unified API for the use and training of a vast amount of Transformer models such as BERT and GPT-2. 

We will support ourselves on the [library documentation](https://huggingface.co/transformers/) and a [working example](https://www.kaggle.com/foolofatook/news-classification-using-bert). We will run this example on a google colab TPU environment.


## Dependency loading, environment preparation

In [1]:
!pip install -q transformers

In [2]:
### descarga del dataset
!gdown --id 18g0n5IrhTc_7uJlUTYjnavgnjkPrPVJp

Downloading...
From: https://drive.google.com/uc?id=18g0n5IrhTc_7uJlUTYjnavgnjkPrPVJp
To: /content/News_Category_Dataset_v2.json
83.9MB [00:00, 149MB/s] 


In [3]:
import numpy as np 
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow_datasets as tfds
import pandas as pd
import transformers
import sklearn
import seaborn as sns

from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, TFAutoModel

In [4]:
# Taken from https://www.kaggle.com/philculliton/a-simple-tf-2-1-notebook
# Detect hardware, return appropriate distribution strategy
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection. No parameters necessary if TPU_NAME environment variable is set. On Kaggle this is always the case.
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    strategy = tf.distribute.get_strategy() # default distribution strategy in Tensorflow. Works on CPU and single GPU.

print("REPLICAS: ", strategy.num_replicas_in_sync)

INFO:absl:Entering into master device scope: /job:worker/replica:0/task:0/device:CPU:0


Running on TPU  grpc://10.81.140.218:8470
INFO:tensorflow:Initializing the TPU system: grpc://10.81.140.218:8470


INFO:tensorflow:Initializing the TPU system: grpc://10.81.140.218:8470


INFO:tensorflow:Clearing out eager caches


INFO:tensorflow:Clearing out eager caches


INFO:tensorflow:Finished initializing TPU system.


INFO:tensorflow:Finished initializing TPU system.


INFO:tensorflow:Found TPU system:


INFO:tensorflow:Found TPU system:


INFO:tensorflow:*** Num TPU Cores: 8


INFO:tensorflow:*** Num TPU Cores: 8


INFO:tensorflow:*** Num TPU Workers: 1


INFO:tensorflow:*** Num TPU Workers: 1


INFO:tensorflow:*** Num TPU Cores Per Worker: 8


INFO:tensorflow:*** Num TPU Cores Per Worker: 8


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)


REPLICAS:  8


## Data loading

In [5]:
df = pd.read_json('/content/News_Category_Dataset_v2.json', lines=True)                                            # load dataset from drive

# merge duplicate and similar categories, uncomment for raw dataset
df['category'] = df['category'].map(lambda z :'ARTS & CULTURE' if z == 'ARTS' else z)
df['category'] = df['category'].map(lambda z :'ARTS & CULTURE' if z == 'CULTURE & ARTS' else z)
df['category'] = df['category'].map(lambda z :'THE WORLDPOST' if z == 'WORLDPOST' else z)
df['category'] = df['category'].map(lambda z :'PARENTING' if z == 'PARENTS' else z)
df['category'] = df['category'].map(lambda z :'ENVIRONMENT' if z == 'GREEN' else z)
df['category'] = df['category'].map(lambda z :'FOOD & DRINK' if z == 'TASTE' else z)
df['category'] = df['category'].map(lambda z :'STYLE & BEAUTY' if z == 'STYLE' else z)
df['category'] = df['category'].map(lambda z :'EDUCATION' if z == 'COLLEGE' else z)

df = df.sample(frac=1)                                                                                             # shuffling
n_classes = df.category.nunique()                                                                                  # count number of unique categories
df['category'] = pd.Categorical(df['category'])                                                                    
df['category_label'] = df['category'].cat.codes
categories = df['category'].cat.categories
one_hot_labels = tf.keras.utils.to_categorical(df['category_label'], num_classes=n_classes, dtype = 'int32')       # create one_hot representation of each category
dataset_size = len(one_hot_labels)
df['category_label_one_hot'] = one_hot_labels.tolist()
df['string_inputs'] = df['headline'] + df['short_description']
df = df.sort_index(axis=1)                                                                                         # sort dataset
df.head()

Unnamed: 0,authors,category,category_label,category_label_one_hot,date,headline,link,short_description,string_inputs
43742,"Sue-Lin Wong, Reuters",ENVIRONMENT,8,"[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...",2016-11-01,"In Rare Move, China Criticizes Trump Plan To E...",https://www.huffingtonpost.com/entry/china-don...,“I believe a wise political leader should take...,"In Rare Move, China Criticizes Trump Plan To E..."
37929,Julia Brucculieri,ENTERTAINMENT,7,"[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...",2017-01-06,Ed Sheeran Just Blessed Us With Not One But Tw...,https://www.huffingtonpost.com/entry/ed-sheera...,Get ready to sing.,Ed Sheeran Just Blessed Us With Not One But Tw...
138521,"Brandon Turner, Contributor\nVP of Content at ...",BUSINESS,2,"[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",2013-11-23,The Top 10 Mistakes 20-Somethings Make Regardi...,https://www.huffingtonpost.com/entry/the-top-1...,"At this point in life, many 20-somethings are ...",The Top 10 Mistakes 20-Somethings Make Regardi...
170217,,COMEDY,3,"[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",2012-12-23,Rejected Folgers Christmas Commercial Goes Hor...,https://www.huffingtonpost.com/entry/rejected-...,"Wait, what's going on here? Unless they're not...",Rejected Folgers Christmas Commercial Goes Hor...
161491,"Oyster.com, Contributor\nThe Hotel Tell-All",TRAVEL,27,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",2013-03-26,The Top 10 Easter Brunches In The U.S.,https://www.huffingtonpost.com/entry/the-top-1...,On the menus: vanilla-dipped brioche French to...,The Top 10 Easter Brunches In The U.S.On the m...


## Data preprocessing and model declaration




We will now tokenize our dataset. Transformers Auto API allows us to retrieve the adequate tokenizer from the given name of a pre-trained model, we have tested this notebook on both uncased base bert and distilgpt2. 

In [6]:
model_name = "bert-base-uncased"                                  # use for BERT
cls_token  = 0                                                    # use for BERT

# model_name = 'distilgpt2'                                         # use for gpt-2
# cls_token = -1                                                    # use for gpt-2

d_input = 2**7                                                    # chosen dimensionality for the input

tokenizer = AutoTokenizer.from_pretrained(model_name)
# tokenizer.pad_token = tokenizer.eos_token                         # use for gpt-2. Define pad token as eos token for gpt-2 tokenizer

input_features = np.array(tokenizer.batch_encode_plus(              # tokenize our inputs
                    df['string_inputs'].astype('str'), 
                    pad_to_max_length=True,
                    truncation = True,
                    max_length=d_input)['input_ids'])



Lets define some training parameters. We will use the tf.data API with [autotuned prefetching](https://www.tensorflow.org/api_docs/python/tf/data/experimental#AUTOTUNE) and the Keras pipeline for optimal training speed

In [7]:
EPOCHS = 10
BATCH_SIZE = 32*strategy.num_replicas_in_sync
AUTO = tf.data.experimental.AUTOTUNE                           # as suggested in kaggle.com/philculliton/a-simple-tf-2-1-notebook

Now we split our training examples into a training and a test sets, we will not use a validation set as for time constraints we dont have the intention of performing hyperparameter tuning. We will .repeat() our training tf.data.Dataset for prefetching.



In [8]:
test_fraction = 0.25
training_examples, test_examples ,training_labels, test_labels = train_test_split(input_features, one_hot_labels, test_size = test_fraction)
training_set = tf.data.Dataset.from_tensor_slices((training_examples, training_labels)).repeat().batch(BATCH_SIZE).prefetch(AUTO)
test_set = tf.data.Dataset.from_tensor_slices((test_examples, test_labels)).batch(BATCH_SIZE)

Lets now declare and compile our model on TPU. Transformers Auto API provides a tf.keras.Layer encapsulating the entire transformer, we can now freely use this layer via Keras sequential or functional API. 

We will stack a softmax dense layer with dropout regularization on top of the transformer for multi-class clasification. cls_token indicates the position of the classification token for a given architecture, 
- for BERT this token is at the position 0 of the latent representation. 

- GPT-2 was not trained with an explicit clasification token, but we can choose the last position as this one inherits context from the entire sentence.

In [9]:
def declare_transformer(transformer_layer, d_input, cls_token, n_classes, dropout_frac = 0.25):
    
    input = tf.keras.layers.Input(shape=(d_input,), dtype=tf.int32)
    transformer_output = transformer_layer(input)[0]
    cls_output = transformer_output[:, cls_token, :]
    dropout_output = tf.keras.layers.Dropout(dropout_frac)(cls_output)
    output = tf.keras.layers.Dense(n_classes, activation='softmax')(dropout_output)
    model = tf.keras.Model(inputs=input, outputs=output)
    
    return model

In [10]:
loss = 'categorical_crossentropy'

with strategy.scope():
    transformer_layer = TFAutoModel.from_pretrained(model_name)
    model = declare_transformer(transformer_layer = transformer_layer, 
                                d_input = d_input, 
                                cls_token = cls_token, 
                                n_classes = len(categories),
                                dropout_frac = 0.25)
    model.compile(tf.keras.optimizers.Adam(lr=3e-5), loss=loss, metrics=['accuracy'])

model.summary()

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 128)]             0         
_________________________________________________________________
tf_bert_model (TFBertModel)  ((None, 128, 768), (None, 109482240 
_________________________________________________________________
tf_op_layer_strided_slice (T [(None, 768)]             0         
_________________________________________________________________
dropout_37 (Dropout)         (None, 768)               0         
_________________________________________________________________
dense (Dense)                (None, 33)                25377     
Total params: 109,507,617
Trainable params: 109,507,617
Non-trainable params: 0
_________________________________________________________________


## Training and testing

Now we can train our model

In [11]:
n_steps = len(training_labels) // BATCH_SIZE

train_log = model.fit(
                training_set,
                steps_per_epoch=n_steps,
                epochs=10,
                verbose = 1)

Epoch 1/10
Instructions for updating:
Use `tf.data.Iterator.get_next_as_optional()` instead.


Instructions for updating:
Use `tf.data.Iterator.get_next_as_optional()` instead.






















Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


And measure its performance on the test set

In [12]:
predictions = model.predict(test_set, verbose = 0)
one_hot_preds = np.argmax(predictions, axis = 1)
true_classes = np.argmax(test_labels, axis = 1)
pred_category = [categories[x] for x in one_hot_preds]
true_category = [categories[x] for x in true_classes]
accuracy = sklearn.metrics.accuracy_score(pred_category, true_category)
print("test set accuracy is {} for {}".format(accuracy, model_name))

test set accuracy is 0.7226271557732903 for bert-base-uncased


After 10 training epochs, a test set performance of around 72% was observed for BERT and around 70% for GPT-2 for the cleaned dataset.