# Hugging Face Demonstration with PyTorch
## Text Classification
### TP Goter
### January 25, 2021

In [None]:
import numpy as np
%matplotlib inline
from matplotlib import pyplot as plt
import pandas as pd
from transformers import TFBertModel, TFBertForSequenceClassification, BertTokenizer
from transformers import TFTrainer, TFTrainingArguments
import os
from pprint import pprint
import tensorflow as tf
from tqdm import tqdm
from tensorflow.keras import layers, initializers

In [45]:
print(f'Numpy version: {np.__version__}')
print(f'Pandas version: {pd.__version__}')
print(f'TensorFlow version: {tf.__version__}')

Numpy version: 1.18.5
Pandas version: 1.1.2
TensorFlow version: 2.3.1


## Load pre-trained BERT-Base model with tokenizer

In [25]:
#model = TFBertModel.from_pretrained('bert-base-uncased')
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=4)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Using Subset of AG News Dataset

#### Sourced from [this GitHub repo](https://github.com/mhjabreel/CharCnn_Keras)
AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .

The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

In [26]:
model.summary()

Model: "tf_bert_for_sequence_classification_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_75 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  3076      
Total params: 109,485,316
Trainable params: 109,485,316
Non-trainable params: 0
_________________________________________________________________


In [5]:
TRAINING_DATA = './ag_news_csv/train.csv'
VAL_DATA = './ag_news_csv/test.csv' 

### Gather Training Data
Read in the CSV file into a pandas dataframe and then label the columns appropriately for easier access. The regex replacement takes care of line breaks in the training data.

In [6]:
train_df = pd.read_csv(TRAINING_DATA, header=None)
train_df.columns = ['label', 'title', 'desc']
train_df.desc = train_df.desc.replace(r'\\', ' ', regex=True)

test_df = pd.read_csv(VAL_DATA, header=None)
test_df.columns = ['label', 'title', 'desc']
test_df.desc = test_df.desc.replace(r'\\', ' ', regex=True)

In [7]:
train_df.label.value_counts()

4    30000
3    30000
2    30000
1    30000
Name: label, dtype: int64

In [8]:
test_df.label.value_counts()

3    1900
2    1900
1    1900
4    1900
Name: label, dtype: int64

In [9]:
# The labels are a column in the data frame - pop them into their own object
train_labels = train_df.label.values
train_labels = train_labels -1

# Get the training sentences
train_sentences = train_df.desc.values

# The labels are a column in the data frame - pop them into their own object
test_labels = test_df.label.values
test_labels = test_labels -1

# Get the training sentences
test_sentences = test_df.desc.values

In [10]:
def create_dataset(sequences, labels):
    input_ids = []
    attention_mask = []
    token_ids = []
    for sent in tqdm(sequences):
        encoded_dict = tokenizer.encode_plus(sent,
                     add_special_tokens = True,
                     padding = 'max_length',
                     max_length = 128,
                     truncation = True,
                     return_attention_mask = True,
                     return_token_type_ids = True,
                     return_tensors = 'tf') 
        input_ids.append(tf.reshape(encoded_dict['input_ids'],[-1]))
        #token_ids.append(tf.reshape(encoded_dict['token_type_ids'],[-1]))
        attention_mask.append(tf.reshape(encoded_dict['attention_mask'],[-1]))
    
    dataset = tf.data.Dataset.from_tensor_slices(({'input_ids': input_ids,
                                              #'token_type_ids':token_ids,
                                              'attention_mask':attention_mask}, labels))
    
    
    return dataset
    

## Run the function above for both the training and test data

In [11]:
training_dataset = create_dataset(train_sentences, train_labels)

test_dataset = create_dataset(test_sentences, test_labels)

100%|██████████| 120000/120000 [02:19<00:00, 862.99it/s] 
100%|██████████| 7600/7600 [00:09<00:00, 840.81it/s]


In [31]:
batched_training_dataset = training_dataset.shuffle(100).batch(8)
batched_test_dataset = test_dataset.shuffle(100).batch(8)

In [18]:
METRICS = [
      tf.keras.metrics.Precision(name='precision'),
      tf.keras.metrics.Recall(name='recall'),
]

In [35]:
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss)
model.fit(batched_training_dataset, epochs=40, steps_per_epoch=20)

The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.


Epoch 1/40


The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.


Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<tensorflow.python.keras.callbacks.History at 0x7fcd23d32b10>

## Cells below would be used for distributed training
would need a distribution strategy and distributed dataset

In [15]:
training_args = TFTrainingArguments(
    output_dir='./results',          # output directory
    max_steps=30,              # total # of training steps
    per_device_train_batch_size=4,  # batch size per device during training
    per_device_eval_batch_size=4,   # batch size for evaluation
    evaluation_strategy = 'steps',
    eval_steps = 20,
    warmup_steps=5,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
)

trainer = TFTrainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=training_dataset,         # training dataset
    eval_dataset=test_dataset            # evaluation dataset
)

In [16]:
trainer.train()

The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.


ValueError: in user code:

    /Users/tom/anaconda3/lib/python3.7/site-packages/transformers/trainer_tf.py:672 distributed_training_steps  *
        self.args.strategy.run(self.apply_gradients, inputs)
    /Users/tom/anaconda3/lib/python3.7/site-packages/transformers/trainer_tf.py:635 apply_gradients  *
        self.optimizer.apply_gradients(list(zip(gradients, self.model.trainable_variables)))
    /Users/tom/anaconda3/lib/python3.7/site-packages/transformers/optimization_tf.py:232 apply_gradients  *
        return super(AdamWeightDecay, self).apply_gradients(zip(grads, tvars), name=name, **kwargs)
    /Users/tom/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:519 apply_gradients  **
        self._create_all_weights(var_list)
    /Users/tom/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:704 _create_all_weights
        self._create_slots(var_list)
    /Users/tom/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/optimizer_v2/adam.py:127 _create_slots
        self.add_slot(var, 'm')
    /Users/tom/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:757 add_slot
        .format(strategy, var))

    ValueError: Trying to create optimizer slot variable under the scope for tf.distribute.Strategy (<tensorflow.python.distribute.one_device_strategy.OneDeviceStrategy object at 0x7fce65620950>), which is different from the scope used for the original variable (<tf.Variable 'tf_bert_for_sequence_classification/bert/embeddings/word_embeddings/weight:0' shape=(30522, 768) dtype=float32, numpy=
    array([[-0.01018257, -0.06154883, -0.02649689, ..., -0.01985357,
            -0.03720997, -0.00975152],
           [-0.01170495, -0.06002603, -0.03233192, ..., -0.01681456,
            -0.04009988, -0.0106634 ],
           [-0.01975381, -0.06273633, -0.03262176, ..., -0.01650258,
            -0.04198876, -0.00323178],
           ...,
           [-0.02176224, -0.0556396 , -0.01346345, ..., -0.00432698,
            -0.0151355 , -0.02489496],
           [-0.04617237, -0.05647721, -0.00192082, ...,  0.01568751,
            -0.01387033, -0.00945213],
           [ 0.00145601, -0.08208051, -0.01597912, ..., -0.00811687,
            -0.04746607,  0.07527421]], dtype=float32)>). Make sure the slot variables are created under the same strategy scope. This may happen if you're restoring from a checkpoint outside the scope


In [17]:
tf.config.list_physical_devices()

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
 PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU')]

In [43]:
tokenizer.encode('spellingz')

[101, 11379, 2480, 102]

In [44]:
tokenizer.decode(2480)

'# # z'