# Art of Bert: Unlocking the full potential of transformers for your business

## Follow me or checkout my work 

### Social Media

* [LinkedIN](https://www.linkedin.com/in/thushanganegedara/)
* [Twitter](https://twitter.com/thush89)
* [Medium](https://thushv89.medium.com)
* [YouTube](https://www.youtube.com/channel/UC1HkxV8PtmWRyQ39MfzmtGA)


### Video Courses

* [Machine Translation in Python](https://www.datacamp.com/courses/machine-translation-in-python)

### Books

<table align='left'>
    <td>
        <a target="_blank" href="https://www.manning.com/books/tensorflow-in-action"><img src="images/manning.png" width='160px' /></a>
    </td>
</table>

<table align='left'>
    <td>
        <a target="_blank" href="https://www.amazon.com.au/Natural-Language-Processing-TensorFlow-Ganegedara/dp/1788478312/ref=sr_1_40?dchild=1&keywords=nlp+with+tensorflow&qid=1615628750&sr=8-40"><img src="images/packt.png" width='180px' /></a>
    </td>
</table>


## Importing libraries and other important information

In [96]:
import pandas as pd
from datasets import load_dataset
import numpy as np
# TF 2.4
# Cuda 11.0
# CuDNN 8.0.5 (CUDA 11.0)
# at nvidia-smi if you encounter Failed to initialize NVML: Driver/library version mismatch
# simply restart the computer and see if it goes away
import tensorflow as tf
import os

print(tf.config.list_physical_devices('GPU'))

random_seed=4321
from transformers.trainer_utils import set_seed

set_seed(random_seed)

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


## Download the dataset

In [2]:
dataset = load_dataset('financial_phrasebank', 'sentences_50agree', script_version="master" )

Reusing dataset financial_phrasebank (/home/thushv89/.cache/huggingface/datasets/financial_phrasebank/sentences_50agree/1.0.0/8573a5b5922d152c7b77924429a18b5546458c179db2685eb266b227d51d1b6b)


## Print the data

In [3]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 4846
    })
})


In [4]:
# Print a few samples in the data
for s,l in zip(dataset['train']['sentence'][:10], dataset['train']['label'][:10]):
    print(s)
    print('\tLabel: {}'.format(l))

According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .
	Label: 1
Technopolis plans to develop in stages an area of no less than 100,000 square meters in order to host companies working in computer technologies and telecommunications , the statement said .
	Label: 1
The international electronic industry company Elcoteq has laid off tens of employees from its Tallinn facility ; contrary to earlier layoffs the company contracted the ranks of its office workers , the daily Postimees reported .
	Label: 0
With the new production plant the company would increase its capacity to meet the expected increase in demand and would improve the use of raw materials and therefore increase the production profitability .
	Label: 2
According to the company 's updated strategy for the years 2009-2012 , Basware targets a long-term net sales growth in the range of 20 % -40 % with an operating profit margin of 10 % -20 % of net sales .
	

## How many examples per label?

In [5]:
# How many examples per label
print(pd.Series(dataset['train']['label']).value_counts())

1    2879
2    1363
0     604
dtype: int64


##  Split to train/validation/testing data

In [6]:
inputs, labels = np.array(dataset['train']['sentence']).reshape(-1,1), np.array(dataset['train']['label'])

In [7]:
from imblearn.under_sampling import OneSidedSelection, NearMiss, RandomUnderSampler
import numpy as np

n=75 # Number of instances for each class for train/validation sets

# Define a random undersample that will give equal amounts of data for each label randomly
rus = RandomUnderSampler(sampling_strategy={0:n, 1:n, 2:n}, random_state=random_seed)
rus.fit_resample(inputs, labels)

# Get test indices from the random undersampler
test_inds = rus.sample_indices_
test_x, test_y = inputs[test_inds], np.array(labels)[test_inds]
print("Test statistics")
print(pd.Series(test_y).value_counts())

# Get rest (train + valid)
rest_inds = [i for i in range(inputs.shape[0]) if i not in test_inds]
rest_x, rest_y = inputs[rest_inds], labels[rest_inds]

# Get valid indices from the random undersampler
rus.fit_resample(rest_x, rest_y)
valid_inds = rus.sample_indices_
valid_x, valid_y = rest_x[valid_inds], rest_y[valid_inds]
print("Valid statistics")
print(pd.Series(valid_y).value_counts())

# Rest goes in training
train_inds = [i for i in range(rest_x.shape[0]) if i not in valid_inds]
train_x, train_y = rest_x[train_inds], rest_y[train_inds]
print("Train statistics")
print(pd.Series(train_y).value_counts())

Test statistics
2    75
1    75
0    75
dtype: int64
Valid statistics
2    75
1    75
0    75
dtype: int64
Train statistics
1    2729
2    1213
0     454
dtype: int64


## Sample sentences

In [8]:
test_x[:10,0].tolist()

['Finnish communication electronics components supplier Scanfil Oyj Tuesday said sales in the first half of 2006 will be 15 % lower than during the same period a year ago .',
 'Finnish Exel Composites , a technology company that designs , manufactures , and markets composite profiles and tubes for various industrial applications , reports its net sales decreased by 0.6 % in the second quarter of 2010 to EUR 19.2 mn from EUR 19.3 mn in the corresponding period in 2009 .',
 'Earnings per share ( EPS ) amounted to a loss of to EUR0 .06 .',
 "In the first half of 2008 , the Bank 's operating profit fell to EUR 11.8 mn from EUR 18.9 mn , while net interest income increased to EUR 20.9 mn from EUR 18.8 mn in the first half of 2007 .",
 "Last year 's third quarter result had been burdened by costs stemming from restructuring in the US .",
 "HELSINKI ( AFX ) - KCI Konecranes said that Franklin Resources Inc 's share of voting rights in the Finnish cranes company fell last week to 4.65 pct from

## Get to know BERT: Understanding the crux of BERT

URL: https://huggingface.co/bert-base-uncased

In [9]:
model_zoo = {}
tokenizer_zoo = {}

### Download model + tokenizer

In [13]:
from transformers import BertTokenizer, TFBertModel

# Downloading and saving the tokenizer + model
tokenizer_zoo['bert'] = BertTokenizer.from_pretrained('bert-base-uncased')
model_zoo['bert'] = TFBertModel.from_pretrained("bert-base-uncased")


Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [11]:
text = "Nokia s U.S. shares were 3.3 percent lower at $ 12.73 today than what it was yesterday ."

### Understanding the tokenizer

In [12]:
# For demonstration purpose - It's already downloaded from the code cell above
tokenizer_zoo['bert'] = BertTokenizer.from_pretrained('bert-base-uncased')

# Encode to IDs
encoded_ids = tokenizer_zoo['bert'](text, return_tensors='tf')
print(encoded_ids)

# Get the tokens from the ids
encoded_tokens = tokenizer_zoo['bert'].convert_ids_to_tokens(encoded_ids['input_ids'].numpy()[0])
print(encoded_tokens)

{'input_ids': <tf.Tensor: shape=(1, 27), dtype=int32, numpy=
array([[  101, 22098,  1055,  1057,  1012,  1055,  1012,  6661,  2020,
         1017,  1012,  1017,  3867,  2896,  2012,  1002,  2260,  1012,
         6421,  2651,  2084,  2054,  2009,  2001,  7483,  1012,   102]],
      dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(1, 27), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 27), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1]], dtype=int32)>}
['[CLS]', 'nokia', 's', 'u', '.', 's', '.', 'shares', 'were', '3', '.', '3', 'percent', 'lower', 'at', '$', '12', '.', '73', 'today', 'than', 'what', 'it', 'was', 'yesterday', '.', '[SEP]']


### Getting the model output

In [13]:
def get_model_output(text, model, tokenizer):
    """ Generate the model output for a given text """
    # [CLS] is automatically added by the tokenizer
    encoded_input = tokenizer(text, return_tensors='tf')
    output = model(encoded_input)
    return output

In [14]:
# Get the output of BERT
model_zoo['bert'] = TFBertModel.from_pretrained("bert-base-uncased")
output = get_model_output(text, model_zoo['bert'], tokenizer_zoo['bert'])

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [15]:
# Checking various attributes of the final output
print(output.pooler_output.shape)
print(output.last_hidden_state.shape)
print(output.attentions.shape)

(1, 768)
(1, 27, 768)


AttributeError: 'NoneType' object has no attribute 'shape'

### Auxiliary outputs with BERT

In [16]:
from transformers import BertTokenizer, TFBertModel, BertConfig

# We are passing in special arguments in the config to BERT
config = BertConfig.from_pretrained('bert-base-uncased', output_attentions=True, output_hidden_states=False)

model_zoo['bert_aux_outputs'] = TFBertModel.from_pretrained("bert-base-uncased", config=config)

# Get the output after specifying special args
output = get_model_output(text, model_zoo['bert_aux_outputs'], tokenizer_zoo['bert'])



Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [17]:
print(model_zoo['bert_aux_outputs'] .config)

BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_attentions": true,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.3.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



In [18]:
print(output.pooler_output.shape)
print(output.last_hidden_state.shape)
print(output.attentions[0].shape)

(1, 768)
(1, 27, 768)
(1, 12, 27, 27)


## Visualize attention

You can select a layer by changing the layer and different colors depicts different attention heads

In [20]:
import torch
from bertviz import head_view

head_view([torch.from_numpy(layer_attn.numpy()) for layer_attn in output[-1]], encoded_tokens)

<IPython.core.display.Javascript object>

## Two worlds apart: How BERT is misunderstanding sentences

In [22]:
# Let's see some common words in our corpus
from collections import Counter

# We use a counter to count the frequency of individual words
cnt = Counter([w for doc in train_x[:,0].tolist() for w in doc.split() ])
print('Vocabulary size: {}'.format(len(cnt)))
print('\n', '='*50, '\n')
print("Common words ... ")
print(cnt.most_common(100))


Vocabulary size: 12323


Common words ... 
[('.', 4339), ('the', 4292), (',', 4225), ('of', 2899), ('in', 2461), ('and', 2364), ('to', 2263), ('a', 1467), ('The', 1248), ('for', 1019), ("'s", 916), ('is', 848), ('will', 795), ('EUR', 791), ('company', 735), ('from', 672), ('on', 613), ('its', 550), ('has', 525), ('be', 506), ('with', 498), ('by', 485), ('said', 484), (')', 461), ('(', 459), ('Finnish', 452), ('as', 446), ('mn', 424), ('that', 398), ('million', 398), ('at', 397), ('%', 384), ('sales', 368), (':', 333), ('was', 330), ('profit', 328), ('it', 325), ('net', 313), ('Finland', 306), ('an', 281), ('-', 277), ('are', 267), ('2009', 266), ('2008', 247), ('mln', 247), ('m', 246), ('``', 234), ('period', 231), ('year', 229), ('new', 228), ('2007', 215), ('share', 207), ('business', 207), ('2010', 204), ('have', 201), ('Oyj', 201), ('which', 200), ("''", 200), ('quarter', 197), ('market', 197), ('In', 196), ('also', 188), ('$', 188), ('shares', 181), ('services', 175), ('up', 167),

In [18]:
from sklearn.metrics.pairwise import cosine_distances

def get_distance(s1,s2, model, tokenizer):
    """ Computes the distance between two sentence representations from BERT """
    o1 = get_model_output(s1, model, tokenizer)
    o2 = get_model_output(s2, model, tokenizer)

    d = cosine_distances(o1.pooler_output, o2.pooler_output)[0][0]
    return d


In [24]:
# Sentences that are not similar
s1 = "my fishing net was torn"
s2 = "the profit margin was up"
d = get_distance(s1, s2, model_zoo['bert'], tokenizer_zoo['bert'])
print("Distance - {} / {}: {}".format(s1, s2, d))

# Sentences that shouldn't be similar but has a common word
s1 = "my fishing net was torn"
s2 = "the net profit margin was up"
d = get_distance(s1, s2, model_zoo['bert'], tokenizer_zoo['bert'])
print("Distance - {} / {}: {}".format(s1, s2, d))

s1 = "john bought an orange"
s2 = "john bougth Apple shares"
d = get_distance(s1, s2, model_zoo['bert'], tokenizer_zoo['bert'])
print("Distance - {} / {}: {}".format(s1, s2, d))

s1 = "john bought an apple"
s2 = "john bougth Apple shares"
d = get_distance(s1, s2, model_zoo['bert'], tokenizer_zoo['bert'])
print("Distance - {} / {}: {}".format(s1, s2, d))

# Sentences that should be similar but shown different
s1 = "john bought Apple stocks"
s2 = "john bougth Tesla shares"
d = get_distance(s1, s2, model_zoo['bert'], tokenizer_zoo['bert'])
print("Distance - {} / {}: {}".format(s1, s2, d))

Distance - my fishing net was torn / the profit margin was up: 0.10874581336975098
Distance - my fishing net was torn / the net profit margin was up: 0.05190145969390869
Distance - john bought an orange / john bougth Apple shares: 0.026924610137939453
Distance - john bought an apple / john bougth Apple shares: 0.028050243854522705
Distance - john bought Apple stocks / john bougth Tesla shares: 0.22359681129455566


## Finbert to the rescue!

We will first download the Finbert tokenizer.
1. Go to https://github.com/yya518/FinBERT and download the [Finbert-Uncased](https://gohkust-my.sharepoint.com/:t:/g/personal/imyiyang_ust_hk/EX3C-KM9bTxOjdttsPslLZUBw_mh9Jdh8PB0WTv6b2tEIA?e=DYBVJY) vocabulary file.
2. Create the folder structure `data/fin_vocab`, relative the path of this notebook (i.e. if this notebook is located in `/home/<user>/code/Demo/odsc/Art_of_BERT.ipynb`, create the folder `/home/<user>/code/Demo/odsc/data/fin_vocab`.
3. Copy the downloaded file to `data/fin_vocab` and rename it to `vocab.txt`

In [11]:
from transformers import BertTokenizer, TFBertModel

# Download FinBERT tokenizer
# Before running the model below download the vocab file by following the instruction above
tokenizer_zoo['finbert'] = BertTokenizer.from_pretrained(os.path.join('data', 'fin_vocab'))


## Tokenizers of BERT and FinBERT

In [15]:
text = "401(k) Asset Turnover Ratio Blockchain Bankruptcy Demonetization FAANG Stocks Fintech Home Equity Loan Indemnity Insurance Liquidity Macroeconomics"

# Output of BERT
encoded_ids = tokenizer_zoo['bert'](text, return_tensors='tf')
encoded_tokens = tokenizer_zoo['bert'].convert_ids_to_tokens(encoded_ids['input_ids'].numpy()[0])
print("Bert says ...\n")
print(encoded_tokens)
print('\n','='*50, '\n')

# Output of FinBERT
encoded_ids = tokenizer_zoo['finbert'](text, return_tensors='tf')
encoded_tokens = tokenizer_zoo['finbert'].convert_ids_to_tokens(encoded_ids['input_ids'].numpy()[0])
print('FinBERT says ...\n')
print(encoded_tokens)

Bert says ...

['[CLS]', '401', '(', 'k', ')', 'asset', 'turnover', 'ratio', 'block', '##chai', '##n', 'bankruptcy', 'demon', '##eti', '##zation', 'faa', '##ng', 'stocks', 'fin', '##tech', 'home', 'equity', 'loan', 'ind', '##em', '##nity', 'insurance', 'liquid', '##ity', 'macro', '##economic', '##s', '[SEP]']


FinBERT says ...

['[CLS]', '401', '(', 'k', ')', 'asset', 'turnover', 'ratio', 'block', '##chain', 'bankruptcy', 'demo', '##net', '##ization', 'faa', '##ng', 'stocks', 'fin', '##tech', 'home', 'equity', 'loan', 'indemnity', 'insurance', 'liquidity', 'macroeconomic', '##s', '[SEP]']


### Load the FinBERT model

Here we download the FinBERT model

1. Go to https://github.com/yya518/FinBERT and download the [FinBERT-FinVocab-Uncased](https://gohkust-my.sharepoint.com/:f:/g/personal/imyiyang_ust_hk/EksJcamJpclJlbMweFfB5DQB1XrsxURYN5GSqZw3jmSeSw?e=KAyhsX)
2. Create a folder `models/finbert` relative to the path of this notebook.
3. Extract the files (there should be two `config.json` and `pytorch_model.bin`) to `models/finbert`.

In [16]:
# Load FinBERT from downloaded files
# Before running the code below download the files by following the instruction above
model_zoo['finbert'] = TFBertModel.from_pretrained(os.path.join('models','finbert'), from_pt=True)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already

In [19]:
# Sentences that are not similar
s1 = "my fishing net was torn"
s2 = "the profit margin was up"
d = get_distance(s1, s2, model_zoo['bert'], tokenizer_zoo['finbert'])
print("Distance - {} / {}: {}".format(s1, s2, d))

# Sentences that shouldn't be similar but has a common word
s1 = "my fishing net was torn"
s2 = "the net profit margin was up"
d = get_distance(s1, s2, model_zoo['bert'], tokenizer_zoo['finbert'])
print("Distance - {} / {}: {}".format(s1, s2, d))

s1 = "john bought an orange"
s2 = "john bougth Apple shares"
d = get_distance(s1, s2, model_zoo['bert'], tokenizer_zoo['finbert'])
print("Distance - {} / {}: {}".format(s1, s2, d))

s1 = "john bought an apple"
s2 = "john bougth Apple shares"
d = get_distance(s1, s2, model_zoo['bert'], tokenizer_zoo['finbert'])
print("Distance - {} / {}: {}".format(s1, s2, d))

# Sentences that should be similar but shown different
s1 = "john bought Apple stocks"
s2 = "john bougth Tesla shares"
d = get_distance(s1, s2, model_zoo['bert'], tokenizer_zoo['finbert'])
print("Distance - {} / {}: {}".format(s1, s2, d))

Distance - my fishing net was torn / the profit margin was up: 0.31897926330566406
Distance - my fishing net was torn / the net profit margin was up: 0.3376138210296631
Distance - john bought an orange / john bougth Apple shares: 0.12963992357254028
Distance - john bought an apple / john bougth Apple shares: 0.19487375020980835
Distance - john bought Apple stocks / john bougth Tesla shares: 0.04805094003677368


## In retrospective - You don't want to trust BERT as your financial advisor 

Distance - my fishing net was torn / the profit margin was up

* `BERT`:    0.10874581336975098
* `FinBERT`: 0.31897926330566406

Distance - my fishing net was torn / the net profit margin was up

* `BERT`:    0.05190145969390869
* `FinBERT`: 0.3376138210296631

Distance - john bought an orange / john bougth Apple shares

* `BERT`:    0.026924610137939453
* `FinBERT`: 0.12963992357254028

Distance - john bought an apple / john bougth Apple shares

* `BERT`:    0.028050243854522705
* `FinBERT`: 0.19487375020980835

Distance - john bought Apple stocks / john bougth Tesla shares

* `BERT`:    0.22359681129455566
* `FinBERT`: 0.04805094003677368


## Finbert for financial sentiment classification

Training: https://huggingface.co/transformers/v3.0.2/training.html

### Download the model

In [20]:
from transformers import TFBertForSequenceClassification

# Let's now create BERT + classifier model to solve the usecase
num_labels = 3
model_zoo['finbert+classifier'] = TFBertForSequenceClassification.from_pretrained(os.path.join('models','finbert'), num_labels=num_labels, from_pt=True)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Compiling the model

In [21]:
# Compile the model with an optimizer + loss funciton

optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model_zoo['finbert+classifier'].compile(optimizer=optimizer, loss=loss, metrics=tf.keras.metrics.SparseCategoricalAccuracy())
model_zoo['finbert+classifier'].summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109751808 
_________________________________________________________________
dropout_111 (Dropout)        multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  2307      
Total params: 109,754,115
Trainable params: 109,754,115
Non-trainable params: 0
_________________________________________________________________


### Get the sequence length

In [22]:
# See some summary stats of the sequence length of the data

seq_length = []
for x in train_x:
    seq_length.append(len(tokenizer_zoo['finbert'](x[0])['input_ids']))

print(pd.Series(seq_length).describe(percentiles=[0.25,0.5,0.75,0.9]))
print(pd.Series(seq_length).median())

count    4396.00000
mean       30.24818
std        13.43368
min         5.00000
25%        20.00000
50%        28.00000
75%        38.00000
90%        49.00000
max       134.00000
dtype: float64
28.0


### Convert raw examples to features 

In [95]:
def examples_to_features(x, tokenizer):
    """ Convert an example to a dictionary BERT model accepts """
    res = tokenizer(x, padding=True, truncation=True, max_length=50, return_tensors='tf')
    return dict(list(res.items()))

train_tokens = examples_to_features(train_x.reshape(-1).tolist(), tokenizer_zoo['finbert'])
train_labels = train_y.reshape(-1,1)

valid_tokens = examples_to_features(valid_x.reshape(-1).tolist(), tokenizer_zoo['finbert'])
valid_labels = valid_y.reshape(-1,1)


# Sample data
sample_text = "Company shares are down"
print(examples_to_features(sample_text, tokenizer_zoo["finbert"]))

{'input_ids': <tf.Tensor: shape=(1, 6), dtype=int32, numpy=array([[  3,  37, 203,  21, 269,   4]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(1, 6), dtype=int32, numpy=array([[0, 0, 0, 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 6), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1]], dtype=int32)>}


### Training the model

You can train your model with more options, using the [`Trainer`](https://huggingface.co/transformers/main_classes/trainer.html) provided in the library.

In [24]:
# Finetune FinBert on the financial sentiment classification task

model_zoo['finbert+classifier'].fit(
    train_tokens, train_labels, 
    validation_data=(valid_tokens, valid_labels), 
    epochs=3
)

Epoch 1/3
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: <cyfunction Socket.send at 0x7ff777fb4110> is not a module, class, method, function, traceback, frame, or code object
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: <cyfunction Socket.send at 0x7ff777fb4110> is not a module, class, method, function, traceback, frame, or code object

Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7ff600a187f0>

### Save the model

In [25]:
# Create folders
if not os.path.exists(os.path.join('models', 'finbert_classifier')):
    os.makedirs(os.path.join('models', 'finbert_classifier'))
if not os.path.exists(os.path.join('tokenizers', 'finbert_classifier')):
    os.makedirs(os.path.join('tokenizers', 'finbert_classifier'))

# Save the modle
model_zoo['finbert+classifier'].save_pretrained(os.path.join('models', 'finbert_classifier'))

# Save the tokenizer
tokenizer_zoo['finbert'].save_pretrained(os.path.join('tokenizers', 'finbert_classifier'))

('tokenizers/finbert_classifier/tokenizer_config.json',
 'tokenizers/finbert_classifier/special_tokens_map.json',
 'tokenizers/finbert_classifier/vocab.txt',
 'tokenizers/finbert_classifier/added_tokens.json')

### Load and test the model 

In [30]:
# Load the model
model_zoo['finbert+classifier'] = TFBertForSequenceClassification.from_pretrained(
    os.path.join('models','finbert_classifier'), num_labels=num_labels)

# Load the tokenizer
tokenizer_zoo['finbert'] = BertTokenizer.from_pretrained(os.path.join('tokenizers', 'finbert_classifier'))

# Compile the model so it can be used for predictions
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model_zoo['finbert+classifier'].compile(optimizer=optimizer, loss=loss, metrics=tf.keras.metrics.SparseCategoricalAccuracy())

Some layers from the model checkpoint at models/finbert_classifier were not used when initializing TFBertForSequenceClassification: ['dropout_111']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at models/finbert_classifier.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [31]:
# Get test performance
test_tokens = examples_to_features(test_x.reshape(-1).tolist(), tokenizer_zoo['finbert'])
test_labels = test_y.reshape(-1,1)

model_zoo['finbert+classifier'].evaluate(test_tokens, test_labels)



[0.4354636073112488, 0.8577777743339539]

## Training BERT on the same data

In [32]:
from transformers import TFBertForSequenceClassification
import tensorflow.keras.backend as K
K.clear_session()

# Define a BERT + classifier for comparison
num_labels = 3
model_zoo['bert+classifier'] = TFBertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=num_labels)

# Compile the model so that we can train it
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model_zoo['bert+classifier'].compile(optimizer=optimizer, loss=loss, metrics=tf.keras.metrics.SparseCategoricalAccuracy())
model_zoo['bert+classifier'].summary()

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  2307      
Total params: 109,484,547
Trainable params: 109,484,547
Non-trainable params: 0
_________________________________________________________________


In [33]:
# Train the model
model_zoo['bert+classifier'].fit(
    train_tokens, train_labels, 
    validation_data=(valid_tokens, valid_labels), 
    epochs=3
)

# Evaluate the model on test data
model_zoo['bert+classifier'].evaluate(test_tokens, test_labels)

Epoch 1/3
Epoch 2/3
Epoch 3/3


[1.0434657335281372, 0.4933333396911621]

## Training BERT from Scratch

### Generating the data

In [71]:
from transformers.data.data_collator import DataCollatorForLanguageModeling

def generate_masked_data(x, tokenizer):
    """ Given a text corpus this generates masked data for pretraining """
        
    if isinstance(x, np.ndarray):
        x = x.ravel().tolist()
    
    # Create encoded input from the tokenizer
    encoded_input = tokenizer(x, padding='max_length', max_length=50, truncation=True, return_special_tokens_mask=True, return_tensors='pt')
    
    # Randomly mask words and create inputs/outputs
    # Masked words are represented by [MASK] in the input
    # Trivial words (unmasked) in the output are represented by -100
    lm_inputs, lm_outputs = DataCollatorForLanguageModeling(
        tokenizer, mlm_probability=0.3
    ).mask_tokens(
        encoded_input['input_ids'], encoded_input['special_tokens_mask']
    )
    
    # We get torch tensors, convert them to numpy
    lm_inputs = lm_inputs.numpy()
    lm_outputs = lm_outputs.numpy()
    
    return lm_inputs, lm_outputs

# Generate training masked data
train_lm_inputs, train_lm_outputs = generate_masked_data(train_x, tokenizer['bert'])

# See the shape
print(train_lm_inputs.shape)
print(train_lm_outputs.shape)

# See some samples
print(train_lm_inputs[:3])
print(train_lm_outputs[:3])

(4396, 50)
(4396, 50)
[[  101  2429   103 12604  1010  1996  2194  2038  2053   103  2000   103
    103  2537  2000  3607  1010   103   103  2003  2073  1996  2194  2003
    103  1012   102     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0]
 [  101   103 17699  3488  2000  4503  1999  5711  2019  2181  1997  2053
   2625   103  2531   103  2199   103   103   103  2344  2000  3677  3316
   2551  1999  3274  6786  1998 12108  1010   103  4861   103   103   102
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0]
 [  101  1996  2248   103  3068  2194  3449 19800  4160  2038   103  2125
  15295  1997  5126  2013  2049 21169   120  1025 10043  2000 14623  3913
    103  1996  2194   103  1996  6938  1997  2049  2436  3667  1010  1996
    103  2695 14428  2229  2988  1012   102     0     0     0     0     0
      0     0]]
[[ -100  -100  2000  -100  -100  -100  -100 

### Download the MLM model

In [91]:
from transformers import TFBertForMaskedLM
import tensorflow.keras.backend as K
K.clear_session()

model_zoo['bert-mlm'] = TFBertForMaskedLM.from_pretrained('bert-base-uncased')


All model checkpoint layers were used when initializing TFBertForMaskedLM.

All the layers of TFBertForMaskedLM were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.


### Compile and train the model

In [92]:
from tensorflow.keras.losses import SparseCategoricalCrossentropy
import tensorflow as tf
from transformers.modeling_tf_utils import TFMaskedLanguageModelingLoss

def masked_sparse_categorical_crossentropy(y_true, y_pred):
    """ Actual computations that go through (Educational puproses) """
    
    # Reshape inputs/targets 
    y_true = tf.reshape(y_true,[-1,])
    y_pred = tf.reshape(y_pred, [-1, y_pred.shape[2]])
    
    # Create a mask where y_true is not equal -100
    mask = tf.not_equal(y_true, -100)
    
    # Get the elements that satisfies the mask from y_true and y_pred
    valid_true = tf.boolean_mask(y_true, mask)
    valid_pred = tf.boolean_mask(y_pred, mask)
    
    # Get the loss
    return SparseCategoricalCrossentropy(
        from_logits=True, 
        reduction='sum'
    )(valid_true, valid_pred)
    
def masked_sparse_categorical_crossentropy_v2(y_true, y_pred):
    """ Using the built in loss (Recommended) """
    return TFBertPreTrainingLoss().compute_loss(labels={"labels": y_true}, logits=[y_pred])

# Compile the model
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)
model_zoo['bert-mlm'].compile(optimizer=optimizer, loss=TFMaskedLanguageModelingLoss().compute_loss)



In [94]:
# Input to the model needs to have this format
inputs = {'input_ids': train_lm_inputs, 'token_type_ids': np.zeros_like(train_lm_inputs)}

# Train the model
model_zoo['bert-mlm'].fit(train_lm_inputs, train_lm_outputs, batch_size=4)



<tensorflow.python.keras.callbacks.History at 0x7f3138a71240>

### Code used for debugging

**Context**: I was getting `nan` values in the loss. Turns out, I was using the `cased` BERT model with an `uncased` tokenizer. The difference in the vocabulary size was the culprit. This code below was used to debug the model.

In [93]:
for i in range(0, 500, 1):
    
    b_x, b_y = train_lm_inputs[i:i+1], train_lm_outputs[i:i+1]
    inp = {'input_ids': b_x, 'token_type_ids': np.zeros_like(b_x), 'labels': b_y}
    out = model_zoo['bert-mlm'](inp)
    loss_2 = TFMaskedLanguageModelingLoss().compute_loss(b_y, out.logits)
    
    if tf.math.is_nan(tf.reduce_sum(out.loss)):
        print(out.loss)
        print(loss_2)
        print(b_x)
        print(b_y)
        
    if tf.math.is_nan(tf.reduce_sum(loss_2)):
        print(out.loss)
        print(loss_2)
        print(b_x)
        print(b_y)
        





KeyboardInterrupt: 