<a href="https://colab.research.google.com/github/sravanisasu/BERT_Regression/blob/main/FinBERT_10K.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Setup GPU**

In [1]:
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


**Clone data from github**

In [2]:
!git clone https://github.com/sravanisasu/10k-sample

Cloning into '10k-sample'...
remote: Enumerating objects: 9, done.[K
remote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 9557 (delta 1), reused 1 (delta 0), pack-reused 9548[K
Receiving objects: 100% (9557/9557), 158.15 MiB | 20.90 MiB/s, done.
Resolving deltas: 100% (336/336), done.
Checking out files: 100% (10020/10020), done.


**Necessary imports and installations for the implementation of FinBERT Architecture**

In [3]:
% pip install sentencepiece
% pip install transformers

Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 4.1MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.95
Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ed/d5/f4157a376b8a79489a76ce6cfe147f4f3be1e029b7144fa7b8432e8acb26/transformers-4.4.2-py3-none-any.whl (2.0MB)
[K     |████████████████████████████████| 2.0MB 4.3MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 46.9MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0d

In [4]:
import tensorflow_hub as hub
import tensorflow as tf
import os as os
import regex as re
import pandas as pd
import numpy as np
from transformers import BertTokenizer,BertConfig
from transformers import TFBertModel
from keras.models import Model
from keras import optimizers
from keras.metrics import MeanSquaredError
from sklearn.model_selection import KFold
import matplotlib.pyplot as plt

**Create a FinBERT model from the transformers library**

In [5]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [6]:
!git clone https://github.com/sravanisasu/analyst_tone/blob/main/config.json
!git clone https://github.com/sravanisasu/analyst_tone/blob/main/FinVocab-Uncased.txt

fatal: destination path 'config.json' already exists and is not an empty directory.
fatal: destination path 'FinVocab-Uncased.txt' already exists and is not an empty directory.


In [7]:
config = BertConfig.from_pretrained(vocab_path='/content/FinVocab-Uncased.txt',pretrained_model_name_or_path='/content/config.json')
FinBERT_model = TFBertModel.from_pretrained(config=config,pretrained_model_name_or_path='/content/drive/MyDrive/Colab Notebooks/pytorch_model.bin',from_pt=True)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already

**Functions to preprocess input 10-K documents and output values**

In [8]:
######## Function to extract the input text from the files ########
def process_inp_doc(path_file) :

  file_text = open(path_file,encoding='utf8').read()

  # remove punctations and digits and remove <PAGE> which was used for page number
  file_data = re.sub(r'[\d$%-:;!]', '', file_text)
  file_data = re.sub(r'<PAGE>', '', file_data)
  file_data = ''.join(file_data)

  return file_data

######## Function to extract the output values from the file ########
def process_out(company_id,output_file):
  
  with open(output_file,'r', encoding='utf-8') as m_file :
    for line in m_file.readlines():
      if company_id == line.split()[1]:
        return line.split()[0]
    print("not found")
  return None

######## Function to pre-process the documents from meta-file of a given year ########
def pre_processing(meta_file,output_file):
  
  with open(meta_file,'r', encoding='utf-8') as m_file :
    
    year = meta_file.split('/')[3].split('.')[0]
    dir_path = os.path.dirname(meta_file) + '/' +year+'.tok'
    data =[]
    
    for line in m_file.readlines():
      inp_path_file = dir_path +'/'+ line.split()[0] + '.mda'

      # get input sentences from the company document
      inp_sentences = process_inp_doc(inp_path_file)
    
      # get output value for the company
      out_values = float(process_out(line.split()[0],output_file))

      #insert values into the data list
      data.append({'text':inp_sentences,'value':out_values})

  return data

**Functions to get the embeddings(token,masked,segment) and to encode the text for the model**

In [9]:
######## Function to get the encoded values ######## 
def FinBERT_encode(sentences, tokenizer, MAX_SEQ_LEN=512):

  all_tokens = []
  all_masks = []
  all_segments = []
  for sentence in sentences:
    stokens = tokenizer.tokenize(sentence)
    stokens = stokens[-MAX_SEQ_LEN+2:]
    stokens = ["[CLS]"] + stokens + ["[SEP]"]

    token_ids = tokenizer.convert_tokens_to_ids(stokens,)

    ids = token_ids + [0] * (MAX_SEQ_LEN-len(token_ids))
    masks = [1]*len(token_ids) + [0] * (MAX_SEQ_LEN - len(token_ids))
    segments = [0] * (MAX_SEQ_LEN)

    all_tokens.append(ids)
    all_masks.append(masks)
    all_segments.append(segments)

  return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

**Data Preprocessing**

In [10]:
with tf.device('/device:GPU:0'):
  ######## extracting text and storing it in dataframes ########
  data_train = pre_processing('/content/10k-sample/2007.meta.txt','/content/10k-sample/2007.logvol.+12.txt')
  data_train.extend(pre_processing('/content/10k-sample/2008.meta.txt','/content/10k-sample/2008.logvol.+12.txt'))
  data_train.extend(pre_processing('/content/10k-sample/2009.meta.txt','/content/10k-sample/2009.logvol.+12.txt'))
  train_df = pd.DataFrame(data_train,columns=['text','value'])
  print("Length of training data",len(data_train))

  data_test = pre_processing('/content/10k-sample/2010.meta.txt','/content/10k-sample/2010.logvol.+12.txt')
  test_df = pd.DataFrame(data_test,columns=['text','value'])
  print("Length of testing data",len(data_test))

  print("SAMPLE INPUT TEXT AND VOLATILITY VALUES")
  print(train_df.sample(5)[['text','value']])
  print(test_df.sample(5)[['text','value']])

Length of training data 7571
Length of testing data 2439
SAMPLE INPUT TEXT AND VOLATILITY VALUES
                                                   text    value
7469  item # management s discussion and analysis of... -3.45425
1751  item # management s discussion and analysis of... -3.97542
1602  item # management s discussion and analysis of... -3.76826
6803  item # management s discussion and analysis of... -3.30715
7499  item # management s discussion and analysis of... -3.02023
                                                   text    value
737   item # management s discussion and analysis of... -3.52022
546   item # management s discussion and analysis of... -4.11226
734   item # management s discussion and analysis of... -3.46107
2438  item # management s discussion and analysis of... -3.40038
431   item # management s discussion and analysis of... -3.77565


In [11]:
train_df = train_df.loc[train_df["text"].apply(lambda x: x.split().__len__())>256]
print(train_df)
#88.7%
test_df = test_df.loc[test_df["text"].apply(lambda x: x.split().__len__())>256]
print(test_df)
#89.3%

                                                   text    value
0     item # management s discussion and analysis of... -3.46398
1     item # management s discussion and analysis of... -3.58048
2     item # management s discussion and analysis of... -3.87840
3     item # management s discussion and analysis of... -3.37969
4     item # management s discussion and analysis of... -4.34506
...                                                 ...      ...
7566  item # management s discussion and analysis of... -2.75096
7567  item # management s discussion and analysis of... -3.46372
7568  item # management s discussion and analysis of... -2.94439
7569  item # management s discussion and analysis of... -3.27556
7570  item # management s discussion and analysis of... -3.33055

[6717 rows x 2 columns]
                                                   text    value
0     item # management s discussion and analysis of... -3.87816
1     item # management s discussion and analysis of... -3.45482


In [12]:
MAX_SEQ_LEN = 512

vocab_path = '/content/FinVocab-Uncased.txt'
######## extracting tokens from dataframes ########

tokenizer = BertTokenizer(vocab_file = vocab_path, do_lower_case = True, do_basic_tokenize = True)

with tf.device('/device:GPU:0'):

  #### training 
  # input encoding
  sentences = train_df.text.values
  FinBERT_train_input = FinBERT_encode(sentences, tokenizer, MAX_SEQ_LEN)
  # output values
  FinBERT_train_output = train_df.value.values

  #### test
  # input encoding
  sentences = test_df.text.values
  FinBERT_test_input = FinBERT_encode(sentences, tokenizer, MAX_SEQ_LEN)
  # output values
  FinBERT_test_output = test_df.value.values

In [13]:
FinBERT_train_output = np.array(FinBERT_train_output).reshape(len(FinBERT_train_output),1)
FinBERT_test_output = np.array(FinBERT_test_output).reshape(len(FinBERT_test_output),1)

**Function that define the model architecture**

In [17]:
def get_model():

  input_word_ids = tf.keras.layers.Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32,name="input_word_ids")
  input_mask = tf.keras.layers.Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32,name="input_mask")
  segment_ids = tf.keras.layers.Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32,name="segment_ids")

  custom_objects={'leaky_relu': tf.nn.leaky_relu}

  model_output = FinBERT_model(input_word_ids, input_mask, segment_ids)
  clf_output = model_output.last_hidden_state
      
  net = tf.keras.layers.GlobalMaxPool1D()(clf_output)
  net = tf.keras.layers.Dense(1, activation='linear')(net)
  out = tf.keras.layers.Dense(1, activation='linear', name='output')(net)

  model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)

  opt = optimizers.Adam(learning_rate=0.05)
  model.compile(optimizer=opt, loss='mse')

  return model

In [18]:
model = get_model()
model.summary()









Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 512)]        0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 512)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 512)]        0                                            
__________________________________________________________________________________________________
tf_bert_model (TFBertModel)     TFBaseModelOutputWit 109751808   input_word_ids[0][0]             
                                                                 input_mask[0][0]           

**Fit the Model**

In [None]:
n_splits = 3
epochs = 8
batch_size = 10
with tf.device('/device:GPU:0'):
  kf = KFold(n_splits=n_splits)
  history =[]
  train_loss=[]
  vald_loss=[]
  fold = 1
  for train_index, test_index in kf.split(FinBERT_train_input[0]):
    
    checkpoint_filepath = 'FinBERT_results/CheckPoints/FinBERT_checkpoint'+str(fold)
    model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=False,
    monitor='loss',
    mode='min',
    save_best_only=True)

    train_history = model.fit(
                              [FinBERT_train_input[0][train_index],FinBERT_train_input[1][train_index],FinBERT_train_input[2][train_index]],#input
                              FinBERT_train_output[train_index],#output
                              epochs=epochs, #epochs
                              verbose=1 ,
                              batch_size = batch_size,
                              callbacks=[model_checkpoint_callback]
                          )
    model_best = get_model()
    model_best.load_weights(checkpoint_filepath)
    
    fold+=1
    loss_T = model_best.evaluate([FinBERT_train_input[0][train_index],FinBERT_train_input[1][train_index],FinBERT_train_input[2][train_index]]
                                       , FinBERT_train_output[train_index], verbose=0)
    loss_V = model_best.evaluate([FinBERT_train_input[0][test_index],FinBERT_train_input[1][test_index],FinBERT_train_input[2][test_index]]
                                      , FinBERT_train_output[test_index], verbose=0)
    print(loss_T,loss_V)
    train_loss.append(loss_T)
    vald_loss.append(loss_V)
    history.append(train_history)


Epoch 1/8








































































































INFO:tensorflow:Assets written to: FinBERT_results/CheckPoints/FinBERT_checkpoint1/assets


INFO:tensorflow:Assets written to: FinBERT_results/CheckPoints/FinBERT_checkpoint1/assets


Epoch 2/8








































































































INFO:tensorflow:Assets written to: FinBERT_results/CheckPoints/FinBERT_checkpoint1/assets


INFO:tensorflow:Assets written to: FinBERT_results/CheckPoints/FinBERT_checkpoint1/assets


Epoch 3/8








































































































INFO:tensorflow:Assets written to: FinBERT_results/CheckPoints/FinBERT_checkpoint1/assets


INFO:tensorflow:Assets written to: FinBERT_results/CheckPoints/FinBERT_checkpoint1/assets


Epoch 4/8








































































































INFO:tensorflow:Assets written to: FinBERT_results/CheckPoints/FinBERT_checkpoint1/assets


INFO:tensorflow:Assets written to: FinBERT_results/CheckPoints/FinBERT_checkpoint1/assets


Epoch 5/8
Epoch 6/8

**Plot the results**

In [None]:
plt.plot(train_loss, label = "Trainng Loss")
plt.plot(vald_loss, label = "Validation Loss")
# naming the x axis 
plt.xlabel('Folds') 
# naming the y axis 
plt.ylabel('Error') 
# function to show the plot 
plt.legend()
plt.savefig('FinBERT_results/Plots/FinBERT_loss_check.png')

In [None]:
test_loss = []
with tf.device('/device:GPU:0'):
    
    for i in range(n_splits):

        checkpoint_filepath = 'FinBERT_results/CheckPoints/FinBERT_checkpoint'+str(i+1) 
        best_model = get_model()
        model_best.load_weights(checkpoint_filepath)
        predicted = best_model.predict(FinBERT_test_input[0:50])
        
        loss_test = best_model.evaluate([FinBERT_test_input[0],FinBERT_test_input[1],FinBERT_test_input[2]]
                                          , FinBERT_test_output, verbose=0)
        print("Test Errror for the fold ",i+1," is",loss_test )
        
        
        plt.plot(predicted[0:50], label = "Predicted Values")  
        plt.plot(FinBERT_test_output[0:50], label = "Actual Values")
        # naming the x axis 
        plt.xlabel('Test Samples') 
        # naming the y axis 
        plt.ylabel('Output Values') 
        # function to show the plot 
        plt.legend()
        textstr = "Test Errror for the fold "+ str(i+1)+" is "+str(np.round(loss_test,3))
        plt.gcf().text(0, -0.25, textstr, fontsize=14)
        plt.savefig('FinBERT_results/Plots/FinBERT_fold'+str(i+1)+'.png',bbox_inches='tight')
        plt.clf()

        test_loss.append(loss_test)

In [None]:
import matplotlib.pylab as plt
data=[]
data.append(train_loss)
data.append(vald_loss)
data.append(test_loss)
  
fig = plt.figure()  
# Creating axes instance 
ax = fig.add_axes([0, 0, 1, 1]) 
  
# Creating plot 
ax.boxplot(data)

ax.set_xticklabels(['Training', 'Validation','Test']) 

# naming the y axis 
plt.ylabel('MSE Loss')
plt.title("Box plot for Training, Validation and Test Loss")
textstr ='Training Loss  : '+str(np.round(np.mean(train_loss),3))+' ('+str(np.round(np.std(train_loss),3))+')\n'+'Validation Loss  : '+str(np.round(np.mean(vald_loss),3))+' ('+str(np.round(np.std(vald_loss),3))+')\n'+'Test Loss  : '+str(np.round(np.mean(test_loss),3))+' ('+str(np.round(np.std(test_loss),3))+')'
plt.gcf().text(0, -0.25, textstr, fontsize=14)
# show plot 
plt.savefig('FinBERT_results/Plots/block_FinBERT.png',bbox_inches='tight')

print('Training Loss: %.3f (%.3f)' % (np.mean(train_loss), np.std(train_loss)))
print('Validation Loss: %.3f (%.3f)' % (np.mean(vald_loss), np.std(vald_loss)))
print('Test Loss: %.3f (%.3f)' % (np.mean(test_loss), np.std(test_loss)))