## Financial news categorization/sentiment analysis using NLP techniques


Sentiment analysis is the statistical analysis of simple sentiment
cues. Essentially, it involves making statistical analyses on polarized
statements (i.e., statements with a positive, negative and neutral sen
timent), which are usually collected in the form of social media posts,
reviews, and news articles. Financial sentiment analysis is a challenging task due to the specialized language and lack of labeled data in that domain.


In our case, we will focus on two different tasks.


1. **Category tagger**: Create a NLP classifier capable of assigning a financial category to a text derived from the financial industry.

The Twitter Financial News dataset is an English-language dataset containing an annotated corpus of finance-related tweets. This dataset is used to classify finance-related tweets for their topic.

    The dataset holds 21,107 documents annotated with 20 labels:

topics = {
    "LABEL_0": "Analyst Update",
    "LABEL_1": "Fed | Central Banks",
    "LABEL_2": "Company | Product News",
    "LABEL_3": "Treasuries | Corporate Debt",
    "LABEL_4": "Dividend",
    "LABEL_5": "Earnings",
    "LABEL_6": "Energy | Oil",
    "LABEL_7": "Financials",
    "LABEL_8": "Currencies",
    "LABEL_9": "General News | Opinion",
    "LABEL_10": "Gold | Metals | Materials",
    "LABEL_11": "IPO",
    "LABEL_12": "Legal | Regulation",
    "LABEL_13": "M&A | Investments",
    "LABEL_14": "Macro",
    "LABEL_15": "Markets",
    "LABEL_16": "Politics",
    "LABEL_17": "Personnel Change",
    "LABEL_18": "Stock Commentary",
    "LABEL_19": "Stock Movement"
}

2. **Sentiment tagger**: Create a NLP classifier capable of assigning a sentiment score (positive,negative,neutral) to text derived from the financial industry. Additionally, we will use a powerful pre-trained model, finetuned on financial data, to assign scores to financial headlines, data from social media posts, etc ...


## Pre-requisites:


High level requirements of Python library.

    - Pytorch
    - HuggingFace Transformers library
    - Pandas
    - Numpy
    - Sklearn
    

In [1]:
# %%capture
# ! pip install pandas
# ! pip install numpy
# ! pip install matplotlib
# ! pip install scikit-learn
# ! pip install transformers
# ! pip install torch
# ! pip install tensorflow
# ! pip install tensorflow-metal

## **Step 1: Pulling the data together**


Download and inspect the data from the various sources:

1. Financial Phrasebank https://huggingface.co/datasets/financial_phrasebank. Humanly annotated

2. Financial tweets topics dataset: https://huggingface.co/datasets/zeroshot/twitter-financial-news-topic/viewer/default/train?p=169. Humanly annotated

Think of any pre-processing functions (
    Converting the text to lowercase,
    removing punctuation,
    tokenizing the text,
    removing stop words and empty strings,
    lemmatizing tokens.
) that you might need to apply for downstream tasks. As always, pick a framework for data analysis and data exploration.

In [2]:
import pandas as pd
import string

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

import spacy
from spacy.lang.en import English

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

In [None]:
import nltk
nltk.download("wordnet")
nltk.download('punkt')

In [4]:
# read the finantial phrase bank data
# directory = 'FinancialPhraseBank-v1.0'
# files = os.listdir(directory)
# files = [file for file in files if file.startswith('Sentences_')]

data = []

with open("FinancialPhraseBank-v1.0/Sentences_AllAgree.txt", "r", encoding="latin1") as f:
    lines = f.readlines()
    for line in lines:
        sentence, label = line.strip().split("@")
        data.append(dict(sentence=sentence, label=label))

finantial = pd.DataFrame(data)
finantial

Unnamed: 0,sentence,label
0,"According to Gran , the company has no plans t...",neutral
1,"For the last quarter of 2010 , Componenta 's n...",positive
2,"In the third quarter of 2010 , net sales incre...",positive
3,Operating profit rose to EUR 13.1 mn from EUR ...,positive
4,"Operating profit totalled EUR 21.1 mn , up fro...",positive
...,...,...
2259,Operating result for the 12-month period decre...,negative
2260,HELSINKI Thomson Financial - Shares in Cargote...,negative
2261,LONDON MarketWatch -- Share prices ended lower...,negative
2262,Operating profit fell to EUR 35.4 mn from EUR ...,negative


In [5]:
def clean_text(text):
    # lowercase
    text = text.lower()

    # remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    #tokenize text
    tokens = word_tokenize(text)

    # Remove stop words and empty strings
    stop_words = spacy.lang.en.stop_words.STOP_WORDS
    tokens = [
        token for token in tokens
        if token not in stop_words
            and token.strip() != ''
    ]

    # lemmatize tokens
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]

    return tokens

In [6]:
# map the labels
labels = {
    "neutral": 0,
    "positive": 1,
    "negative": 2
}

In [7]:
# process finantial data
fin_df = finantial.copy()
fin_df["tokens"] = fin_df["sentence"].apply(clean_text)
fin_df["cleaned_sentence"] = fin_df["tokens"].apply(lambda x: " ".join(x))

fin_df["label"] = fin_df["label"].map(labels)
fin_df

Unnamed: 0,sentence,label,tokens,cleaned_sentence
0,"According to Gran , the company has no plans t...",0,"[according, gran, company, plan, production, r...",according gran company plan production russia ...
1,"For the last quarter of 2010 , Componenta 's n...",1,"[quarter, 2010, componenta, s, net, sale, doub...",quarter 2010 componenta s net sale doubled eur...
2,"In the third quarter of 2010 , net sales incre...",1,"[quarter, 2010, net, sale, increased, 52, eur,...",quarter 2010 net sale increased 52 eur 2055 mn...
3,Operating profit rose to EUR 13.1 mn from EUR ...,1,"[operating, profit, rose, eur, 131, mn, eur, 8...",operating profit rose eur 131 mn eur 87 mn cor...
4,"Operating profit totalled EUR 21.1 mn , up fro...",1,"[operating, profit, totalled, eur, 211, mn, eu...",operating profit totalled eur 211 mn eur 186 m...
...,...,...,...,...
2259,Operating result for the 12-month period decre...,2,"[operating, result, 12month, period, decreased...",operating result 12month period decreased prof...
2260,HELSINKI Thomson Financial - Shares in Cargote...,2,"[helsinki, thomson, financial, share, cargotec...",helsinki thomson financial share cargotec fell...
2261,LONDON MarketWatch -- Share prices ended lower...,2,"[london, marketwatch, share, price, ended, low...",london marketwatch share price ended lower lon...
2262,Operating profit fell to EUR 35.4 mn from EUR ...,2,"[operating, profit, fell, eur, 354, mn, eur, 6...",operating profit fell eur 354 mn eur 688 mn 20...


## **Step 2: Train and fine-tune various NLP classifiers on financial news datasets**



#### **2.1 Let´s start with simple baseline (at your own choice)**. For example, build a logistic regression model based on pre-trained word embeddings or TF-IDF vectors of the financial news corpus **


Build a baseline model  with **Financial Phrasebank dataset**. What are the limitations of these baseline models?

### Baseline Model
Logistic Regression

In [8]:
vertorizer = TfidfVectorizer(max_features=1_000)

In [9]:
# prepare the tests
# train test split (80/20)
fin_train, fin_test = train_test_split(
    fin_df,
    test_size=0.2,
    random_state=42,
    stratify=fin_df["label"]
)

In [10]:
# train
fin_train_tfidf = vertorizer.fit_transform(fin_train["cleaned_sentence"])
y_fin_train = fin_train["label"]

In [11]:
# test
fin_test_tfidf = vertorizer.transform(fin_test["cleaned_sentence"])
y_fin_test = fin_test["label"]


In [12]:
model = LogisticRegression()
model.fit(fin_train_tfidf, y_fin_train)

y_train_pred = model.predict(fin_train_tfidf)
train_accuracy = accuracy_score(y_fin_train, y_train_pred)

y_test_pred = model.predict(fin_test_tfidf)
test_accuracy = accuracy_score(y_fin_test, y_test_pred)

print(f"Train accuracy: {train_accuracy:.2f}")
print(f"Test accuracy: {test_accuracy:.2f}")


Train accuracy: 0.89
Test accuracy: 0.83



#### **2.2 Compare the baseline with a pre-trained model that is specialized for the finance domain. Download and use the FinBERT model from Huggingfaces**

Model source: https://huggingface.co/ProsusAI/finbert

Once you have downloaded the model, run inference and compute performance metrics to get a sense of how the specialized pre-trained model fares against the baseline  model.  Use the HuggingFaces library to download the model and run inference on it. For large datasets or text sequences, CPU running time might be large.

For more information on the model: Araci, D. (2019). FinBERT: Financial Sentiment Analysis with Pre-trained Language Models.

### Bert model

In [None]:
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler, random_split

from transformers import BertTokenizer, BertForSequenceClassification, get_linear_schedule_with_warmup


In [14]:
fb_tokenizer = BertTokenizer.from_pretrained('ProsusAI/finbert', do_lower_case=True)
fb_model = BertForSequenceClassification.from_pretrained('ProsusAI/finbert')

In [15]:
def predict_label(text): 
    print(type(text))
    tokens = fb_tokenizer.encode_plus(
        text,
        add_special_tokens=False,
        return_tensors="pt",
    )
    output = fb_model(**tokens)
    probabilities = torch.nn.functional.softmax(output.logits, dim=-1)
    label = torch.argmax(probabilities).item()
    return label

In [16]:
# prepare the df
fb_df = finantial.copy()
fb_df["label"] = fb_df["label"].map(labels)

fb_df

Unnamed: 0,sentence,label
0,"According to Gran , the company has no plans t...",0
1,"For the last quarter of 2010 , Componenta 's n...",1
2,"In the third quarter of 2010 , net sales incre...",1
3,Operating profit rose to EUR 13.1 mn from EUR ...,1
4,"Operating profit totalled EUR 21.1 mn , up fro...",1
...,...,...
2259,Operating result for the 12-month period decre...,2
2260,HELSINKI Thomson Financial - Shares in Cargote...,2
2261,LONDON MarketWatch -- Share prices ended lower...,2
2262,Operating profit fell to EUR 35.4 mn from EUR ...,2


In [17]:
# predict the label
fb_df["predicted_label"] = fb_df["sentence"].apply(predict_label)
fb_df

Unnamed: 0,sentence,label,predicted_label
0,"According to Gran , the company has no plans t...",0,2
1,"For the last quarter of 2010 , Componenta 's n...",1,0
2,"In the third quarter of 2010 , net sales incre...",1,2
3,Operating profit rose to EUR 13.1 mn from EUR ...,1,0
4,"Operating profit totalled EUR 21.1 mn , up fro...",1,2
...,...,...,...
2259,Operating result for the 12-month period decre...,2,1
2260,HELSINKI Thomson Financial - Shares in Cargote...,2,1
2261,LONDON MarketWatch -- Share prices ended lower...,2,0
2262,Operating profit fell to EUR 35.4 mn from EUR ...,2,1


In [18]:
# accuracy
fb_accuracy = accuracy_score(fb_df["label"], fb_df["predicted_label"])
print(f"FinBERT accuracy: {fb_accuracy:.2f}")


FinBERT accuracy: 0.05


In [19]:
# confusion matrix
matrix = confusion_matrix(fb_df["label"], fb_df["predicted_label"])
matrix

array([[  32,   17, 1342],
       [ 405,   15,  150],
       [  31,  212,   60]])

#### **2.3 (Advanced) Fine-tune a pre-trained model such a base BERT model on a small labeled dataset**

General-purpose models are not effective enough because of the specialized language used in a financial context. We hypothesize that pre-trained language models can help with this problem because they require fewer labeled examples and they can be further trained on domain-specific corpora.

In recent years the NLP community has seen many breakthoughs in Natural Language Processing, especially the shift to transfer learning. Models like ELMo, fast.ai's ULMFiT, Transformer and OpenAI's GPT have allowed researchers to achieves state-of-the-art results on multiple benchmarks and provided the community with large pre-trained models with high performance. This shift in NLP is seen as NLP's ImageNet moment, a shift in computer vision a few year ago when lower layers of deep learning networks with million of parameters trained on a specific task can be reused and fine-tuned for other tasks, rather than training new networks from scratch.

One of the most significant milestones in the evolution of NLP recently is the release of Google's BERT, which is described as the beginning of a new era in NLP. In our case, we are going to explore a pre-trained model called FinBERT, already tuned with a financial corpus. I specifically recommend the HuggingFace library for easeness of implementation.

*What is HuggingFace?* Hugging Face’s Transformers is an open-source library that provides thousands of pre-trained models to perform various tasks on texts such as text classification, named entity recognition, translation, and more. The library has a unified, high-level API for these models and supports a wide range of languages and model architectures.


Here are various tutorials for finetuning BERT: https://drlee.io/fine-tuning-hugging-faces-bert-transformer-for-sentiment-analysis-69b976e6ac5d and https://skimai.com/fine-tuning-bert-for-sentiment-analysis/. I specially recommnend this one: http://mccormickml.com/2019/07/22/BERT-fine-tuning/

The dataset where to finetune a BERT related model can be found in the previous cell: **Financial tweets topics dataset**

*ALERT*: Running or training a large language model like BERT or FinBERT might incur in large CPU processing times. Although BERT is very large, complicated, and have millions of parameters, we might only need to fine-tune it in only 2-4 epochs. You can also explore Google colab, for limited acces to free GPUs, which might best suited for this task., specially if training required.

Finally, compare the previous baseline with fine-tuned FinBERT

In [None]:
# Load BertForSequenceClassification, the pretrained BERT model with a single
# linear classification layer on top.
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", # Use the 12-layer BERT model, with an uncased vocab.
    num_labels=20, # The number of output labels--2 for binary classification.
                    # You can increase this for multi-class tasks.
    # output_attentions=False, # Whether the model returns attentions weights.
    # output_hidden_states=False, # Whether the model returns all hidden-states.
)

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [3]:
# read twitter data
twitter_train  = pd.read_csv("twitter/topic_train.csv")
twitter_train.head()

Unnamed: 0,text,label
0,Here are Thursday's biggest analyst calls: App...,0
1,Buy Las Vegas Sands as travel to Singapore bui...,0
2,"Piper Sandler downgrades DocuSign to sell, cit...",0
3,"Analysts react to Tesla's latest earnings, bre...",0
4,Netflix and its peers are set for a ‘return to...,0


In [111]:
# read twitter data
twitter_valid = pd.read_csv("twitter/topic_valid.csv")
twitter_valid.head()

Unnamed: 0,text,label
0,Analyst call of the day for @CNBCPro subscribe...,0
1,"Loop upgrades CSX to buy, says it's a good pla...",0
2,BofA believes we're already in a recession — a...,0
3,JPMorgan sees these derivative plays as best w...,0
4,Morgan Stanley's Huberty sees Apple earnings m...,0


In [137]:
# concat the train and valid data
twitter = pd.concat([twitter_train, twitter_valid], axis=0, ignore_index=True)

### Loading and Preprocessing the Data
Using https://mccormickml.com/2019/07/22/BERT-fine-tuning/

In [138]:
MAX_TEXT_LEN = 128

In [141]:
max_text_len = twitter["text"].apply(lambda x: tokenizer.encode(x, add_special_tokens=True)).apply(len).max()
max_text_len

147

In [142]:
def encode_text(text):
    tokens = fb_tokenizer.encode_plus(
        text,
        add_special_tokens=True,
        max_length=MAX_TEXT_LEN,
        return_tensors="pt",
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
    )
    return tokens

# encode the text
twitter["tokens"] = twitter["text"].apply(encode_text)
twitter

Unnamed: 0,text,label,tokens
0,Here are Thursday's biggest analyst calls: App...,0,"[input_ids, token_type_ids, attention_mask]"
1,Buy Las Vegas Sands as travel to Singapore bui...,0,"[input_ids, token_type_ids, attention_mask]"
2,"Piper Sandler downgrades DocuSign to sell, cit...",0,"[input_ids, token_type_ids, attention_mask]"
3,"Analysts react to Tesla's latest earnings, bre...",0,"[input_ids, token_type_ids, attention_mask]"
4,Netflix and its peers are set for a ‘return to...,0,"[input_ids, token_type_ids, attention_mask]"
...,...,...,...
21102,Dollar bonds of Chinese developers fall as str...,3,"[input_ids, token_type_ids, attention_mask]"
21103,Longer maturity Treasury yields have scope to ...,3,"[input_ids, token_type_ids, attention_mask]"
21104,Pimco buys €1bn of Apollo buyout loans from ba...,3,"[input_ids, token_type_ids, attention_mask]"
21105,Analysis: Banks' snubbing of junk-rated loan f...,3,"[input_ids, token_type_ids, attention_mask]"


In [143]:
twitter["input_ids"] = twitter["tokens"].apply(lambda x: x["input_ids"])
twitter["attention_mask"] = twitter["tokens"].apply(lambda x: x["attention_mask"])

twitter

Unnamed: 0,text,label,tokens,input_ids,attention_mask
0,Here are Thursday's biggest analyst calls: App...,0,"[input_ids, token_type_ids, attention_mask]","[[tensor(101), tensor(2182), tensor(2024), ten...","[[tensor(1), tensor(1), tensor(1), tensor(1), ..."
1,Buy Las Vegas Sands as travel to Singapore bui...,0,"[input_ids, token_type_ids, attention_mask]","[[tensor(101), tensor(4965), tensor(5869), ten...","[[tensor(1), tensor(1), tensor(1), tensor(1), ..."
2,"Piper Sandler downgrades DocuSign to sell, cit...",0,"[input_ids, token_type_ids, attention_mask]","[[tensor(101), tensor(11939), tensor(5472), te...","[[tensor(1), tensor(1), tensor(1), tensor(1), ..."
3,"Analysts react to Tesla's latest earnings, bre...",0,"[input_ids, token_type_ids, attention_mask]","[[tensor(101), tensor(18288), tensor(10509), t...","[[tensor(1), tensor(1), tensor(1), tensor(1), ..."
4,Netflix and its peers are set for a ‘return to...,0,"[input_ids, token_type_ids, attention_mask]","[[tensor(101), tensor(20907), tensor(1998), te...","[[tensor(1), tensor(1), tensor(1), tensor(1), ..."
...,...,...,...,...,...
21102,Dollar bonds of Chinese developers fall as str...,3,"[input_ids, token_type_ids, attention_mask]","[[tensor(101), tensor(7922), tensor(9547), ten...","[[tensor(1), tensor(1), tensor(1), tensor(1), ..."
21103,Longer maturity Treasury yields have scope to ...,3,"[input_ids, token_type_ids, attention_mask]","[[tensor(101), tensor(2936), tensor(16736), te...","[[tensor(1), tensor(1), tensor(1), tensor(1), ..."
21104,Pimco buys €1bn of Apollo buyout loans from ba...,3,"[input_ids, token_type_ids, attention_mask]","[[tensor(101), tensor(14255), tensor(12458), t...","[[tensor(1), tensor(1), tensor(1), tensor(1), ..."
21105,Analysis: Banks' snubbing of junk-rated loan f...,3,"[input_ids, token_type_ids, attention_mask]","[[tensor(101), tensor(4106), tensor(1024), ten...","[[tensor(1), tensor(1), tensor(1), tensor(1), ..."


In [144]:
print(
    twitter.loc[1, "text"],
    twitter.loc[1, "input_ids"],
    sep="\n"
)

Buy Las Vegas Sands as travel to Singapore builds, Wells Fargo says  https://t.co/fLS2w57iCz
tensor([[  101,  4965,  5869,  7136, 13457,  2004,  3604,  2000,  5264, 16473,
          1010,  7051, 23054,  2758, 16770,  1024,  1013,  1013,  1056,  1012,
          2522,  1013, 13109,  2015,  2475,  2860, 28311, 18682,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,  

In [145]:
input_ids = twitter["input_ids"].tolist()
input_ids = torch.cat(input_ids, dim=0)
input_ids, len(input_ids)

(tensor([[  101,  2182,  2024,  ...,     0,     0,     0],
         [  101,  4965,  5869,  ...,     0,     0,     0],
         [  101, 11939,  5472,  ...,     0,     0,     0],
         ...,
         [  101, 14255, 12458,  ...,     0,     0,     0],
         [  101,  4106,  1024,  ...,     0,     0,     0],
         [  101,  1057,  1012,  ...,     0,     0,     0]]),
 21107)

In [146]:
attention_masks = twitter["attention_mask"].tolist()
attention_masks = torch.cat(attention_masks, dim=0)
attention_masks, len(attention_masks)

(tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 21107)

In [147]:
labels = twitter["label"].tolist()
labels = torch.tensor(labels)
labels, len(labels)

(tensor([0, 0, 0,  ..., 3, 3, 3]), 21107)

### Training & Validation Split
90% for training and 10% for validation

In [148]:
# Combine the training inputs into a TensorDataset.
dataset = TensorDataset(input_ids, attention_masks, labels)

# Create a 90-10 train-validation split.

# Calculate the number of samples to include in each set.
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size

# Divide the dataset by randomly selecting samples.
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))

18,996 training samples
2,111 validation samples


In [150]:
# The DataLoader needs to know our batch size for training, so we specify it
# here. For fine-tuning BERT on a specific task, the authors recommend a batch
# size of 16 or 32.
batch_size = 32

# Create the DataLoaders for our training and validation sets.
# We'll take training samples in random order.
train_dataloader = DataLoader(
    train_dataset,  # The training samples.
    sampler=RandomSampler(train_dataset), # Select batches randomly
    batch_size=batch_size # Trains with this batch size.
)

# For validation the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
    val_dataset, # The validation samples.
    sampler=SequentialSampler(val_dataset), # Pull out batches sequentially.
    batch_size=batch_size # Evaluate with this batch size.
)

In [151]:
# Get all of the model's parameters as a list of tuples.
params = list(model.named_parameters())

print('The BERT model has {:} different named parameters.\n'.format(len(params)))

print('==== Embedding Layer ====\n')

for p in params[0:5]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== First Transformer ====\n')

for p in params[5:21]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== Output Layer ====\n')

for p in params[-4:]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

The BERT model has 201 different named parameters.

==== Embedding Layer ====

bert.embeddings.word_embeddings.weight                  (30522, 768)
bert.embeddings.position_embeddings.weight                (512, 768)
bert.embeddings.token_type_embeddings.weight                (2, 768)
bert.embeddings.LayerNorm.weight                              (768,)
bert.embeddings.LayerNorm.bias                                (768,)

==== First Transformer ====

bert.encoder.layer.0.attention.self.query.weight          (768, 768)
bert.encoder.layer.0.attention.self.query.bias                (768,)
bert.encoder.layer.0.attention.self.key.weight            (768, 768)
bert.encoder.layer.0.attention.self.key.bias                  (768,)
bert.encoder.layer.0.attention.self.value.weight          (768, 768)
bert.encoder.layer.0.attention.self.value.bias                (768,)
bert.encoder.layer.0.attention.output.dense.weight        (768, 768)
bert.encoder.layer.0.attention.output.dense.bias              (

In [152]:

# Note: AdamW is a class from the huggingface library (as opposed to pytorch)
# I believe the 'W' stands for 'Weight Decay fix"
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=5e-5, # args.learning_rate - default is 5e-5, our notebook had 2e-5
    eps=1e-8 # args.adam_epsilon  - default is 1e-8.
)

In [153]:
# Number of training epochs. The BERT authors recommend between 2 and 4.
# We chose to run for 4, but we'll see later that this may be over-fitting the
# training data.
epochs = 2

# Total number of training steps is [number of batches] x [number of epochs].
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0, # Default value in run_glue.py
    num_training_steps=total_steps
)

print("total_steps", total_steps)
print("epochs", epochs)

total_steps 1188
epochs 2


### Aux functions

In [154]:
import numpy as np

# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

In [155]:
import time
import datetime

def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))

    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

In [156]:
# If there's a GPU available...
if torch.cuda.is_available():
    # Tell PyTorch to use the GPU.
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

No GPU available, using the CPU instead.


### Training Loop

In [157]:
%env CUDA_LAUNCH_BLOCKING=1

env: CUDA_LAUNCH_BLOCKING=1


In [158]:
import random
import numpy as np

# This training code is based on the `run_glue.py` script here:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128

# Set the seed value all over the place to make this reproducible.
seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# We'll store a number of quantities such as training and validation loss,
# validation accuracy, and timings.
training_stats = []

# Measure the total training time for the whole run.
total_t0 = time.time()

In [159]:
# For each epoch...
for epoch_i in range(0, epochs):

    # ========================================
    #               Training
    # ========================================

    # Perform one full pass over the training set.

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # Measure how long the training epoch takes.
    t0 = time.time()

    # Reset the total loss for this epoch.
    total_train_loss = 0

    # Put the model into training mode. Don't be mislead--the call to
    # `train` just changes the *mode*, it doesn't *perform* the training.
    # `dropout` and `batchnorm` layers behave differently during training
    # vs. test (source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch)
    model.train()

    # For each batch of training data...
    for step, batch in enumerate(train_dataloader):
        # Progress update every 40 batches.
        if step % 40 == 0 and not step == 0:
            # Calculate elapsed time in minutes.
            elapsed = format_time(time.time() - t0)

            # Report progress.
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        # Unpack this training batch from our dataloader.
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using the
        # `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids
        #   [1]: attention masks
        #   [2]: labels
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        # Always clear any previously calculated gradients before performing a
        # backward pass. PyTorch doesn't do this automatically because
        # accumulating the gradients is "convenient while training RNNs".
        # (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)
        model.zero_grad()

        # Perform a forward pass (evaluate the model on this training batch).
        # The documentation for this `model` function is here:
        # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
        # It returns different numbers of parameters depending on what arguments
        # arge given and what flags are set. For our useage here, it returns
        # the loss (because we provided labels) and the "logits"--the model
        # outputs prior to activation.
        outputs = model(
            b_input_ids,
            token_type_ids=None,
            attention_mask=b_input_mask,
            labels=b_labels
        )
        loss = outputs[0]
        logits = outputs[1]

        # Accumulate the training loss over all of the batches so that we can
        # calculate the average loss at the end. `loss` is a Tensor containing a
        # single value; the `.item()` function just returns the Python value
        # from the tensor.
        total_train_loss += loss.item()

        # Perform a backward pass to calculate the gradients.
        loss.backward()

        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient.
        # The optimizer dictates the "update rule"--how the parameters are
        # modified based on their gradients, the learning rate, etc.
        optimizer.step()

        # Update the learning rate.
        scheduler.step()

    # Calculate the average loss over all of the batches.
    avg_train_loss = total_train_loss / len(train_dataloader)

    # Measure how long this epoch took.
    training_time = format_time(time.time() - t0)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(training_time))

    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set.

    print("")
    print("Running Validation...")

    t0 = time.time()

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()

    # Tracking variables
    total_eval_accuracy = 0
    total_eval_loss = 0
    nb_eval_steps = 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:

        # Unpack this training batch from our dataloader.
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using
        # the `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids
        #   [1]: attention masks
        #   [2]: labels
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        # Tell pytorch not to bother with constructing the compute graph during
        # the forward pass, since this is only needed for backprop (training).
        with torch.no_grad():
            # Forward pass, calculate logit predictions.
            # token_type_ids is the same as the "segment ids", which
            # differentiates sentence 1 and 2 in 2-sentence tasks.
            # The documentation for this `model` function is here:
            # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
            # Get the "logits" output by the model. The "logits" are the output
            # values prior to applying an activation function like the softmax.
            outputs = model(
                b_input_ids,
                token_type_ids=None,
                attention_mask=b_input_mask,
                labels=b_labels
            )
            loss = outputs[0]
            logits = outputs[1]

        # Accumulate the validation loss.
        total_eval_loss += loss.item()

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        # Calculate the accuracy for this batch of test sentences, and
        # accumulate it over all batches.
        total_eval_accuracy += flat_accuracy(logits, label_ids)


    # Report the final accuracy for this validation run.
    avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)
    print("  Accuracy: {0:.2f}".format(avg_val_accuracy))

    # Calculate the average loss over all of the batches.
    avg_val_loss = total_eval_loss / len(validation_dataloader)

    # Measure how long the validation run took.
    validation_time = format_time(time.time() - t0)

    print("  Validation Loss: {0:.2f}".format(avg_val_loss))
    print("  Validation took: {:}".format(validation_time))

    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'Valid. Accur.': avg_val_accuracy,
            'Training Time': training_time,
            'Validation Time': validation_time
        }
    )

print("")
print("Training complete!")

print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))


Training...
  Batch    40  of    594.    Elapsed: 0:02:15.
  Batch    80  of    594.    Elapsed: 0:04:27.
  Batch   120  of    594.    Elapsed: 0:06:39.
  Batch   160  of    594.    Elapsed: 0:08:52.
  Batch   200  of    594.    Elapsed: 0:11:04.
  Batch   240  of    594.    Elapsed: 0:13:16.
  Batch   280  of    594.    Elapsed: 0:15:31.
  Batch   320  of    594.    Elapsed: 0:17:46.
  Batch   360  of    594.    Elapsed: 0:19:59.
  Batch   400  of    594.    Elapsed: 0:22:13.
  Batch   440  of    594.    Elapsed: 0:24:26.
  Batch   480  of    594.    Elapsed: 0:26:39.
  Batch   520  of    594.    Elapsed: 0:29:05.
  Batch   560  of    594.    Elapsed: 0:31:18.

  Average training loss: 0.30
  Training epcoh took: 0:33:10

Running Validation...
  Accuracy: 0.93
  Validation Loss: 0.23
  Validation took: 0:00:54

Training...
  Batch    40  of    594.    Elapsed: 0:02:14.
  Batch    80  of    594.    Elapsed: 0:04:33.
  Batch   120  of    594.    Elapsed: 0:06:52.
  Batch   160  of    5

Running Validation...
* Accuracy: 0.94
* Validation Loss: 0.21
* Validation took: 0:00:54

### Saving the model

In [160]:
import os

# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()

output_dir = './two/'

# Create output directory if needed
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

print("Saving model to %s" % output_dir)

# Save a trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
fb_tokenizer.save_pretrained(output_dir)

# Good practice: save your training arguments together with the trained model
# torch.save(args, os.path.join(output_dir, 'training_args.bin'))

Saving model to ./two/


('./two/tokenizer_config.json',
 './two/special_tokens_map.json',
 './two/vocab.txt',
 './two/added_tokens.json')

## **Step 3: Deployment of the sentiment/category tagger on  financial news or social media posts**

Let´s now turn our attention to a live deployment of the financial news tagger. Things can get quite complicated, specially if we add streaming data, so it is best to keep the deploymnet lightweight. There are mainly three important pieces. Let´s explore them:


- Build a local dashboard/app (e.g. using Streamlit or another web applications framework of your choice). A bit UI to display the sentiment tagger in action and demonstrate the practical application of your model.


- Build a financial news/alerts scraper pipeline, filter some entities if you focus your search. In a real world setting,  you’d likely want to build a more robust infrastructure for processing and ingestion of new examples, handling any preprocessing, and outputting predictions. Here are some options where to scrape data (real-time data might be expensive or limited):

    - <span style="color:blue">*Social Media Posts*</span>: Pulling historical or live data from tweets or reddit. There are public APIs with extensive documentation for them.
    - <span style="color:blue">*OpenBB*</span>: Open research investment platform. It aggregates financial news across the world and has an API to access them.
    - <span style="color:blue">*Financial news outlet*</span>: Yahoo Finance
    
An pipeline example: The basic premise is to read in a stream of tweets, use a lighweight sentiment analysis engine (BERT might not be a good fit here) to assign a bullish/neutral/bearish score to each tweet, and then see how this cumulatively changes over time.
    
    
- Build an inference endpoint for the tagging model. Within your infrastructure, you can deploy and load the resuting model. One way is to build a REST API endpoint, only to be queried locally (in your laptop).



Extra: You could explore or quantify correlations with the market for a list of selected stock.