### Measuring Topic Sentiment with FinancialBert
There are various ways of approaching this problem. Here, we will use the information that we already have from the topic modelling, namely the
topic models. We then apply sentiment at document level. For that purpose we will use a model that is hosted in hugging face, and that constitutes an improvement over FinBert. The advantage of this approach is that we don't have to use the external api vendors and submit our data to them. However, under the right circumstances, and with the appropriate budget considerations, this problem can be solved with prompt engineering directly. We load the data containing the documents for each section and then load the topic models.


In [1]:
import pandas as pd
from bertopic import BERTopic
import pickle
sections = [f'Section{s}' for s in ['1', '1A', '7'] ]
topic_models = {s: BERTopic.load(f'../topic_models/topic_models_{s}') for s in sections}

# Load the split_sections_file
with open("../data/split_sections_text.pickle", "rb") as file:
    split_sections_text = pickle.load(file)


  @numba.jit()
  @numba.jit()
  @numba.jit()
  @numba.jit()


In [27]:
topics_and_docs = {s: tm.get_document_info(split_sections_text[s]['text']) for s,tm in topic_models.items() }
for s,v in topics_and_docs.items():
    topics_and_docs[s]['Timestamp'] = pd.Series(data=[meta['filedAt'] for meta in split_sections_text[s]['meta']])


In [3]:
topics_and_docs[sections[0]]

Unnamed: 0,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Representative_document,Timestamp
0,"item 1. business assurant, inc. was incorporat...",458,458_global lifestyle_credit insurance_global a...,"[global lifestyle, credit insurance, global au...",,global lifestyle - credit insurance - global a...,False,2023-02-17 16:12:13-05:00
1,", voluntary homeowners insurance and other spe...",-1,-1_reports_sec_form_employees,"[reports, sec, form, employees, form 10, repor...",,reports - sec - form - employees - form 10 - r...,False,2023-02-17 16:12:13-05:00
2,"emerging technologies and operations, includin...",-1,-1_reports_sec_form_employees,"[reports, sec, form, employees, form 10, repor...",,reports - sec - form - employees - form 10 - r...,False,2023-02-17 16:12:13-05:00
3,partnerships with major clients and prospects ...,732,732_client_client relationships_support client...,"[client, client relationships, support clients...",,client - client relationships - support client...,False,2023-02-17 16:12:13-05:00
4,"we are focused on strategically attracting, de...",1062,1062_shared success_guiding principles_clients...,"[shared success, guiding principles, clients p...",,shared success - guiding principles - clients ...,False,2023-02-17 16:12:13-05:00
...,...,...,...,...,...,...,...,...
105993,"many cases, under our customer agreements and ...",2,2_patents_intellectual property_patent applica...,"[patents, intellectual property, patent applic...",,patents - intellectual property - patent appli...,False,2019-12-20 17:04:07-05:00
105994,officer joseph w. logan & # 160 ; & # 160 ; sa...,1561,1561_synopsys_runkel_joined synopsys_foon chan,"[synopsys, runkel, joined synopsys, foon chan,...",,synopsys - runkel - joined synopsys - foon cha...,False,2019-12-20 17:04:07-05:00
105995,and as our president and a member of our board...,1561,1561_synopsys_runkel_joined synopsys_foon chan,"[synopsys, runkel, joined synopsys, foon chan,...",,synopsys - runkel - joined synopsys - foon cha...,False,2019-12-20 17:04:07-05:00
105996,joseph w. logan serves as our sales and corpor...,1561,1561_synopsys_runkel_joined synopsys_foon chan,"[synopsys, runkel, joined synopsys, foon chan,...",,synopsys - runkel - joined synopsys - foon cha...,False,2019-12-20 17:04:07-05:00


Now, we can apply sentiment to each of the document. We will use a fine tuned version of  [FinancialBERT](https://huggingface.co/Sigma/financial-sentiment-analysis) on the financial_phrasebank dataset. This model is currently the top performer on this dataset.


In [4]:
from transformers import pipeline
import torch
device = 'mps' if torch.backends.mps.is_available() else 'cpu'
device = 'cuda' if torch.cuda.is_available() else device
classifier = pipeline(model='Sigma/financial-sentiment-analysis', task='sentiment-analysis', device=device)

We can now call the classifier to predict the label and the probability associated with that label. We then compute a sentiment value using those scores if the label is positive, or their symetric if the label is negative. We also have to perform a mapping of labels because the authors of the model use
the labels 'LABEL_0', 'LABEL_1' amd 'LABEL_2' which map to 'NEGATIVE', 'NEUTRAL', and 'POSITIVE' respectively. We start by

In [5]:

sentiment_labels = {s: classifier(topics_and_docs[s]['Document'].tolist()) for s in topics_and_docs}

In [55]:
import json
with open("../api/assets/sentiment_labels.json", "w") as file:
    json.dump(sentiment_labels, file)

In [36]:
import pandas as pd
import numpy as np
documents_df = {}
for s in topics_and_docs:
    documents_df[s] = pd.DataFrame(sentiment_labels[s]).add_suffix('_sigma_fsa')


# Dictionary to map sentiment and add sentiment to topics_and_docs
sentiment_mapping = {'LABEL_0': 'NEGATIVE', 'LABEL_1': 'NEUTRAL', 'LABEL_2': 'POSITIVE'}
for s in topics_and_docs:
    topics_and_docs[s]['label_sigma_fsa'] = documents_df[s]['label_sigma_fsa'].map(sentiment_mapping)
    topics_and_docs[s]['score_sigma_fsa'] = documents_df[s]['score_sigma_fsa']
    topics_and_docs[s]['sentiment_sigma_fsa'] = np.where(topics_and_docs[s]['label_sigma_fsa'] =='NEGATIVE', -1 *topics_and_docs[s]['score_sigma_fsa'], topics_and_docs[s]['score_sigma_fsa'])
    topics_and_docs[s].loc[topics_and_docs[s]['label_sigma_fsa']=='NEUTRAL','sentiment_sigma_fsa'] = 0
    # We need to parse the date again. We remove time-zone awareness so that we can parse the datetime.
    # We need to keep in mind that -5 will be changed to 0
    topics_and_docs[s]['Timestamp'] = pd.to_datetime(topics_and_docs[s]['Timestamp'], infer_datetime_format=True, utc=True)


In [37]:
for s in topics_and_docs:
    print(topics_and_docs[s].info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 105998 entries, 0 to 105997
Data columns (total 11 columns):
 #   Column                   Non-Null Count   Dtype              
---  ------                   --------------   -----              
 0   Document                 105998 non-null  object             
 1   Topic                    105998 non-null  int64              
 2   Name                     105998 non-null  object             
 3   Representation           105998 non-null  object             
 4   Representative_Docs      0 non-null       float64            
 5   Top_n_words              105998 non-null  object             
 6   Representative_document  105998 non-null  bool               
 7   Timestamp                105998 non-null  datetime64[ns, UTC]
 8   label_sigma_fsa          105998 non-null  object             
 9   score_sigma_fsa          105998 non-null  float64            
 10  sentiment_sigma_fsa      105998 non-null  float64            
dtypes: bool(1), d

### Aggregated Sentiment
 Now that we have the individual documents scored, we can aggregate them at a user-defined frequency and return that time series.

In [46]:
agg_sentiment = {}
import numpy as np
for s in topics_and_docs:
    print(s)
    agg_sentiment[s] =topics_and_docs[s].set_index('Timestamp').groupby(['Topic','Name']).resample(rule='1Y').agg({'sentiment_sigma_fsa': [np.mean, np.median]}).reset_index(level=(0,1,2,))
    agg_sentiment[s].columns = [col_1+col_2 for col_1, col_2 in agg_sentiment[s].columns]

Section1
Section1A
Section7


Since we want to aggregate across topics, we keep topic, name and compute the mean and median sentiment.

In [56]:
topics_and_docs[sections[0]]['Timestamp'] = pd.to_datetime(topics_and_docs[sections[0]]['Timestamp'], utc=True)
sentiment = topics_and_docs[sections[0]].set_index('Timestamp').groupby(['Topic','Name'])['sentiment_sigma_fsa'].resample(rule='1Y').mean().reset_index(level=(0,1,2,))

In [47]:
agg_sentiment[sections[1]].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7404 entries, 0 to 7403
Data columns (total 5 columns):
 #   Column                     Non-Null Count  Dtype              
---  ------                     --------------  -----              
 0   Topic                      7404 non-null   int64              
 1   Name                       7404 non-null   object             
 2   Timestamp                  7404 non-null   datetime64[ns, UTC]
 3   sentiment_sigma_fsamean    7380 non-null   float64            
 4   sentiment_sigma_fsamedian  7380 non-null   float64            
dtypes: datetime64[ns, UTC](1), float64(2), int64(1), object(1)
memory usage: 289.3+ KB


In [50]:
import plotly.express as px

for s in agg_sentiment:
    sentiment = agg_sentiment[s]
    fig = px.line(data_frame=sentiment.loc[sentiment.Topic.isin(range(1, 20)),:], x='Timestamp', y='sentiment_sigma_fsamean', color='Name')
    fig.update_layout(
        title=f'Topic sentiment -{s}',
        xaxis_title='Timestamp',
        yaxis_title='Average Sentiment per period'
    )

    fig.show()

Above, we can see how the sentiment evolved for the top 20 periods per section. Although, while using only the chart, is not possible to do an exhaustive analysis, we can see an improvement of the sentiment around the topic 6 (COVID-19) in section 1. In fact, we can see for the timestamp 31 December 2020 a very negative sentiment as measured by the mean for that topic, which reflects the imense impact that covid-19 had during the year 2020. We also see on section 7 a decrease in sentiment for the topic 15, which relates to aircraft and airlines. Again this probably reflects the worries about airlines, very accentuated during the pandemic, as we can see from the minimum in Jan 2022, which consists of the annual reports for the year 2021, which was a very complicated year for the airlines. It is important to remark that in the context of annual filings, sentiment will be backward-looking, sometimes with a significant lag. We now save the series sentiment so that we can use it for the web service. This file will be large and needs to be added to the git-lfs.

In [54]:
with open('../api/assets/topics_and_docs_sentiment.json', 'w') as f:
    topics_and_docs_sentiment = {s:v.to_json()for s,v in topics_and_docs.items()}
    json.dump(topics_and_docs_sentiment,f)