### Measuring Topic Sentiment
There are various ways of approaching this problem. Here, we will use the information that we already have from the topic modelling, namely the
topic models. We then apply sentiment at document level. For that purpose we will use a model that is hosted in hugging face, and that consistutes an improvement over FinBert. The advantage of this approach is that we don't have to use the external api vendors and submit our data to them. However, under the right circunstances, and with the appropriate budget considerations, this problem can be solved with prompt engineering directly. We load the data containing the documents for each section and then load the topic models.


In [1]:
import pandas as pd
from bertopic import BERTopic
import pickle
sections = [f'Section{s}' for s in ['1', '1A', '7'] ]
topic_models = {s: BERTopic.load(f'../topic_models/topic_models_{s}') for s in sections}

# Load the split_sections_file
with open("../data/split_sections_text.pickle", "rb") as file:
    split_sections_text = pickle.load(file)


  @numba.jit()
  @numba.jit()
  @numba.jit()
  @numba.jit()


In [2]:
topics_and_docs = {s: tm.get_document_info(split_sections_text[s]['text']) for s,tm in topic_models.items() }
for s,v in topics_and_docs.items():
    topics_and_docs[s]['Timestamp'] = pd.Series(data=[meta['filedAt'] for meta in split_sections_text[s]['meta']])


ValueError: All arrays must be of the same length

In [32]:
topics_and_docs[sections[0]]

Unnamed: 0,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Representative_document,Timestamp
0,"item 1. business assurant, inc. was incorporat...",169,169_lender_global housing_flood insurance_hous...,"[lender, global housing, flood insurance, hous...",,lender - global housing - flood insurance - ho...,False,2023-02-17 16:12:13-05:00
1,"of december 31, 2022, we had $ 33. 12 billion ...",31,31_start 8226_end table_table_table end,"[start 8226, end table, table, table end, tabl...",,start 8226 - end table - table - table end - t...,False,2023-02-17 16:12:13-05:00
2,partnerships with major clients and prospects ...,31,31_start 8226_end table_table_table end,"[start 8226, end table, table, table end, tabl...",,start 8226 - end table - table - table end - t...,False,2023-02-17 16:12:13-05:00
3,we continued to strengthen partnerships with k...,31,31_start 8226_end table_table_table end,"[start 8226, end table, table, table end, tabl...",,start 8226 - end table - table - table end - t...,False,2023-02-17 16:12:13-05:00
4,"year, we have maintained a strong balance shee...",169,169_lender_global housing_flood insurance_hous...,"[lender, global housing, flood insurance, hous...",,lender - global housing - flood insurance - ho...,False,2023-02-17 16:12:13-05:00
...,...,...,...,...,...,...,...,...
71062,a single vendor. at the same time & # 8212 ; b...,-1,-1_160_table_environmental_financial,"[160, table, environmental, financial, vice, t...",,160 - table - environmental - financial - vice...,False,2019-12-20 17:04:07-05:00
71063,"ag ). we also compete with other eda vendors, ...",-1,-1_160_table_environmental_financial,"[160, table, environmental, financial, vice, t...",,160 - table - environmental - financial - vice...,False,2019-12-20 17:04:07-05:00
71064,"many cases, under our customer agreements and ...",3,3_patents_intellectual property_trademarks_pat...,"[patents, intellectual property, trademarks, p...",,patents - intellectual property - trademarks -...,False,2019-12-20 17:04:07-05:00
71065,of engineering and senior vice president of ma...,33,33_university_vice president_vice_executive vice,"[university, vice president, vice, executive v...",,university - vice president - vice - executive...,False,2019-12-20 17:04:07-05:00


Now, we can apply sentiment to each of the document. We will use a fine tuned version of  [FinancialBERT](https://huggingface.co/Sigma/financial-sentiment-analysis) on the financial_phrasebank dataset. This model is currently the top performer on this dataset.


In [62]:
from transformers import pipeline
import torch
device = 'mps' if torch.backends.mps.is_available() else 'cpu'
device = 'cuda' if torch.cuda.is_available() else device
classifier = pipeline(model='Sigma/financial-sentiment-analysis', task='sentiment-analysis', device=device)

We can now call the classifier to predict the label and the probability associated with that label. We then compute a sentiment value using those scores if the label is positive, or their symetric if the label is negative. We also have to perform a mapping of labels because the authors of the model use
the labels 'LABEL_0', 'LABEL_1' amd 'LABEL_2' which map to 'NEGATIVE', 'NEUTRAL', and 'POSITIVE' respectively. We start by

In [None]:
sentiment_labels = {s: classifier(topics_and_docs[s]['Document'].tolist()) for s in topics_and_docs}

In [64]:
import pandas as pd
documents_df = pd.DataFrame(sentiment_labels).add_suffix('_sigma_fsa')
import numpy as np

# Dictionary to map sentiment
sentiment_mapping = {'LABEL_0': 'NEGATIVE', 'LABEL_1': 'NEUTRAL', 'LABEL_2': 'POSITIVE'}
for s in topics_and_docs:
    topics_and_docs[s]['label_sigma_fsa'] = documents_df['label_sigma_fsa'].map(sentiment_mapping)
    topics_and_docs[s]['sentiment_sigma_fsa'] = np.where(topics_and_docs[s]['label_sigma_fsa'] =='NEGATIVE', -1 *topics_and_docs[s]['score_sigma_fsa'], topics_and_docs[s]['score_sigma_fsa'])
    topics_and_docs[s].loc[topics_and_docs[s]['label_sigma_fsa']=='NEUTRAL','sentiment_sigma_fsa'] = 0
    # We need to parse the date again
    topics_and_docs[sections[0]]['Timestamp'] = pd.to_datetime(topics_and_docs[sections[0]]['Timestamp'], utc=True)


NameError: name 'sentiment_labels' is not defined

In [63]:
agg_sentiment = {}
import numpy as np
for s in topics_and_docs:
    agg_sentiment[s] =topics_and_docs[s].set_index('Timestamp').groupby(['Topic','Name']).resample(rule='1Y').agg({'sentiment_sigma_fsa': [np.mean, np.median]}).reset_index(level=(0,1,2,))

TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'

In [49]:
topics_and_docs[sections[0]].sort_values('score_sigma_fsa', ascending=False)

Unnamed: 0,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Representative_document,Timestamp,label_sigma_fsa,score_sigma_fsa,sentiment_sgima_fsa,sentiment_sigma_fsa
64048,"details. in 2018, ametek achieved sales of $ 4...",329,329_emg_eig 8217_power industrial_electromecha...,"[emg, eig 8217, power industrial, electromecha...",,emg - eig 8217 - power industrial - electromec...,False,2019-02-21 15:22:45-05:00,POSITIVE,0.998357,0.998357,0.998357
19455,dividends strong general insurance performance...,848,848_aig_business aig_international group_aig p...,"[aig, business aig, international group, aig p...",,aig - business aig - international group - aig...,False,2023-02-17 13:24:05-05:00,POSITIVE,0.998063,0.998063,0.998063
63977,responsible for successfully driving these str...,-1,-1_160_table_environmental_financial,"[160, table, environmental, financial, vice, t...",,160 - table - environmental - financial - vice...,False,2023-02-21 13:37:14-05:00,POSITIVE,0.998050,0.998050,0.998050
63995,"2021, the company posted record sales, operati...",-1,-1_160_table_environmental_financial,"[160, table, environmental, financial, vice, t...",,160 - table - environmental - financial - vice...,False,2022-02-22 13:58:32-05:00,POSITIVE,0.997955,0.997955,0.997955
17438,"brand dealers and distributors. further, finan...",-1,-1_160_table_environmental_financial,"[160, table, environmental, financial, vice, t...",,160 - table - environmental - financial - vice...,False,2023-02-28 15:55:52-05:00,POSITIVE,0.997925,0.997925,0.997925
...,...,...,...,...,...,...,...,...,...,...,...,...
26825,item 1a. & # 8220 ; risk factors & # 8221 ; fo...,142,142_smelting_smelter_pt smelting_copper concen...,"[smelting, smelter, pt smelting, copper concen...",,smelting - smelter - pt smelting - copper conc...,False,2021-02-16 15:38:33-05:00,NEUTRAL,0.000000,0.990650,0.000000
26826,##bdenum disulfide. we operate molybdenum roas...,-1,-1_160_table_environmental_financial,"[160, table, environmental, financial, vice, t...",,160 - table - environmental - financial - vice...,False,2021-02-16 15:38:33-05:00,NEUTRAL,0.000000,0.985354,0.000000
26827,. we continue to review our mine development a...,669,669_block cave_grasberg block_metric tons_ore ...,"[block cave, grasberg block, metric tons, ore ...",,block cave - grasberg block - metric tons - or...,False,2021-02-16 15:38:33-05:00,NEUTRAL,0.000000,0.581226,0.000000
26828,with our existing mines focusing on opportunit...,142,142_smelting_smelter_pt smelting_copper concen...,"[smelting, smelter, pt smelting, copper concen...",,smelting - smelter - pt smelting - copper conc...,False,2021-02-16 15:38:33-05:00,NEUTRAL,0.000000,0.875626,0.000000


Since we want to aggregate across topics, we keep topic, name and compute the mean and median sentiment.

In [56]:
topics_and_docs[sections[0]]['Timestamp'] = pd.to_datetime(topics_and_docs[sections[0]]['Timestamp'], utc=True)
sentiment = topics_and_docs[sections[0]].set_index('Timestamp').groupby(['Topic','Name'])['sentiment_sigma_fsa'].resample(rule='1Y').mean().reset_index(level=(0,1,2,))

In [57]:
sentiment.head()

Unnamed: 0,Topic,Name,Timestamp,sentiment_sigma_fsa
0,-1,-1_160_table_environmental_financial,2019-12-31 00:00:00+00:00,0.116944
1,-1,-1_160_table_environmental_financial,2020-12-31 00:00:00+00:00,0.131976
2,-1,-1_160_table_environmental_financial,2021-12-31 00:00:00+00:00,0.168374
3,-1,-1_160_table_environmental_financial,2022-12-31 00:00:00+00:00,0.183749
4,-1,-1_160_table_environmental_financial,2023-12-31 00:00:00+00:00,0.179892


In [61]:
import plotly.express as px
fig = px.line(data_frame=sentiment.loc[sentiment.Topic.isin(range(1, 20)),:], x='Timestamp', y='sentiment_sigma_fsa', color='Name')
fig.update_layout(
    title='Frequency over Time',
    xaxis_title='Timestamp',
    yaxis_title='Average Sentiment per period'
)

fig.show()

Above, we can see how the sentiment evolved for the top 20 periods. Although, while using only the chart, is not possible to do an exhaustive analysis, we can see an improvement of the sentiment around the topic 6 (COVID-19). Let's do this for the remaining two sections,