In [1]:
# !pip install torch torchvision


In [2]:
# !pip install transformers

## Comparing prebuilt roberta financial sentiment analyzer to finvader

as a note, had to downgrade to numpy 1.24.4 for compatibility reasons with transfomers library

In [83]:
import numpy as np
import torch 
import pandas as pd
import torchvision
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from transformers import pipeline
from transformers import AutoConfig
import datetime
# torch.set_printoptions(edgeitems=2, precision=2, linewidth=75)


In [4]:
#Using mac gpu, change if needed
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

In [5]:
#filter for only the articles
articles = pd.read_csv('./data/complete_next_open.csv')
articles = articles[articles['Headline'].notna()]
articles['Text'].iloc[0]

"The segment is an invaluable asset to Apple's overall business."

We are importing the distilroberta sentiment analyzer with the most downloads on hugging face.

In [26]:
model_name = 'mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis'
config = AutoConfig.from_pretrained(model_name)
config.max_position_embeddings

514

In [28]:
sentiment_pipeline = pipeline(model = model_name,
                              device = device, 
                              batch_size = 8,
                              truncation = True)

Running an example computation

In [7]:
articles = articles[['Headline', 'Text', 'finvader_tot']]

In [49]:
articles.iloc[1]['Headline'] + '. '+articles.iloc[1]['Text']

'Jim Cramer Gives His Opinion On Bank Of America, Sarepta, Wendy\'s And More. On CNBC\'s "Mad Money Lightning Round", Jim Cramer said Bank of America Corp (NYSE: BAC) is too inexpensive.'

In [18]:
sentiment_pipeline(articles.iloc[1]['Text'])
#spits out a list of dictionaries, one for each input

[{'label': 'negative', 'score': 0.9783570170402527}]

Now let's apply to the whole dataframe. 

One thing to be wary about is that some of the text is longer than allowed by the model (514 tokens). To get around this, the basic function below should be updated to do some sort of splitting and aggregation, e.g. cutting the text into shorter lengths and averaging the scores for each length. To make the splits less arbitrary, maybe splitting along sentence ends. However, aggregation is not an ideal way to accommodate sentiment, especially since 'chunks' would not, in theory, affect each other, but would to a human reader.

The most straightforward method is truncation, but this is also not ideal. However, it is easiest to implement so done below

In [22]:
def sentiment_scores(text:str):
    result = sentiment_pipeline(text)[0]
    return pd.Series([result['label'], result['score']])


In [65]:
sentiment_scores(articles.apply(lambda x: x['Headline'] +'. ' + x['Text'], axis = 1).iloc[4])



0    positive
1     0.99965
dtype: object

In [62]:
articles[['rob_sentiment', 'rob_score']] = articles.apply(lambda x: x['Headline'] +'. '+ x['Text'], axis =1).apply(sentiment_scores)
articles.head()

Unnamed: 0,Headline,Text,finvader_tot,rob_sentiment,rob_score,fin_sentiment
150,Don't Underestimate Apple's iPhone Business,The segment is an invaluable asset to Apple's ...,0.0396,positive,0.669296,neutral
153,Jim Cramer Gives His Opinion On Bank Of Americ...,"On CNBC's ""Mad Money Lightning Round"", Jim Cra...",0.0129,neutral,0.994898,neutral
154,Uber And Waymo Seeking Outside Funding For Aut...,Commercially viable autonomous vehicle (AV) te...,-0.3215,negative,0.995782,negative
158,A Closer Look At Mastercard's Key Value Drivers,Mastercard has consistently beat street estima...,0.8922,positive,0.999699,positive
164,Did Wells Fargo CEO Tim Sloan Earn His $1 Mill...,We learned this week that the scandal-plagued ...,0.2869,positive,0.99965,positive


In [36]:
def fin_sentiment(finscore:int):
    threshold = .1
    if finscore > threshold:
        return 'positive'
    elif finscore < -threshold:
        return 'negative'
    else:
        return 'neutral'

In [67]:
articles['fin_sentiment'] = articles['finvader_tot'].apply(fin_sentiment)
articles.head()

Unnamed: 0,Headline,Text,finvader_tot,rob_sentiment,rob_score,fin_sentiment
150,Don't Underestimate Apple's iPhone Business,The segment is an invaluable asset to Apple's ...,0.0396,positive,0.669296,neutral
153,Jim Cramer Gives His Opinion On Bank Of Americ...,"On CNBC's ""Mad Money Lightning Round"", Jim Cra...",0.0129,neutral,0.994898,neutral
154,Uber And Waymo Seeking Outside Funding For Aut...,Commercially viable autonomous vehicle (AV) te...,-0.3215,negative,0.995782,negative
158,A Closer Look At Mastercard's Key Value Drivers,Mastercard has consistently beat street estima...,0.8922,positive,0.999699,positive
164,Did Wells Fargo CEO Tim Sloan Earn His $1 Mill...,We learned this week that the scandal-plagued ...,0.2869,positive,0.99965,positive


In [68]:
comparisons = articles['fin_sentiment'] == articles['rob_sentiment']
print(f'Agrees on {len(comparisons[comparisons])/len(comparisons)} %')

Agrees on 0.5699417609845691 %


In [86]:
old = pd.read_csv('./data/complete_next_open.csv')
old.head()


Unnamed: 0,Publishing Time,Market Date,Ticker,Sector,finvader_neg,finvader_neu,finvader_pos,finvader_tot,Source,Headline,Text,URL,Open,High,Low,Close,Volume,Dividends,Stock Splits
0,,2019-03-01,AAPL,Technology,,,,,,,,,41.887973,42.097075,41.553888,42.053814,103544800,0.0,0.0
1,,2019-03-01,ABBV,Healthcare,,,,,,,,,62.740368,63.589807,62.354977,62.99992,8567900,0.0,0.0
2,,2019-03-01,AMZN,Technology,,,,,,,,,82.7565,83.712997,82.550003,83.586502,99498000,0.0,0.0
3,,2019-03-01,BAC,Finance,,,,,,,,,25.918994,26.201778,25.812949,25.90132,45771500,0.0,0.0
4,,2019-03-01,GOOGL,Technology,,,,,,,,,56.549999,57.5,56.549999,57.425999,34086000,0.0,0.0


In [87]:
new = pd.merge(old, articles, on = ['Headline', 'Text', 'finvader_tot'], how = 'outer')
new.head()

Unnamed: 0,Publishing Time,Market Date,Ticker,Sector,finvader_neg,finvader_neu,finvader_pos,finvader_tot,Source,Headline,...,Open,High,Low,Close,Volume,Dividends,Stock Splits,rob_sentiment,rob_score,fin_sentiment
0,2019-10-23 20:18:00+00:00,2019-10-24,AMZN,Technology,0.077,0.616,0.306,0.6053,Zacks Investment Research,"""Alexa, Play the News"": Amazon Launches Fire T...",...,88.554497,89.417,88.013496,89.039001,88922000,0.0,0.0,positive,0.938327,positive
1,2019-03-26 23:41:00+00:00,2019-03-27,GOOGL,Technology,0.0,1.0,0.0,0.0,The Motley Fool,"""Alphabet Earnings"" Mark Your Calendar",...,59.596001,59.596001,58.211498,58.900501,29428000,0.0,0.0,neutral,0.999879,neutral
2,2021-08-19 15:49:28+00:00,2021-08-20,NVDA,Technology,0.0,0.823,0.177,0.3909,Business Insider,"""Demand continues to outpace supply"": Here's w...",...,199.565678,208.290614,198.986675,207.801468,67574100,0.0,0.0,positive,0.998652,positive
3,2020-08-15 12:07:00+00:00,2020-08-17,AAPL,Technology,0.072,0.807,0.121,-0.3193,The Motley Fool,"""Fortnite"" Publisher Epic Games Wants to Chang...",...,113.552773,113.577234,111.498185,112.129234,119561600,0.0,0.0,neutral,0.999724,negative
4,2022-12-29 00:08:20+00:00,2022-12-29,AAPL,Technology,0.068,0.779,0.153,0.0775,Zacks Investment Research,"""Krampus Rally"" Threatens Last Sessions of 2022",...,126.944168,129.41382,126.686298,128.550934,75703700,0.0,0.0,positive,0.995475,neutral


In [88]:
new['Market Date'] = pd.to_datetime(new['Market Date'])
new.set_index('Market Date', inplace = True)
new.sort_index(inplace = True)

In [93]:
new.head()
new.iloc[203]

Publishing Time                            2019-03-18 17:45:02+00:00
Ticker                                                          AAPL
Sector                                                    Technology
finvader_neg                                                   0.044
finvader_neu                                                   0.595
finvader_pos                                                   0.361
finvader_tot                                                  0.6657
Source                                               The Motley Fool
Headline           A Foolish Take: The iPhone's Market Share in t...
Text               Apple's flagship device continues to enjoy a h...
URL                https://www.fool.com/investing/2019/03/18/a-fo...
Open                                                       45.269685
High                                                       45.423508
Low                                                        44.685636
Close                             

In [94]:
# new.to_csv('data/prebuit_rob_sentiment.csv')

This is a noticeable difference! But one thing we haven't incorporated is the softmax scores given by roberta. For instance, the first article in our dataframe is labelled differently by roberta and finvader, but the roberta softmax is around .66 so it isn't as confident on it. 

However, reading the article text certainly does indicate the roberta score is more accurate, compared to the finvader score at least. 

## Finetuning our own Analyzer?

Might be worth to see if we finetune our own how that goes. Of course, we need a proper dataset, and reusing the one from the above pretrained model is redundant. 

## Comparisons via older models

One thing we should do is see how the older models function with these sentiment scores rather than the finvader ones. However, we effectively lose the 'scalar' classification, instead only having a binary classfication tell us how to see the article. 'Averaging' to get sentiment on the day is thus very difficult.

Some solutions to this averaging problem include:
1. Taking counts of the articles and using that as our input instead.
2. Modifying the finvader scores by the Roberta ones?