# FinSent Model 0.1

Applies a pretrained finance tweets sentiment classifier to tweets, to get a prediction about market sentiments.

Tweet Sources:

These are the Twitter List IDs. Both are private and cannot be accessed by accounts other than mine:

1. '1561585052949282816' : Finance News outlets on Twitter
2. '1561588376641609728' : Finance Influencers

Model:

Roberta-base trained on a Finance News headlines dataset and then on a finance tweets dataset. More info: https://github.com/samyuktsriram/nlp-2022

### Loading a model

In [1]:
#Loading a model and prepping for predictions

#Installing

!pip install transformers
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.21.1-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 24.9 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.9.0-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 62.5 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 46.3 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.9.0 tokenizers-0.12.1 transformers-4.21.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.4.0-py3-none-any.whl (365 kB)
[K     |████████████████████████████████| 365 

In [2]:
import scipy
import sklearn
import numpy as np

from transformers import AutoTokenizer, DataCollatorWithPadding, TFAutoModelForSequenceClassification, create_optimizer
from transformers.keras_callbacks import KerasMetricCallback
from tensorflow.keras.callbacks import TensorBoard

import datasets

import tensorflow as tf
from datasets import load_dataset, load_metric

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
#Setting up task and model:
task = 'sst2' #Similar to problem statement from GLUE - supervised sentiment classification on Stanford Sentiment Treebank
model_checkpoint = 'roberta-base' #Make sure the model is compatible with classification tasks
#Here are some models for classification: roberta-base, roberta-large, ProsusAI/finbert

#vocab for distilbert = 30522
batch_size = 16 #This might be need to tweaked based on task and model.

metric = load_metric('glue', task)

#Preprocessing

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

#Defining Loss and Model

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
num_labels = 3 #for tweet sentiments

model = TFAutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels = num_labels)


#Compiling the model

num_epochs = 2
#batches_per_epoch = len(encoded_dataset['train']) // batch_size -> is 157 in another example
#total_train_steps = int(batches_per_epoch * num_epochs) -> is 314 in another example


#create_optimizer() is AdamW with weight and learning rate decay
optimizer, schedule = create_optimizer(
    init_lr = 3e-5, num_warmup_steps = 0, num_train_steps = 10 #Random number, we aren't training so does not matter
)

model.compile(optimizer = optimizer, loss = loss, metrics = ['accuracy'])

model.load_weights('/content/drive/MyDrive/roberta_base_2_sentfin/twitter_model')

Downloading builder script:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading tf_model.h5:   0%|          | 0.00/627M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

Some layers of TFRobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f3de41469d0>

In [5]:
sentiments = ['positive', 'neutral', 'negative']
input_tweet = "Microsoft records massive surge in Q1 profits"

inputs = tokenizer(input_tweet, return_tensors="tf")

logits = model(**inputs).logits

predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])

print(f'logits: {logits}')
print(f'Input tweet: {input_tweet}')
print(f'Highest probability prediction: {sentiments[predicted_class_id]}')


logits: [[ 3.5974314 -1.2441466 -2.4829462]]
Input tweet: Microsoft records massive surge in Q1 profits
Highest probability prediction: positive


### Testing Model

This includes the get_predictions() function

In [6]:
def get_predictions(input_tweet, model, verbose=False, softmax=True, return_preds=False):

  '''Prints out highest probability predictions. If verbose is set to True, prints logits and input_tweet as well.'''

  sentiments = ['positive', 'neutral', 'negative']
  input_tweet = input_tweet
  model = model
  inputs = tokenizer(input_tweet, return_tensors="tf")

  logits = model(**inputs).logits

  predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])

  if verbose:

    print(f'logits: {logits}')
    print(f'Input tweet: {input_tweet}')
  
  if softmax:
    probs = tf.nn.softmax(logits)
    print(f'Probabilities: {probs}')
  
  if return_preds:
    output = (predicted_class_id - 1) * -1
    return output
  else:
    print(f'Highest probability prediction: {sentiments[predicted_class_id]}')

In [16]:
def filtered_pred(input_tweet, model):

  '''Filters out all predictions that are not > 90% probability, returns the highest probability predicted class of the remaining
  This returns None for the filtered values, filter them out later'''

  sentiments = ['positive', 'neutral', 'negative']
  input_tweet = input_tweet
  model = model
  inputs = tokenizer(input_tweet, return_tensors="tf")

  logits = model(**inputs).logits
  probs = tf.nn.softmax(logits)

  if max(probs[0]) < 0.90:
    pass
  else:
    predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])
    output = (predicted_class_id - 1) * -1
    return output


### Getting tweets

In [7]:
!pip install tweepy==4.10

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tweepy==4.10
  Downloading tweepy-4.10.0-py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 3.0 MB/s 
Collecting requests<3,>=2.27.0
  Downloading requests-2.28.1-py3-none-any.whl (62 kB)
[K     |████████████████████████████████| 62 kB 1.5 MB/s 
Installing collected packages: requests, tweepy
  Attempting uninstall: requests
    Found existing installation: requests 2.23.0
    Uninstalling requests-2.23.0:
      Successfully uninstalled requests-2.23.0
  Attempting uninstall: tweepy
    Found existing installation: tweepy 3.10.0
    Uninstalling tweepy-3.10.0:
      Successfully uninstalled tweepy-3.10.0
Successfully installed requests-2.28.1 tweepy-4.10.0


In [8]:
import os
import tweepy
import numpy as np

In [9]:
#Tokens for the twitter API, be sure to replace with your own. Current access: Read Only

consumer_key = ''
consumer_secret = ''
access_token = ''
access_token_secret = ''

bearer_token = ''

In [10]:
#client = tweepy.Client(bearer_token)

client = tweepy.Client(
    consumer_key = consumer_key,
    consumer_secret = consumer_secret,
    access_token = access_token,
    access_token_secret = access_token_secret
)

In [11]:
#Look at the api documentation to find more info on capabilities

response = client.get_list_tweets('1561585052949282816', max_results=10, user_auth=True)
print(response.meta)

{'result_count': 97, 'next_token': '7140dibdnow9c7btw4232le1odqzylwttzi6b9jmy0e2g'}


In [17]:
tweets = response.data

sentiment_list = []

for tweet in tweets:
  out = filtered_pred(tweet.text, model=model)
  sentiment_list.append(out)
  print('===============')

tf.Tensor(0.8910456, shape=(), dtype=float32)
we're skipping this one
tf.Tensor(0.8262472, shape=(), dtype=float32)
we're skipping this one
tf.Tensor(0.55459845, shape=(), dtype=float32)
we're skipping this one
tf.Tensor(0.63921905, shape=(), dtype=float32)
we're skipping this one
tf.Tensor(0.8458203, shape=(), dtype=float32)
we're skipping this one
tf.Tensor(0.7083158, shape=(), dtype=float32)
we're skipping this one
tf.Tensor(0.7703504, shape=(), dtype=float32)
we're skipping this one
tf.Tensor(0.7964092, shape=(), dtype=float32)
we're skipping this one
tf.Tensor(0.49463502, shape=(), dtype=float32)
we're skipping this one
tf.Tensor(0.5000442, shape=(), dtype=float32)
we're skipping this one
tf.Tensor(0.6573041, shape=(), dtype=float32)
we're skipping this one
tf.Tensor(0.67772615, shape=(), dtype=float32)
we're skipping this one
tf.Tensor(0.6804699, shape=(), dtype=float32)
we're skipping this one
tf.Tensor(0.76084787, shape=(), dtype=float32)
we're skipping this one
tf.Tensor(0.684

In [22]:
#We return a bunch of None into sentiment_list so this list comprehension removes them
np.mean([val for val in sentiment_list if val is not None])

#There could be better ways of measuring this / creating a more informative datapoint like frequency of positive / negative.

0.24