<a href="https://colab.research.google.com/github/xiaowei-v/HW4-/blob/main/HW4_huggingfacebert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Sentiment Analysis: Huggingface Transfomers and Application**


# Learning Objectives
The purpose of this lecture is a brief introduction to some basics of the huggingface transformer library and its application in sentiment analysis using pre-trained models. We use FinBERT and the pretrained model as an example to carry out sentiment analysis on financial news headlines datasets. But you are free to explore models in other domains for different purposes.

### 1. Basic Introduction to the huggingface and transformer library



*   Context learning ---- BERT embeddings
*   Getting familiar with transformers and pipeline

### 2. Example Sentiment Analysis on Financial News Dataset


*   Pre-processing
*   Sentiment analysis using pre-trained model











In [3]:
! pip install transformers 
! pip install torch torchvision

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m49.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m67.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.0 tokenizers-0.13.2 transformers-4.26.1
Looking in indexes: https://pypi.org/simple, https://us

In [4]:
! pip install pysentiment2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pysentiment2
  Downloading pysentiment2-0.1.1-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pysentiment2
Successfully installed pysentiment2-0.1.1


In [1]:
import scipy 
print(scipy.__version__)

1.7.3


In [5]:
import transformers 
print(transformers.__version__)

from transformers import BertTokenizer, BertForSequenceClassification
from transformers import pipeline
from scipy.special import softmax

import pandas as pd
import numpy as np

import torch
import torch.nn.functional as F

4.26.1


## 1. Introduction to basic operations

 * Pre-processing: Tokenization

In [6]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [58]:
sample_text = "Although the result is relatively good, it is far from satisfactory."

In [48]:
tokens = tokenizer.tokenize(sample_text)
tokens

['although',
 'the',
 'result',
 'is',
 'relatively',
 'good',
 ',',
 'it',
 'is',
 'far',
 'from',
 'satisfactory',
 '.',
 'it',
 'is',
 'the',
 'most',
 'exciting',
 'news',
 '!']

In [49]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
token_ids

[2348,
 1996,
 2765,
 2003,
 4659,
 2204,
 1010,
 2009,
 2003,
 2521,
 2013,
 23045,
 1012,
 2009,
 2003,
 1996,
 2087,
 10990,
 2739,
 999]

In [25]:
tokenize_sample_text = tokenizer(sample_text, return_tensors = 'pt')
tokenize_sample_text

{'input_ids': tensor([[  101,  2348,  1996,  2765,  2003,  4659,  2204,  1010,  2009,  2003,
          2521,  2013, 23045,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [28]:
tokenize_sample_text = tokenizer(sample_text)
tokenize_sample_text

{'input_ids': [101, 2348, 1996, 2765, 2003, 4659, 2204, 1010, 2009, 2003, 2521, 2013, 23045, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [29]:
tokenize_sample_text['input_ids']

[101,
 2348,
 1996,
 2765,
 2003,
 4659,
 2204,
 1010,
 2009,
 2003,
 2521,
 2013,
 23045,
 1012,
 102]

In [32]:
print([tokenizer.ids_to_tokens[x] for x in tokenize_sample_text['input_ids']])

['[CLS]', 'although', 'the', 'result', 'is', 'relatively', 'good', ',', 'it', 'is', 'far', 'from', 'satisfactory', '.', '[SEP]']


In [30]:
tokenizer.decode(tokenize_sample_text['input_ids'])

'[CLS] although the result is relatively good, it is far from satisfactory. [SEP]'

special tokens

In [43]:
tokenizer.cls_token, tokenizer.cls_token_id

('[CLS]', 101)

In [45]:
tokenizer.sep_token, tokenizer.sep_token_id

('[SEP]', 102)

In [46]:
tokenizer.unk_token, tokenizer.unk_token_id

('[UNK]', 100)



*   Output Probability



In [50]:
input_sample = tokenizer(sample_text, return_tensors = 'pt')
input_sample

{'input_ids': tensor([[  101,  2348,  1996,  2765,  2003,  4659,  2204,  1010,  2009,  2003,
          2521,  2013, 23045,  1012,  2009,  2003,  1996,  2087, 10990,  2739,
           999,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [51]:
classifier_sample = BertForSequenceClassification.from_pretrained('bert-base-uncased')


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [52]:
logit_sample = classifier_sample.forward(**input_sample).logits
logit_sample

tensor([[0.2468, 0.0139]], grad_fn=<AddmmBackward0>)

In [53]:
output_prob_sample = softmax(logit_sample.detach().cpu().numpy())
output_prob_sample

array([[0.55796474, 0.44203523]], dtype=float32)

In [54]:
output_sample = classifier_sample(**input_sample)
pred_sample = F.softmax(output_sample.logits, dim=-1)
pred_sample

tensor([[0.5580, 0.4420]], grad_fn=<SoftmaxBackward0>)

* Comparison 


> Benchmark Model: LM Dictionary
pysentiment libarary for sentiment analysis in dictionary framework. Two dictionaries are provided in the library, namely, Harvard IV-4 and Loughran and McDonald Financial Sentiment Dictionaries, which are sentiment dictionaries for general and financial sentiment analysis.




In [55]:
import pysentiment2 as ps
lm = ps.LM()

In [59]:
LM_tokens = lm.tokenize(sample_text)

# display the tokens of LM
LM_tokens

['result', 'rel', 'far', 'satisfactori']

In [60]:
lm.get_score(LM_tokens)

{'Positive': 1,
 'Negative': 0,
 'Polarity': 0.9999990000010001,
 'Subjectivity': 0.24999993750001562}

## 2. Sentiment Analysis Example: Finanical News sentiment Analysis

In this part we practice application of BERT model in sentiment analysis on a financial news dataset from Kaggle. You may download the original dataset and check the features [here](https://www.kaggle.com/datasets/miguelaenlle/massive-stock-news-analysis-db-for-nlpbacktests/code).

In order to build our model with application in specific domain (finance), here we use the pre-trained model FinBERT, which is a financial domain-specific pre-trained language model based on BERT, trained on 4.9 billion financial texts. The goal is to enhance financial NLP research and practice. You may find detailed information and tutorials [here](https://github.com/yya518/FinBERT).

This is a simple example of domain-specific pre-trained model. You are free to explore other models with application in other fields. Many of such models and corresponding datasets can be found on huggingface official website: https://huggingface.co/models.

Our project follows these steps:


1.   Pre-processing the dataset
2.   FinBERT Model
3.   Tokenization
4.   Output processing



**Step 1: load in dataset and preprocessing**

In [61]:
from google.colab import files
upload = files.upload()

Saving kaggle.json to kaggle.json


In [63]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = "/content"

In [64]:
! kaggle datasets download -d miguelaenlle/massive-stock-news-analysis-db-for-nlpbacktests

Downloading massive-stock-news-analysis-db-for-nlpbacktests.zip to /content
 91% 191M/210M [00:02<00:00, 96.6MB/s]
100% 210M/210M [00:02<00:00, 88.6MB/s]


In [65]:
# load in dataset as pandas dataframe
df = pd.read_csv('/content/analyst_ratings_processed.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,title,date,stock
0,0.0,Stocks That Hit 52-Week Highs On Friday,2020-06-05 10:30:00-04:00,A
1,1.0,Stocks That Hit 52-Week Highs On Wednesday,2020-06-03 10:45:00-04:00,A
2,2.0,71 Biggest Movers From Friday,2020-05-26 04:30:00-04:00,A
3,3.0,46 Stocks Moving In Friday's Mid-Day Session,2020-05-22 12:45:00-04:00,A
4,4.0,B of A Securities Maintains Neutral on Agilent...,2020-05-22 11:38:00-04:00,A


In [73]:
df.isna().sum()

Unnamed: 0    1289
title            0
date          1289
stock         2578
dtype: int64

In [74]:
df.shape

# we drop all the null values directly because we have a rather large sample size
df.dropna(inplace=True)

In [75]:
df.shape

(1397891, 4)

In [76]:
df.isna().sum()

Unnamed: 0    0
title         0
date          0
stock         0
dtype: int64

**Step 2: Launch the FinBERT Model to Implement Tokenization**

The FinBERT model consists of two modules:
* BertTokenizer: tokenize the raw text input into word tokens
* BertForSequenceClassification: The FinBERT forward model to putput the label probability 


Download/load the pretrained/fine-tuned model weights and instantiate the classifier for this task

In [78]:
model_dir = 'yiyanghkust/finbert-tone'
token_dir = 'yiyanghkust/finbert-tone'

labels_map = {0:'neurtral', 1:'positive', 2:'negative'}

In [79]:
#load the tokenizer(FinVocab)
finBERT_tokenizer = BertTokenizer.from_pretrained(token_dir)

#load the FinBERT model weight 
fin_Bert_engine = BertForSequenceClassification.from_pretrained(model_dir, num_labels = 3)

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/533 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/439M [00:00<?, ?B/s]

FinBERT Vocabulary 


> FinBERT constructs a financial vocabulary: from the corpus of financial texts. Consistent with FinVocab including a sustantial number of finance domain-specific terms not frequently used in general texts from BERT's BaseVocab.


> In total there are 30873 case-insensitive tokens, similar to BERT model's vocabulary.








In [None]:
from collections import OrderedDict

def get_top_players(data, n=30, order=False, reverse=False):
  '''
  Get top n players by score
  Returns a dictionary or an 'OrderedDict' if 'order' is true.
  '''
  top = sorted(data.items(), key=lambda x: x[1], reverse=reverse)[:n]
  if order:
    return OrderedDict(top)
  return dict(top)

In [None]:
# derive the FinBERT vocabulary
fin_vocab_dict = finBERT_tokenizer.get_vocab()

In [None]:
# see the top 50 vocab tokens
get_top_players(fin_vocab_dict, n=50, reverse=False)

In [None]:
fin_Bert_engine.classifier

Linear(in_features=768, out_features=3, bias=True)

In [87]:
d = df.iloc[0,:]

In [92]:
d['tokens'] = d[['title']].apply(finBERT_tokenizer.tokenize)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cacher_needs_updating = self._check_is_chained_assignment_possible()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer_missing(indexer, value)


In [93]:
d

Unnamed: 0                                                  0.0
title                   Stocks That Hit 52-Week Highs On Friday
date                                  2020-06-05 10:30:00-04:00
stock                                                         A
tokens        title    [stocks, that, hit, 52, -, week, high...
Name: 0, dtype: object

**Step 3: Sentiment Classification**

* Example Input: "We also believe that there's generally way too much optimism in Techland with a recession very likely to hit next year and many of out favorite forward Tech spending indicators already heading south."

In [None]:
input_text = '''
We also believe that there's generally way too much optimism 
in Techland with a recession very likely to hit next year and 
many of out favorite forward Tech spending indicators already heading south.'''

print(input_text)


We also believe that there's generally way too much optimism 
in Techland with a recession very likely to hit next year and 
many of out favorite forward Tech spending indicators already heading south.


In [39]:
input_tensor = finBERT_tokenizer(input_text, return_tensors = 'pt', padding = True)
input_tensor

{'input_ids': tensor([[    3,    13,    67,   127,    15,   112,  5674,    58,   316,   788,
          1727,   406, 10950,    10,  4579,  4099,    20,    11,  5091,   190,
           419,     9,  2484,   165,    62,     8,   321,     7,   263, 16079,
           663,  4579,   741,  3719,   943,  4555,  1270,    48,     4]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [41]:
input_tensor = finBERT_tokenizer(input_text, return_tensors = 'pt', padding = True)
logit_tensor = fin_Bert_engine.forward(**input_tensor).logits
logit_tensor

tensor([[-0.0680, -1.8548,  3.9033]], grad_fn=<AddmmBackward0>)

Convert logits to pandas dataframe format soft probability: negatviv probability

In [43]:
output_prob = softmax(logit_tensor.detach().cpu().numpy())
pd.DataFrame(output_prob).rename(columns=labels_map)

Unnamed: 0,neurtral,positive,negative
0,0.018442,0.003089,0.978469


In [None]:
# we follow the previous logic to define a funtion to get sentiment labels and scores for each text
def SentimentAnalyzer(doc):
  '''
  Feed the input text to the model and get the classification for the input text
  Input:
       a string: not been processed 
  Returns the numpy arrary of the probility prediction
  '''
  input_tensor = finBERT_tokenizer(doc, padding=True, return_tensors="pt")
  outputs = fin_Bert_engine(**input_tensor)
  predictions = F.softmax(outputs.logits, dim=-1)
  output_prob = predictions.detach().cpu().numpy()
  return output_prob

In [103]:
# a simpler way of going through the same process using pipeline method
def SentimentAnalyzer_pipe(doc):
  '''
  Feed the input text to the model and get the classification for the input text
  Input:
       a string: not been processed 
  Returns the corresponding label
  '''
  nlp = pipeline("sentiment-analysis", model=fin_Bert_engine, tokenizer=finBERT_tokenizer)
  results = nlp(doc)
  return results[0]['label']
  

In [104]:
SentimentAnalyzer_pipe("We also believe that there's generally way too much optimism in Techland with a recession very likely to hit next year and many of out favorite forward Tech spending indicators already heading south.")

'Negative'

In [25]:
finbert = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone',num_labels=3)
tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone')

nlp = pipeline("sentiment-analysis", model=finbert, tokenizer=tokenizer)

sentences = ["there is a shortage of capital, and we need extra financing",  
             "growth is strong and we have plenty of liquidity", 
             "there are doubts about our finances", 
             "profits are flat"]
results = nlp(sentences)
print(results)  #LABEL_0: neutral; LABEL_1: positive; LABEL_2: negative

[{'label': 'Negative', 'score': 0.9966173768043518}, {'label': 'Positive', 'score': 1.0}, {'label': 'Negative', 'score': 0.9999710321426392}, {'label': 'Neutral', 'score': 0.9889442920684814}]


**Step 4: apply the funtion to the dataframe to label each instance**

In [95]:
d = df.iloc[0:100,:]
d

Unnamed: 0.1,Unnamed: 0,title,date,stock
0,0.0,Stocks That Hit 52-Week Highs On Friday,2020-06-05 10:30:00-04:00,A
1,1.0,Stocks That Hit 52-Week Highs On Wednesday,2020-06-03 10:45:00-04:00,A
2,2.0,71 Biggest Movers From Friday,2020-05-26 04:30:00-04:00,A
3,3.0,46 Stocks Moving In Friday's Mid-Day Session,2020-05-22 12:45:00-04:00,A
4,4.0,B of A Securities Maintains Neutral on Agilent...,2020-05-22 11:38:00-04:00,A
...,...,...,...,...
95,95.0,Barclays Maintains Equal-Weight on Agilent Tec...,2019-10-09 08:10:00-04:00,A
96,96.0,Shares of several healthcare companies are tra...,2019-10-08 10:36:00-04:00,A
97,97.0,Shares of several healthcare companies are tra...,2019-10-02 10:33:00-04:00,A
98,98.0,Shares of several healthcare companies are tra...,2019-09-05 15:34:00-04:00,A


In [105]:
d['label'] = d['title'].apply(SentimentAnalyzer_pipe)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d['label'] = d['title'].apply(SentimentAnalyzer_pipe)


In [106]:
d

Unnamed: 0.1,Unnamed: 0,title,date,stock,label
0,0.0,Stocks That Hit 52-Week Highs On Friday,2020-06-05 10:30:00-04:00,A,Neutral
1,1.0,Stocks That Hit 52-Week Highs On Wednesday,2020-06-03 10:45:00-04:00,A,Neutral
2,2.0,71 Biggest Movers From Friday,2020-05-26 04:30:00-04:00,A,Neutral
3,3.0,46 Stocks Moving In Friday's Mid-Day Session,2020-05-22 12:45:00-04:00,A,Neutral
4,4.0,B of A Securities Maintains Neutral on Agilent...,2020-05-22 11:38:00-04:00,A,Positive
...,...,...,...,...,...
95,95.0,Barclays Maintains Equal-Weight on Agilent Tec...,2019-10-09 08:10:00-04:00,A,Negative
96,96.0,Shares of several healthcare companies are tra...,2019-10-08 10:36:00-04:00,A,Negative
97,97.0,Shares of several healthcare companies are tra...,2019-10-02 10:33:00-04:00,A,Negative
98,98.0,Shares of several healthcare companies are tra...,2019-09-05 15:34:00-04:00,A,Positive


In [112]:
df['title'].str.extract(r'(.*).(.*)')[1][0]

''