<a href="https://colab.research.google.com/github/xiaowei-v/HW4-/blob/main/HW4_huggingfacebert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Sentiment Analysis: Huggingface Transfomers and Application**


# Learning Objectives
The purpose of this lecture is a brief introduction to some basics of the BERT model (bidirectional encoder representations from transformers) and its application in sentiment analysis using pre-trained models. We use FinBERT and the pretrained model as an example to carry out sentiment analysis on financial news headlines datasets. But you are free to explore models in other domains for different purposes.

### 1. Basic Introduction to the huggingface and transformer library



*   Context learning ---- BERT embeddings
*   Getting familiar with transformers and pipeline

### 2. Example Sentiment Analysis on Financial News Dataset


*   Pre-processing
*   Sentiment analysis using pre-trained model











In [None]:
! pip install transformers 
! pip install torch torchvision
! pip install pysentiment2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m46.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m90.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.0 tokenizers-0.13.2 transformers-4.26.1
Looking in indexes: https://pypi.org/simple, https://us

In [None]:
import scipy 
print(scipy.__version__)

import transformers 
print(transformers.__version__)

from transformers import BertTokenizer, BertForSequenceClassification
from transformers import pipeline
from scipy.special import softmax

import pandas as pd
import numpy as np

import torch
import torch.nn.functional as F

1.7.3
4.26.1


## 1. Introduction to basic operations

*  step1: Preprocessing (Tokenization) input data 
*  step2: Import the trained model 
*  step3: Input the data to make prediction  



-----------------------------------

- Tokenization

In [None]:
sample_text = "Although the result is relatively good, it is not satisfactory."

In [None]:
# created tokenizer based on the pre-trained model we will use 
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [None]:
# tokennization for the text and check the keys in the result 
tokenize_sample_text = tokenizer(sample_text, return_tensors = 'pt')
print(tokenize_sample_text.keys())

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])


In [None]:
# see the difference between different parts 
tokens = tokenizer.tokenize(sample_text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
token_id_asinput = tokenize_sample_text.get('input_ids')
matching = pd.DataFrame([tokens, token_ids],index = None).transpose()

In [None]:
print(token_id_asinput)
matching

tensor([[  101,  2348,  1996,  2765,  2003,  4659,  2204,  1010,  2009,  2003,
          2025, 23045,  1012,   102]])


Unnamed: 0,0,1
0,although,2348
1,the,1996
2,result,2765
3,is,2003
4,relatively,4659
5,good,2204
6,",",1010
7,it,2009
8,is,2003
9,not,2025


In [None]:
# see how 101 and 102 demonstrated as tokens 
tokenize_sample_text = tokenizer(sample_text)
print([tokenizer.ids_to_tokens[x] for x in tokenize_sample_text['input_ids']])

['[CLS]', 'although', 'the', 'result', 'is', 'relatively', 'good', ',', 'it', 'is', 'not', 'satisfactory', '.', '[SEP]']


--------------------------------------------------

- Special Tokens 

In [None]:
tokenizer.cls_token, tokenizer.cls_token_id

('[CLS]', 101)

In [None]:
tokenizer.sep_token, tokenizer.sep_token_id

('[SEP]', 102)

-----------------



*   Output Probability



In [None]:
input_sample = tokenizer(sample_text, return_tensors = 'pt')
input_sample

{'input_ids': tensor([[  101,  2348,  1996,  2765,  2003,  4659,  2204,  1010,  2009,  2003,
          2025, 23045,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [None]:
classifier_sample = BertForSequenceClassification.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [None]:
logit_sample = classifier_sample.forward(**input_sample).logits
logit_sample

tensor([[0.2205, 0.0364]], grad_fn=<AddmmBackward0>)

In [None]:
output_prob_sample = softmax(logit_sample.detach().cpu().numpy())
output_prob_sample

array([[0.5459147, 0.4540854]], dtype=float32)

In [None]:
output_sample = classifier_sample(**input_sample)
pred_sample = F.softmax(output_sample.logits, dim=-1)
pred_sample

tensor([[0.5459, 0.4541]], grad_fn=<SoftmaxBackward0>)

## **2. Sentiment Analysis Example: Finanical News sentiment Analysis**

In this part we practice application of BERT model in sentiment analysis on a financial news dataset from Kaggle. You may download the original dataset and check the features [here](https://www.kaggle.com/datasets/miguelaenlle/massive-stock-news-analysis-db-for-nlpbacktests/code).

In order to build our model with application in specific domain (finance), here we use the pre-trained model FinBERT, which is a financial domain-specific pre-trained language model based on BERT, trained on 4.9 billion financial texts. The goal is to enhance financial NLP research and practice. You may find detailed information and tutorials [here](https://github.com/yya518/FinBERT).

This is a simple example of domain-specific pre-trained model. You are free to explore other models with application in other fields. Many of such models and corresponding datasets can be found on huggingface official website: https://huggingface.co/models.

Our project follows these steps:


1.   Cleaning the dataset (drop null etc.)
2.   Import FinBERT Model
3.   Pre-processing --- Tokenization
4.   Output processing



**Step 1: load in dataset and preprocessing**

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = "/content"

In [None]:
! kaggle datasets download -d miguelaenlle/massive-stock-news-analysis-db-for-nlpbacktests

Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python3.8/dist-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/usr/local/lib/python3.8/dist-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method.


In [None]:
# load in dataset as pandas dataframe
df = pd.read_csv('/content/analyst_ratings_processed.csv')
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,title,date,stock
0,0,0.0,Stocks That Hit 52-Week Highs On Friday,2020-06-05 10:30:00-04:00,A
1,1,1.0,Stocks That Hit 52-Week Highs On Wednesday,2020-06-03 10:45:00-04:00,A
2,2,2.0,71 Biggest Movers From Friday,2020-05-26 04:30:00-04:00,A
3,3,3.0,46 Stocks Moving In Friday's Mid-Day Session,2020-05-22 12:45:00-04:00,A
4,4,4.0,B of A Securities Maintains Neutral on Agilent...,2020-05-22 11:38:00-04:00,A


In [None]:
df.isna().sum()

Unnamed: 0      0
Unnamed: 0.1    0
title           0
date            0
stock           0
dtype: int64

In [None]:
df.shape

# we drop all the null values directly because we have a rather large sample size
df.dropna(inplace=True)

In [None]:
df.shape

(500, 5)

In [None]:
df.isna().sum()

Unnamed: 0      0
Unnamed: 0.1    0
title           0
date            0
stock           0
dtype: int64

**Step 2: Launch the FinBERT Model to Implement Tokenization**

The FinBERT model consists of two modules:
* BertTokenizer: tokenize the raw text input into word tokens
* BertForSequenceClassification: The FinBERT forward model to putput the label probability 


Download/load the pretrained/fine-tuned model weights and instantiate the classifier for this task

In [None]:
model_dir = 'yiyanghkust/finbert-tone'
token_dir = 'yiyanghkust/finbert-tone'

labels_map = {0:'neurtral', 1:'positive', 2:'negative'}

In [None]:
#load the tokenizer(FinVocab)
finBERT_tokenizer = BertTokenizer.from_pretrained(token_dir)

#load the FinBERT model weight 
fin_Bert_engine = BertForSequenceClassification.from_pretrained(model_dir, num_labels = 3)

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/533 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/439M [00:00<?, ?B/s]

**Step 3: Sentiment Classification**

* Example Input: "We also believe that there's generally way too much optimism in Techland with a recession very likely to hit next year and many of out favorite forward Tech spending indicators already heading south."

Convert logits to pandas dataframe format soft probability: negatviv probability

In [None]:
nlp = pipeline("sentiment-analysis", model=fin_Bert_engine, tokenizer=finBERT_tokenizer)

sentences = [df.iloc[0, :]['title']]
results = nlp(sentences)
print(results)  #LABEL_0: neutral; LABEL_1: positive; LABEL_2: negative

[{'label': 'Neutral', 'score': 0.999387264251709}]


In [None]:
# a simpler way of going through the same process using pipeline method
def SentimentAnalyzer_pipe(doc):
  '''
  Feed the input text to the model and get the classification for the input text
  Input:
       a string: not been processed 
  Returns the corresponding label
  '''
  nlp = pipeline("sentiment-analysis", model=fin_Bert_engine, tokenizer=finBERT_tokenizer)
  results = nlp(doc)
  return results[0]['label']
  

**Step 4: apply the funtion to the dataframe to label each instance**

In [None]:
df['label'] = df['title'].apply(SentimentAnalyzer_pipe)

In [None]:
df

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,title,date,stock,label
0,0,0.0,Stocks That Hit 52-Week Highs On Friday,2020-06-05 10:30:00-04:00,A,Neutral
1,1,1.0,Stocks That Hit 52-Week Highs On Wednesday,2020-06-03 10:45:00-04:00,A,Neutral
2,2,2.0,71 Biggest Movers From Friday,2020-05-26 04:30:00-04:00,A,Neutral
3,3,3.0,46 Stocks Moving In Friday's Mid-Day Session,2020-05-22 12:45:00-04:00,A,Neutral
4,4,4.0,B of A Securities Maintains Neutral on Agilent...,2020-05-22 11:38:00-04:00,A,Positive
...,...,...,...,...,...,...
495,495,501.0,Benzinga's Top #PreMarket Gainers,2013-11-15 08:16:00-05:00,A,Positive
496,496,502.0,UPDATE: Citigroup Reiterates on Agilent Techno...,2013-11-15 08:07:00-05:00,A,Positive
497,497,503.0,Citigroup Maintains Buy on Agilent Technologie...,2013-11-15 07:42:00-05:00,A,Positive
498,498,504.0,"Agilent Technologies, Inc. Sees Q1 EPS $0.65-0...",2013-11-14 16:06:00-05:00,A,Neutral


## Comparison
Benchmark Model: LM Dictionary pysentiment libarary for sentiment analysis in dictionary framework. Two dictionaries are provided in the library, namely, Harvard IV-4 and Loughran and McDonald Financial Sentiment Dictionaries, which are sentiment dictionaries for general and financial sentiment analysis.

In [None]:
compare_text = "sss"

In [None]:
import pysentiment2 as ps
lm = ps.LM()

LM_tokens = lm.tokenize(compare_text)

# display the tokens of LM
LM_tokens

['sss']

In [None]:
lm.get_score(LM_tokens)

{'Positive': 0, 'Negative': 0, 'Polarity': 0.0, 'Subjectivity': 0.0}