<a href="https://colab.research.google.com/github/xiaowei-v/HW4-/blob/main/14_Sentiment_Analysis_Using_BERT_Xiaowei_Guo__Chris_Zhang.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 <font size="9">**Sentiment Analysis:**</font> 

 <font size="9">**BERT Model, Huggingface and Application**</font> 

<br/><br/> 

# Learning Objectives
The purpose of this lecture is a brief introduction to some basics of the BERT model (bidirectional encoder representations from transformers) and its application in sentiment analysis using pre-trained models. We use FinBERT and the pretrained model as an example to carry out sentiment analysis on financial news headlines datasets. But you are free to explore models in other domains for different purposes.

###  Follow These Steps:


*   Fetch all the materials
*   Open the link navigating to google cola
*   Click on Files on the left and upload the dataset csv file which is used for sentiment analysis


### 1. Basic Introduction to BERT and Transformer Library

*   What is BERT:


> We introduce a new **language representation model** called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the **pre-trained** BERT model can be **fine-tuned with just one additional output layer** to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. Refer to the original article for further information: https://arxiv.org/abs/1810.04805

*   Getting familiar with transformers and pipeline


> Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. 

> basic logic of the pipline: 
raw text -> representation -> pre-trained model -> output layer according to specific task


### 2. Application: Sentiment Analysis on Financial News Dataset


*   Pre-processing
*   Sentiment analysis following the pipline


In [None]:
# Run the following command to install required packages
#! pip install transformers 
#! pip install torch torchvision
#! pip install pysentiment2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m46.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m68.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.1 tokenizers-0.13.2 transformers-4.26.1
Looking in indexes: https://pypi.org/simple, https://us

In [None]:
# import all the libraries
import transformers 
print(transformers.__version__)

from transformers import BertTokenizer, BertForSequenceClassification
from transformers import pipeline

import pandas as pd
import numpy as np

import torch
import torch.nn.functional as F

1.7.3
4.26.1


## **1. Introduction to basic operations**

*  step1: Preprocessing (Tokenization) input data 
*  step2: Import the trained model 
*  step3: Input the data to make prediction  



-----------------------------------

- Tokenization

In [None]:
sample_text = "Although the result is relatively good, it is not satisfactory."

In [None]:
# created tokenizer based on the pre-trained model we will use 
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

<b> the tokenizer we used is relativel small, it only 240k, but the pre-trained model is really huge, we'll see it later <b>

In [None]:
# tokennization for the text and check the keys in the result 
tokenize_sample_text = tokenizer(sample_text, return_tensors = 'pt')
print(tokenize_sample_text.keys())

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])


In [None]:
# see the full tokenization 
tokenize_sample_text = tokenizer(sample_text)
token_id_asinput = tokenize_sample_text.get('input_ids')
print('input format: ', tokenize_sample_text, '\n')
print('tokenized text: ',[tokenizer.ids_to_tokens[x] for x in tokenize_sample_text['input_ids']],'\n')
print('token id: ', token_id_asinput)

input format:  {'input_ids': [101, 2348, 1996, 2765, 2003, 4659, 2204, 1010, 2009, 2003, 2025, 23045, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} 

tokenized text:  ['[CLS]', 'although', 'the', 'result', 'is', 'relatively', 'good', ',', 'it', 'is', 'not', 'satisfactory', '.', '[SEP]'] 

token id:  [101, 2348, 1996, 2765, 2003, 4659, 2204, 1010, 2009, 2003, 2025, 23045, 1012, 102]


**input ids:** the id that assigned to each tokens, not vector, just a id \
**attention_mask:** 1 is the actual token, 0 stands for the padding token that created by the model 

--------------------------------------------------



*   Output Probability



In [None]:
# import the model 
classifier_sample = BertForSequenceClassification.from_pretrained('bert-base-uncased')

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at


<b> the model that we imported is really huge, around 440MB which unable to load in local jupyternotebook, therefore we use colab <b>

In [None]:
# pass the input into the classifier
return_result = classifier_sample(**tokenize_sample_text)

# get the logits score which is the raw score for each category before normalization
logit_sample = return_result.logits

print('result formate: ', return_result)
print('logit part: ', logit_sample)

result formate:  SequenceClassifierOutput(loss=None, logits=tensor([[0.1843, 0.1328]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
logit part:  tensor([[0.1843, 0.1328]], grad_fn=<AddmmBackward0>)


What is logits:

> The vector of raw (non-normalized) predictions that the last layer of the model generates, which is ordinarily then passed to a normalization function. If the model is solving a multi-class classification problem, logits typically become an input to the softmax function. The softmax function then generates a vector of (normalized) probabilities with one value for each possible class.



<b> for more information about .forward() function's result, including loss, grad_fn and etc.,<b> https://huggingface.co/transformers/v3.0.2/model_doc/bert.html

In [None]:
# use the softmax function to normalize the raw score 
# returns the probability for each label
pred_sample = F.sortmax(logit_sample, dim = -1)
pred_sample 

Probability distribution:  [[0.51289326 0.4871068 ]]


For details in model output: https://huggingface.co/docs/transformers/main_classes/output

## **2. Sentiment Analysis Example: Finanical News sentiment Analysis**

In this part we practice application of BERT model in sentiment analysis on a financial news dataset from Kaggle. Because of the time limit, we are using only part of the original dataset. You may download the original dataset and check the features [here](https://www.kaggle.com/datasets/miguelaenlle/massive-stock-news-analysis-db-for-nlpbacktests/code).

In order to build our model with application in specific domain (finance), here we use the pre-trained model FinBERT, which is a financial domain-specific pre-trained language model based on BERT, trained on 4.9 billion financial texts. The goal is to enhance financial NLP research and practice. You may find detailed information and tutorials [here](https://github.com/yya518/FinBERT).

This is a simple example of domain-specific pre-trained model. You are free to explore other models with application in other fields. Many of such models and corresponding datasets can be found on huggingface official website: https://huggingface.co/models.

Our project follows these steps:


1.   Cleaning the dataset (drop null etc.)
2.   Import FinBERT Model
3.   Pipeline and function for sentiment detaction
4.   Apply it to the dataframe

_____________________________________________________________________________________________

**Step 1: load in dataset and preprocessing**

In [None]:
# load in dataset as pandas dataframe
df = pd.read_csv('/content/analyst_ratings_processed.csv')
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,title,date,stock
0,0,0.0,Stocks That Hit 52-Week Highs On Friday,2020-06-05 10:30:00-04:00,A
1,1,1.0,Stocks That Hit 52-Week Highs On Wednesday,2020-06-03 10:45:00-04:00,A
2,2,2.0,71 Biggest Movers From Friday,2020-05-26 04:30:00-04:00,A
3,3,3.0,46 Stocks Moving In Friday's Mid-Day Session,2020-05-22 12:45:00-04:00,A
4,4,4.0,B of A Securities Maintains Neutral on Agilent...,2020-05-22 11:38:00-04:00,A


In [None]:
# check na values
df.isna().sum()

Unnamed: 0      0
Unnamed: 0.1    0
title           0
date            0
stock           0
dtype: int64

In [None]:
# check data shape
df.shape

# we drop all the null values directly because we have a rather large sample size
df.dropna(inplace=True)

In [None]:
# check null values again
df.isna().sum()

Unnamed: 0      0
Unnamed: 0.1    0
title           0
date            0
stock           0
dtype: int64

**Step 2: Launch the FinBERT Model to Implement Tokenization**

The FinBERT model consists of two modules:
* BertTokenizer: tokenize the raw text input into word tokens
* BertForSequenceClassification: The FinBERT forward model to putput the label probability 


Download/load the pretrained/fine-tuned model weights and instantiate the classifier for this task

In [None]:
# define the directory of the pre-trained model we use
model_dir = 'yiyanghkust/finbert-tone'
token_dir = 'yiyanghkust/finbert-tone'

In [None]:
#load the tokenizer(FinVocab)
finBERT_tokenizer = BertTokenizer.from_pretrained(token_dir)

#load the FinBERT model weight 
fin_Bert_engine = BertForSequenceClassification.from_pretrained(model_dir, num_labels = 3)

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/533 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/439M [00:00<?, ?B/s]

**Step 3: Sentiment Classification**


Use pipline abstraction to apply the pre-traied model for specific tasks

> The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering. See the task summary for examples of use. Refer to https://huggingface.co/docs/transformers/main_classes/pipelines for details.



Example:
* Example Input: "We also believe that there's generally way too much optimism in Techland with a recession very likely to hit next year and many of out favorite forward Tech spending indicators already heading south."

In [None]:
# use the pipline function to go through the previous pipline
# initialize the pipline: specify the task, the model and the tokenizer
nlp = pipeline("sentiment-analysis", model=fin_Bert_engine, tokenizer=finBERT_tokenizer)

# the input raw text
sentences = ['''
We also believe that there's generally way too much optimism in Techland 
with a recession very likely to hit next year and many of out favorite 
forward Tech spending indicators already heading south.''']

# output a dictionary of labels and scores
results = nlp(sentences)
print(results)  #LABEL_0: neutral; LABEL_1: positive; LABEL_2: negative

[{'label': 'Neutral', 'score': 0.999387264251709}]


In [None]:
# define a function to carry out sentiment analysis using pipline
def SentimentAnalyzer_pipe(doc):
  '''
  Feed the input text to the model and get the classification for the input text
  Input:
       a string: not been processed 
  Returns the corresponding label
  '''
  nlp = pipeline("sentiment-analysis", model=fin_Bert_engine, tokenizer=finBERT_tokenizer)
  results = nlp(doc)
  return results[0]['label']
  

**Step 4: apply the funtion to the dataframe to label each instance**

In [None]:
# use pandas apply method to apply the sentiment analyzer function to the 'title' column
# label each row with the sentiment by adding the 'label' column
df['label'] = df['title'].apply(SentimentAnalyzer_pipe)

In [None]:
# check the labels the function generates
df

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,title,date,stock,label
0,0,0.0,Stocks That Hit 52-Week Highs On Friday,2020-06-05 10:30:00-04:00,A,Neutral
1,1,1.0,Stocks That Hit 52-Week Highs On Wednesday,2020-06-03 10:45:00-04:00,A,Neutral
2,2,2.0,71 Biggest Movers From Friday,2020-05-26 04:30:00-04:00,A,Neutral
3,3,3.0,46 Stocks Moving In Friday's Mid-Day Session,2020-05-22 12:45:00-04:00,A,Neutral
4,4,4.0,B of A Securities Maintains Neutral on Agilent...,2020-05-22 11:38:00-04:00,A,Positive
...,...,...,...,...,...,...
495,495,501.0,Benzinga's Top #PreMarket Gainers,2013-11-15 08:16:00-05:00,A,Positive
496,496,502.0,UPDATE: Citigroup Reiterates on Agilent Techno...,2013-11-15 08:07:00-05:00,A,Positive
497,497,503.0,Citigroup Maintains Buy on Agilent Technologie...,2013-11-15 07:42:00-05:00,A,Positive
498,498,504.0,"Agilent Technologies, Inc. Sees Q1 EPS $0.65-0...",2013-11-14 16:06:00-05:00,A,Neutral
