# Sentiment Analysis
Sentiment analysis (https://en.wikipedia.org/wiki/Sentiment_analysis) is the use of natural language processing to identify, extract, quantify, and study subjective information. Sentiment analysis is ideal for classifying online and social media posts as positive, negative or neutral.

In this notebook we will assess 3 models for sentiment analysis:
- VADER
- FinBERT
- XLNet


# VADER
The VADER model (Valence Aware Dictionary and sEntiment Reasoner) is ideal for assessing the sentiment of comments as positive or negative.  This is a rules-based NLP model that is pre-trained on social media data, and as such doesn't require re-training. It is sufficient to provide a curated list of positive and negative word labels.  A small sample of labeled data would be sufficient for testing the effectiveness of the model.

Since this is not a deep learning model we won't dig deeper into this model.  


# FinBERT
FinBERT (https://github.com/ProsusAI/finBERT) is an NLP model for financial sentiment analysis based on BERT.  The model is pre-trained and tuned on financial text. The model is described in detail here: https://arxiv.org/pdf/1908.10063.pdf.  General purpose pre-trained NLP models are inadequate due to the specific language and terms used in the financial domain.  FinBERT solves this problem by training on a datasets specific to the financial context.

The following code shows prediction examples with finBERT.


In [2]:
import sys
sys.path.append("../../finBERT")

from finbert.finbert import predict
from transformers import AutoModelForSequenceClassification

In [6]:
import os

text = """\
Shares in the spin-off of South African e-commerce group Naspers surged more than 25% in the first minutes of their market debut in Amsterdam on Wednesday.

Bob van Dijk, CEO of Naspers and Prosus Group poses at Amsterdam's stock exchange, as Prosus begins trading on the Euronext stock exchange in Amsterdam, Netherlands, September 11, 2019. REUTERS/Piroschka van de Wouw
Prosus comprises Naspers’ global empire of consumer internet assets, with the jewel in the crown a 31% stake in Chinese tech titan Tencent.

There is "way more demand than is even available, so that’s good," said the CEO of Euronext Amsterdam, Maurice van Tilburg. "It’s going to be an interesting hour of trade after opening this morning."

Euronext had given an indicative price of 58.70 euros per share for Prosus, implying a market value of 95.3 billion euros ($105 billion).

The shares jumped to 76 euros on opening and were trading at 75 euros at 0719 GMT.
"""

model = AutoModelForSequenceClassification.from_pretrained('../../finBERT/models/classifier_model/finbert-sentiment',num_labels=3,cache_dir=None)

output = "predictions.csv"
predict(text,model,write_to_csv=True,path=os.path.join('.',output))

08/01/2021 20:05:00 - INFO - finbert.utils -   *** Example ***
08/01/2021 20:05:00 - INFO - finbert.utils -   guid: 0
08/01/2021 20:05:00 - INFO - finbert.utils -   tokens: [CLS] shares in the spin - off of south african e - commerce group nas ##pers surged more than 25 % in the first minutes of their market debut in amsterdam on wednesday . [SEP]
08/01/2021 20:05:00 - INFO - finbert.utils -   input_ids: 101 6661 1999 1996 6714 1011 2125 1997 2148 3060 1041 1011 6236 2177 17235 7347 18852 2062 2084 2423 1003 1999 1996 2034 2781 1997 2037 3006 2834 1999 7598 2006 9317 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/01/2021 20:05:00 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/01/2021 20:05:00 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Unnamed: 0,sentence,logit,prediction,sentiment_score
0,Shares in the spin-off of South African e-comm...,"[0.595314, 0.23879628, 0.16588964]",positive,0.356518
1,"Bob van Dijk, CEO of Naspers and Prosus Group ...","[0.37831756, 0.2276785, 0.3940039]",neutral,0.150639
2,REUTERS/Piroschka van de Wouw\nProsus comprise...,"[0.58364475, 0.17245562, 0.24389963]",positive,0.411189
3,"There is ""way more demand than is even availab...","[0.48395312, 0.19740734, 0.31863958]",positive,0.286546
4,"""It’s going to be an interesting hour of trade...","[0.5186951, 0.25127164, 0.23003323]",positive,0.267423
5,Euronext had given an indicative price of 58.7...,"[0.60320425, 0.1752505, 0.22154525]",positive,0.427954
6,The shares jumped to 76 euros on opening and w...,"[0.609923, 0.19815238, 0.1919246]",positive,0.411771


# XLNet
XLNet is an NLP model based on BERT that features a generalized autoregressive (AR) pre-training method.  By contrast, BERT is an autoencoder (AE) language model.  The reseach paper for XLNet can be found at https://arxiv.org/abs/1906.08237.  The paper argues that BERT suffers from a pretrain-finetune discrepancy and solves this problem with an autoregressive formlation that enables learning bidirectional contexts. 

For this project we adapted the XLnet implementation from Shanay Ghag described at  https://medium.com/swlh/using-xlnet-for-sentiment-classification-cfa948e65e85.

A Python module for XLNet is provided by HuggingFace at https://huggingface.co/transformers/model_doc/xlnet.html. 

The following changes were made to the original implementation from Shanay Ghag to match the settings of FinBERT:
- Expand from 2 classes (positive, negative) to 3 classes (positive, negative, neutral)
- Decreased max sequence length from 512 to 64
- Batch size increased from 4 to 32
- Training epochs increased from 3 to 10

In addition, the code was restructured as two Python classes: XLNetSentiment and XLNetSentimentTrain.

The modified version is published here:
https://github.com/rrmorris2102/ucsd-mle/tree/main/xlnet

The following code shows a prediction example with XLNet.

In [10]:
import sys
sys.path.append("../../xlnet")

from xlnet import XLNetSentiment

model_file = '../../xlnet/models/xlnet_model.bin'
xlnet = XLNetSentiment(model_file, batchsize=1)

text = "Movie is the worst one I have ever seen!! The story has no meaning at all"
results = xlnet.predict(text)
print(results)

device cuda:0


Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.bias', 'lm_loss.weight']
- This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['sequence_summary.summary.weight', 'sequence_summary.summary.bias', 'logits_proj.weight', 'logits_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions a

{'positive_score': 0.008360231295228004, 'negative_score': 0.9822444319725037, 'neutral_score': 0.009395359084010124, 'text': 'Movie is the worst one I have ever seen!! The story has no meaning at all', 'sentiment': 'negative'}
