# Sentiment analysis of financial data using Hugging Face NLP models

This notebook aims to provide an introduction to documenting an NLP model using the ValidMind Developer Framework. The use case presented is a sentiment analysis of financial news data (https://huggingface.co/datasets/financial_phrasebank).

- Initializing the ValidMind Developer Framework
- Running a test various tests to quickly generate documentation about the data and model

## Before you begin

To use the ValidMind Developer Framework with a Jupyter notebook, you need to install and initialize the client library first, along with getting your Python environment ready.

If you don't already have one, you should also [create a documentation project](https://docs.validmind.ai/guide/create-your-first-documentation-project.html) on the ValidMind platform. You will use this project to upload your documentation and test results.

## Install the client library


In [None]:
%pip install -q validmind

## Initialize the client library

In a browser, go to the **Client Integration** page of your documentation project and click **Copy to clipboard** next to the code snippet. This code snippet gives you the API key, API secret, and project identifier to link your notebook to your documentation project.

::: {.column-margin}
::: {.callout-tip}
This step requires a documentation project. [Learn how you can create one](https://docs.validmind.ai/guide/create-your-first-documentation-project.html).
:::
:::

Next, replace this placeholder with your own code snippet:

In [None]:
## Replace the code below with the code snippet from your project ## 

import validmind as vm
  
vm.init(
    api_host = "https://api.prod.validmind.ai/api/v1/tracking",
    api_key = "...",
    api_secret = "...",
    project = "..."
)

### Preview the template

A template predefines sections for your documentation project and provides a general outline to follow, making the documentation process much easier.

You will upload documentation and test results into this template later on. For now, take a look at the structure that the template provides with the vm.preview_template() function from the ValidMind library and note the empty sections:

In [None]:
vm.preview_template()

### Load Dataset

In this section, we'll load the financial phrasebank dataset, which will be the foundation for our sentiment analysis tasks.

In [None]:
import pandas as pd

df = pd.read_csv('./datasets/sentiments.csv')
sample = df.sample(10)
sample

## NLP data quality tests

Before we proceed with the analysis, it's crucial to ensure the quality of our NLP data. We can run the "data preparation" section of the template to validate the data's integrity and suitability.

In [None]:
vm_ds = vm.init_dataset(
    dataset=df,
    text_column='Sentence',
    target_column="Sentiment"
)

text_data_test_plan = vm.run_documentation_tests(section="data_preparation", dataset=vm_ds)

## Hugging face transformers

## 1. Hugging Face: FinancialBERT-Sentiment-Analysis

https://huggingface.co/ahmedrachid/FinancialBERT-Sentiment-Analysis

Let's now explore integrating and testing FinancialBERT, a model designed specifically for sentiment analysis in the financial domain.

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import pipeline

model = BertForSequenceClassification.from_pretrained("ahmedrachid/FinancialBERT-Sentiment-Analysis",num_labels=3)
tokenizer = BertTokenizer.from_pretrained("ahmedrachid/FinancialBERT-Sentiment-Analysis")
hfmodel = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

### Initialize VM dataset

In [None]:
# Load a test dataset with 100 rows only
vm_test_ds = vm.init_dataset(
    dataset=df.head(100),
    text_column="Sentence",
    target_column="Sentiment",
)

### Initialize VM model

When initializing a VM model, we pre-calculate predictions on the test dataset. This operation can take a long time for large datasets.

In [None]:
vm_model_1 = vm.init_model(
    hfmodel,
    test_ds=vm_test_ds,
)

### Run model validation tests

It's possible to run a subset of tests on the documentation template by passing a `section` parameter to `run_documentation_tests()`. Let's run the tests that correspond to model validation only.

In [None]:
full_suite = vm.run_documentation_tests(
    section="model_development",
    dataset=vm_test_ds,
    model=vm_model_1,
)

## 2. Hugging Face: distilroberta-finetuned-financial-news-sentiment-analysis

https://huggingface.co/mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis

The distilroberta model was fine-tuned on the phrasebank dataset: https://huggingface.co/datasets/financial_phrasebank.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis")
model = AutoModelForSequenceClassification.from_pretrained("mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis")
hfmodel = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

### Initialize VM model

In [None]:
vm_model_2 = vm.init_model(
    hfmodel,
    test_ds=vm_test_ds,
)

In [None]:
full_suite = vm.run_documentation_tests(
    section="model_development",
    dataset=vm_test_ds,
    model=vm_model_2,
    models=[vm_model_1]

)

## 3. Hugging Face: financial-roberta-large-sentiment

https://huggingface.co/soleimanian/financial-roberta-large-sentiment

The financial-roberta-large model is another financial sentiment analysis model trained on large amounts of data including:

- Financial Statements
- Earnings Announcements
- Earnings Call Transcripts
- Corporate Social Responsibility (CSR) Reports
- Environmental, Social, and Governance (ESG) News
- Financial News
- Etc.

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("soleimanian/financial-roberta-large-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("soleimanian/financial-roberta-large-sentiment")
hfmodel = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)


In [None]:
vm_model_3 = vm.init_model(
    hfmodel,
    test_ds=vm_test_ds,
)

In [None]:
full_suite = vm.run_documentation_tests(
    section="model_development",
    dataset=vm_test_ds,
    model=vm_model_3,
    models=[vm_model_1, vm_model_2]
)