# Sensitivity analysis of financial data using Hugging face LLM models
This notebook aims to provide an introduction to documenting an LLM model using the ValidMind Developer Framework. The use case presented is a sentiment analysis of financial pharse data (https://huggingface.co/datasets/financial_phrasebank).

- Initializing the ValidMind Developer Framework
- Running a test various tests to quickly generate document about the data and model

## Before you begin

To use the ValidMind Developer Framework with a Jupyter notebook, you need to install and initialize the client library first, along with getting your Python environment ready.

If you don't already have one, you should also [create a documentation project](https://docs.validmind.ai/guide/create-your-first-documentation-project.html) on the ValidMind platform. You will use this project to upload your documentation and test results.

## Install the client library


In [1]:
# %pip install --upgrade validmind

## Initialize the client library

In a browser, go to the **Client Integration** page of your documentation project and click **Copy to clipboard** next to the code snippet. This code snippet gives you the API key, API secret, and project identifier to link your notebook to your documentation project.

::: {.column-margin}
::: {.callout-tip}
This step requires a documentation project. [Learn how you can create one](https://docs.validmind.ai/guide/create-your-first-documentation-project.html).
:::
:::

Next, replace this placeholder with your own code snippet:

In [2]:
import validmind as vm

vm.init(
  api_host = "http://localhost:3000/api/v1/tracking",
  api_key = "dd4abeb23264f4784e1932204a47965d",
  api_secret = "1aba00ce6500a58b4605c59e42e0c5c83526080a648855b988f99a7827e4a06e",
  project = "cliop8llc003x32rlklophmdl"
)

2023-08-17 16:34:06,662 - INFO(validmind.api_client): Connected to ValidMind. Project: nlp model sensitivity analysis - Initial Validation (cliop8llc003x32rlklophmdl)


### Load Dataset

In [3]:
import pandas as pd
import numpy as np
df = pd.read_csv('./datasets/sentiments.csv')
sample = df.sample(10)
sample

Unnamed: 0,Sentiment,Sentence
4190,negative,Finnish soapstone processing and fireplaces ma...
208,positive,"Ragutis , which is controlled by the Finnish b..."
1886,neutral,`` Subscribers can browse free numbers with th...
4619,negative,When this information was released on 5 Septem...
4211,neutral,Kesko has previously published a stock exchang...
159,neutral,Aviation Systems Maintenance is based in Kansa...
2019,positive,in Q1 '10 19 April 2010 - Finnish forest machi...
914,positive,This resulted in improved sales figures in Swe...
3088,neutral,"Previously , Grimaldi held a 46.43 pct stake i..."
3876,neutral,The orange-handled scissors from Fiskars are p...


## NLP data quality Tests

In [4]:
vm_ds = vm.init_dataset(dataset=df, type="generic", text_column='Sentence', target_column="Sentiment")
text_data_test_plan = vm.run_test_plan("text_data_quality",
                                       dataset=vm_ds)

2023-08-17 16:34:06,688 - INFO(validmind.client): The 'type' argument to init_dataset() argument is deprecated and no longer required.
2023-08-17 16:34:06,688 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...


HBox(children=(Label(value='Running test plan...'), IntProgress(value=0, max=14)))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/anilsorathiya/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/anilsorathiya/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


VBox(children=(HTML(value='<h2>Results for <i>Text Data Quality</i> Test Plan:</h2><hr>'), HTML(value='<div cl…

## Hugging face transformers  

## Define model wrapper

In [5]:
from transformers import pipeline
from dataclasses import dataclass
@dataclass
class Sentiment_HuggingFace:
    predicted_prob_values = None

    def __init__(self, pipeline_task, model_name=None, tokenizer=None):
        self.model_name = model_name
        self.pipeline_task = pipeline_task
        self.model = pipeline(pipeline_task, model=model, tokenizer=tokenizer)

    def predict(self, data):
        data = [str(datapoint) for datapoint in data]
        results = []
        results = self.model(data)
        results_df = pd.DataFrame(results)
        self.predicted_prob_values = results_df.score.values
        return results_df.label.values

    def predict_proba(self):
        if self.predicted_prob_values is None:
            raise ValueError("First run predict method to retrieve predicted probabilities")
        return self.predicted_prob_values


In [6]:
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import pipeline

model = BertForSequenceClassification.from_pretrained("ahmedrachid/FinancialBERT-Sentiment-Analysis",num_labels=3)
tokenizer = BertTokenizer.from_pretrained("ahmedrachid/FinancialBERT-Sentiment-Analysis")
sentiment_model_hf = Sentiment_HuggingFace("sentiment-analysis",model_name=model, tokenizer=tokenizer)

df_test = df.head(15)

y_pred = sentiment_model_hf.predict(df_test.Sentence.values.tolist())
y_pred_prob = sentiment_model_hf.predict_proba()
df_results = df_test.copy()
df_results['y_pred'] = y_pred
df_results['y_pred_prob'] = y_pred_prob
df_results.head(10)

Unnamed: 0,Sentiment,Sentence,y_pred,y_pred_prob
0,neutral,"According to Gran , the company has no plans t...",neutral,0.988819
1,neutral,Technopolis plans to develop in stages an area...,neutral,0.999853
2,negative,The international electronic industry company ...,negative,0.999682
3,positive,With the new production plant the company woul...,positive,0.999855
4,positive,According to the company 's updated strategy f...,positive,0.999766
5,positive,FINANCING OF ASPOCOMP 'S GROWTH Aspocomp is ag...,positive,0.999806
6,positive,"For the last quarter of 2010 , Componenta 's n...",positive,0.999853
7,positive,"In the third quarter of 2010 , net sales incre...",positive,0.999842
8,positive,Operating profit rose to EUR 13.1 mn from EUR ...,positive,0.999813
9,positive,"Operating profit totalled EUR 21.1 mn , up fro...",positive,0.999816


## 1. Hugging Face: FinancialBERT-Sentiment-Analysis
https://huggingface.co/ahmedrachid/FinancialBERT-Sentiment-Analysis

In [7]:
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import pipeline

model = BertForSequenceClassification.from_pretrained("ahmedrachid/FinancialBERT-Sentiment-Analysis",num_labels=3)
tokenizer = BertTokenizer.from_pretrained("ahmedrachid/FinancialBERT-Sentiment-Analysis")
hfmodel = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)


### Initialize VM dataset

In [8]:
vm_test_ds = vm.init_dataset(
    dataset=df_test,
    text_column="Sentence",
    target_column="Sentiment",
)

2023-08-17 16:34:23,597 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...


### Initialize VM model

In [9]:

vm_model = vm.init_model(
    hfmodel,
    train_ds=vm_test_ds,
    test_ds=vm_test_ds,
)

In [10]:
full_suite = vm.run_test_suite(
    "binary_classifier_model_validation",
    dataset=vm_test_ds,
    model=vm_model
)

HBox(children=(Label(value='Running test suite...'), IntProgress(value=0, max=34)))

Note that pos_label (set to 'negative') is ignored when average != 'binary' (got 'micro'). You may use labels=[pos_label] to specify a single positive class.
Note that pos_label (set to 'negative') is ignored when average != 'binary' (got 'micro'). You may use labels=[pos_label] to specify a single positive class.
Note that pos_label (set to 'negative') is ignored when average != 'binary' (got 'micro'). You may use labels=[pos_label] to specify a single positive class.
Note that pos_label (set to 'negative') is ignored when average != 'binary' (got 'micro'). You may use labels=[pos_label] to specify a single positive class.
Note that pos_label (set to 'negative') is ignored when average != 'binary' (got 'micro'). You may use labels=[pos_label] to specify a single positive class.
Note that pos_label (set to 'negative') is ignored when average != 'binary' (got 'micro'). You may use labels=[pos_label] to specify a single positive class.
2023-08-17 16:34:24,897 - ERROR(validmind.vm_models.

VBox(children=(HTML(value='<h2>Test Suite Results: <i style="color: #DE257E">Binary Classifier Model Validatio…

## 2. Hugging Face: distilroberta-finetuned-financial-news-sentiment-analysis
https://huggingface.co/mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis

In [11]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis")
model = AutoModelForSequenceClassification.from_pretrained("mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis")
hfmodel = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)


### Initialize VM model

In [12]:
vm_model = vm.init_model(
    hfmodel,
    test_ds=vm_test_ds,
    train_ds=vm_test_ds,
)

In [13]:
full_suite = vm.run_test_suite(
    "binary_classifier_model_validation",
    dataset=vm_test_ds,
    model=vm_model
)

HBox(children=(Label(value='Running test suite...'), IntProgress(value=0, max=34)))

2023-08-17 16:34:29,578 - ERROR(validmind.vm_models.test_plan): Failed to run test 'pfi': The 'estimator' parameter of permutation_importance must be an object implementing 'fit'. Got <transformers.pipelines.text_classification.TextClassificationPipeline object at 0x3321a5580> instead.
2023-08-17 16:34:29,579 - ERROR(validmind.vm_models.test_plan): Failed to run test 'pr_curve': Model requires a implemention of predict_proba method with 1 argument that is tensor features matrix
2023-08-17 16:34:29,579 - ERROR(validmind.vm_models.test_plan): Failed to run test 'roc_curve': Model requires a implemention of predict_proba method with 1 argument that is tensor features matrix
2023-08-17 16:34:29,580 - ERROR(validmind.vm_models.test_plan): Failed to run test 'psi': Model requires a implemention of predict_proba method with 1 argument that is tensor features matrix
2023-08-17 16:34:29,580 - ERROR(validmind.vm_models.test_plan): Failed to run test 'shap': Model TextClassificationPipeline not s

VBox(children=(HTML(value='<h2>Test Suite Results: <i style="color: #DE257E">Binary Classifier Model Validatio…

## 3. Hugging Face: financial-roberta-large-sentiment

https://huggingface.co/soleimanian/financial-roberta-large-sentiment

In [14]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("soleimanian/financial-roberta-large-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("soleimanian/financial-roberta-large-sentiment")
hfmodel = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)


In [15]:
vm_model = vm.init_model(
    hfmodel,
    test_ds=vm_test_ds,
    train_ds=vm_test_ds,
)

In [16]:
full_suite = vm.run_test_suite(
    "binary_classifier_model_validation",
    dataset=vm_test_ds,
    model=vm_model
)

HBox(children=(Label(value='Running test suite...'), IntProgress(value=0, max=34)))

2023-08-17 16:34:36,142 - ERROR(validmind.vm_models.test_plan): Failed to run test 'pfi': The 'estimator' parameter of permutation_importance must be an object implementing 'fit'. Got <transformers.pipelines.text_classification.TextClassificationPipeline object at 0x1773ea670> instead.
2023-08-17 16:34:36,143 - ERROR(validmind.vm_models.test_plan): Failed to run test 'pr_curve': Model requires a implemention of predict_proba method with 1 argument that is tensor features matrix
2023-08-17 16:34:36,143 - ERROR(validmind.vm_models.test_plan): Failed to run test 'roc_curve': Model requires a implemention of predict_proba method with 1 argument that is tensor features matrix
2023-08-17 16:34:36,144 - ERROR(validmind.vm_models.test_plan): Failed to run test 'psi': Model requires a implemention of predict_proba method with 1 argument that is tensor features matrix
2023-08-17 16:34:36,144 - ERROR(validmind.vm_models.test_plan): Failed to run test 'shap': Model TextClassificationPipeline not s

VBox(children=(HTML(value='<h2>Test Suite Results: <i style="color: #DE257E">Binary Classifier Model Validatio…