# Summarization of financial data using Hugging Face NLP models

This notebook aims to provide an introduction to documenting an NLP model using the ValidMind Developer Framework. The use case presented is a summarization of financial news (https://huggingface.co/datasets/cnn_dailymail).

- Initializing the ValidMind Developer Framework
- Running a test various tests to quickly generate document about the data and model

## Before you begin

To use the ValidMind Developer Framework with a Jupyter notebook, you need to install and initialize the client library first, along with getting your Python environment ready.

If you don't already have one, you should also [create a documentation project](https://docs.validmind.ai/guide/create-your-first-documentation-project.html) on the ValidMind platform. You will use this project to upload your documentation and test results.

## Install the client library

In [1]:
%pip install -q validmind

Note: you may need to restart the kernel to use updated packages.


## Initialize the client library

In a browser, go to the **Client Integration** page of your documentation project and click **Copy to clipboard** next to the code snippet. This code snippet gives you the API key, API secret, and project identifier to link your notebook to your documentation project.

::: {.column-margin}
::: {.callout-tip}
This step requires a documentation project. [Learn how you can create one](https://docs.validmind.ai/guide/create-your-first-documentation-project.html).
:::
:::

Next, replace this placeholder with your own code snippet:

In [2]:
## Replace the code below with the code snippet from your project ## 

import validmind as vm

vm.init(
  api_host = "....",
  api_key = "...",
  api_secret = "...",
  project = "..."
)

2023-09-25 22:18:29,149 - INFO(validmind.api_client): Connected to ValidMind. Project: Summarization (Hugging Face) - Initial Validation (clmqvzvql0005v28hl2tjzf4e)


### CNN dataset

The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. The current version supports both extractive and abstractive summarization, though the original version was created for machine reading and comprehension and abstractive question answering.

In [4]:
from datasets import load_dataset
cnn_dataset = load_dataset('cnn_dailymail', '3.0.0')
train_df = cnn_dataset.data['train'].to_pandas()
val_df = cnn_dataset.data['validation'].to_pandas()
test_df = cnn_dataset.data['test'].to_pandas()
train_df = train_df[['article','highlights']]
train_df = train_df.head(20)

In [5]:
df = train_df.head(100)
# Load a test dataset with 100 rows only
vm_ds = vm.init_dataset(
    dataset=df,
    text_column="article",
    target_column="highlights",
)

2023-09-25 22:18:37,294 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...


In [6]:
text_data_test_plan = vm.run_test_plan("text_data_quality",
                                       dataset=vm_ds,)

HBox(children=(Label(value='Running test plan...'), IntProgress(value=0, max=14)))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/andres/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/andres/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


VBox(children=(HTML(value='<h2>Results for <i>Text Data Quality</i> Test Plan:</h2><hr>'), HTML(value='<div cl…

In [7]:
from transformers import pipeline, T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

summarizer_model = pipeline(
    task="summarization",
    model=model,
    tokenizer = tokenizer,
    min_length=0,
    max_length=60,
    truncation=True,
    model_kwargs={"cache_dir": '/Documents/Huggin_Face/'},
)  # Note: We specify cache_dir to use predownloaded models.

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [8]:
df_test = df.head(10)

vm_test_ds = vm.init_dataset(
    dataset=train_df,
    text_column="article",
    target_column="highlights",
)

2023-09-25 22:18:44,964 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...


In [9]:
vm_model = vm.init_model(
    summarizer_model,
    test_ds=vm_test_ds,
)

In [10]:
vm.test_plans.list_plans()

ID,Name,Description
classifier_metrics,ClassifierMetrics,Test plan for sklearn classifier metrics
classifier_validation,ClassifierPerformance,Test plan for sklearn classifier models
classifier_model_diagnosis,ClassifierDiagnosis,Test plan for sklearn classifier model diagnosis tests
prompt_validation,PromptValidation,Test plan for prompt validation
tabular_dataset_description,TabularDatasetDescription,Test plan to extract metadata and descriptive  statistics from a tabular dataset
tabular_data_quality,TabularDataQuality,Test plan for data quality on tabular datasets
time_series_data_quality,TimeSeriesDataQuality,Test plan for data quality on time series datasets
time_series_univariate,TimeSeriesUnivariate,Test plan to perform time series univariate analysis.
time_series_multivariate,TimeSeriesMultivariate,Test plan to perform time series multivariate analysis.
time_series_forecast,TimeSeriesForecast,Test plan to perform time series forecast tests.


In [11]:
vm.test_plans.describe_plan("summarization_metrics")

ID,Name,Description,Tests
summarization_metrics,SummarizationMetrics,Test plan for Summarization metrics,RougeMetrics (Metric) TokenDisparity (Metric) BleuScore (Metric) BertScore (Metric) ContextualRecall (Metric)


In [12]:
config={
    "rouge_metric": {
        "rouge_metrics": ["rouge-1","rouge-2", "rouge-l"],
    },
}
summarization_metrics = vm.run_test_plan("summarization_metrics", 
                                             model=vm_model,
                                             config=config)

HBox(children=(Label(value='Running test plan...'), IntProgress(value=0, max=10)))

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


VBox(children=(HTML(value='<h2>Results for <i>Summarization Metrics</i> Test Plan:</h2><hr>'), HTML(value='<di…