# BERT Embeddings for Financial News Articles

This notebook demonstrates the use of a BERT model to create embeddings for news articles. This matches the Oliver Wyman NewsTrack model and will act as a POC for embeddings model support in the ValidMind Developer Framework.

## ValidMind at a glance

ValidMind's platform enables organizations to identify, document, and manage model risks for all types of models, including AI/ML models, LLMs, and statistical models. As a model developer, you use the ValidMind Developer Framework to automate documentation and validation tests, and then use the ValidMind AI Risk Platform UI to collaborate on model documentation. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model validators.

If this is your first time trying out ValidMind, you can make use of the following resources alongside this notebook:

- [Get started](https://docs.validmind.ai/guide/get-started.html) — The basics, including key concepts, and how our products work
- [Get started with the ValidMind Developer Framework](https://docs.validmind.ai/guide/get-started-developer-framework.html) —  The path for developers, more code samples, and our developer reference

## Before you begin

::: {.callout-tip}
### New to ValidMind? 
For access to all features available in this notebook, create a free ValidMind account. 

Signing up is FREE — [**Sign up now**](https://app.prod.validmind.ai)
:::

If you encounter errors due to missing modules in your Python environment, install the modules with `pip install`, and then re-run the notebook. For more help, refer to [Installing Python Modules](https://docs.python.org/3/installing/index.html).

## Install the client library

The client library provides Python support for the ValidMind Developer Framework. To install it:

In [None]:
%pip install -q validmind

## Initialize the client library

ValidMind generates a unique _code snippet_ for each registered model to connect with your developer environment. You initialize the client library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook.

Get your code snippet:

1. In a browser, log into the [Platform UI](https://app.prod.validmind.ai).

2. In the left sidebar, navigate to **Model Inventory** and click **+ Register new model**.

3. Enter the model details and click **Continue**. ([Need more help?](https://docs.validmind.ai/guide/register-models-in-model-inventory.html))

4. Go to **Getting Started** and click **Copy snippet to clipboard**.

Next, replace this placeholder with your own code snippet:

In [None]:
# Replace with your code snippet

import validmind as vm

vm.init(
    api_host="...",
    api_key="...",
    api_secret="...",
    project="..."
)

## Initialize the Python environment

Next, let's import the necessary libraries and set up your Python environment for data analysis:

In [None]:
import pandas as pd
from transformers import pipeline

### Preview the documentation template

A template predefines sections for your model documentation and provides a general outline to follow, making the documentation process much easier.

You will upload documentation and test results into this template later on. For now, take a look at the structure that the template provides with the `vm.preview_template()` function from the ValidMind library and note the empty sections:

In [None]:
vm.preview_template()

## Load the sample dataset

The sample dataset used here is provided by the ValidMind library, along with a second, different dataset you can try as well. 

To be able to use either sample dataset, you need to import the dataset and load it into a pandas [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), a two-dimensional tabular data structure that makes use of rows and columns:

In [None]:

# This dataset should be located at the `notebooks/` directory root.
df = pd.read_feather('../20231026 SampleData_filtered.f')

news_articles_df = pd.DataFrame(df, columns=['title_snippet_processed'])
news_articles_df.head()

### Initialize a ValidMind dataset object

Before you can run a test suite, which are just a collection of tests, you must first initialize a ValidMind dataset object using the [`init_dataset`](https://docs.validmind.ai/validmind/validmind.html#init_dataset) function from the ValidMind (`vm`) module. 

This function takes a number of arguments: 

- `dataset` — the raw dataset that you want to analyze
- `text_column` — the name of the text column in the dataset 


In [None]:
vm_dataset = vm.init_dataset(
    dataset=news_articles_df,
    text_column="title_snippet_processed",
)

## Document the model

As part of documenting the model with the ValidMind Developer Framework, you need to preprocess the raw dataset, initialize some training and test datasets, initialize a model object you can use for testing, and then run the full suite of tests. 

In [None]:

embedding_model = pipeline(
    'feature-extraction', model='bert-base-uncased', tokenizer='bert-base-uncased')

In [None]:
vm_test_ds = vm.init_dataset(
    dataset=news_articles_df,
    text_column="title_snippet_processed",
)

vm_model = vm.init_model(
    embedding_model,
    test_ds=vm_test_ds,
)

In [None]:
config = {
    "validmind.model_validation.embeddings.StabilityAnalysisKeyword": {
        "keyword_dict": {
            'investors': 'shareholders',
            'tech': 'technology',
            'are': 'exist',
            'this': 'that',
            'after': 'post',
            'by': 'via',
            'announced': 'declared',
            'have': 'possess',
            'global': 'worldwide',
            'industry': 'sector',
            'major': 'primary',
        }
    }
}
document_tests = vm.run_documentation_tests(
    model=vm_model, dataset=vm_dataset, section="data_preparation", config=config)