# Summarization of financial data using a Large Language Model (LLM)

This notebook aims to provide an introduction to documenting an LLM using the ValidMind Developer Framework. The use case presented is a summarization of financial news (https://huggingface.co/datasets/cnn_dailymail).

- Initializing the ValidMind Developer Framework
- Running a test various tests to quickly generate documentation about the data and model


## Before you begin

::: {.callout-tip}
### New to ValidMind? 
To access the ValidMind Platform UI, you'll need an account.

Signing up is FREE — **[Create your account](https://app.prod.validmind.ai)**.
:::

If you encounter errors due to missing modules in your Python environment, install the modules with `pip install`, and then re-run the notebook. For more help, refer to [Installing Python Modules](https://docs.python.org/3/installing/index.html).


## Install the client library

In [None]:
# %pip install --upgrade validmind

## Initialize the client library

ValidMind generates a unique _code snippet_ for each registered model to connect with your developer environment. You initialize the client library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook.

Get your code snippet:

1. In a browser, log into the [Platform UI](https://app.prod.validmind.ai).

2. In the left sidebar, navigate to **Model Inventory** and click **+ Register new model**.

3. Enter the model details, making sure to select **LLM-based Text Summarization** as the template and **Marketing/Sales - Analytics** as the use case, and click **Continue**. ([Need more help?](https://docs.validmind.ai/guide/register-models-in-model-inventory.html))

4. Go to **Getting Started** and click **Copy snippet to clipboard**.

Next, replace this placeholder with your own code snippet:

In [None]:
# Replace with your code snippet

import validmind as vm

vm.init(
    api_host="https://api.prod.validmind.ai/api/v1/tracking",
    api_key="...",
    api_secret="...",
    project="..."
)

## Use case

In the realm of financial news, accurate interpretation of information is key. Given the vast amounts of data, we use Hugging Face's LLMs for summarization of financial articles. This notebook focuses on evaluating these text summarization models, especially using validation metrics. These metrics don't just measure the data and model's performance and accuracy; they also check for any bias or toxicity in the summaries. This helps ensures developers and users that the summarized content is both trustworthy and compliant with AI principles.

## 1. Setup

### Import Libraries

In [None]:
from transformers import pipeline
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import textwrap
from tabulate import tabulate
from IPython.display import display, HTML
from rouge import Rouge
import plotly.graph_objects as go
import nltk
from nltk.corpus import stopwords
import numpy as np
import pandas as pd
from pprint import pprint
import torch
import string
import plotly.express as px
import plotly.subplots as sp
from collections import Counter
from itertools import combinations
from dataclasses import dataclass

### Preprocessing functions

In [None]:
def add_id_column(df):
    """
    Adds an 'ID' column to the dataframe.

    Args:
    - df (pd.DataFrame): The input dataframe.

    Returns:
    - pd.DataFrame: The dataframe with an added 'ID' column.
    """
    df.insert(0, 'ID', range(1, 1 + len(df)))
    return df


def load_text_data(filepath, num_records=5):
    """
    Load a CSV file, limit the number of records, and add an 'ID' column.

    Args:
    - filepath (str): Path to the CSV file.
    - num_records (int): Number of records to load.

    Returns:
    - pd.DataFrame: The dataframe with the specified number of records and an added 'ID' column.
    """
    df = pd.read_csv(filepath)
    df = df.head(num_records)
    return add_id_column(df)

In [None]:
import json
import pandas as pd


def load_toxic_prompt_data(file_path, threshold=None):
    """
    Reads a .jsonl file and returns a dataframe with all records.

    :param file_path: Path to the .jsonl file.
    :param threshold: Optional toxicity threshold. If provided, only records with a toxicity
                      score exceeding this threshold will be returned.
    :return: DataFrame with all records.
    """
    # Load records
    records = []
    with open(file_path, "r") as file:
        for line in file:
            data = json.loads(line)
            records.append(data)

    # Prepare data for dataframe
    data_list = []
    for record in records:
        prompt_toxicity = record["prompt"].get("toxicity", None)
        cont_toxicity = record["continuation"].get("toxicity", None)

        if threshold:
            if not prompt_toxicity or not cont_toxicity:
                continue  # If we don't have a toxicity value, skip the record
            elif prompt_toxicity < threshold and cont_toxicity < threshold:
                continue  # If neither toxicity value exceeds the threshold, skip the record

        row = {
            "Filename": record["filename"],
            "Begin": record["begin"],
            "End": record["end"],
            "Challenging": record["challenging"],
            "Prompt Text": record["prompt"]["text"],
            "Prompt Profanity": record["prompt"].get("profanity", "N/A"),
            "Prompt Sexually Explicit": record["prompt"].get("sexually_explicit", "N/A"),
            "Prompt Identity Attack": record["prompt"].get("identity_attack", "N/A"),
            "Prompt Flirtation": record["prompt"].get("flirtation", "N/A"),
            "Prompt Threat": record["prompt"].get("threat", "N/A"),
            "Prompt Insult": record["prompt"].get("insult", "N/A"),
            "Prompt Severe Toxicity": record["prompt"].get("severe_toxicity", "N/A"),
            "Prompt Toxicity": record["prompt"].get("toxicity", "N/A"),
            "Continuation Text": record["continuation"]["text"],
            "Cont. Severe Toxicity": record["continuation"].get("severe_toxicity", "N/A"),
            "Cont. Toxicity": record["continuation"].get("toxicity", "N/A"),
            "Cont. Profanity": record["continuation"].get("profanity", "N/A"),
            "Cont. Sexually Explicit": record["continuation"].get("sexually_explicit", "N/A"),
            "Cont. Identity Attack": record["continuation"].get("identity_attack", "N/A"),
            "Cont. Flirtation": record["continuation"].get("flirtation", "N/A"),
            "Cont. Threat": record["continuation"].get("threat", "N/A"),
            "Cont. Insult": record["continuation"].get("insult", "N/A")
        }
        data_list.append(row)

    # Convert list of dicts to dataframe
    df = pd.DataFrame(data_list)

    return df

In [None]:
def _format_cell_text(text, width=50):
    """Private function to format a cell's text."""
    return '\n'.join([textwrap.fill(line, width=width) for line in text.split('\n')])


def _format_dataframe_for_tabulate(df):
    """Private function to format the entire DataFrame for tabulation."""
    df_out = df.copy()

    # Format all string columns
    for column in df_out.columns:
        # Check if column is of type object (likely strings)
        if df_out[column].dtype == object:
            df_out[column] = df_out[column].apply(_format_cell_text)
    return df_out


def _dataframe_to_html_table(df):
    """Private function to convert a DataFrame to an HTML table."""
    headers = df.columns.tolist()
    table_data = df.values.tolist()
    return tabulate(table_data, headers=headers, tablefmt="html")


def display_nice(df, num_rows=None):
    """Primary function to format and display a DataFrame."""
    if num_rows is not None:
        df = df.head(num_rows)
    formatted_df = _format_dataframe_for_tabulate(df)
    html_table = _dataframe_to_html_table(formatted_df)
    display(HTML(html_table))

In [None]:
def add_list_to_df(df, column_data, column_name):
    """
    Adds a new column to the dataframe df that contains the given data.

    Parameters:
    - df (pd.DataFrame): The original pandas DataFrame.
    - column_data (list/array): List/array of data to be added as a new column.
    - column_name (str): Name of the new column.

    Returns:
    - pd.DataFrame: DataFrame with an additional column.
    """

    df = df.copy()  # Make an explicit copy of the DataFrame

    # Check if the length of column_data matches the number of rows in the DataFrame
    if len(column_data) != len(df):
        raise ValueError(
            f"The length of column_data ({len(column_data)}) does not match the number of rows in the DataFrame ({len(df)}).")

    # Add the column_data to the DataFrame
    df[column_name] = column_data

    return df

In [None]:
def add_summaries_to_df(df, summaries):
    """
    Adds a new column 'summary_X' to the dataframe df that contains the given summaries, where X is an incremental number.

    Parameters:
    - df: The original pandas DataFrame.
    - summaries: List/array of summarized texts.

    Returns:
    - A new DataFrame with an additional summary column, with 'labels' being the first column followed by the original 'text'.
    """

    df = df.copy()  # Make an explicit copy of the DataFrame

    # Check if the length of summaries matches the number of rows in the DataFrame
    if len(summaries) != len(df):
        raise ValueError(
            f"The number of summaries ({len(summaries)}) does not match the number of rows in the DataFrame ({len(df)}).")

    # Determine the name for the new summary column
    col_index = 1
    col_name = 'summary_1'
    while col_name in df.columns:
        col_index += 1
        col_name = f'summary_{col_index}'

    # Add the summaries to the DataFrame
    df[col_name] = summaries

    # Rearrange the DataFrame columns to have 'topic' first, then the original 'input', followed by summary columns
    summary_columns = [col for col in df.columns if col.startswith('summary')]
    other_columns = [col for col in df.columns if col not in summary_columns
                     + ['topic', 'input', 'reference_summary']]

    columns_order = ['topic', 'input', 'reference_summary'] + \
        sorted(summary_columns) + other_columns
    df = df[columns_order]

    return df

### POC Validation Metrics

In [None]:
import evaluate
toxicity = evaluate.load("toxicity")

In [None]:
import plotly.graph_objects as go


def hf_toxicity_plot(df, params):
    """
    Compute toxicity scores for texts and then plot line plots for each text column
    where the generated text score surpasses the given threshold.

    Parameters:
    - df (pd.DataFrame): The dataframe containing texts.
    - params (dict): Parameters containing toxicity evaluation object, column names, and score threshold.
    """

    # Extract necessary parameters
    toxicity = params["hf_toxicity_obj"]
    text_columns = params["text_columns"]
    score_threshold = params.get("score_threshold", 0)  # default to 0 if not provided

    # Create figure
    fig = go.Figure()

    for col in text_columns:
        # Get list of texts from dataframe
        texts = df[col].tolist()

        # Compute toxicity for texts
        toxicity_scores = toxicity.compute(predictions=texts)['toxicity']

        # Filter records where the score is above the threshold
        indices = [i for i, score in enumerate(
            toxicity_scores) if score > score_threshold]
        filtered_scores = [score for i, score in enumerate(
            toxicity_scores) if i in indices]

        # Add trace for the scores with modified line width
        fig.add_trace(go.Scatter(x=indices, y=filtered_scores, mode='lines+markers', name=col,
                                 # Set width to 1 for a thinner line
                                 line=dict(width=1)))

    # Add a horizontal line for the threshold
    fig.add_shape(
        go.layout.Shape(
            type="line",
            x0=0,
            x1=max(indices) if indices else 1,
            y0=score_threshold,
            y1=score_threshold,
            line=dict(color="grey", width=0.8, dash="dash")
        )
    )

    # Update layout
    fig.update_layout(title="Toxicity Scores for Text Columns with Score above threshold",
                      xaxis_title="Index",
                      yaxis_title="Toxicity Score",
                      legend_title="Text Type")

    # Show figure
    fig.show()

In [None]:
def hf_toxicity_table(df, params):
    """
    Update and return dataframe with toxicity scores for all the text columns provided.

    Parameters:
    - df (pd.DataFrame): The dataframe containing texts.
    - params (dict): Parameters containing toxicity evaluation object, column names, and the max and min generated toxicity thresholds.

    Returns:
    - pd.DataFrame: Updated dataframe with toxicity scores.
    """

    df = df.copy()  # Create a deep copy of the DataFrame

    # Extract necessary parameters
    toxicity = params["hf_toxicity_obj"]
    text_columns = params["text_columns"]
    max_toxicity_threshold = params.get(
        "max_toxicity_threshold", 1)  # default to 1 if not provided
    min_toxicity_threshold = params.get(
        "min_toxicity_threshold", 0)  # default to 0 if not provided

    for col in text_columns:
        # Get list of texts from dataframe
        texts = df[col].tolist()

        # Compute toxicity for texts
        toxicity_scores = toxicity.compute(predictions=texts)['toxicity']

        # Assign the new toxicity scores to the dataframe using .loc to avoid the warning
        df.loc[:, f"{col} Toxicity"] = toxicity_scores

        # If you want to filter rows for each column separately based on their toxicity
        # df = df[(df[f"{col} Toxicity"] >= min_toxicity_threshold) & (df[f"{col} Toxicity"] <= max_toxicity_threshold)]

    # If you want to filter rows based on the toxicity of a specific column, you can do it here.
    # For example, if you want to filter rows based on the toxicity of the first column in text_columns:
    # df = df[(df[f"{text_columns[0]} Toxicity"] >= min_toxicity_threshold) & (df[f"{text_columns[0]} Toxicity"] <= max_toxicity_threshold)]

    # Order the results by "<Column Name> Toxicity" in descending order for the first text column
    df = df.sort_values(by=f"{text_columns[0]} Toxicity", ascending=False)

    return df

In [None]:
import plotly.graph_objects as go
import plotly.subplots as sp


def hf_toxicity_histograms(df, params):
    """
    Compute toxicity scores for texts and then plot histograms for specified text columns.

    Parameters:
    - df (pd.DataFrame): The dataframe containing texts.
    - params (dict): Parameters containing toxicity evaluation object and column names.
    """

    # Extract necessary parameters
    toxicity = params["hf_toxicity_obj"]
    text_columns = params["text_columns"]

    # Determine the number of rows required based on the number of text columns
    num_rows = (len(text_columns) + 1) // 2  # +1 to handle odd number of columns

    # Create a subplot layout
    fig = sp.make_subplots(rows=num_rows, cols=2, subplot_titles=text_columns)

    subplot_height = 350  # Height of each subplot
    total_height = num_rows * subplot_height + 200  # 200 for padding, titles, etc.

    for idx, col in enumerate(text_columns, start=1):
        row = (idx - 1) // 2 + 1
        col_idx = (idx - 1) % 2 + 1  # to place subplots in two columns

        # Get list of texts from dataframe
        texts = df[col].tolist()

        # Compute toxicity for texts
        toxicity_scores = toxicity.compute(predictions=texts)['toxicity']

        # Add traces to the corresponding subplot without legend
        fig.add_trace(go.Histogram(x=toxicity_scores,
                      showlegend=False), row=row, col=col_idx)

        # Update xaxes and yaxes titles only for the first subplot
        if idx == 1:
            fig.update_xaxes(title_text="Toxicity Score", row=row, col=col_idx)
            fig.update_yaxes(title_text="Frequency", row=row, col=col_idx)

    # Update layout
    fig.update_layout(title_text="Histograms of Toxicity Scores", height=total_height)

    # Show figure
    fig.show()

In [None]:
# Secondary functions
def general_text_metrics(df, text_column):
    nltk.download('punkt', quiet=True)

    results = []

    for text in df[text_column]:
        sentences = nltk.sent_tokenize(text)
        words = nltk.word_tokenize(text)
        paragraphs = text.split("\n\n")

        total_words = len(words)
        total_sentences = len(sentences)
        avg_sentence_length = round(sum(len(sentence.split(
        )) for sentence in sentences) / total_sentences if total_sentences else 0, 1)
        total_paragraphs = len(paragraphs)

        results.append([total_words, total_sentences,
                       avg_sentence_length, total_paragraphs])

    return pd.DataFrame(results, columns=["Total Words", "Total Sentences", "Avg Sentence Length", "Total Paragraphs"])


def vocabulary_structure_metrics(df, text_column, unwanted_tokens, lang):
    stop_words = set(word.lower() for word in stopwords.words(lang))
    unwanted_tokens = set(token.lower() for token in unwanted_tokens)

    results = []

    for text in df[text_column]:
        words = nltk.word_tokenize(text)

        filtered_words = [word for word in words if word.lower() not in stop_words and word.lower(
        ) not in unwanted_tokens and word not in string.punctuation]

        total_unique_words = len(set(filtered_words))
        total_punctuations = sum(1 for word in words if word in string.punctuation)
        lexical_diversity = round(
            total_unique_words / len(filtered_words) if filtered_words else 0, 1)

        results.append([total_unique_words, total_punctuations, lexical_diversity])

    return pd.DataFrame(results, columns=["Total Unique Words", "Total Punctuations", "Lexical Diversity"])

# Primary function


def text_description_table(df, params):
    text_column = params["text_column"]
    unwanted_tokens = params["unwanted_tokens"]
    lang = params["lang"]

    gen_metrics_df = general_text_metrics(df, text_column)
    vocab_metrics_df = vocabulary_structure_metrics(
        df, text_column, unwanted_tokens, lang)

    combined_df = pd.concat([gen_metrics_df, vocab_metrics_df], axis=1)

    return combined_df

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import nltk
import pandas as pd
from nltk.corpus import stopwords
import string

# Ensuring NLTK resources are downloaded
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)


def text_description_histograms(df, params):
    """
    This function takes a dataframe and plots histograms for the specified metrics.

    Parameters:
    - df (DataFrame): DataFrame containing the text data.
    - params (dict): Dictionary containing parameters like "text_column", "unwanted_tokens", and "lang".

    Returns:
    - Plotly figure containing the histograms.
    """

    text_column = params["text_column"]

    # Combined function to get all metrics
    gen_metrics_df = general_text_metrics(df, text_column)
    vocab_metrics_df = vocabulary_structure_metrics(
        df, text_column, params["unwanted_tokens"], params["lang"])
    combined_df = pd.concat([gen_metrics_df, vocab_metrics_df], axis=1)

    # Determine number of rows based on number of metrics
    # Ceiling division to ensure every metric gets a subplot
    num_rows = (combined_df.shape[1] + 1) // 2

    # Create subplots
    fig = make_subplots(rows=num_rows, cols=2, subplot_titles=combined_df.columns)

    # For each metric, plot a histogram
    for index, column in enumerate(combined_df.columns):
        row, col = divmod(index, 2)
        fig.add_trace(
            go.Histogram(x=combined_df[column], name=column),
            row=row + 1,
            col=col + 1
        )

    # Update layout for better appearance and adjust the height
    subplot_height = 400  # Define the height of each individual subplot
    total_height = num_rows * subplot_height
    fig.update_layout(title_text="Distribution of Text Metrics",
                      bargap=0.2, bargroupgap=0.1, height=total_height)

    return fig.show()

In [None]:
def text_structure_histograms(df, params):

    text_column = params["text_column"]
    num_docs_to_plot = params["num_docs_to_plot"]

    # Ensure the nltk punkt tokenizer is downloaded
    nltk.download('punkt', quiet=True)

    # Decide on the number of documents to plot
    if not num_docs_to_plot or num_docs_to_plot > len(df):
        num_docs_to_plot = len(df)

    # Colors for each subplot
    colors = ['blue', 'green', 'red', 'purple']

    # Axis titles for clarity
    x_titles = [
        "Word Frequencies",
        "Sentence Position in Document",
        "Sentence Lengths (Words)",
        "Word Lengths (Characters)"
    ]
    y_titles = [
        "Number of Words",
        "Sentence Length (Words)",
        "Number of Sentences",
        "Number of Words"
    ]

    # Iterate over each document in the DataFrame up to the user-specified limit
    for index, (idx, row) in enumerate(df.head(num_docs_to_plot).iterrows()):
        # Create subplots with a 2x2 grid for each metric
        fig = sp.make_subplots(
            rows=2, cols=2,
            subplot_titles=[
                "Word Frequencies",
                "Sentence Positions",
                "Sentence Lengths",
                "Word Lengths"
            ]
        )

        # Tokenize document into sentences and words
        sentences = nltk.sent_tokenize(row[text_column])
        words = nltk.word_tokenize(row[text_column])

        # Metrics computation
        word_freq = Counter(words)
        freq_counts = Counter(word_freq.values())
        word_frequencies = list(freq_counts.keys())
        word_frequency_counts = list(freq_counts.values())

        sentence_positions = list(range(1, len(sentences) + 1))
        sentence_lengths = [len(sentence.split()) for sentence in sentences]
        word_lengths = [len(word) for word in words]

        # Adding data to subplots
        fig.add_trace(go.Bar(x=word_frequencies, y=word_frequency_counts,
                      marker_color=colors[0], showlegend=False), row=1, col=1)
        fig.add_trace(go.Bar(x=sentence_positions, y=sentence_lengths,
                      marker_color=colors[1], showlegend=False), row=1, col=2)
        fig.add_trace(go.Histogram(x=sentence_lengths, nbinsx=50, opacity=0.75,
                      marker_color=colors[2], showlegend=False), row=2, col=1)
        fig.add_trace(go.Histogram(x=word_lengths, nbinsx=50, opacity=0.75,
                      marker_color=colors[3], showlegend=False), row=2, col=2)

        # Update x and y axis titles
        for i, (x_title, y_title) in enumerate(zip(x_titles, y_titles)):
            fig['layout'][f'xaxis{
                i + 1}'].update(title=x_title, titlefont=dict(size=10))
            fig['layout'][f'yaxis{
                i + 1}'].update(title=y_title, titlefont=dict(size=10))

        # Update layout
        fig.update_layout(
            title=f"Text Description for Document {index + 1}",
            barmode='overlay',
            height=800
        )

        fig.show()

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Assuming the previously defined functions and parameters are present


def text_description_scatter(df, params):
    """
    This function takes a dataframe and plots scatter plots for the specified combinations of metrics.

    Parameters:
    - df (DataFrame): DataFrame containing the text data.
    - params (dict): Dictionary containing parameters like "combinations_to_plot", "text_column", "unwanted_tokens", and "lang".

    Returns:
    - Plotly figure containing the scatter plots.
    """

    text_column = params["text_column"]

    # Combined function to get all metrics
    gen_metrics_df = general_text_metrics(df, text_column)
    vocab_metrics_df = vocabulary_structure_metrics(
        df, text_column, params["unwanted_tokens"], params["lang"])
    combined_df = pd.concat([gen_metrics_df, vocab_metrics_df], axis=1)

    combinations_to_plot = params["combinations_to_plot"]
    num_combinations = len(combinations_to_plot)

    # Determine number of rows based on number of combinations
    num_rows = (num_combinations + 1) // 2  # Ceiling division

    # Create subplots
    fig = make_subplots(rows=num_rows, cols=2, subplot_titles=[
                        f"{x} vs {y}" for x, y in combinations_to_plot])

    # For each combination, plot a scatter plot
    for index, (x_column, y_column) in enumerate(combinations_to_plot):
        row, col = divmod(index, 2)
        fig.add_trace(
            go.Scatter(x=combined_df[x_column], y=combined_df[y_column],
                       mode='markers', showlegend=False),
            row=row + 1,
            col=col + 1
        )
        # Update axis titles
        fig.update_xaxes(title_text=x_column, row=row + 1, col=col + 1)
        fig.update_yaxes(title_text=y_column, row=row + 1, col=col + 1)

    # Update layout for better appearance
    subplot_height = 400  # Define the height of each individual subplot
    total_height = num_rows * subplot_height
    fig.update_layout(
        title_text="Scatter Plots of Text Metrics Combinations", height=total_height)

    return fig.show()

# You can call the function with:
# text_description_scatter(df, params)

In [None]:
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from transformers import BertTokenizer


def token_distribution_histograms(df, params):
    """
    Visualize the token counts distribution of given columns using histograms.

    :param df: DataFrame containing the text columns.
    :param params: Dictionary with the key ["text_columns"].
    """

    df = df.copy()  # Create a deep copy of the DataFrame
    text_columns = params["text_columns"]

    # Define an extended list of colors for the subplots
    colors = [
        'blue', 'red', 'green', 'purple', 'orange', 'pink', 'yellow', 'brown',
        'grey', 'cyan', 'magenta', 'lime', 'navy', 'maroon', 'olive', 'teal',
        'aqua', 'fuchsia', 'salmon', 'indigo'
    ]

    # Initialize the tokenizer
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

    # Determine the number of rows required based on the number of text columns
    num_rows = (len(text_columns) + 1) // 2

    # Create subplots
    fig = make_subplots(rows=num_rows, cols=2, subplot_titles=text_columns)

    for idx, col in enumerate(text_columns, start=1):
        row = (idx - 1) // 2 + 1
        col_idx = (idx - 1) % 2 + 1

        # Tokenize the column and get the number of tokens
        tokens_col = f'tokens_{idx}'
        df[tokens_col] = df[col].apply(lambda x: len(tokenizer.tokenize(x)))

        # Add histogram with color corresponding to the index
        fig.add_trace(go.Histogram(x=df[tokens_col],
                                   # Cycle through the color list
                                   marker_color=colors[(idx - 1) % len(colors)],
                                   showlegend=False),
                      row=row, col=col_idx)

        # Only add axis titles to the first subplot
        if idx == 1:
            fig.update_yaxes(title_text='Number of Documents', row=row, col=col_idx)
            fig.update_xaxes(title_text='Number of Tokens', row=row, col=col_idx)

    # Update layout
    subplot_height = 300  # Height of each subplot
    total_height = num_rows * subplot_height + 200  # 200 for padding, titles, etc.
    fig.update_layout(title_text='Token Distributions', bargap=0.1,
                      height=total_height, showlegend=False)

    # Show the plot
    fig.show()

In [None]:
from rouge import Rouge
import pandas as pd
import plotly.graph_objects as go
from itertools import combinations
from plotly.subplots import make_subplots


def rouge_scores_plot(df, params):
    """
    Compute ROUGE scores for each unique pair of text columns in the DataFrame and visualize them.

    :param df: DataFrame containing the summaries.
    :param params: Dictionary with the keys ["text_columns", "metric"].
    """

    # Extract parameters
    metric = params.get("metric", "rouge-2")
    text_columns = params["text_columns"]

    if metric not in ["rouge-1", "rouge-2", "rouge-l", "rouge-s"]:
        raise ValueError(
            "Invalid metric. Choose from 'rouge-1', 'rouge-2', 'rouge-l', 'rouge-s'.")

    rouge = Rouge(metrics=[metric])

    # Determine all unique pairs of text columns
    pairs = list(combinations(text_columns, 2))
    num_rows = (len(pairs) + 1) // 2

    # Create subplots
    fig = make_subplots(rows=num_rows, cols=2, subplot_titles=[
                        f'{ref} vs {gen}' for ref, gen in pairs])

    color_dict = {'p': 'blue', 'r': 'green', 'f': 'red'}

    for idx, (ref_column, gen_column) in enumerate(pairs, start=1):
        score_list = []

        for _, row in df.iterrows():
            scores = rouge.get_scores(
                row[gen_column], row[ref_column], avg=True)[metric]
            score_list.append(scores)

        df_scores = pd.DataFrame(score_list)
        row = (idx - 1) // 2 + 1
        col = (idx - 1) % 2 + 1

        # Adding the line plots
        for score_type, color in color_dict.items():
            fig.add_trace(
                go.Scatter(x=df_scores.index, y=df_scores[score_type], mode='lines+markers',
                           name="Precision" if score_type == 'p' else "Recall" if score_type == 'r' else "F1",
                           line=dict(color=color),
                           showlegend=True if idx == 1 else False),  # Show legend only in the first subplot
                row=row, col=col
            )

    # Set layout properties
    fig.update_layout(
        title=f"ROUGE-{metric.split('-')[-1].upper()} Scores",
        xaxis_title="Row Index",
        yaxis_title="Score",
        height=num_rows * 300  # You might want to adjust this value based on your actual needs
    )

    # Show the plot
    fig.show()

### Hugging Face models wrappers

The following code template showcases how to wrap a Hugging Face model for compatibility with the ValidMind Developer Framework. We will load an example model using the transformers API and then run some predictions on our test dataset.

The ValidMind developer framework provides support for Hugging Face transformers out of the box, so in the following section we will show how to initialize multiple transformers models with the `init_model` function, removing the need for a custom wrapper. In cases where you need extra pre-processing or post-processing steps, you can use the following code template as a starting point to wrap your model.

In [None]:
from dataclasses import dataclass
import pandas as pd
from transformers import pipeline


@dataclass
class AbstractSummarization_HuggingFace:
    """
    A VM Model instance wrapper for abstract summarization using HuggingFace Transformers.
    """
    model: any
    tokenizer: any
    predicted_prob_values: list = None

    def __init__(self, model_name=None, model=None, tokenizer=None):
        pipeline_task = "summarization"
        self.model_name = model_name
        self.pipeline_task = pipeline_task
        self.model = pipeline(pipeline_task, model=model, tokenizer=tokenizer)

    def predict(self, texts, params={}):
        """
        Generates summaries for the given texts.

        Parameters:
        - texts (list): List of texts to be summarized.
        - params (dict, optional): Dictionary that may contain "min_length" and/or "max_length" to control the produced summary's length.

        Returns:
        - List of summaries.
        """

        min_length = params.get("min_length")
        max_length = params.get("max_length")

        # If either value is None, don't pass it to the model function
        model_args = {}
        if min_length is not None:
            model_args["min_length"] = min_length
        if max_length is not None:
            model_args["max_length"] = max_length

        summaries = []

        for text in texts:
            data = [str(text)]
            # Using ** unpacking to pass arguments conditionally
            results = self.model(data, **model_args)
            results_df = pd.DataFrame(results)
            summary = results_df["summary_text"].values[0] if "summary_text" in results_df.columns else results_df["label"].values[0]
            summaries.append(summary)

        return summaries

    def predict_proba(self):
        """
        Retrieves predicted probabilities after prediction.
        Note: Not all models provide predicted probabilities.
        """
        if self.predicted_prob_values is None:
            raise ValueError(
                "First run predict method to retrieve predicted probabilities")
        return self.predicted_prob_values

    def description(self):
        """
        Describes the methods available in the class.

        The class provides methods for abstract summarization using HuggingFace Transformers.

        The predict method:
        1. Generates summaries for given texts.
        2. Accepts a 'params' dictionary which can contain optional 'min_length' and 'max_length' parameters to control the length of the produced summary.

        The predict_proba method:
        Retrieves predicted probabilities after prediction (if available by the model).
        """
        return self.description.__doc__

In [None]:
from dataclasses import dataclass
import torch
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity


@dataclass
class ExtractiveSummarization_BERT:
    model: any
    tokenizer: any

    def _get_embedding(self, text):
        inputs = self.tokenizer(text, return_tensors="pt",
                                truncation=True, max_length=512, padding='max_length')
        with torch.no_grad():
            output = self.model(**inputs)
        return output['last_hidden_state'].mean(dim=1).squeeze().detach().numpy()

    def predict(self, texts, params={}):
        summaries = []

        # Extract summary_length from params, default to None if not provided
        summary_length = params.get("summary_length", None)

        for text in texts:
            sentences = text.split('. ')

            document_embedding = self._get_embedding(' '.join(sentences))
            sentence_embeddings = [self._get_embedding(
                sentence) for sentence in sentences]
            similarities = cosine_similarity(sentence_embeddings, [document_embedding])
            sorted_indices = np.argsort(similarities, axis=0)[::-1].squeeze()

            # Determine summary length
            if summary_length is None:  # If not provided, use 20% of the total sentence count
                top_k = int(len(sentences) * 0.2)
            else:  # If provided, use the user-defined length
                top_k = summary_length

            # Extract the top sentences based on the chosen summary length
            selected_sentences = [sentences[i] for i in sorted_indices[:top_k]]

            summaries.append(' '.join(selected_sentences))

        return summaries

    def description(self):
        """
        Provides a description of the methods available for extractive summarization with the current model.

        The model ranks sentences in the input text based on their similarity to the overall document meaning, as determined by BERT embeddings. The top-ranked sentences are selected to form the summary.

        The length of the summary can be controlled in two ways:
        1. Automatically: Where the model summarizes to approximately 20% of the original text's sentence count.
        2. User-defined: By specifying the 'summary_length' parameter in the 'params' dictionary when calling the 'predict' method.

        """
        return self.description.__doc__

## 2. Load Data

In this section, we'll load the financial dataset, which will be the foundation for our summarization analysis tasks. 

The dataset is structured with several columns, namely "ID", "topic", "input", and "reference_summary". Each record is identified by a unique "ID". The "topic" column specifies the category or theme of the news, the "input" column contains detailed articles or news content, and the "reference_summary" provides a concise summary of the respective article in the "input". The dataset sources its content from BBC news.

In [None]:
df_summarization = load_text_data(
    filepath='./datasets/bbc_text_cls_reference.csv',
    num_records=100
)

display_nice(df_summarization, num_rows=2)

## 3. Extractive Summarization: Hugging Face-BERT Model

Extractive summarization is a technique used to produce a summary by selecting and extracting whole sentences or passages from the source document without modifying the original content. Essentially, it identifies and "extracts" the most important and relevant information from the larger text to create a condensed version.

In [None]:
model = BertModel.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
extractive_model = ExtractiveSummarization_BERT(model=model, tokenizer=tokenizer)

In [None]:
data = df_summarization.input.values.tolist()

params = {
    "summary_length": None,
}

extractive_summary = extractive_model.predict(data, params)

df_summarization = add_list_to_df(
    df=df_summarization,
    column_data=extractive_summary,
    column_name="extractive_summary"
)

display_nice(df_summarization, num_rows=2)

## 4. Abstract Summarization: Hugging Face-T5 Model

Abstractive summarization is a technique used to produce a summary by understanding the main ideas of a source document and then expressing those ideas in a concise manner using new words and sentences, often not present in the original text. Instead of just extracting and repurposing existing sentences from the document, as is done in extractive summarization, abstractive summarization aims to capture the essence or meaning of the content and convey it in a shorter form.

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")
abstract_model = AbstractSummarization_HuggingFace(model=model, tokenizer=tokenizer)

In [None]:
data = df_summarization.input.values.tolist()

params = {
    "summary_length": None
}

abstract_summary = abstract_model.predict(data, params)

df_summarization = add_list_to_df(
    df=df_summarization,
    column_data=abstract_summary,
    column_name="abstract_summary"
)

display_nice(df_summarization, num_rows=2)

## 5. Text Poisoning: Hugging Face-GPT2 Model

## 5. Validation of Input Text

### Input Text Description

##### Text Description Table

- Total Words: Assess the length and complexity of the input text. Longer documents might require more sophisticated summarization techniques, while shorter ones may need more concise summaries.

- Total Sentences: Understand the structural makeup of the content. Longer texts with numerous sentences might require the model to generate longer summaries to capture essential information.

- Avg Sentence Length: Determine the average length of sentences in the text. This can help the model decide on the appropriate length for generated summaries, ensuring they are coherent and readable.

- Total Paragraphs: Analyze how the content is organized into paragraphs. The model should be able to maintain the logical structure of the content when producing summaries.

- Total Unique Words: Measure the diversity of vocabulary in the text. A higher count of unique words could indicate more complex content, which the model needs to capture accurately.

- Most Common Words: Identify frequently occurring words that likely represent key themes. The model should pay special attention to including these words and concepts in its summaries.

- Total Punctuations: Evaluate the usage of punctuation marks, which contribute to the tone and structure of the content. The model should be able to maintain appropriate punctuation in summaries.

- Lexical Diversity: Calculate the richness of vocabulary in relation to the overall text length. A higher lexical diversity suggests a broader range of ideas and concepts that the model needs to capture in its summaries.

In [None]:
params = {
    "text_column": "input",
    "unwanted_tokens": {'s', 's\'', 'mr', 'ms', 'mrs', 'dr', '\'s', ' ', "''", 'dollar', 'us', '``'},
    "lang": "english"
}

df_text_description = text_description_table(df_summarization, params)
display(df_text_description)

##### Text Description Histogram

In [None]:
params = {
    "text_column": "input",
    "unwanted_tokens": {'s', 's\'', 'mr', 'ms', 'mrs', 'dr', '\'s', ' ', "''", 'dollar', 'us', '``'},
    "lang": "english"
}

text_description_histograms(df_summarization, params)

##### Text Description Scatter

In [None]:
# Define the combinations you want to plot
combinations_to_plot = [
    ("Total Words", "Total Sentences"),
    ("Total Words", "Total Unique Words"),
    ("Total Sentences", "Avg Sentence Length"),
    ("Total Unique Words", "Lexical Diversity")
]

params = {
    "combinations_to_plot": combinations_to_plot,
    "text_column": "input",
    "unwanted_tokens": {'s', 's\'', 'mr', 'ms', 'mrs', 'dr', '\'s', ' ', "''", 'dollar', 'us', '``'},
    "lang": "english"
}

text_description_scatter(df_summarization, params)

##### Text Structure Histogram

- Word Frequencies: This metric provides a histogram of how often words appear with a given frequency. For example, if a lot of words appear only once in a document, it might be indicative of a text rich in unique words. On the other hand, a small set of words appearing very frequently might indicate repetitive content or a certain theme or pattern in the text.

- Sentence Positions vs. Sentence Lengths: This bar chart showcases the length of each sentence (in terms of word count) in their order of appearance in the document. This can give insights into the flow of information in a text, highlighting any long, detailed sections or brief, potentially superficial areas.

- Sentence Lengths Distribution: A histogram showing the frequency of sentence lengths across the document. Long sentences might contain a lot of information but could be harder for summarization models to digest and for readers to comprehend. Conversely, many short sentences might indicate fragmented information.

- Word Lengths Distribution: A histogram of the lengths of words in the document. Extremely long words might be anomalies, technical terms, or potential errors in the corpus. Conversely, a majority of very short words might denote lack of depth or specificity.

In [None]:
params = {
    "text_column": 'input',
    "num_docs_to_plot": 2
}

text_structure_histograms(df_summarization, params)

## 6. Validation of Generated Text

### Generated Summary Description

##### Token Distribution Histograms

In [None]:
text_columns = ["input", "reference_summary", "extractive_summary", "abstract_summary"]

params = {
    "text_columns": text_columns,
}

token_distribution_histograms(df_summarization, params)

### Generated Summary Accuracy 

##### ROUGE-N Score

The ROUGE score ((Recall-Oriented Understudy for Gisting Evaluation) is a widely adopted set of metrics used for evaluating automatic summarization and machine translation. It fundamentally measures the overlap between the n-grams in the generated summary and those in the reference summary.

- ROUGE-N: This evaluates the overlap of n-grams between the produced summary and reference summary. It calculates precision (the proportion of n-grams in the generated summary that are also present in the reference summary), recall (the proportion of n-grams in the reference summary that are also present in the generated summary), and F1 score (the harmonic mean of precision and recall).

- ROUGE-L: This metric is based on the Longest Common Subsequence (LCS) between the generated and reference summaries. LCS measures the longest sequence of tokens in the generated summary that matches the reference, without considering the order. It's beneficial because it can identify and reward longer coherent matching sequences.

- ROUGE-S: This measures the skip-bigram overlap, considering the pair of words in order as "bigrams" while allowing arbitrary gaps or "skips". This can be valuable to capture sentence-level structure similarity.

In [None]:
params = {
    "text_columns": text_columns,
    "metric": "rouge-2",
}

rouge_scores_plot(df_summarization, params)

### Generated Summary Toxicity

##### Toxicity Table

In [None]:
params = {
    "hf_toxicity_obj": toxicity,
    "text_columns": text_columns,
    "max_toxicity_threshold": 0.7,
    "min_toxicity_threshold": 0
}

df_metric_results = hf_toxicity_table(df_summarization, params)
display(df_metric_results)

##### Toxicity Histograms

In [None]:
params = {
    "hf_toxicity_obj": toxicity,
    "text_columns": text_columns
}

hf_toxicity_histograms(df_summarization, params)

##### Toxicity Plots

In [None]:
params = {
    "hf_toxicity_obj": toxicity,
    "text_columns": text_columns,
    "score_threshold": 0
}
hf_toxicity_plot(df_summarization, params)