# Summarization of financial data using Hugging Face LLM models
This notebook aims to provide an introduction to documenting an LLM model using the ValidMind Developer Framework. The use case presented is a summarization of financial data (https://huggingface.co/datasets/financial_phrasebank).

- Initializing the ValidMind Developer Framework
- Running a test various tests to quickly generate document about the data and model

## Before you begin

To use the ValidMind Developer Framework with a Jupyter notebook, you need to install and initialize the client library first, along with getting your Python environment ready.

If you don't already have one, you should also [create a documentation project](https://docs.validmind.ai/guide/create-your-first-documentation-project.html) on the ValidMind platform. You will use this project to upload your documentation and test results.

## Install the client library

In [1]:
# %pip install --upgrade validmind

## Initialize the client library

In a browser, go to the **Client Integration** page of your documentation project and click **Copy to clipboard** next to the code snippet. This code snippet gives you the API key, API secret, and project identifier to link your notebook to your documentation project.

::: {.column-margin}
::: {.callout-tip}
This step requires a documentation project. [Learn how you can create one](https://docs.validmind.ai/guide/create-your-first-documentation-project.html).
:::
:::

Next, replace this placeholder with your own code snippet:

In [2]:
## Replace the code below with the code snippet from your project ## 


import validmind as vm

vm.init(
  api_host = "http://localhost:3000/api/v1/tracking",
  api_key = "2494c3838f48efe590d531bfe225d90b",
  api_secret = "4f692f8161f128414fef542cab2a4e74834c75d01b3a8e088a1834f2afcfe838",
  project = "cllnq0ckr000273y6ev40pmb5"
)

import sys
print(sys.executable)

2023-08-29 17:28:33,136 - INFO(validmind.api_client): Connected to ValidMind. Project: [11] Credit Risk Scorecard - Initial Validation (cllnq0ckr000273y6ev40pmb5)


/Users/juanvalidmind/Library/Caches/pypoetry/virtualenvs/validmind-X_uvMH0R-py3.10/bin/python


### Import Libraries

In [3]:
from transformers import pipeline
from transformers import BertTokenizer, BertModel
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
from sklearn.metrics.pairwise import cosine_similarity
import textwrap
import numpy as np
import pandas as pd
from pprint import pprint
import torch

from dataclasses import dataclass

### Preprocessing functions

In [4]:
import pandas as pd
import textwrap
from tabulate import tabulate
from IPython.display import display, HTML

def _format_cell_text(text, width=50):  
    """Private function to format a cell's text."""
    return '\n'.join([textwrap.fill(line, width=width) for line in text.split('\n')])

def _format_dataframe_for_tabulate(df):
    """Private function to format the entire DataFrame for tabulation."""
    df_out = df.copy()
    
    # Format all string columns
    for column in df_out.columns:
        if df_out[column].dtype == object:  # Check if column is of type object (likely strings)
            df_out[column] = df_out[column].apply(_format_cell_text)
    return df_out

def _dataframe_to_html_table(df):
    """Private function to convert a DataFrame to an HTML table."""
    headers = df.columns.tolist()
    table_data = df.values.tolist()
    return tabulate(table_data, headers=headers, tablefmt="html")

def display_formatted_dataframe(df, num_rows=None):
    """Primary function to format and display a DataFrame."""
    if num_rows is not None:
        df = df.head(num_rows)
    formatted_df = _format_dataframe_for_tabulate(df)
    html_table = _dataframe_to_html_table(formatted_df)
    display(HTML(html_table))


In [5]:
def add_summaries_to_df(df, summaries):
    """
    Adds a new column 'summary_X' to the dataframe df that contains the given summaries, where X is an incremental number.

    Parameters:
    - df: The original pandas DataFrame.
    - summaries: List/array of summarized texts.

    Returns:
    - A new DataFrame with an additional summary column, with 'labels' being the first column followed by the original 'text'.
    """

    df = df.copy()  # Make an explicit copy of the DataFrame

    # Check if the length of summaries matches the number of rows in the DataFrame
    if len(summaries) != len(df):
        raise ValueError(f"The number of summaries ({len(summaries)}) does not match the number of rows in the DataFrame ({len(df)}).")

    # Determine the name for the new summary column
    col_index = 1
    col_name = 'summary_1'
    while col_name in df.columns:
        col_index += 1
        col_name = f'summary_{col_index}'

    # Add the summaries to the DataFrame
    df[col_name] = summaries

    # Rearrange the DataFrame columns to have 'topic' first, then the original 'input', followed by summary columns
    summary_columns = [col for col in df.columns if col.startswith('summary')]
    other_columns = [col for col in df.columns if col not in summary_columns + ['topic', 'input', 'reference_summary']]
    
    columns_order = ['topic', 'input', 'reference_summary'] + sorted(summary_columns) + other_columns
    df = df[columns_order]

    return df


In [6]:
import pandas as pd
from rouge import Rouge
import plotly.graph_objects as go

def calculate_rouge_scores(df, ref_column, gen_column, metric="rouge-2"):
    """
    Compute ROUGE scores for each row in the DataFrame.

    :param df: DataFrame containing the summaries
    :param ref_column: Column name for the reference summaries
    :param gen_column: Column name for the generated summaries
    :param metric: Type of ROUGE metric ("rouge-1", "rouge-2", "rouge-l", "rouge-s")
    :return: DataFrame with ROUGE scores for each row
    """
    if metric not in ["rouge-1", "rouge-2", "rouge-l", "rouge-s"]:
        raise ValueError("Invalid metric. Choose from 'rouge-1', 'rouge-2', 'rouge-l', 'rouge-s'.")
    
    rouge = Rouge(metrics=[metric])
    score_list = []
    
    for _, row in df.iterrows():
        scores = rouge.get_scores(row[gen_column], row[ref_column], avg=True)[metric]
        score_list.append(scores)
    
    return pd.DataFrame(score_list)

def visualize_rouge_scores(df_scores):
    """
    Visualize ROUGE scores using Plotly line plots for each row.

    :param df_scores: DataFrame of ROUGE scores.
    """
    fig = go.Figure()

    # Adding the line plots
    fig.add_trace(go.Scatter(x=df_scores.index, y=df_scores['p'], mode='lines+markers', name='Precision'))
    fig.add_trace(go.Scatter(x=df_scores.index, y=df_scores['r'], mode='lines+markers', name='Recall'))
    fig.add_trace(go.Scatter(x=df_scores.index, y=df_scores['f'], mode='lines+markers', name='F1 Score'))

    fig.update_layout(
        title="ROUGE Scores for Each Row",
        xaxis_title="Row Index",
        yaxis_title="Score"
    )
    fig.show()

### POC Validation Metrics

In [7]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from collections import Counter
import string
import plotly.express as px
from itertools import combinations

# First function
def general_text_metrics(df, text_column):
    nltk.download('punkt', quiet=True)
    
    results = []

    for text in df[text_column]:
        sentences = nltk.sent_tokenize(text)
        words = nltk.word_tokenize(text)
        paragraphs = text.split("\n\n")

        total_words = len(words)
        total_sentences = len(sentences)
        avg_sentence_length = round(sum(len(sentence.split()) for sentence in sentences) / total_sentences if total_sentences else 0, 1)
        total_paragraphs = len(paragraphs)

        results.append([total_words, total_sentences, avg_sentence_length, total_paragraphs])

    return pd.DataFrame(results, columns=["Total Words", "Total Sentences", "Avg Sentence Length", "Total Paragraphs"])

# Second function
def vocabulary_structure_metrics(df, text_column, unwanted_tokens, num_top_words, lang):
    stop_words = set(word.lower() for word in stopwords.words(lang))
    unwanted_tokens = set(token.lower() for token in unwanted_tokens)

    results = []

    for text in df[text_column]:
        words = nltk.word_tokenize(text)

        filtered_words = [word for word in words if word.lower() not in stop_words and word.lower() not in unwanted_tokens and word not in string.punctuation]

        total_unique_words = len(set(filtered_words))
        total_punctuations = sum(1 for word in words if word in string.punctuation)
        lexical_diversity = round(total_unique_words / len(filtered_words) if filtered_words else 0, 1)

        results.append([total_unique_words, total_punctuations, lexical_diversity])

    return pd.DataFrame(results, columns=["Total Unique Words", "Total Punctuations", "Lexical Diversity"])

# Wrapper function that combines the outputs
def text_metrics_summary_table(df, params):
    text_column = params["text_column"]
    unwanted_tokens = params["unwanted_tokens"]
    num_top_words = params["num_top_words"]
    lang = params["lang"]
    
    gen_metrics_df = general_text_metrics(df, text_column)
    vocab_metrics_df = vocabulary_structure_metrics(df, text_column, unwanted_tokens, num_top_words, lang)
    
    combined_df = pd.concat([gen_metrics_df, vocab_metrics_df], axis=1)
    
    return combined_df

# Function to plot scatter plots for specified combinations using Plotly
def plot_specified_scatter(df, combinations_to_plot):
    for metric1, metric2 in combinations_to_plot:
        fig = px.scatter(df, x=metric1, y=metric2, title=f"Scatter Plot: {metric1} vs {metric2}")
        fig.show()


In [8]:
import pandas as pd
import plotly.graph_objects as go
import plotly.subplots as sp
import nltk
from collections import Counter

def text_metrics_summary_plot(df, text_column, num_docs_to_plot=None):
    # Ensure the nltk punkt tokenizer is downloaded
    nltk.download('punkt', quiet=True)
    
    # Decide on the number of documents to plot
    if not num_docs_to_plot or num_docs_to_plot > len(df):
        num_docs_to_plot = len(df)

    # Colors for each subplot
    colors = ['blue', 'green', 'red', 'purple']

    # Axis titles for clarity
    x_titles = [
        "Word Frequencies",
        "Sentence Position in Document",
        "Sentence Lengths (Words)",
        "Word Lengths (Characters)"
    ]
    y_titles = [
        "Number of Words",
        "Sentence Length (Words)",
        "Number of Sentences",
        "Number of Words"
    ]

    # Iterate over each document in the DataFrame up to the user-specified limit
    for index, (idx, row) in enumerate(df.head(num_docs_to_plot).iterrows()):
        # Create subplots with a 2x2 grid for each metric
        fig = sp.make_subplots(
            rows=2, cols=2, 
            subplot_titles=[
                "Word Frequencies", 
                "Sentence Positions",
                "Sentence Lengths", 
                "Word Lengths"
            ]
        )
        
        # Tokenize document into sentences and words
        sentences = nltk.sent_tokenize(row[text_column])
        words = nltk.word_tokenize(row[text_column])
        
        # Metrics computation
        word_freq = Counter(words)
        freq_counts = Counter(word_freq.values())
        word_frequencies = list(freq_counts.keys())
        word_frequency_counts = list(freq_counts.values())
        
        sentence_positions = list(range(1, len(sentences) + 1))
        sentence_lengths = [len(sentence.split()) for sentence in sentences]
        word_lengths = [len(word) for word in words]
        
        # Adding data to subplots
        fig.add_trace(go.Bar(x=word_frequencies, y=word_frequency_counts, marker_color=colors[0], showlegend=False), row=1, col=1)
        fig.add_trace(go.Bar(x=sentence_positions, y=sentence_lengths, marker_color=colors[1], showlegend=False), row=1, col=2)
        fig.add_trace(go.Histogram(x=sentence_lengths, nbinsx=50, opacity=0.75, marker_color=colors[2], showlegend=False), row=2, col=1)
        fig.add_trace(go.Histogram(x=word_lengths, nbinsx=50, opacity=0.75, marker_color=colors[3], showlegend=False), row=2, col=2)

        # Update x and y axis titles
        for i, (x_title, y_title) in enumerate(zip(x_titles, y_titles)):
            fig['layout'][f'xaxis{i+1}'].update(title=x_title, titlefont=dict(size=10))
            fig['layout'][f'yaxis{i+1}'].update(title=y_title, titlefont=dict(size=10))

        # Update layout
        fig.update_layout(
            title=f"Summary Metrics for Document {index+1}",
            barmode='overlay',
            height=800
        )
        
        fig.show()

### Hugging Face summarisation wrappers

The following code template showcases how to wrap a Hugging Face model for compatibility with the ValidMind Developer Framework. We will load an example model using the transformers API and then run some predictions on our test dataset.

The ValidMind developer framework provides support for Hugging Face transformers out of the box, so in the following section we will show how to initialize multiple transformers models with the `init_model` function, removing the need for a custom wrapper. In cases where you need extra pre-processing or post-processing steps, you can use the following code template as a starting point to wrap your model.

In [9]:
from transformers import pipeline

@dataclass

class AbstractSummarization_HuggingFace:
    """
    A VM Model instance wrapper only requires a predict 
    """

    predicted_prob_values = None

    def __init__(self, pipeline_task, model_name=None, model=None, tokenizer=None):
        self.model_name = model_name
        self.pipeline_task = pipeline_task
        self.model = pipeline(pipeline_task, model=model, tokenizer=tokenizer)

    def predict(self, data):
        data = [str(datapoint) for datapoint in data]
        results = []
        results = self.model(data)
        results_df = pd.DataFrame(results)
        self.predicted_prob_values = results_df.score.values
        return results_df.label.values

    def predict_proba(self):
        if self.predicted_prob_values is None:
            raise ValueError("First run predict method to retrieve predicted probabilities")
        return self.predicted_prob_values

In [10]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import torch
from dataclasses import dataclass

@dataclass
class ExtractiveSummarization_BERT:
    model: any
    tokenizer: any

    def _get_embedding(self, text):
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=512, padding='max_length')
        with torch.no_grad():
            output = self.model(**inputs)
        return output['last_hidden_state'].mean(dim=1).squeeze().detach().numpy()

    def predict(self, texts, params):
        summaries = []
        
        for text in texts:
            sentences = text.split("\n\n") # Breaking at double newline to get individual sentences.
            
            document_embedding = self._get_embedding(' '.join(sentences))
            sentence_embeddings = [self._get_embedding(sentence) for sentence in sentences]
            similarities = cosine_similarity(sentence_embeddings, [document_embedding])
            sorted_indices = np.argsort(similarities, axis=0)[::-1].squeeze()

            if params["method"] == "percentage":
                top_k = int(len(sentences) * params["value"])
                selected_sentences = [sentences[i] for i in sorted_indices[:top_k]]

            elif params["method"] == "fixed_sentences":
                top_k = params["value"]
                selected_sentences = [sentences[i] for i in sorted_indices[:top_k]]

            elif params["method"] == "word_count":
                selected_sentences = []
                total_words = 0
                for index in sorted_indices:
                    current_sentence_words = len(sentences[index].split())
                    # If adding the current sentence doesn't exceed the word limit, add it.
                    if total_words + current_sentence_words <= params["value"]:
                        total_words += current_sentence_words
                        selected_sentences.append(sentences[index])
                    # Once the word limit is reached or exceeded, stop adding more sentences.
                    if total_words >= params["value"]:
                        break

            else:
                raise ValueError("Invalid method specified.")
            
            summaries.append(' '.join(selected_sentences))
        
        return summaries

## 0. Load Data

In this section, we'll load the financial dataset, which will be the foundation for our summarization analysis tasks.

In [11]:
import pandas as pd

df = pd.read_csv('./datasets/bbc_text_cls_reference.csv')
df = df.head(100)

## 1. Extractive Summarization: Hugging Face-BERT Model

In [12]:
model = BertModel.from_pretrained("bert-base-uncased")  
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
extractive_model = ExtractiveSummarization_BERT(model=model, tokenizer=tokenizer)

In [13]:
df_raw = df.copy()
data = df_raw.input.values.tolist()

params = {
    "method": "word_count",
    "value": 200
}

list_summary_1 = extractive_model.predict(data, params)

In [14]:
df_all_summaries = add_summaries_to_df(df_raw, list_summary_1)
display_formatted_dataframe(df_all_summaries, num_rows=2)

topic,input,reference_summary,summary_1
business,"Ad sales boost Time Warner profit Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier. The firm, which is now one of the biggest investors in Google, benefited from sales of high- speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL. Time Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AOL's existing customers for high-speed broadband. TimeWarner also has to restate 2000 and 2003 results following a probe by the US Securities Exchange Commission (SEC), which is close to concluding. Time Warner's fourth quarter profits were slightly better than analysts' expectations. But its film division saw profits slump 27% to $284m, helped by box-office flops Alexander and Catwoman, a sharp contrast to year-earlier, when the third and final film in the Lord of the Rings trilogy boosted results. For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn. ""Our financial performance was strong, meeting or exceeding all of our full-year objectives and greatly enhancing our flexibility,"" chairman and chief executive Richard Parsons said. For 2005, TimeWarner is projecting operating earnings growth of around 5%, and also expects higher revenue and wider profit margins. TimeWarner is to restate its accounts as part of efforts to resolve an inquiry into AOL by US market regulators. It has already offered to pay $300m to settle charges, in a deal that is under review by the SEC. The company said it was unable to estimate the amount it needed to set aside for legal reserves, which it previously set at $500m. It intends to adjust the way it accounts for a deal with German music publisher Bertelsmann's purchase of a stake in AOL Europe, which it had reported as advertising revenue. It will now book the sale of its stake in AOL Europe as a loss on the value of that stake.","The firm, which is now one of the biggest investors in Google, benefited from sales of high- speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL. Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier. Ad sales boost Time Warner profit","Time Warner's fourth quarter profits were slightly better than analysts' expectations. But its film division saw profits slump 27% to $284m, helped by box-office flops Alexander and Catwoman, a sharp contrast to year-earlier, when the third and final film in the Lord of the Rings trilogy boosted results. For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn. ""Our financial performance was strong, meeting or exceeding all of our full-year objectives and greatly enhancing our flexibility,"" chairman and chief executive Richard Parsons said. For 2005, TimeWarner is projecting operating earnings growth of around 5%, and also expects higher revenue and wider profit margins. The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL. Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier. Ad sales boost Time Warner profit"
business,"Dollar gains on Greenspan speech The dollar has hit its highest level against the euro in almost three months after the Federal Reserve head said the US trade deficit is set to stabilise. And Alan Greenspan highlighted the US government's willingness to curb spending and rising household savings as factors which may help to reduce it. In late trading in New York, the dollar reached $1.2871 against the euro, from $1.2974 on Thursday. Market concerns about the deficit has hit the greenback in recent months. On Friday, Federal Reserve chairman Mr Greenspan's speech in London ahead of the meeting of G7 finance ministers sent the dollar higher after it had earlier tumbled on the back of worse-than-expected US jobs data. ""I think the chairman's taking a much more sanguine view on the current account deficit than he's taken for some time,"" said Robert Sinche, head of currency strategy at Bank of America in New York. ""He's taking a longer-term view, laying out a set of conditions under which the current account deficit can improve this year and next."" Worries about the deficit concerns about China do, however, remain. China's currency remains pegged to the dollar and the US currency's sharp falls in recent months have therefore made Chinese export prices highly competitive. But calls for a shift in Beijing's policy have fallen on deaf ears, despite recent comments in a major Chinese newspaper that the ""time is ripe"" for a loosening of the peg. The G7 meeting is thought unlikely to produce any meaningful movement in Chinese policy. In the meantime, the US Federal Reserve's decision on 2 February to boost interest rates by a quarter of a point - the sixth such move in as many months - has opened up a differential with European rates. The half-point window, some believe, could be enough to keep US assets looking more attractive, and could help prop up the dollar. The recent falls have partly been the result of big budget deficits, as well as the US's yawning current account gap, both of which need to be funded by the buying of US bonds and assets by foreign firms and governments. The White House will announce its budget on Monday, and many commentators believe the deficit will remain at close to half a trillion dollars.",The dollar has hit its highest level against the euro in almost three months after the Federal Reserve head said the US trade deficit is set to stabilise. Dollar gains on Greenspan speech,"And Alan Greenspan highlighted the US government's willingness to curb spending and rising household savings as factors which may help to reduce it. In late trading in New York, the dollar reached $1.2871 against the euro, from $1.2974 on Thursday. Market concerns about the deficit has hit the greenback in recent months. On Friday, Federal Reserve chairman Mr Greenspan's speech in London ahead of the meeting of G7 finance ministers sent the dollar higher after it had earlier tumbled on the back of worse-than-expected US jobs data. ""I think the chairman's taking a much more sanguine view on the current account deficit than he's taken for some time,"" said Robert Sinche, head of currency strategy at Bank of America in New York. ""He's taking a longer-term view, laying out a set of conditions under which the current account deficit can improve this year and next."" The dollar has hit its highest level against the euro in almost three months after the Federal Reserve head said the US trade deficit is set to stabilise. Dollar gains on Greenspan speech"


## 2. Validation

### General Text Metrics

- **Total Words:** Assess the length and complexity of the input text. Longer documents might require more sophisticated summarization techniques, while shorter ones may need more concise summaries.

- **Total Sentences:** Understand the structural makeup of the content. Longer texts with numerous sentences might require the model to generate longer summaries to capture essential information.

- **Avg Sentence Length:** Determine the average length of sentences in the text. This can help the model decide on the appropriate length for generated summaries, ensuring they are coherent and readable.

- **Total Paragraphs:** Analyze how the content is organized into paragraphs. The model should be able to maintain the logical structure of the content when producing summaries.

- **Total Unique Words:** Measure the diversity of vocabulary in the text. A higher count of unique words could indicate more complex content, which the model needs to capture accurately.

- **Most Common Words:** Identify frequently occurring words that likely represent key themes. The model should pay special attention to including these words and concepts in its summaries.

- **Total Punctuations:** Evaluate the usage of punctuation marks, which contribute to the tone and structure of the content. The model should be able to maintain appropriate punctuation in summaries.

- **Lexical Diversity:** Calculate the richness of vocabulary in relation to the overall text length. A higher lexical diversity suggests a broader range of ideas and concepts that the model needs to capture in its summaries.

In [None]:
params = {
    "text_column": "input",
    "unwanted_tokens": {'s', 's\'', 'mr', 'ms', 'mrs', 'dr', '\'s', ' ', "''", 'dollar', 'us', '``'},
    "num_top_words": 3,
    "lang": "english"
}

summary_table = text_metrics_summary_table(df_raw, params)
display(summary_table)

In [None]:
# Define the combinations you want to plot
combinations_to_plot = [
    ("Total Words", "Total Sentences"),
    ("Total Words", "Total Unique Words"),
    ("Total Sentences", "Avg Sentence Length"),
    ("Total Unique Words", "Lexical Diversity")
]

plot_specified_scatter(summary_table, combinations_to_plot)

### Text Metrics Distributions

- **Word Frequencies:** This metric provides a histogram of how often words appear with a given frequency. For example, if a lot of words appear only once in a document, it might be indicative of a text rich in unique words. On the other hand, a small set of words appearing very frequently might indicate repetitive content or a certain theme or pattern in the text.

- **Sentence Positions vs. Sentence Lengths:** This bar chart showcases the length of each sentence (in terms of word count) in their order of appearance in the document. This can give insights into the flow of information in a text, highlighting any long, detailed sections or brief, potentially superficial areas.

- **Sentence Lengths Distribution:** A histogram showing the frequency of sentence lengths across the document. Long sentences might contain a lot of information but could be harder for summarization models to digest and for readers to comprehend. Conversely, many short sentences might indicate fragmented information.

- **Word Lengths Distribution:** A histogram of the lengths of words in the document. Extremely long words might be anomalies, technical terms, or potential errors in the corpus. Conversely, a majority of very short words might denote lack of depth or specificity.

In [None]:
text_metrics_summary_plot(df_raw, "input", 2)

**ROUGE Score (Recall-Oriented Understudy for Gisting Evaluation)**

The ROUGE score is a widely adopted set of metrics used for evaluating automatic summarization and machine translation. It fundamentally measures the overlap between the n-grams in the generated summary and those in the reference summary.

- **ROUGE-N**: This evaluates the overlap of n-grams between the produced summary and reference summary. It calculates precision (the proportion of n-grams in the generated summary that are also present in the reference summary), recall (the proportion of n-grams in the reference summary that are also present in the generated summary), and F1 score (the harmonic mean of precision and recall).

- **ROUGE-L**: This metric is based on the Longest Common Subsequence (LCS) between the generated and reference summaries. LCS measures the longest sequence of tokens in the generated summary that matches the reference, without considering the order. It's beneficial because it can identify and reward longer coherent matching sequences.

- **ROUGE-S**: This measures the skip-bigram overlap, considering the pair of words in order as "bigrams" while allowing arbitrary gaps or "skips". This can be valuable to capture sentence-level structure similarity.

In [None]:
metric = "rouge-2"
df_scores = calculate_rouge_scores(
    df_all_summaries, 
    ref_column="reference_summary", 
    gen_column="summary_1", 
    metric=metric)
visualize_rouge_scores(df_scores)

## 2. Abstract Summarization: Hugging Face-T5 Model