# Evaluating Summarization with TruLens

In this notebook, we will evaluate a summarization application based on [DialogSum dataset](https://github.com/cylnlp/dialogsum). Using a number of different metrics. These will break down into two main categories: 
1. Ground truth agreement: For these set of metrics, we will measure how similar the generated summary is to some human-created ground truth. We will use for different measures: BERT score, BLEU, ROUGE and a measure where an LLM is prompted to produce a similarity score.
2. Groundedness: For this measure, we will estimate if the generated summary can be traced back to parts of the original transcript.

### Dependencies
Let's first install the packages that this notebook depends on. Uncomment these linse to run.

In [None]:
"""!pip install bert_score==0.3.13 \
             evaluate==0.4.0 \
             absl-py==1.4.0 \
             rouge-score==0.1.2 \
             pandas \
             tenacity """

For the latest metrics, install TruLens from development branch

In [None]:
"""!pip install git+https://github.com/truera/trulens.git@ss/comparison_scores#subdirectory=trulens_eval"""

### Download and load data
Now we will download a portion of the DialogSum dataset from github.

In [None]:
import pandas as pd    

In [None]:
!wget -O dialogsum.dev.jsonl https://raw.githubusercontent.com/cylnlp/dialogsum/main/DialogSum_Data/dialogsum.dev.jsonl

In [None]:
file_path_dev = 'dialogsum.dev.jsonl'
dev_df = pd.read_json(path_or_buf=file_path_dev, lines=True)

Let's preview the data to make sure that the data was properly loaded

In [None]:
dev_df.head(10)

## Create a simple summarization app and instrument it

We will create a simple summarization app based on the OpenAI ChatGPT model and instrument it for use with TruLens

In [None]:
from trulens_eval.tru_custom_app import instrument
from trulens_eval.tru_custom_app import TruCustomApp

In [None]:
import openai

class DialogSummaryApp:
    
    @instrument
    def summarize(self, dialog):
        summary = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[
                    {"role": "system", "content": """Summarize the given dialog into 1-2 sentences based on the following criteria: 
                     1. Convey only the most salient information; 
                     2. Be brief; 
                     3. Preserve important named entities within the conversation; 
                     4. Be written from an observer perspective; 
                     5. Be written in formal language. """},
                    {"role": "user", "content": dialog}
                ]
            )["choices"][0]["message"]["content"]
        return summary

## Initialize Database and view dashboard

In [None]:
from trulens_eval import Tru
tru = Tru()
# If you have a database you can connect to, use a URL. For example:
# tru = Tru(database_url="postgresql://hostname/database?user=username&password=password")

In [None]:
tru.run_dashboard()

## Write feedback functions

We will now create the feedback functions that will evaluate the app. Remember that the criteria we were evaluating against were:
1. Ground truth agreement: For these set of metrics, we will measure how similar the generated summary is to some human-created ground truth. We will use for different measures: BERT score, BLEU, ROUGE and a measure where an LLM is prompted to produce a similarity score.
2. Groundedness: For this measure, we will estimate if the generated summary can be traced back to parts of the original transcript.

In [None]:
from trulens_eval import Feedback, feedback
from trulens_eval.feedback import GroundTruthAgreement

We select the golden dataset based on dataset we downloaded

In [None]:
golden_set = dev_df[['dialogue', 'summary']].rename(columns={'dialogue': 'query', 'summary': 'response'}).to_dict('records')

In [None]:
ground_truth_collection = GroundTruthAgreement(golden_set)
f_groundtruth = Feedback(ground_truth_collection.agreement_measure).on_input_output()
f_bert_score = Feedback(ground_truth_collection.bert_score).on_input_output()
f_bleu = Feedback(ground_truth_collection.bleu).on_input_output()
f_rouge = Feedback(ground_truth_collection.rouge).on_input_output()
# Groundedness between each context chunk and the response.
grounded = feedback.Groundedness()
f_groundedness = feedback.Feedback(grounded.groundedness_measure).on_input().on_output().aggregate(grounded.grounded_statements_aggregator)

## Create the app and wrap it

Now we are ready to wrap our summarization app with TruLens as a `TruCustomApp`. Now each time it will be called, TruLens will log inputs, outputs and any instrumented intermediate steps and evaluate them ith the feedback functions we created.

In [None]:
app = DialogSummaryApp()
#print(app.summarize(dev_df.dialogue[498]))

In [None]:
ta = TruCustomApp(app, app_id='Summarize_v1', feedbacks = [f_groundtruth, f_groundedness, f_bert_score, f_bleu, f_rouge])

We can test a single run of the App as so. This should show up on the dashboard.

In [None]:
ta.with_record(app.summarize, dialog=dev_df.dialogue[498])

We'll make a lot of queries in a short amount of time, so we need tenacity to make sure that most of our requests eventually go through.

In [None]:
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
)  # for exponential backoff


In [None]:
@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def run_with_backoff(doc):
    return ta.with_record(app.summarize, dialog=doc)


In [None]:
for pair in golden_set:
    llm_response = run_with_backoff(pair["query"])
    print(llm_response)

And that's it! This might take a few minutes to run, at the end of it, you can explore the dashboard to see how well your app does.