## "**DSPy**: Programming with Foundation Models" - Demo
## Part 1 - Introduction

In [None]:
import pandas as pd
import dspy

## Configuration

In [None]:
def load_credentials(path_to_credentials: str) -> str:
    with open(path_to_credentials, "r") as _f:
        return _f.read()

llm = dspy.OpenAI(
    deployment_id="gpt-35-turbo",
    api_key=load_credentials("credentials.txt"),
    api_base="https://your_endpoint_name.openai.azure.com/",
    api_provider="azure",
    api_version="2023-03-15-preview",
    model_type="chat"
)

dspy.settings.configure(lm=llm)

In [None]:
# dspy saves the "history" of your llm. So far, the history is empty
# IMPORTANT NOTE: DSPy is caching LLMs calls
# Querying an LLM with the same "conditions" (input, signature, etc.) will return the cached response
llm.inspect_history()

### Which LLMs are supported?

Cloud-based and local models are both supported, however, at the moment:
* among cloud-based,`OpenAI` is supported while Google is (`PaLM2`, `Gemini`...) not;
* among the local models you find: `HFModel`, `vLLModel`, `OllamaModel`.

Full information available at [DSPy | Documentation | Language Models](https://dspy-docs.vercel.app/docs/building-blocks/language_models)

## Building Blocks

### Signatures & Modules

**A signature is a declarative specification of input/output behavior of a DSPy module.**
* _What_ to do instead of _why_;
* Writing signatures is far more modular, adaptive, and reproducible than writing prompts

#### "inline" Signature

In [None]:
review = "Wszystko OK, polecam!"

classify = dspy.Predict('review -> sentiment') # inline signature
classify(review=review)

Many DSPy modules (except `dspy.Predict`) return auxiliary information by expanding your signature under the hood. For example, the `dspy.ChainOfThought` provide also the `rationale`:

In [None]:
classify = dspy.ChainOfThought('review -> sentiment')
result = classify(review=review)
print(f"Sentiment: {result.sentiment}")
print(f"\nRationale: {result.rationale}")

#### "class-based" Signature

In [None]:
class BasicSentimentAnalysis(dspy.Signature):
    """review -> sentiment"""
    
    review = dspy.InputField()
    sentiment = dspy.OutputField()

In [None]:
classify = dspy.Predict(BasicSentimentAnalysis)
classify(review=review)

In [None]:
# In clas-based Signatures, the docstring is part of the prompt
# Input and Ouptut can be enriched by descriptions: `desc`
class ReviewScoreEstimation(dspy.Signature):
    """review -> score"""
    
    review = dspy.InputField()
    # `desc` will be used in the prompt by the model
    score = dspy.OutputField(
        desc="An integer between 1 and 5, where 1 represents extremely negative sentiment and 5 extremely positive"
    )

In [None]:
classify = dspy.Predict(ReviewScoreEstimation)
classify(review=review)

Let's inspect the `llm` history!
**Note: `dspy` is caching the llm's interactions**, so calling the llm with the same input and signatures will ouput the same response.

In [None]:
llm.inspect_history(n=1)

### Programs

Let's now use what we have learned so far to build our first `dspy` program!

(Dummy) Business case: 

_"Recent analys of the market suggest that consumers tend to mention specific product parameters when writing bad reviews"_


The BO of product reviews in your company wants to validate the hyphotheis above.
We want to use a GenAI approach to prototype a solution and make a fast delivery of the results. The BO won't accept the solution as a black-box as she wants to know why the model made specific decisions. In general, we have a fixed budget allocated for GPT-3.5-turbo tests, so we want to query the model as least as possible.

Given this, the specifications of our LLM program are:

* Get the input review;
* Predict the review score with a prediction rationale (business explanation);
* If (and only if) the score is negative, look for parameters (costs optimization);
* Produce a final output.

#### Dataset

In [None]:
# Dataset source: https://github.com/allegro/klejbenchmark-allegroreviews
reviews = [
    {"id": "69", "text": "Jakość dźwięku tragedia. Najgorsze ze słuchawek jakiekolwiek miałem okazje używać. Na 100% nie jest to produkt oryginalny Sony. Taka firma nigdy nie wypuściłaby  takiego złomu. Typowe podróby, 1000 razy gorsze od oryginalnego chińczyka. Nie ma sensu kupować nawet do rozmów telefonicznych, chyba ze kupimy je w firmowym salonie Sony, bo podejrzewam ze wówczas jakość ich będzie zadowalająca.", "ground_truth": "1"},
    {"id": "122", "text": "Antena naprawdę duża  , obudowa plastikowa w moim przypadku przyszła dość porysowana , z minusów  bardzo tandetny uchwyt montażowy wykonany z aluminiowego kątownika który powoduje drżenie przy dużych wiatrach , zysk ponad 20dBi  a będzie jeszcze lepiej po skróceniu przewodu . Adapter na złącze SMA w zestawie , szkoda,że zrobiony na tak cienkim przewodzie. ogólnie bardzo dobra antena z małym ale.", "ground_truth": "4"},
    {"id": "392", "text": "Klawiatura ma słabe ledy i są źle rozprowadzone, gdzie indziej jest jasno a gdzie indziej ciemno, klawisze się zcierają. Myszka jak dla mnie nie ma wad, dobrze mi się nią obsługuje, podkładka się szybko ztarła, a od początku wypakowania była zgięta, rogi podkładki są podwinięte. Słuchawki maja przeciętny dźwięk, materiał zastosowany po dłuższym użytkowaniu robi się twardy i nieprzyjemny, co jakiś czas muszę je zdejmować, bo bolą mnie od nich uszy i głowa, mikrofon nie trzeszczy", "ground_truth": "2"},
    {"id": "396", "text": "Produkt niezgodny ze specyfikacją. Według opisu ma gniazdo jack  3,5 mm do podłączenia zestawu słuchawkowego. W praktyce otrzymałem jakąś zubożoną wersję produktu bez tego gniazda. A jedną z głównych funkcji różnych akcesoriów jabra jest ich wzajemna kompatibilność. Dodatkowo mam wrażenie, że produkt nie jest nowy, brakuje mu folii zabezpieczającej (patrz foto - na pierwszym zdjęciu jak powinna wyglądać folia zabezpieczająca, na drugim zdjęciu - na moim urządzeniu tej folii było brak)", "ground_truth": "1"},
    {"id": "276", "text": "STRASZNIE GRUBE RAMKI! Bez piłowania się nie obyło, co i tak nie wystarczyło, bo o ile można spiłować boki na szerokość/długość, to raczej trudno spiłować boki na grubość/głębokość, chyba, że ktoś posiada specjalistyczne narzędzia typu dremel... Co do rozdzielczości to nie mam zastrzeżeń. Jakość w miarę podobna jak w oryginale. Tylko sprawne oko może zauważyć różnicę. Podsumowując, jeśli chcesz zaoszczędzić i kupić ten wyświetlacz za 54zł to potem się nie dziw, że ci on nie pasuje do obudowy i będziesz musiał piłować pilnikami i papierem ściernym. ", "ground_truth": "3"},
    {"id": "355", "text": "Rewelacyjne kolumny, zdecydowanie lepsza jakość dzwieku od wersji bez SE która również posiadam. Taga to najlepsze kolumny w tej cenie słuchałem Jamo i Yamahy ale Taga zdecydowanie mi bardziej podchodzi zarówno optycznie jak i barwa dzwięku. Wersja SE ma zdecydowanie bardziej klarowny dzwięk i świetne panele z tyłu do podłączanie kabli POLECAM!!!! Lakier fortepianowy to ogromna zaleta tych kolumn wygląda dobrze nawet po 2 latach nic sie z tym nie dziej złego :)", "ground_truth": "5"},
]

df = pd.DataFrame(reviews)
pd.set_option('max_colwidth', 800)
df

The core data type for data in DSPy is `Example`. You will use Examples to represent items in your training set and test set.

In [None]:
train_examples = []
for _, row in df.iterrows():
    train_examples.append(dspy.Example(review=row["text"]).with_inputs("review"))

In [None]:
train_examples[0]

#### DSPy Program

In [None]:
# We are going to use `ReviewScoreEstimation` signature for score estimation
# But we need an additional signature for parameters extraction
class ParametersExtractor(dspy.Signature):
    """review -> ['parameter_name_1', 'parameter_name_2', ...]"""
    
    review = dspy.InputField(
        desc="A product review written by a consumer in a Polish e-commerce"
    )
    parameters = dspy.OutputField(
        desc="A list of extracted product parameter names"
    )

In [None]:
class ReviewAnalyzer(dspy.Module):
    def __init__(self):
        super().__init__()
        self.review_score_estimator = dspy.ChainOfThought(ReviewScoreEstimation)
        self.parameters_extractor = dspy.Predict(ParametersExtractor)

        self.score_threshold = 3

    def forward(self, review):        
        score_estimator_response = self.review_score_estimator(review=review.review)
        score = int(score_estimator_response.score)
        
        if score > self.score_threshold:
            return dspy.Prediction(
                review=review.review,
                score=score,
                business_explanation=score_estimator_response.rationale
            )
            
        mentioned_parameters = self.parameters_extractor(review=review.review).parameters
            
        return dspy.Prediction(
            review=review.review,
            score=score,
            mentioned_parameters=mentioned_parameters,
            business_explanation=score_estimator_response.rationale
        )

#### Let's try our program!

In [None]:
input_example = train_examples[0]

In [None]:
review_analyzer = ReviewAnalyzer()

In [None]:
analysis = review_analyzer.forward(input_example)
analysis

#### How can we improve our results?

Few ideas:

* Improve the Signatures;
* Use more advanced dspy blocks in the pipeline, e.g replace `Predict` with `ChainOfThought`;
* Get more data, define a metric and let the pipeline **train** by itself! (more on this later)

## Part 2 - Optimization

For this part we are going to use a tutorial provided by DSPy authors.

The tutorial is about compiling a RAG pipeline using HotPotQA dataset and ColBERTv2 as a retriever module.

![rag_schema](rag-schema.png)

In [None]:
# DSPy provides a free-to-use retriever model for educational purposes
# We are adding the retriever `rm` to our configuration. This will be used when calling `dspy.Retrieve`
colbertv2_wiki17_abstracts = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
dspy.settings.configure(lm=llm, rm=colbertv2_wiki17_abstracts)

#### HotPotQA Dataset

In [None]:
from dspy.datasets import HotPotQA

# Load the dataset.
dataset = HotPotQA(train_seed=1, train_size=20, eval_seed=2023, dev_size=50, test_size=0)

# Tell DSPy that the 'question' field is the input. Any other fields are labels and/or metadata.
trainset = [x.with_inputs('question') for x in dataset.train]
devset = [x.with_inputs('question') for x in dataset.dev]

len(trainset), len(devset)

In [None]:
train_example = trainset[0]
print(f"Question: {train_example.question}")
print(f"Answer: {train_example.answer}")

In [None]:
dev_example = devset[10]
print(f"Question: {dev_example.question}")
print(f"Answer: {dev_example.answer}")
print(f"Relevant Wikipedia Titles: {dev_example.gold_titles}")

In [None]:
retrieve = dspy.Retrieve(k=3)
topK_passages = retrieve(dev_example.question).passages

print(f"Top {retrieve.k} passages for question: {dev_example.question} \n", '-' * 30, '\n')

for idx, passage in enumerate(topK_passages):
    print(f'{idx+1}]', passage, '\n')

#### DSPy program

In [None]:
class GenerateAnswer(dspy.Signature):
    """question, context -> answer"""

    context = dspy.InputField()
    question = dspy.InputField()
    answer = dspy.OutputField()

In [None]:
class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()

        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
    
    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)

In [None]:
basic_rag = RAG()

### Evaluation

Let's now define some new key components of DSPy in order to perform automatic evaluation of our pipelines.

In [None]:
# We have a positive match when:
# (generated answer and true answer match exactly) AND (the retrieved context does actually contain that answer)
def validate_context_and_answer(example, pred, trace=None):
    answer_EM = dspy.evaluate.answer_exact_match(example, pred)
    answer_PM = dspy.evaluate.answer_passage_match(example, pred)
    return answer_EM and answer_PM

In [None]:
from dspy.evaluate import Evaluate

evaluate_hotpot = Evaluate(
    devset=devset,
    metric=validate_context_and_answer,
    num_threads=32,
    display_progress=True,
    display_table=15,
)

In [None]:
evaluate_hotpot(basic_rag)

### Optimizers (a.k.a Teleprompters)

A DSPy optimizer is an algorithm that can tune the parameters of a DSPy program (i.e., the prompts and/or the LM weights) to maximize the metrics you specify, like accuracy. There are many built-in optimizers in DSPy, which apply vastly different strategies. A typical DSPy optimizer takes three things:

* Your DSPy program. This may be a single module (e.g., dspy.Predict) or a complex multi-module program.

* Your metric. This is a function that evaluates the output of your program, and assigns it a score (higher is better).

* A few training inputs. This may be very small (i.e., only 5 or 10 examples) and incomplete (only inputs to your program, without any labels).

### Optimizer 1: BootstrapFewShot

Uses few-shot learning examples to bootstrap your program.
* few-shot examples may come from a training set;
* few-shot examples may be self-generated by the program itself.

In [None]:
from dspy.teleprompt import BootstrapFewShot

teleprompter_few_shot = BootstrapFewShot(
    metric=validate_context_and_answer,       # it defines the metric to be optimized
    max_bootstrapped_demos=2,                 # it defines the max number of bootstrapped "demos" of your program
    max_labeled_demos=4,                      # it defines the max number of few-shot learning examples
    max_rounds=2,                             # how many times to repeat the process before arresting it
    teacher_settings={}                       # if set, the teacher LLM will be used to generate synthetic examples and the bootstrap logic
)

# Compile!
compiled_few_shot_rag = teleprompter_few_shot.compile(RAG(), trainset=trainset)
compiled_few_shot_rag.save('compiled_rag.json')

In [None]:
evaluate_hotpot(compiled_few_shot_rag)

In [None]:
llm.inspect_history()

In [None]:
import json
from pprint import pprint

# How does our "frozen" RAG pipeline looks like?
with open("compiled_rag.json", "r") as json_file:
    compiled_rag_json = json_file.read()
    
pprint(json.loads(compiled_rag_json))

### Teleprompter 2: SignatureOptimizer

From the docs: Generates and refines new instructions for each step, and optimizes them with coordinate ascent.

https://github.com/stanfordnlp/dspy/blob/main/dspy/teleprompt/signature_opt.py

In [None]:
from dspy.teleprompt import SignatureOptimizer

teleprompter_signature = SignatureOptimizer(
    metric=validate_context_and_answer,
    breadth=10,                          # The number of new prompts to generate at each iteration. Default=10.
    depth=3,                             # The number of times we should ask our prompt model to generate new prompts, with the history of the past prompts as input. Default=3
    verbose=True
)

In [None]:
# Compile!
kwargs = dict(num_threads=32, display_progress=True, display_table=0)
compiled_rag_signature = teleprompter_signature.compile(RAG(), devset=devset, eval_kwargs=kwargs)

In [None]:
evaluate_hotpot(compiled_rag_signature)

In [None]:
llm.inspect_history()

### Testing our compiled programs "in production"

In [None]:
user_question = "Who did invent the computer?"

In [None]:
basic_rag(user_question)

In [None]:
compiled_few_shot_rag(user_question)

In [None]:
compiled_rag_signature(user_question)

## Challenge

* Set up an evaluation pipeline for the `ReviewAnalysis` DSPy program
  * What should be the metric? Hint: we want correctly estimated score and correctly extracted parameters
* Use an optimizer to improve the basic prompt;
* Try out different modules from DSPy module zoo

Enjoy programming -- not prompting -- LLMs!