# Supercharge LLM Apps with DSPy and Langfuse

Prompt engineering, the art of crafting precise instructions for LLMs, can be a time-consuming and iterative process. Debugging and troubleshooting LLM behavior can also be complex, given the inherent "black box" nature of these models. Additionally, gaining insights into the performance and cost implications of LLM applications is crucial for optimization and scalability (key components for any production grade setup).

## The LLM Ecosystem
The ecosystem for LLMs is still in its nascent stages. To address some of these challenges, a number of innovative tools and frameworks are being developed. DSPy from Stanford University is one such unique take towards formalizing LLM-based app development. Langfuse on the other-hand has emerged as an offering to streamline and operationalize aspects of LLM app maintenance. To put it in brief: 
- **[DSPY](https://dspy-docs.vercel.app/)** provides a modular and composable framework for building LLM applications, abstracting away the complexities of prompt engineering and enabling developers to focus on the core logic of their applications.
- **[Langfuse](https://langfuse.com/docs)** offers a comprehensive observability platform for LLM apps, providing deep insights into model performance, cost, and user interactions.

By combining DSPy and Langfuse, developers can unlock the full potential of LLMs, building robust, scalable, and insightful applications that deliver exceptional user experiences.

### Langfuse Setup
We will make use of self-hosting option for Langfuse. This is based on ``docker`` and ``docker compose``.
Steps:
- Clone the langfuse repository: ``git clone https://github.com/langfuse/langfuse.git``
- From the langfuse repository: ``cd langfuse``
- Start the docker containers: ``docker compose up``
> The last step spins up a container for langfuse and another one for postgres, you may change settings using the ``.env`` or ``docker-compose.yml`` files

### Imports and Config

In [None]:
# !pip3 install dspy-ai=2.5.2
# !pip3 install langfuse==2.51.2
# pip3 install chromadb==0.5.5

In [18]:
import os
import sys
import dspy
from dsp.utils import deduplicate
from dspy.retrieve.chromadb_rm import ChromadbRM
from dsp.trackers.langfuse_tracker import LangfuseTracker

import chromadb
from chromadb.utils import embedding_functions

from langfuse import Langfuse

import random
import itertools
from scraper_utils import NB_Markdown_Scraper
from IPython.display import display, Markdown

In [2]:
config = {
    'LANGFUSE_PUBLIC_KEY': 'XXXX',
    'LANGFUSE_SECRET_KEY': 'XXXX',
    'LANGFUSE_HOST': 'http://localhost:3000',
    'OPENAI_API_KEY': 'XXXX',
    'OPENAI_BASE_URL': '',
    'OPENAI_PROVIDER': '',
    'CHROMA_DB_PATH': './chromadb/',
    'CHROMA_COLLECTION_NAME':"supercharged_workshop_collection",
    'CHROMA_EMB_MODEL': 'all-MiniLM-L6-v2'
}

In [3]:
os.environ["LANGFUSE_PUBLIC_KEY"] = config.get('LANGFUSE_PUBLIC_KEY')
os.environ["LANGFUSE_SECRET_KEY"] = config.get('LANGFUSE_SECRET_KEY')
os.environ["LANGFUSE_HOST"] = config.get('LANGFUSE_HOST')
os.environ["OPENAI_API_KEY"] = config.get('OPENAI_API_KEY')

In [4]:
# setup Langfuse tracker
langfuse_tracker = LangfuseTracker(session_id='supercharger001')

In [5]:
# instantiate language-model for DSPY
llm_model = dspy.OpenAI(
    api_key=config.get('OPENAI_API_KEY'),
    model='gpt-4o-mini'
)

## Prepare Dataset

In [11]:
nb_scraper = NB_Markdown_Scraper([f'../module_0{i}' for i in range(1,5)])
nb_scraper.scrape_markdowns()

In [12]:
with open("./dspy_content.tsv", "w") as record_file:
    for k,v in nb_scraper.notebook_md_dict.items():
        record_file.write(f"{k}\t{v}\n")

In [13]:
doc_ids = []
ctr = 1
for k,_ in nb_scraper.notebook_md_dict.items():
    doc_ids.append(f'{ctr}_{k}')
    ctr+= 1

### Ingest Data into ChromaDB
> ensure Chroma is running in our terminal
> ``$>chroma run --path ./chromadb``

In [6]:
chroma_emb_fn = embedding_functions.\
                    SentenceTransformerEmbeddingFunction(
                        model_name=config.get(
                            'CHROMA_EMB_MODEL'
                        )
                    )
client = chromadb.HttpClient()



In [11]:
# if collection exists
collection = client.get_collection(config.get('CHROMA_COLLECTION_NAME'))

In [10]:
collection = client.create_collection(
    config.get('CHROMA_COLLECTION_NAME'),
    embedding_function=chroma_emb_fn,
    metadata={"hnsw:space": "cosine"}
)

In [19]:
# Add to collection
collection.add(
    documents=[v for _,v in nb_scraper.notebook_md_dict.items()], 
    ids=doc_ids, # must be unique for each doc
)

### Test Retrieval using ChromaDB Client

In [12]:
results = collection.query(
    query_texts=["RLHF"], # Chroma will embed using the function we provided
    n_results=3 # how many results to return
)
print(results['ids'][0])
print(results['distances'][0])
#print([i[:100] for j in results['documents'] for i in j])

['6_module_03_03_RLHF_phi2', '10_module_04_06_supercharge_llm_apps', '2_module_01_02_getting_started']
[0.6175035195275418, 0.7261012146561765, 0.8062081214907408]


### Setup ChromaDB as DSPy Retriever 

In [13]:
retriever_model = ChromadbRM(
    config.get('CHROMA_COLLECTION_NAME'),
    config.get('CHROMA_DB_PATH'),
    embedding_function=chroma_emb_fn,
    client=client,
    k=5
)

# Test Retrieval
results = retriever_model("RLHF")
for result in results:
    display(Markdown(f"__Document__::{result.long_text[:100]}... \n"))
    display(Markdown(f">- __Document id__::{result.id} \n>- __Document score__::{result.score}"))

__Document__::# Quick Overview of RLFH

The performance of Language Models until GPT-3 was kind of amazing as-is. ... 


>- __Document id__::6_module_03_03_RLHF_phi2 
>- __Document score__::0.6174977412306334

__Document__::... 


>- __Document id__::10_module_04_06_supercharge_llm_apps 
>- __Document score__::0.7260969660795557

__Document__::# Getting Started : Text Representation
<img src="./assets/banner_notebook_1.jpg">


The NLP domain ... 


>- __Document id__::2_module_01_02_getting_started 
>- __Document score__::0.8062083377747705

__Document__::# Text Generation <a target="_blank" href="https://colab.research.google.com/github/raghavbali/llm_w... 


>- __Document id__::3_module_02_02_simple_text_generator 
>- __Document score__::0.8826038964887366

__Document__::# <img src="./assets/dspy_logo.png" width="2%"> DSPy: Beyond Prompting
---
<img src="./assets/dspy_b... 


>- __Document id__::12_module_04_05_dspy_demo 
>- __Document score__::0.9200280698248913

## Prepare DSPy Program

In [14]:
# Set up the LM and RM
dspy.settings.configure(lm=llm_model,rm=retriever_model)

In [15]:
class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often less than 50 words")

In [16]:
class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()

        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
    
    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)

## Let us Answer Some Questions

In [69]:
compiled_rag = RAG()

In [70]:
my_questions = [
    "List the models covered in module03",
    "Brief summary of module02",
    "What is LLaMA?"
]

for question in my_questions:
    # Get the prediction. This contains `pred.context` and `pred.answer`.
    pred = compiled_rag(question)
    
    display(Markdown(f"__Question__: {question}"))
    display(Markdown(f"__Predicted Answer__: _{pred.answer}_"))
    display(Markdown("__Retrieved Contexts (truncated):__"))
    for idx,cont in enumerate(pred.context):
        print(f"{idx+1}. {cont[:200]}..." )
        print()
    display(Markdown('---'))

__Question__: List the models covered in module03

__Predicted Answer__: _The models covered in module 03 include LLaMA 3.1, Chinchilla, and Gopher._

__Retrieved Contexts (truncated):__

1. # Scaling Neural Nets and Efficient Training

We have covered quite some ground in previous 2 modules and observed the steady increase in size and performance of the models. These gains come at huge c...

2. # Prompt Engineering
<img src="./assets/pe_banner.jpg">

Prompt Engineering is this thrilling new discipline that opens the door to a world of possibilities with large language models (LLMs).

As a pr...

3. # Text Generation <a target="_blank" href="https://colab.research.google.com/github/raghavbali/llm_workshop/blob/main/module_02/02_simple_text_generator.ipynb">
  <img src="https://colab.research.goog...



---

__Question__: Brief summary of module02

__Predicted Answer__: _Module 02 focuses on text generation using pre-trained models like GPT-2, explaining foundation models, decoding strategies (greedy, beam search, sampling), and the impact of temperature on randomness. It also discusses limitations like long-range context and hallucination._

__Retrieved Contexts (truncated):__

1. # Prompt Engineering
<img src="./assets/pe_banner.jpg">

Prompt Engineering is this thrilling new discipline that opens the door to a world of possibilities with large language models (LLMs).

As a pr...

2. # Text Generation <a target="_blank" href="https://colab.research.google.com/github/raghavbali/llm_workshop/blob/main/module_02/02_simple_text_generator.ipynb">
  <img src="https://colab.research.goog...

3. # Scaling Neural Nets and Efficient Training

We have covered quite some ground in previous 2 modules and observed the steady increase in size and performance of the models. These gains come at huge c...



---

__Question__: What is LLaMA?

__Predicted Answer__: _LLaMA is a language model from Meta.AI, available in sizes 8B, 70B, and 405B, and it outperforms many existing LLMs on various benchmarks._

__Retrieved Contexts (truncated):__

1. # Open Source Vs Close Sourced LLMs

Similar to any other piece of technology, LLMs are available in all flavours and license types. While some of the most popular offerings are closed source (OpenAI ...

2. # Scaling Neural Nets and Efficient Training

We have covered quite some ground in previous 2 modules and observed the steady increase in size and performance of the models. These gains come at huge c...

3. # Retrieval Augmented LLM App
<img src="./assets/rap_banner.jpeg">

We have covered quite some ground in terms of understanding and building components for:
- Text Representation
- NLP Tasks
- Pretrai...



---

## Langfuse
Understanding Costs

<img src ='./assets/langfuse_dashboard.png'>

---

<img src = './assets/langfuse_traces.png'>

## Testing Langfuse Dataset using OpenLLaMA

In [23]:
langfuse =langfuse_tracker.langfuse
ollama_dspy = dspy.OllamaLocal(model='llama3.1',temperature=0.5)

# Set up the ollama as LM and RM
dspy.settings.configure(lm=ollama_dspy,rm=retriever_model)

In [24]:
# get annotated dataset
annotated_dataset = langfuse.get_dataset("llm_workshop_rag")

In [25]:
# test rag using ollama
ollama_rag = RAG()

In [27]:
for item in annotated_dataset.items:
    question = item.input[0]['content'].split('Question: ')[-1].split('\n')[0]
    answer = item.expected_output['content'].split('Answer: ')[-1]
    o_pred = ollama_rag(question)
    with item.observe(
        run_name='ollama_experiment',
        run_description='compare LLaMA3.1 RAG vs GPT4o-mini RAG ',
        run_metadata={"model": "llama3.1"},
    ) as trace_id:
        langfuse.score(
            name="visual-eval",
            # any float value
            value=1.0,
            comment="LLaMA3.1 is very verbose",
        )
    langfuse.trace(input=question,output=o_pred.answer,metadata={'model':'LLaMA3.1'})
    display(Markdown(f"__Question__: {question}"))
    display(Markdown(f"__Predicted Answer (LLaMA 3.1)__: {o_pred.answer}"))
    display(Markdown(f">__Annotated Answer (GPT-4o-mini)__: _{answer}_"))

__Question__: Brief summary of module02

__Predicted Answer (LLaMA 3.1)__: Here is a brief summary of module02:

* LoRA (Low-Rank Adaptation) technique for fine-tuning large models:
	+ Freezes base model weights
	+ Decomposes weight update matrix into lower rank matrices, reducing updates by 100-1000x
* qLoRA: Combines quantization and LoRA to further improve efficiency
* Model Parameters:
	+ Model size: 405 billion parameters
	+ Training dataset: 15 trillion data points
* GPU Performance and Compute Time:
	+ Compute required for training large models
	+ Cost of training large models
* Scaling Laws:
	+ Insights from the paper "Scaling Laws for Neural Language Models"

>__Annotated Answer (GPT-4o-mini)__: _Module 02 focuses on text generation using pre-trained models like GPT-2, explaining foundation models, decoding strategies (greedy, beam search, sampling), and the impact of temperature on randomness. It also discusses limitations like long-range context and hallucination._

__Question__: What is LLaMA?

__Predicted Answer (LLaMA 3.1)__: It seems like you're trying to follow along with a workshop on Large Language Models (LLMs) and their applications. However, the question about LLaMA was not fully answered.

To provide a complete answer:

Llama is a large language model developed by Meta AI. It's designed for natural language processing tasks such as text generation, translation, and more. Like other popular LLMs like BERT and RoBERTa, Llama uses self-supervised learning to learn patterns in language from vast amounts of text data.

Now, let's get back to the original question: "Fine-Tuning PEFT - SFT and LLM Landscape - Vector Databases - Libraries and Frameworks".

To answer this question:

The topic seems to be

>__Annotated Answer (GPT-4o-mini)__: _LLaMA is a language model from Meta.AI, available in sizes 8B, 70B, and 405B, and it outperforms many existing LLMs on various benchmarks._