## LLM Comparison

When building an LLM application we have hundreds of different models to choose from, all with different costs/latency and performance characteristics. Importantly, performance of LLMs can be heterogeneous across different use cases. Rather than relying on standard benchmarks or leaderboard performance, we want to evaluate an LLM for the use case we need.

Doing this sort of comparison is a core use case of TruLens. In this example, we'll walk through how to build a simple langchain app and evaluate across 3 different models: small flan, large flan and text-turbo-3.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/truera/trulens/blob/main/examples/expositional/frameworks/langchain/langchain_model_comparison.ipynb)

### Import libraries

In [None]:
!pip install trulens trulens-providers-huggingface trulens-providers-openai langchain langchain_community

In [None]:
import os

from langchain.prompts import PromptTemplate

# Imports main tools:
from trulens.core import Feedback
from trulens.core import TruSession
from trulens.apps.langchain import TruChain
from trulens.providers.huggingface import Huggingface
from trulens.providers.openai import OpenAI as fOpenAI

session = TruSession()

### Set API Keys

For this example, we need API keys for the Huggingface, HuggingFaceHub, and OpenAI

In [None]:
os.environ["HUGGINGFACE_API_KEY"] = "..."
os.environ["OPENAI_API_KEY"] = "..."

### Set up prompt template

In [None]:
template = """Question: {question}

Answer: """
prompt = PromptTemplate(template=template, input_variables=["question"])

### Set up feedback functions

In [None]:
# API endpoints for models used in feedback functions:
hugs = Huggingface()
openai = fOpenAI()

# Question/answer relevance between overall question and answer.
f_qa_relevance = Feedback(openai.relevance).on_input_output()
# By default this will evaluate feedback on main app input and main app output.

all_feedbacks = [f_qa_relevance]

### Load a couple sizes of Flan and ask questions

In [None]:
from huggingface_hub import login

login(token=os.environ['HUGGINGFACE_API_KEY'])

In [None]:
from langchain.chains.llm import LLMChain
from langchain_huggingface import HuggingFaceEndpoint
from langchain_openai import OpenAI

# initialize the models
hub_llm_smallflan = HuggingFaceEndpoint(
    model="google/flan-t5-small", 
    temperature=1e-10, 
    max_new_tokens=250
)

gemma_2b = HuggingFaceEndpoint(
    model="google/gemma-2-2b", 
    temperature=1e-10
)

gpt = OpenAI(model="gpt-4o-mini")

# create prompt template > LLM chain
smallflan_chain = LLMChain(prompt=prompt, llm=hub_llm_smallflan)

gemma_2b_chain = LLMChain(prompt=prompt, llm=gemma_2b)

openai_gpt_chain = LLMChain(prompt=prompt, llm=gpt)

# Trulens instrumentation.
smallflan_app_recorder = TruChain(
    app_name="langchain_model_comparison", 
    app_version="small_flan", 
    app=smallflan_chain, 
    feedbacks=all_feedbacks
)

gemma_2b_app_recorder = TruChain(
    app_name="langchain_model_comparison", 
    app_version="gemma-2b", 
    app=gemma_2b_chain, 
    feedbacks=all_feedbacks
)

openai_gpt_app_recorder = TruChain(
    app_name="langchain_model_comparison", 
    app_version="GPT-4o-mini", 
    app=openai_gpt_chain, 
    feedbacks=all_feedbacks
)

### Run the application with all 3 models

In [None]:
prompts = [
    "Who won the superbowl in 2010?",
    "What is the capital of Thailand?",
    "Who developed the theory of evolution by natural selection?",
]

for prompt in prompts:
    with smallflan_app_recorder as recording:
        smallflan_chain(prompt)
    with gemma_2b_app_recorder as recording:
        gemma_2b_chain(prompt)
    with openai_gpt_app_recorder as recording:
        openai_gpt_chain(prompt)

### Run the TruLens dashboard

In [None]:
from trulens.dashboard import run_dashboard

run_dashboard(session)