## Comparing models for our different tasks

In this Notebook, we are going to use another model, Flan-T5-large in parallel to Granite-7B-Instruct and see how it behaves.

Flan-T5-Large is indeed smaller, will run without GPU and use only 4 GB of RAM, but is it up to the task?

### Requirements and Imports

If you have selected the right workbench image to launch as per the Lab's instructions, you should already have all the needed libraries. If not uncomment the first line in the next cell to install all the right packages.

In [None]:
# Uncomment the following line only if you have not selected the right workbench image, or are using this notebook outside of the workshop environment.
# !pip install --no-cache-dir --no-dependencies --disable-pip-version-check -r requirements.txt

import json
import time

from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.prompts import PromptTemplate
from langchain_community.llms import HuggingFaceTextGenInference, VLLMOpenAI

### Langchain pipeline

We are now going to define two different LLM endpoints, and two different Langchain pipelines.

In [None]:
# LLM Inference Server URL
inference_server_url = "http://granite-7b-instruct-predictor.ic-shared-llm.svc.cluster.local:8080"

# LLM definition
llm = VLLMOpenAI(           # We are using the vLLM OpenAI-compatible API client. But the Model is running on OpenShift AI, not OpenAI.
    openai_api_key="EMPTY",   # And that is why we don't need an OpenAI key for this.
    openai_api_base= f"{inference_server_url}/v1",
    model_name="granite-7b-instruct",
    top_p=0.92,
    temperature=0.01,
    max_tokens=512,
    presence_penalty=1.03,
    streaming=True,
    callbacks=[StreamingStdOutCallbackHandler()]
)

In [None]:
import warnings
warnings.filterwarnings("ignore")

# Flan-T5-Large LLM Inference Server URL
inference_server_url_flan_t5 = "http://llm-flant5.ic-shared-llm.svc.cluster.local:3000/"

# LLM definition
llm_flant5 = HuggingFaceTextGenInference(
    inference_server_url=inference_server_url_flan_t5,
    max_new_tokens=96,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.01,
    repetition_penalty=1.03,
    streaming=True,
    callbacks=[StreamingStdOutCallbackHandler()]
)

The **template** will be the same for both models.

In [None]:
template="""<|system|>
You are a helpful, respectful and honest assistant.
Always assist with care, respect, and truth. Respond with utmost utility yet securely.
Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity.
I will give you a text, then ask a question about it. Give a precise and as concise as possible answer to this question.
<|user|>
### TEXT:
{text}

### QUESTION:
{query}

### ANSWER:
<|assistant|>
"""
prompt = PromptTemplate(input_variables=["input"], template=template)

And we can now create two **conversation** objects that we will use to query the two models.

In [None]:
conversation = prompt | llm
conversation_flant5 = prompt | llm_flant5

We are now ready to query the models!

In this example, we are only going to query one claim and see what happens. Of course, feel free to try with different ones.

In [None]:
filename = 'claims/claim1.json'

# Opening JSON file
claims = {}
with open(filename, 'r') as file:
    data = json.load(file)
claims[filename] = data

# Content and queries
text_input = f"Subject: {claims[filename]['subject']}\nContent:\n{claims[filename]['content']}"
sentiment_query = "What is the sentiment of the person sending this claim?"
location_query = "Where does the event the claim is related to happen?"
time_query = "When does the event the claim is related to happen?"

# Analyze the claim
print(f"***************************")
print(f"* Claim: {filename}")
print(f"***************************")
print("Original content:")
print("-----------------")
print(f"Subject: {claims[filename]['subject']}\nContent:\n{claims[filename]['content']}\n\n")
print('Analysis with Granite-7B-Instruct:')
print("--------")
start_granite = time.time()
print(f"- Sentiment: ")
conversation.invoke(input={"text": text_input, "query": sentiment_query});
print("\n- Location: ")
conversation.invoke(input={"text": text_input, "query": location_query});
print("\n- Time: ")
conversation.invoke(input={"text": text_input, "query": time_query});
print("\n\n                          ----====----\n")
end_granite = time.time()
print('Analysis with Flan-T5-Large:')
print("--------")
start_flan = time.time()
print(f"- Sentiment: ")
conversation_flant5.invoke(input={"text": text_input, "query": sentiment_query});
print("\n- Location: ")
conversation_flant5.invoke(input={"text": text_input, "query": location_query});
print("\n- Time: ")
conversation_flant5.invoke(input={"text": text_input, "query": time_query});
print("\n\n                          ----====----\n")
end_flan = time.time()

print(f"Granite analysis time: {end_granite - start_granite:.2f} seconds")
print(f"Flan analysis time: {end_flan - start_flan:.2f} seconds")

As you can see, Flan-T5-Large may be faster to produce some of the results as it's a 770 Million parameters model only. However those results are less accurate or detailed. So it works to some extent, but the results are nowhere near the ones from Granite-7B-Instruct, which is a 7 Billion parameter.

The art of working with LLM is to find the right balance between the performance and accuracy you require, and the resources it takes along with the involved costs.

Therefore it's important to have confidence checks in place to make sure that as your data changes, or your model evolves, you always get the behaviour you expected.