## Comparing models for our different tasks

In this Notebook, we are going to use another model, **Qwen2.5-0.5B-w8a8** in parallel to **Granite-3.1-8B-Instruct** and see how it behaves.

Qwen2.5-0.5B-w8a8 is a much smaller and quantized model (think like "compressed"), but it will run without a GPU and use only 5 GB of RAM. But is it up to the task?

### Requirements and Imports

If you have selected the right workbench image to launch as per the Lab's instructions, you should already have all the needed libraries. If not uncomment the first line in the next cell to install all the right packages.

In [None]:
# Uncomment the following line only if you have not selected the right workbench image, or are using this notebook outside of the workshop environment.
# !pip install --no-cache-dir --no-dependencies --disable-pip-version-check -r requirements.txt

import json
import time

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.prompts.chat import SystemMessagePromptTemplate, HumanMessagePromptTemplate
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

### Langchain pipeline

We are now going to define two different LLM endpoints, and two different Langchain pipelines.

In [None]:
# LLM Inference Server URL
inference_server_url = "http://granite-3-1-8b-instruct-predictor.ic-shared-llm.svc.cluster.local:8080"

# LLM definition
llm = ChatOpenAI(
    openai_api_key="EMPTY",   # Private model, we don't need a key
    openai_api_base=f"{inference_server_url}/v1",
    model_name="granite-3-1-8b-instruct",
    temperature=0.01,
    max_tokens=512,
    streaming=True,
    callbacks=[StreamingStdOutCallbackHandler()],
    top_p=0.9,
    presence_penalty=0.5,
    model_kwargs={
        "stream_options": {"include_usage": True}
    }
)

In [None]:
# Qwen2.5 LLM Inference Server URL
inference_server_url_qwen = "http://qwen-predictor.ic-shared-llm.svc.cluster.local:8080"

# LLM definition
llm_qwen = ChatOpenAI(
    openai_api_key="EMPTY",   # Private model, we don't need a key
    openai_api_base=f"{inference_server_url_qwen}/v1",
    model_name="qwen2.5",
    temperature=0.01,
    max_tokens=128,
    streaming=True,
    callbacks=[StreamingStdOutCallbackHandler()],
    top_p=0.9,
    presence_penalty=0.5,
    model_kwargs={
        "stream_options": {"include_usage": True}
    }
)

The **template** will be the same for both models.

In [None]:
template = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template(
        """You are a helpful, respectful and honest assistant.
        Always assist with care, respect, and truth. Respond with utmost utility yet securely.
        Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity.
        I will give you a text, then ask a question about it. Give a precise and as concise as possible answer to this question.
        """),
    HumanMessagePromptTemplate.from_template(
        """### ### TEXT:
        {text}

        ### QUESTION:
        {query}

        ### ANSWER:"""
    )
])

We are now ready to query the models!

In this example, we are only going to query one claim and see what happens. Of course, feel free to try with different ones.

In [None]:
filename = 'claims/claim1.json'

# Opening JSON file
claims = {}
with open(filename, 'r') as file:
    data = json.load(file)
claims[filename] = data

# Content and queries
text_input = f"Subject: {claims[filename]['subject']}\nContent:\n{claims[filename]['content']}"
sentiment_query = "What is the sentiment of the person sending this claim?"
location_query = "Where does the event the claim is related to happen?"
time_query = "When does the event the claim is related to happen?"

# Analyze the claim
print("***************************")
print(f"* Claim: {filename}")
print("***************************")
print("Original content:")
print("-----------------")
print(f"Subject: {claims[filename]['subject']}\nContent:\n{claims[filename]['content']}\n\n")
print('Analysis with Granite-3.1-8B-Instruct:')
print("--------")
start_granite = time.time()
print("- Sentiment: ")
prompt = template.invoke({"text": text_input, "query": sentiment_query})
sentiment_granite = llm.invoke(input=prompt);
print("\n- Location: ")
prompt = template.invoke({"text": text_input, "query": location_query})
location_granite = llm.invoke(input=prompt);
print("\n- Time: ")
prompt = template.invoke({"text": text_input, "query": time_query})
time_granite = llm.invoke(input=prompt);
print("\n\n                          ----====----\n")
end_granite = time.time()
print('Analysis with Qwen:')
print("--------")
start_qwen = time.time()
print("- Sentiment: ")
prompt = template.invoke({"text": text_input, "query": sentiment_query})
sentiment_qwen = llm_qwen.invoke(input=prompt);
print("\n- Location: ")
prompt = template.invoke({"text": text_input, "query": location_query})
location_qwen = llm_qwen.invoke(input=prompt);
print("\n- Time: ")
prompt = template.invoke({"text": text_input, "query": time_query})
time_qwen = llm_qwen.invoke(input=prompt);
print("\n\n                          ----====----\n")
end_qwen = time.time()

print(f"Granite analysis time: {end_granite - start_granite:.2f} seconds")
print(f"Qwen analysis time: {end_qwen - start_qwen:.2f} seconds")
print("\n\n                          ----====----\n")

As you can see, **Qwen2.5-0.5B-w8a8** can faitly be run on a CPU as it's a compressed 500 Million parameters model only. However those results may be less accurate or detailed, even contain hallucinations and unneeded text, and take much more time. So it works to some extent, but the results are nowhere near the ones from Granite-3.1-8B-Instruct, which is an 8 Billion parameter running on a GPU.

The art of working with LLM is to find the right balance between the performance and accuracy you require, and the resources it take, along with the involved costs.

Therefore it's important to have confidence checks in place to make sure that as your data changes, or your model evolves, you always get the behaviour you expect.