# Querying LLMs (Chatbots)

We will use [LangChain](https://www.langchain.com/), an open-source library for making applications with LLMs.


## The Language Model
We'll use models from [HuggingFace](https://huggingface.co/), a website that has tools and models for machine learning.
For this task, we’ll use the open-weights LLM 
[meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B).
This is a small model with only 1 billion parameters.
It should be possible to use on most laptops.

```{admonition} Model types
`meta-llama/Llama-3.2-1B` is a *base model*.
Base models have been trained on large text corpora, but not *fine-tuned* to a specific task.
Many models are also available in versions that have been fine-tuned to follow instructions, called *instruct* or *chat* models.
Instruct models are more suitable for use in applications like chatbots.
```

## Model Location
We should tell the HuggingFace library where to store its data. If you’re running on Educloud/Fox project ec443 the model is stored at the path below.

In [None]:
%env HF_HOME=/fp/projects01/ec443/huggingface/cache/

## Loading the Model
To use the model, we create a *pipeline*.
A pipeline can consist of several processing steps, but in this case, we only need one step.
We can use the method `HuggingFacePipeline.from_model_id()`, which automatically downloads the specified model from HuggingFace.

from transformers import pipeline

llm = pipeline("text-generation", 
               model="mistralai/Mistral-Nemo-Instruct-2407",
               device=0,
               max_new_tokens=1000)

In [None]:
from langchain_community.llms import HuggingFacePipeline

llm = HuggingFacePipeline.from_model_id(
    model_id='meta-llama/Llama-3.2-1B',
    task='text-generation',
    device=0,
    pipeline_kwargs={
        'max_new_tokens': 100,
        #'temperature': 0.3,
        #'num_beams': 4,
        #'do_sample': True
    }
)

We can give some arguments to the pipeline:
- `model_id`: the name of the  model on HuggingFace
- `task`:  the task you want to use the model for
- `device`: the GPU hardware device to use. If we don't specify a device, no GPU will be used.
- `pipeline_kwargs`: additional parameters that are passed to the model.
    - `max_new_tokens`: maximum length of the generated text
    - `do_sample`: by default, the most likely next word is chosen.  This makes the output deterministic. We can introduce some randomness by sampling among the  most likely words instead.
    - `temperature`: the temperature controls the statistical *distribution* of the next word and is usually between 0 and 1. A low temperature increases the probability of common words. A high temperature increases the probability of outputting a rare word. Model makers often recommend a temperature setting, which we can use as a starting point.
    - `num_beams`: by default the model works with a single sequence of  tokens/words. With beam search, the program  builds multiple sequences at the same time, and then selects the best one in the end.


## Making a Prompt
We can use a *prompt* to tell the language model how to answer.
The prompt should contain a few short, helpful instructions.
In addition, we provide placeholders for the context.
LangChain replaces these with the actual documents when we execute a query.

First, we import the library functions that we need:

In [None]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.messages import AIMessage, HumanMessage, SystemMessage

Next, we make the system prompt that will be the context for the chat.
The system prompt consists of a system message to the model and a placeholder for the user's message.

In [None]:
messages = [
    SystemMessage("You are a pirate chatbot who always responds in pirate speak in whole sentences!"),
    MessagesPlaceholder(variable_name="messages")
]

This list of messages is then used to make the actual prompt:

In [None]:
prompt = ChatPromptTemplate.from_messages(messages)

LangChain processes  input in *chains* that can consist of several steps.
Now, we define our chain which sends the prompt into the LLM.

In [None]:
chatbot = prompt | llm

The chatbot is complete, and we can try it out by invoking it:

In [None]:
result = chatbot.invoke([HumanMessage("Who are you?")])
print(result)

Each time we invoke the chatbot, it starts fresh.
It has no memory of our previous conversation.
It's possible to add memory, but that requires more programming.

In [None]:
result = chatbot.invoke([HumanMessage("Tell me about your ideal boat?")])
print(result)

## Exercises

```{admonition} Exercise 1
:class: tip

The model `meta-llama/Llama-3.2-1B` is a small model and will yield low accuracy on many tasks.
To get the benefit of the power of the GPU, we should use a larger model.
Try to change the code in the pirate example to use the model `mistralai/Mistral-7B-Instruct-v0.3` instead.
How does this change the output?
```

```{admonition} Exercise 2
:class: tip

Continue using the model `mistralai/Mistral-7B-Instruct-v0.3`.
Try to change the temperature parameter, first to 0.9, then to 2.0 and 5.0.
How does this change the output?
```