# CP5403 Prac 1 - part 1
In this prac we'll explore how to run pre-existing LLMs using Hugging Face (https://huggingface.co). Hugging Face is a widely used machine learning/AI repository for models and datasets, and allow provides access to inference services (i.e, running models on cloud services in order to use more computing power).

Hugging Face hosts [open-weights](https://promptmetheus.com/resources/llm-knowledge-base/open-weights-model) text-to-text LLM models such as the Gemma series by Google, LLaMA series by Meta AI, GPT-2 and gpt-oss by Open AI, and many more. Models for other tasks are also hosted, like text-to-image (e.g., Stable Diffusion), text-to-speech, text-to-video, etc.

## Google Colab runtime
The CP5403 pracs are Jupyter Notebooks that are intended to be run using Google Colab. By default, Google Colab will run your code using a CPU only, but also allows limited free access to T4 GPU runtime, which will significantly speed up training and inference for machine learning. **Make sure to switch to the T4 GPU runtime before running any code -- in the top right, next to `Connect`, click the down arrow and then choose `Change runtime type`. Under `Hardware accelerator` select `T4 GPU`, then click save.**

## Sign up for Hugging Face
If you don't already have a Hugging Face account, visit https://huggingface.co/join

## Generate an access token
Some of the models available on Huggingface are "gated"; they require you to agree to terms of service before you can download and use the model. To use these models requires an access token.

Visit https://huggingface.co/settings/tokens and click the button labelled `+ Create new token`. Select the `Read` token type and provide your token with a name (e.g., `prac-read`). Click `Create token`.

A `Save your Access Token` popup will appear. Click copy, and then paste your token below between the quotation marks.

In [None]:
HF_TOKEN = "" # paste your token here

from huggingface_hub import login
login(token=HF_TOKEN)

Note: this is a not a safe way to store your tokens, we are it doing it this way for now just so we can get started quickly. Do not store your token in a public place or share it.

## Allow access to Gemma model
Visit https://huggingface.co/google/gemma-3-4b-it and click `Acknowledge access`. Now you will be able to use this model.

In [None]:
from transformers import pipeline

pipe = pipeline("text-generation", model="google/gemma-3-4b-pt")

## Running the LLM model with `pipeline`

Run the code below. It may take a few minutes to download the model.

In [None]:
from transformers import pipeline

pipe = pipeline("text-generation", model="google/gemma-3-4b-it")

The code above first imports the `pipeline` abstraction from Hugging Face's `transformers` module (see (https://huggingface.co/docs/transformers/en/main_classes/pipelines).

We then create a `pipeline` for text generation with the `google/gemma-3-4b-it` model.

This model contains 4 billion parameters, hence it can take a while to download. Now we'll use the model to generate an answer to a question.

In [None]:
# We pass a list of messages to the pipeline
# For Gemma models, we need to provide in the following format
messages = [
    {"role": "user", "content": "What is a koala? Explain in two sentences."}
]

# Pass the messages to the pipeline and retrieve the LLM's response
response = pipe(messages)

# The response comes back as a list of previous messages, we can see the full format
# by just printing it
print(response)

This model has been tuned for *instruction following*. We provide a list of messages with a single message that identifies the role of the message. This is a message coming from the *user* (rather than the *assistant*).

```
messages = [
    {"role": "user", "content": "What is a koala? Explain in two sentences."}
]
```

We create a list of messages, each of which consists of a dictionary that defines role and content in this case. The way we need to format the messages for generation will vary depending on the particular model.

We then just pass those messages to the `pipe` object using function call syntax (i.e., parentheses) and get back a response.

```
response = pipe(messages)
```

The response is a list that includes a single dictionary, mapping `'generated_text'` to another list of dictionaries. The first dictionary contains the message we provided from the user, and the second contains a response with the role of assistant. Because LLMs are probablistic, the output will vary, but you should have seen a response like

```
[{'generated_text': [{'role': 'user', 'content': 'What is a koala? Explain in two sentences.'}, {'role': 'assistant', 'content': 'Koalas are adorable, herbivorous marsupials native to Australia, known for spending almost their entire lives in eucalyptus trees. They primarily eat eucalyptus leaves and have a specialized digestive system to handle their toxic diet!'}]}]
```

So, we can grab the assistant's text for the response with code like `assistant_text = response[0]["generated_text"][-1]["content"]`.

## Varying generation parameters
When you give the pipeline a message, the first thing it does is break the text into tokens. These are the basic units an LLM can understand. A token might represent a whole word, part of a word, punctuation, numeral, or whitespace. Each token has a unique ID number in the model’s vocabulary. The model never sees raw text, only these token IDs. Later on, those IDs get turned into numerical vectors (embeddings), which the neural network can process. We will look more at tokens and the tokenisation process next week.

To create text, the model generates a series of tokens that are then converted into the text. The generation takes place by looking at what the model considers the most probable next token to follow what it has previously seen. The generation is usually *probablistic*; the model *samples* one of the most likely tokens according to a **temperature** parameter. 

The lower the temperature, the more likely the model is to choose a higher probability token. The higher the temperature, the more likely the model is to choose a lower probability token. Temperature is usually a number larger than 0 and smaller than 2.

Different models will use different default temperature values according to their generation config. Our model does not set a default temperature. In this case, the pipeline will use 0.7 by default.

Try out the code below, with different values for temperature.

In [None]:
messages = [
    {"role": "user", "content": "What is a koala?"}
]

response = pipe(messages, temperature=3.0, top_k=None, top_p=None)
print(response[0]["generated_text"][-1]["content"])

In [None]:
pipe.model.generation_config

In [None]:
messages = [
    {"role": "user", "content": "What is a koala?"}
]

outputs = []
response = pipe(messages, do_sample=False)
outputs.append(response[0]["generated_text"][-1]["content"])

response = pipe(messages, do_sample=False)
outputs.append(response[0]["generated_text"][-1]["content"])

print(outputs)

In [None]:
print(response)

In [None]:
pipe.model.generation_config

In [None]:
pipe.model.config

## Conversing with the LLM

In [None]:
user_input = input("Enter your text: ")
message_history = []
while user_input != "quit":
    message_history.append({"role": "user", "content": user_input})
    response = pipe(message_history)
    message_history = response[0]["generated_text"]
    last_assistant_response = response[0]["generated_text"][-1]["content"]
    print('\n' + "*" * 30)
    print("Assistant: ")
    print("*" * 30)
    print(last_assistant_response)
    print("*" * 30 + '\n')
    user_input = input("Enter your text: ")

## Creating a chatbot app with `Gradio`

In [None]:
import gradio as gr
import time

def chat_response(message, history):
    """
    This is the core function that Gradio's ChatInterface will call.
    - 'message' is the new text input from the user.
    - 'history' is the conversation history, managed by Gradio.
    """
    
    # 1. Convert Gradio's history format to the format the model expects
    # Gradio history is a list of lists: [["user", "assistant"], ["user", "assistant"]]
    # Your model expects a list of dicts: [{"role": "user", "content": ...}, ...]
    messages = []
    for user_msg, assistant_msg in history:
        messages.append({"role": "user", "content": user_msg})
        messages.append({"role": "assistant", "content": assistant_msg})
    
    # 2. Add the latest user message
    messages.append({"role": "user", "content": message})
    
    # 3. Call the model pipeline
    response = pipe(messages)
    
    # 4. Extract just the last assistant response to return to the UI
    last_assistant_response = response[0]["generated_text"][-1]["content"]
    
    # Gradio's ChatInterface expects a single string response.
    # It will automatically append it to the chat display.
    return last_assistant_response

# --- Create and launch the Gradio Chat Interface ---
demo = gr.ChatInterface(
    fn=chat_response,
    title="My AI Chatbot",
    description="Enter your text and chat with the AI. Type 'quit' or 'exit' to end the session.",
    examples=[["Hello!"], ["How does a transformer model work?"]],
    # retry_btn=None,
    # undo_btn="Delete Previous",
    # clear_btn="Clear Chat",
)

demo.launch()