# Generation with LLMs

Import the required libraries and classes.

In [5]:
import torch

from typing import Tuple
from transformers import AutoModelForCausalLM, AutoTokenizer, BatchEncoding, PreTrainedTokenizer, PreTrainedModel

  from .autonotebook import tqdm as notebook_tqdm
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]


We will use ```Llama2-7b-chat``` and ```Llama-3-8B-Instruct```.
These models are fine-tuned versions of the base models.
Since the models were prompted with specific templates during fine-tuning, we will use the same templates to have the models be in the best conditions.

In [6]:
MODELS = {
    "llama2": "meta-llama/Llama-2-7b-chat-hf",
    "llama3": "meta-llama/Meta-Llama-3-8B-Instruct",
}

TEMPLATES = {
    "llama2": "<s>[INST] <<SYS>>\n{}\n<</SYS>>\n\n{} [/INST]",
    "llama3": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>",
}

Functions for loading the models and generate responses.

In [7]:
def load_model(model_name: str, dtype) -> Tuple[PreTrainedModel, PreTrainedTokenizer]:
    torch_dtype = torch.float32
    if dtype == "bf16":
        torch_dtype = torch.bfloat16 
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map="auto",
        torch_dtype=torch_dtype,

    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return model, tokenizer

def generate(
    model: PreTrainedModel,
    inputs: BatchEncoding,
    tokenizer: PreTrainedTokenizer,
    max_seq_length: int,    
) -> str:
    output = model.generate(
        inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_length=max_seq_length,
        pad_token_id=tokenizer.eos_token_id,
    )
    return tokenizer.decode(
        output[0][len(inputs.input_ids[0]) :], skip_special_tokens=True
    )

Parameters and input for the generation.

In [8]:
model_name = "llama2"
chat_template = TEMPLATES[model_name]
model_name = MODELS[model_name]

dtype = "f32" #"bf16"
max_seq_length = 128

system_prompt = "You are a pizza ordering assistant."
input = "User: Hello, I would like a pizza. System: "

Load the model and tokenizer based on the parameters.

In [9]:
model, tokenizer = load_model(model_name, dtype)

OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/meta-llama/Llama-2-7b-chat-hf.
401 Client Error. (Request ID: Root=1-673c9923-773fef7618333ce17d4a6a84;16a3b0d7-9799-489d-8418-60cf6a71302e)

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json.
Access to model meta-llama/Llama-2-7b-chat-hf is restricted. You must have access to it and be authenticated to access it. Please log in.

Prepare the input and generate a response.

In [6]:
# Format and tokenize the input
input_text = chat_template.format(system_prompt, input)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

# Generate a response
response = generate(model, inputs, tokenizer, max_seq_length)
print(response)

 Of course! What can I get for you today? 🍕

Would you like a classic margherita or something more adventurous like the "Meat Lover's Delight"? 🤔

Please let me know if you have any dietary restrictions or preferences, and I'll be happy to help you find the perfect pizza.


# Run on Cluster

* Connect to the [university VPN](https://servicedesk.unitn.it/sd/it/service/vpn?id=unitrento_v2_service_card&table=sc_cat_item&sys_id=e5a898fec35bfd104cbb7055df013171).

* Connect to the cluster:

    ```ssh firstname.lastname@marzola.disi.unitn.it```

* Copy sbatch file to request resources and run:

    ```cp /data/hmd_2024/example.sbatch .```


#### example.sbatch
```bash
#!/bin/bash
#SBATCH --gres=gpu:1            # request 1 gpu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2       # request 2 cpus
#SBATCH -N 1

#SBATCH --job-name=hmd_example  # job name to be displayed in the queue
#SBATCH --output=%x-%j.out      # output file name
#SBATCH --error=%x-%j.err

#SBATCH --partition edu5        # partition
#SBATCH -t 0:5:0                # request 5 minutes

/data/hmd_2024/run "$@"         # all parameters given to 'example.sbatch' are passed to '/data/hmd_2024/run'
```

#### Parameters

Parameters for the underlying python script:

```
usage: python -m query_model [-h] [--system-prompt SYSTEM_PROMPT] [--dtype {f32,bf16}] [--max_seq_length MAX_SEQ_LENGTH] [--return-full] [--dotenv-path DOTENV_PATH] {llama2,llama3} INPUT_TEXT

Query a specific model with a given input.

positional arguments:
  {llama2,llama3}       The model to query.
  INPUT_TEXT            The input to query the model with.

options:
  -h, --help            show this help message and exit
  --system-prompt SYSTEM_PROMPT
                        The system prompt to use for the model. (default: )
  --dtype {f32,bf16}    The data type to use for the model. (default: f32)
  --max_seq_length MAX_SEQ_LENGTH
                        The maximum sequence length to use for the model. (default: 128)
  --return-full         Return the full output. (default: False)
  --dotenv-path DOTENV_PATH
                        The path to the .env file. (default: .env)
```

#### Run model to generate a response
Use ```sbatch``` to request a job on the cluster.
All the settings for the job are in ```example.sbatch```.
The other parameters will be passed to the python code.

```sbatch example.sbatch --system-prompt "You are a pizza ordering assistant." llama2 "User: Hello, I would like a pizza System: "```

