## <font color='purple'>__Using Transformers to run LLMs__</font>

#### <font color='purple'>Basic workflow for using transformers</font>
1. Load the model parameters
2. Convert prompt query to tokens
3. Call model to process tokens and generate response tokens
4. Decode tokens to text response

#### <font color='purple'>Run Llama2 model with transformers</font>  
Llama2 model on the Hugging Face site: https://huggingface.co/meta-llama/Llama-2-13b-chat-hf  
##### Python script


In [None]:
##################################
# transformers using Llama2 model
##################################

from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig
import time
import pandas as pd
from pathlib import Path

# Set up model and directory info_  
llm_dir = "/kellogg/data/llm_models_opensource/llama2_meta_huggingface"
llm_model = "meta-llama/Llama-2-7b-chat-hf"
# llm_model = "meta-llama/Llama-2-13b-chat-hf"
# llm_model = "meta-llama/Llama-2-70b-chat-hf"

# 1. Load the model parameters
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(llm_model,cache_dir=llm_dir, device_map="auto", quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(llm_model, cache_dir=llm_dir)

# For Llama2 chat, need to enclosed your prompt by [INST] and [/INST]
query = "[INST] Tell a fun fact about Kellogg Business School. [/INST]"

# 2. Convert prompt query to tokens
device = "cuda"
model_input = tokenizer(query, return_tensors="pt").to(device)

# Settings for LLM model  
customize_setting = {
    "max_new_tokens": 400,
    "do_sample": True,
    "temperature": 0.8,
}
print(f"=== Customized setting:")
for key, value in customize_setting.items():
    print(f"    {key}: {value}")

# 3. Call model to process tokens and generate response tokens
outputs = model.generate(**model_input, **customize_setting)

# 4. Decode tokens to text response
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("====================")
print(f"LLM model: {llm_model}")
print(f"Query: {query}")
print("Response: ")
print(decoded)
print("====================")

# Logging
finished_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())
columns = ["llm_model", "query", "response", "finished_time"]
row = [llm_model, query, decoded, finished_time]
for key, value in customize_setting.items():
    columns.append(key)
    row.append(value)
df = pd.DataFrame([row], columns=columns)
llm_name = llm_model.split("/")[-1]
log_file = Path(f"./log_{llm_name}.csv")
df.to_csv(log_file, index=False, mode='a', header=not log_file.exists())

#### Slurm script
The following script is for running the GPU job on Quest. To run with Kellogg GPU, change the "#SBATCH -A" and "#SBATCH -p" lines to  
```  
#SBATCH -A kellogg
#SBATCH -p kellogg
```  
__# run_batch_llama2.sh__  
```
# Run on Quest
#!/bin/bash

#SBATCH -A your_quest_allocation_account
#SBATCH -p gengpu
#SBATCH --gres=gpu:a100:1
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 0:30:00
#SBATCH --mem=40G

module purge
module load mamba/23.1.0
source /hpc/software/mamba/23.1.0/etc/profile.d/conda.sh
source activate /kellogg/software/envs/gpu-llama2

python test_llama2.py
```

#### <font color='purple'>Run Mistral model with transformers</font>  
Check out scripts/transformers/test_mistral.py.  

#### <font color='purple'>Run Gemma model with transformers</font>  
Check out scripts/transformers/test_gemma.py.  