# How to Use Transformers to run LLMs

## Basic workflow for using transformers
<font color='red'>1. Load the model parameters</font>  
<font color='orange'>2. Convert prompt query to tokens</font>  
<font color='darkblue'>3. Call model to process tokens and generate response tokens</font>  
<font color='green'>4. Decode tokens to text response</font>  

## Run Llama2 model with transformers  
Model on the Hugging Face site: https://huggingface.co/meta-llama/Llama-2-13b-chat-hf  
### Python script
__# test_llama2.py__  
  
from transformers import AutoTokenizer, AutoModelForCausalLM  
from transformers import BitsAndBytesConfig  
import time  
import pandas as pd  
from pathlib import Path  
  
start_time = time.time()  

_# Set up model and directory info_    
llm_dir = "/kellogg/data/llm_models_opensource/llama2_meta_huggingface"  
llm_model = "meta-llama/Llama-2-7b-chat-hf"  
_# llm_model = "meta-llama/Llama-2-13b-chat-hf"_  
_# llm_model = "meta-llama/Llama-2-70b-chat-hf"_  
  
<font color='red'>quantization_config = BitsAndBytesConfig(load_in_8bit=True)</font>   
<font color='red'>model = AutoModelForCausalLM.from_pretrained(llm_model,cache_dir=llm_dir, device_map="auto", quantization_config=quantization_config)</font>   
<font color='red'>tokenizer = AutoTokenizer.from_pretrained(llm_model, cache_dir=llm_dir)</font>   
  
print(f"=== Loading time: {time.time() - start_time} seconds")  
  
_# For Llama2 chat, need to enclosed your prompt by [INST] and [/INST]_  
query = "[INST] Tell a fun fact about Kellogg Business School. [/INST]"  
  
<font color='orange'>device = "cuda"</font>  
<font color='orange'>model_input = tokenizer(query, return_tensors="pt").to(device)</font>  

_# Settings for LLM model_    
customize_setting = {  
    "max_new_tokens": 400,  
    "do_sample": True,  
    "temperature": 0.8,  
}  
print(f"=== Customized setting:")  
for key, value in customize_setting.items():  
    print(f"    {key}: {value}")  
  
<font color='darkblue'>outputs = model.generate(**model_input, **customize_setting)</font>  
<font color='green'>decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)</font>  
  
print("====================")  
print(f"LLM model: {llm_model}")  
print(f"Query: {query}")  
print("Response: ")  
print(decoded)  
print("====================")  
  
end_time = time.time()  
execution_time = end_time - start_time  
print(f"Execution time: {execution_time} seconds")  
finished_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())  
print(f"Finished at: {finished_time}")  
print("====================")  
  
_# Logging_  
columns = ["llm_model", "query", "response", "finished_time"]  
row = [llm_model, query, decoded, finished_time]  
for key, value in customize_setting.items():  
    columns.append(key)  
    row.append(value)  
df = pd.DataFrame([row], columns=columns)  
llm_name = llm_model.split("/")[-1]  
log_file = Path(f"./log_{llm_name}.csv")  
df.to_csv(log_file, index=False, mode='a', header=not log_file.exists())  
  


### Slurm script
The following script is for running the GPU job on Quest. To run with Kellogg GPU, change the "#SBATCH -A" and "#SBATCH -p" lines to  
```  
#SBATCH -A kellogg
#SBATCH -p kellogg
```  
__# run_batch_llama2.sh__  
```
# Run on Quest
#!/bin/bash

#SBATCH -A your_quest_allocation_account
#SBATCH -p gengpu
#SBATCH --gres=gpu:a100:1
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 0:30:00
#SBATCH --mem=40G

module purge
module load mamba/23.1.0
source /hpc/software/mamba/23.1.0/etc/profile.d/conda.sh
source activate /kellogg/software/envs/gpu-llama2

python test_llama2.py
```

## Run Mistral model with transformers  
Check out scripts/test_mistral.py.  

## Run Gemma model with transformers  
Check out scripts/tset_gemma.py.  