## Explore how to use a model that is hosted on Hugging Face
#### With the "transformers" library from Hugging Face there are two possible ways to do inference allowing for different levels of control:
* **Option 1:** Using the "pipeline" function, allowing for fast model setup
* **Option 2:** Sequential function calls for tokenisation, generation and subsequent decoding allowing you to specify more things in the process

In [None]:
# if needed download the packages
# !pip install transformers huggingface_hub torch gc

In [None]:
# import packages
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import notebook_login
import torch
import gc 

In [None]:
# check GPU availability
# if you selected "GPU T4 x2", then you should see 2 instances of Tesla T4 with 15GB each.
!nvidia-smi

## Option 1: Using the pipeline function 

In [None]:
# use Pipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device: ", device)
model = pipeline(task = "text-generation",
                model = "Qwen/Qwen2-7B-Instruct",
                device_map = "auto",
                torch_dtype = torch.float16) # specify precision for model weights

In [None]:
# check that model is loaded on the GPU
!nvidia-smi

In [None]:
# prompt the model
output = model("Explain how to win an Hackathon using AI", max_new_tokens = 250)
print(output[0]["generated_text"])

In [None]:
# delete model from GPU
del model #deleting the model 

# model will still be on cache until its place is taken by other objects
# so also execute the below lines
gc.collect()
torch.cuda.empty_cache() 

In [None]:
!nvidia-smi

## Option 2: Manually do model loading, tokenisation, generation and decoding

In [None]:
# set access token -- copy it from HuggingFace
notebook_login()

In [None]:
# Load "manually" model and tokenizer - now for a model that requires an access token
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device: ", device)
model = AutoModelForCausalLM.from_pretrained("google/gemma-1.1-7b-it",
                device_map = "auto",
                torch_dtype = torch.float16)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-1.1-7b-it")

In [None]:
# your prompt
prompt = "Give me a short introduction to large language model."
chat = [
    { "role": "user", "content": prompt },
]

# apply the correct template to your prompt
text = tokenizer.apply_chat_template(
    chat,
    tokenize=False,
    add_generation_prompt=True
)

# tokenize the prompt
model_inputs = tokenizer([text], return_tensors="pt").to(device)

# generate output
generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)

# extract only the response tokens
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

# decode tokens back to text
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)