## Preface

### Getting the model

LLaMA models are hosted in several places including [llama.com](https://www.llama.com/llama-downloads), [huggingface](https://huggingface.co/meta-llama), or [kaggle](https://www.kaggle.com/organizations/metaresearch/models).

Meta currently requires requesting access to the models prior to being able to download them. The form will be accessable based on the web platform you are going to download it from. Responses were very fast in my experience.

For this demo, I decided to use huggingface as I found their instructions easiest.

# Example

First we will install the huggingface CLI which helps with logins and authentication.

In [None]:
!pip install -U "huggingface_hub[cli]"
!pip install transformers
!pip install accelerate

Once installed, we can go ahead and get logged in. I have opted to create an auth token and it is stored in the notebooks secrets.

In [2]:
from google.colab import userdata
from huggingface_hub import login

login(userdata.get('HF_TOKEN'))

Import a couple libraries that will be used in later scripts, including transformers and pytorch.

In [5]:
from transformers import AutoTokenizer
import transformers
import torch

import datetime

Designate the model we want to use. This will have to be downloaded the first on the first use, but will be available for future queries from that point. I am using the model id of a predefined tokenizer hosted inside a model repo on huggingface, but directory paths also work if the model is already local.

In [None]:
start = datetime.datetime.now()
print(start)

l3_model = "meta-llama/Llama-3.2-1B-Instruct"
l3_tokenizer = AutoTokenizer.from_pretrained(l3_model)

l3_pipeline = transformers.pipeline(
    "text-generation",
    model=l3_model,
    tokenizer=l3_tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
)

print(datetime.datetime.now() - start)

In [None]:
start = datetime.datetime.now()
print(start)

sequences = l3_pipeline(
    'I have tomatoes, basil and cheese at home. What can I cook for dinner?\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=l3_tokenizer.eos_token_id,
    truncation = True,
    max_length=400,
)

for seq in sequences:
    print(f"Result: {seq['generated_text']}")

print(datetime.datetime.now() - start)

Example with larger CodeLlama model.

In [None]:
start = datetime.datetime.now()
print(start)

cl_model = "meta-llama/CodeLlama-7b-hf"
#cl_model = "meta-llama/Llama-2-7b-hf"
cl_tokenizer = AutoTokenizer.from_pretrained(cl_model)

cl_pipeline = transformers.pipeline(
    "text-generation",
    model=cl_model,
    tokenizer=cl_tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
)

print(datetime.datetime.now() - start)

Send input prompt and settings to pipeline. Then check the result. Some models have more intricate prompt formatting options, as with the Instruct models.

In [None]:
base_prompt = "<s>[INST]\n<<SYS>>\n{system_prompt}\n<</SYS>>\n\n{user_prompt}[/INST]"

input = base_prompt.format(
    system_prompt = "Answer the following programming questions as a knowledgable engineer.",
    user_prompt = "What casting options are there in c++?"
  )

print(input)

start = datetime.datetime.now()
print(start)

sequences = cl_pipeline(
    input,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=cl_tokenizer.eos_token_id,
    truncation = True,
    max_length=400,
)

for seq in sequences:
    print(f"Result: {seq['generated_text']}")

print(datetime.datetime.now() - start)