# Run LLAMA 3.1 in a notebook

This example shows how to load and run inference with a [Llama 3.1 model](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/) on Verily Workbench, using the [Huggingface](https://huggingface.co/) libraries and model access.

You'll need a Huggingface [account](https://huggingface.co/join) and [access token](https://huggingface.co/settings/tokens). You'll also need to apply for *approval to access the Llama 3.1 model files*.  You'll find a link to do that when you access one of the Llama models from Huggingface, e.g.: 
https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct

This notebook uses the Huggingface `transformers` library for model inference, and uses the LLama 3.1 8B-Instruct model: 
https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct

Create a **Verily Workbench JupyterLab notebook environment** to run this example. **Use 8 CPUs, and 1 v100 GPU**.  With that configuration, the notebook costs ~3.01/hr to run.  \
Pick the **TensorFlow image** when you create the notebook environment. (Below, we'll install `torch`. This gives us a newer version of `torch` than that used by the Pytorch notebook environment, for better memory management).

Note: the larger Llama 3.1 models need more powerful GPUs and will not run with the above configuration.  See the end of this example for a bit more discussion on this.

## Setup

Before you get started, make sure you have your Huggingface access token available.

First, install some libraries:

In [None]:
!pip install -U transformers torch accelerate

**Restart the kernel before proceeding**.

Do some imports, and set the model ID.

In [None]:
import transformers
import torch
import accelerate

# The above notebook configuration will not support the 70B model.  See the end of the notebook for more discussion.
# model_id = "meta-llama/Meta-Llama-3.1-70B-Instruct"
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"


In [None]:
%env PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Load the model. Before running the following cell, edit `YOUR_HF_ACCESS_TOKEN` to **use your access token**.

You'll only need to download the model files once; after that, they'll load from the notebook environment's file system.

In [None]:
pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
    token="YOUR_HF_ACCESS_TOKEN"
)

## Run inference on the model and view the response

We'll formulate the prompt in terms of 'roles'— information for the 'system', and then the 'user' query.

In [None]:
messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Tell me about the field of biomedical research."},
]

In [None]:
%%time
response = pipeline(
    messages,
    max_new_tokens=512,
)
chat = response[0]['generated_text']


For the 8B model on a V100, each inference call will take ~2-5 mins.


In [None]:
print(response[0]['generated_text'][-1]['content'])

### Append a new query to the existing prompt context

We can maintain the existing context as we ask the model additional questions.

In [None]:
chat.append( {"role": "user", "content": "Describe what a GWAS is"})

In [None]:
%%time
response = pipeline(
    chat,
    max_new_tokens=256,
)
chat = response[0]['generated_text']

In [None]:
print(response[0]['generated_text'][-1]['content'])


Note that because we appended the new query to the previous response context, we're still seeing the response in "pirate speak".

Next, try refining the query— note that we don't need to provide additional context on what "more" means.

In [None]:
chat.append( {"role": "user", "content": "Tell me more."})

In [None]:
%%time
response = pipeline(
    chat,
    max_new_tokens=512,
)
chat = response[0]['generated_text']

In [None]:
print(response[0]['generated_text'][-1]['content'])

## Augment the prompt with information from a relevant document

Download the "Introduction to Verily Workbench" document (in Markdown format) as taken from the Workbench [support site](https://support.workbench.verily.com/):

**TODO**: update to use main branch path.

In [None]:
!mkdir -p documents
!wget https://raw.githubusercontent.com/verily-src/workbench-examples/amyu/llama31/ml_examples/llama31/overview.md -O documents/overview.md

Read the file into a string:

In [None]:
file_path = '/home/jupyter/documents/overview.md'
 
with open(file_path, 'r') as file:
    file_content = file.read()

We'll first try a query without using this supplementary information:

In [None]:
messages = [
    {"role": "user", "content": "Tell me about Verily Workbench."},
]

In [None]:
%%time
response = pipeline(
    messages,
    max_new_tokens=512,
)

In [None]:
print(response[0]['generated_text'][-1]['content'])

The above response will **likely not be very accurate** (the larger Llama 3.1 models would typically do a bit better).  

We can include a bit more information about Verily Workbench in the prompt to the model. We'll do that by including the Verily Workbench 'Overview' content from the Workbench [support site](https://support.workbench.verily.com/), that we downloaded above. This information will help the model summarize more accurately.

In [None]:
messages = [
    {"role": "system", "content": file_content},
    {"role": "user", "content": "Tell me about Verily Workbench."},
]

In [None]:
%%time
response = pipeline(
    messages,
    max_new_tokens=512,
)
chat = response[0]['generated_text']

In [None]:
print(response[0]['generated_text'][-1]['content'])

This response should look more accurate.

## Cleanup

GPUs can be expensive; be sure to stop or delete your notebook environment when you are done.

## Experimenting with a larger Llama3.1 model

If you want to experiment with using the Llama3.1 70B model, try creating a notebook that uses 2 A100s and has more disk space.  This notebook must be created via the [Verily Workbench CLI](https://support.workbench.verily.com/docs/guides/cli/cli_install_and_run/), as the UI does not support all the necessary config:

```
wb resource create gcp-notebook --name llama3170b --machine-type=a2-highgpu-2g  --vm-image-family=tf-ent-latest-gpu --vm-image-project=deeplearning-platform-release  --data-disk-size 800 --accelerator-type NVIDIA_TESLA_A100 --accelerator-core-count=2 --install-gpu-driver=true
```

Most of the examples in this notebook should run with that configuration, with the exception of the last section on "Augmenting the prompt with information from a relevant document".  That will likely cause an OoM error.  
Each inference call to the 70B model will take ~2 hours with this configuration.



## Provenance

(You can ignore the `huggingface/tokenizers` warnings in the following.)

In [None]:
!date

In [None]:
!pip freeze

In [None]:
!grep ^processor /proc/cpuinfo | wc -l

In [None]:
!grep "^MemTotal:" /proc/meminfo

---

Copyright 2024 Verily Life Sciences LLC

Use of this source code is governed by a BSD-style \
license that can be found in the LICENSE file or at \
https://developers.google.com/open-source/licenses/bsd