<a href="https://colab.research.google.com/github/meta-llama/llama-recipes/blob/main/recipes/quickstart/Prompt_Engineering_with_Llama_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prompt Engineering with Llama 3.1

Prompt engineering is using natural language to produce a desired response from a large language model (LLM).

This interactive guide covers prompt engineering & best practices with Llama 3.1.

## Introduction

### Why now?

[Vaswani et al. (2017)](https://arxiv.org/abs/1706.03762) introduced the world to transformer neural networks (originally for machine translation). Transformers ushered an era of generative AI with diffusion models for image creation and large language models (`LLMs`) as **programmable deep learning networks**.

Programming foundational LLMs is done with natural language – it doesn't require training/tuning like ML models of the past. This has opened the door to a massive amount of innovation and a paradigm shift in how technology can be deployed. The science/art of using natural language to program language models to accomplish a task is referred to as **Prompt Engineering**.

### Llama Models

In 2023, Meta introduced the [Llama language models](https://ai.meta.com/llama/) (Llama Chat, Code Llama, Llama Guard). These are general purpose, state-of-the-art LLMs.

Llama models come in varying parameter sizes. The smaller models are cheaper to deploy and run; the larger models are more capable.

#### Llama 3.1
1. `llama-3.1-8b` - base pretrained 8 billion parameter model
1. `llama-3.1-70b` - base pretrained 70 billion parameter model
1. `llama-3.1-405b` - base pretrained 405 billion parameter model
1. `llama-3.1-8b-instruct` - instruction fine-tuned 8 billion parameter model
1. `llama-3.1-70b-instruct` - instruction fine-tuned 70 billion parameter model
1. `llama-3.1-405b-instruct` - instruction fine-tuned 405 billion parameter model (flagship)


#### Llama 3
1. `llama-3-8b` - base pretrained 8 billion parameter model
1. `llama-3-70b` - base pretrained 70 billion parameter model
1. `llama-3-8b-instruct` - instruction fine-tuned 8 billion parameter model
1. `llama-3-70b-instruct` - instruction fine-tuned 70 billion parameter model (flagship)

#### Llama 2
1. `llama-2-7b` - base pretrained 7 billion parameter model
1. `llama-2-13b` - base pretrained 13 billion parameter model
1. `llama-2-70b` - base pretrained 70 billion parameter model
1. `llama-2-7b-chat` - chat fine-tuned 7 billion parameter model
1. `llama-2-13b-chat` - chat fine-tuned 13 billion parameter model
1. `llama-2-70b-chat` - chat fine-tuned 70 billion parameter model (flagship)


Code Llama is a code-focused LLM built on top of Llama 2 also available in various sizes and finetunes:

#### Code Llama
1. `codellama-7b` - code fine-tuned 7 billion parameter model
1. `codellama-13b` - code fine-tuned 13 billion parameter model
1. `codellama-34b` - code fine-tuned 34 billion parameter model
1. `codellama-70b` - code fine-tuned 70 billion parameter model
1. `codellama-7b-instruct` - code & instruct fine-tuned 7 billion parameter model
2. `codellama-13b-instruct` - code & instruct fine-tuned 13 billion parameter model
3. `codellama-34b-instruct` - code & instruct fine-tuned 34 billion parameter model
3. `codellama-70b-instruct` - code & instruct fine-tuned 70 billion parameter model
1. `codellama-7b-python` - Python fine-tuned 7 billion parameter model
2. `codellama-13b-python` - Python fine-tuned 13 billion parameter model
3. `codellama-34b-python` - Python fine-tuned 34 billion parameter model
3. `codellama-70b-python` - Python fine-tuned 70 billion parameter model

## Getting an LLM

Large language models are deployed and accessed in a variety of ways, including:

1. **Self-hosting**: Using local hardware to run inference. Ex. running Llama on your Macbook Pro using [llama.cpp](https://github.com/ggerganov/llama.cpp).
    * Best for privacy/security or if you already have a GPU.
1. **Cloud hosting**: Using a cloud provider to deploy an instance that hosts a specific model. Ex. running Llama on cloud providers like AWS, Azure, GCP, and others.
    * Best for customizing models and their runtime (ex. fine-tuning a model for your use case).
1. **Hosted API**: Call LLMs directly via an API. There are many companies that provide Llama inference APIs including AWS Bedrock, Replicate, Anyscale, Together and others.
    * Easiest option overall.

### Hosted APIs

Hosted APIs are the easiest way to get started. We'll use them here. There are usually two main endpoints:

1. **`completion`**: generate a response to a given prompt (a string).
1. **`chat_completion`**: generate the next message in a list of messages, enabling more explicit instruction and context for use cases like chatbots.

## Tokens

LLMs process inputs and outputs in chunks called *tokens*. Think of these, roughly, as words – each model will have its own tokenization scheme. For example, this sentence...

> Our destiny is written in the stars.

...is tokenized into `["Our", " destiny", " is", " written", " in", " the", " stars", "."]` for Llama 3. See [this](https://tiktokenizer.vercel.app/?model=meta-llama%2FMeta-Llama-3-8B) for an interactive tokenizer tool.

Tokens matter most when you consider API pricing and internal behavior (ex. hyperparameters).

Each model has a maximum context length that your prompt cannot exceed. That's 128k tokens for Llama 3.1, 4K for Llama 2, and 100K for Code Llama.


## Notebook Setup

The following APIs will be used to call LLMs throughout the guide. As an example, we'll call Llama 3.1 chat using [Grok](https://console.groq.com/playground?model=llama3-70b-8192).

To install prerequisites run:

In [1]:
import sys
!{sys.executable} -m pip install groq

Collecting groq
  Downloading groq-0.11.0-py3-none-any.whl.metadata (13 kB)
Collecting httpx<1,>=0.23.0 (from groq)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->groq)
  Downloading httpcore-1.0.6-py3-none-any.whl.metadata (21 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->groq)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading groq-0.11.0-py3-none-any.whl (106 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.5/106.5 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpx-0.27.2-py3-none-any.whl (76 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpcore-1.0.6-py3-none-any.whl (78 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.0/78.0 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading h11-0.14.0-py3-none-any.whl (58 kB

In [16]:
from google.colab import userdata
a=userdata.get('groq1')

In [17]:
print(a)

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


In [18]:
import os
from typing import Dict, List
from groq import Groq

# Get a free API key from https://console.groq.com/keys
os.environ["GROQ_API_KEY"] = a


#LLAMA3_405B_INSTRUCT = "llama-3.1-405b-reasoning" # Note: Groq currently only gives access here to paying customers for 405B model
LLAMA3_70B_INSTRUCT = "llama-3.1-70b-versatile"
LLAMA3_8B_INSTRUCT = "llama3.1-8b-instant"

DEFAULT_MODEL = LLAMA3_70B_INSTRUCT

client = Groq()

def assistant(content: str):
    return { "role": "assistant", "content": content }

def user(content: str):
    return { "role": "user", "content": content }

def chat_completion(
    messages: List[Dict],
    model = DEFAULT_MODEL,
    temperature: float = 0.6,
    top_p: float = 0.9,
) -> str:
    response = client.chat.completions.create(
        messages=messages,
        model=model,
        temperature=temperature,
        top_p=top_p,
    )
    return response.choices[0].message.content


def completion(
    prompt: str,
    model: str = DEFAULT_MODEL,
    temperature: float = 0.6,
    top_p: float = 0.9,
) -> str:
    return chat_completion(
        [user(prompt)],
        model=model,
        temperature=temperature,
        top_p=top_p,
    )

def complete_and_print(prompt: str, model: str = DEFAULT_MODEL):
    print(f'==============\n{prompt}\n==============')
    response = completion(prompt, model)
    print(response, end='\n\n')


### Completion APIs

Let's try Llama 3.1!

In [19]:
complete_and_print("The typical color of the sky is: ")

The typical color of the sky is: 
Blue.



In [20]:
complete_and_print("which model version are you?")

which model version are you?
I'm a large language model, my model version is InstructLLaMA, and my knowledge cutoff is currently December 2023, but I don't have a specific version number.



In [22]:
def print_tuned_completion(temperature: float, top_p: float):
    response = completion("Write a haiku about llamas", temperature=temperature, top_p=top_p)
    print(f'[temperature: {temperature} | top_p: {top_p}]\n{response.strip()}\n')

print_tuned_completion(0.01, 0.01)
print_tuned_completion(0.01, 0.01)
# These two generations are highly likely to be the same

print_tuned_completion(1.0, 1.0)
print_tuned_completion(1.0, 1.0)
# These two generations are highly likely to be different

[temperature: 0.01 | top_p: 0.01]
Softly gentle eyes
Llama's gentle, fuzzy form
Misty mountain home

[temperature: 0.01 | top_p: 0.01]
Softly gentle eyes
Llama's gentle, fuzzy form
Misty mountain home

[temperature: 1.0 | top_p: 1.0]
Soft, woolly creature
Ears perked up with gentle gaze
Llama's gentle soul

[temperature: 1.0 | top_p: 1.0]
Fuzzy, gentle eyes
Softly pads across the land
Majestic delight



## Prompting Techniques

### Explicit Instructions

Detailed, explicit instructions produce better results than open-ended prompts:

In [None]:
complete_and_print(prompt="Describe quantum physics in one short sentence of no more than 12 words")
# Returns a succinct explanation of quantum physics that mentions particles and states existing simultaneously.

You can think about giving explicit instructions as using rules and restrictions to how Llama 3 responds to your prompt.

- Stylization
    - `Explain this to me like a topic on a children's educational network show teaching elementary students.`
    - `I'm a software engineer using large language models for summarization. Summarize the following text in under 250 words:`
    - `Give your answer like an old timey private investigator hunting down a case step by step.`
- Formatting
    - `Use bullet points.`
    - `Return as a JSON object.`
    - `Use less technical terms and help me apply it in my work in communications.`
- Restrictions
    - `Only use academic papers.`
    - `Never give sources older than 2020.`
    - `If you don't know the answer, say that you don't know.`

Here's an example of giving explicit instructions to give more specific results by limiting the responses to recently created sources.

In [31]:
# prompt: accept a csv file and ask llama to explain ti

import sys
from google.colab import userdata
import os
from typing import Dict, List
from groq import Groq
import csv

# ... (rest of your existing code)

def analyze_csv_with_llama(csv_file_path):
    """
    Reads a CSV file and asks Llama to explain its contents.
    """
    try:
        with open(csv_file_path, 'r') as file:
            reader = csv.reader(file)
            header = next(reader, None)  # Get the header row
            data = list(reader)  # Read the remaining data rows

            if header is None:
                prompt = f"Can you give me global interpretation for the 2 assessments for the athlete"
            else:
                prompt = f"Can you give me global interpretation for the 2 assessments for this specific athlete. The header is {header}. The data is:\n{data}"

            explanation = completion(prompt)  # Get explanation from Llama
            print(explanation)

    except FileNotFoundError:
        print(f"Error: CSV file not found at {csv_file_path}")
    except Exception as e:
        print(f"An error occurred: {e}")


# Example usage (replace 'your_file.csv' with the actual path to your CSV)
analyze_csv_with_llama('/content/FMS&MKS.csv')

Based on the provided data, here's a global interpretation of the two assessments for the specific athlete:

**Musculoskeletal Assessment:**

The musculoskeletal assessment reveals several key findings that suggest the athlete has a complex pattern of movement and postural dysfunctions. The main findings include:

1. **Right-sided dominance**: The athlete presents with a right-sided dominance, characterized by a higher right hip, increased weight-bearing on the right side, and a convex spine towards the right.
2. **Upper quarter restrictions**: The athlete has restricted shoulder internal rotation on the right side, tender upper trapezius, and rhomboids and lats insertion. This suggests a potential issue with scapular stability and mobility.
3. **Lower quarter restrictions**: The athlete has restricted bilateral hip external rotation, tight right glute medius, and tight left psoas. This suggests a potential issue with hip mobility and stability.
4. **Postural dysfunctions**: The athlet

In [24]:
# prompt: accept an excel file and ask llama to explain it

from google.colab import files
import pandas as pd

uploaded = files.upload()

# Assuming the uploaded file is named 'your_excel_file.xlsx'
# Replace 'your_excel_file.xlsx' with the actual filename if different
excel_filename = list(uploaded.keys())[0]

try:
    df = pd.read_excel(excel_filename)
    excel_data_string = df.to_string()  # Convert the DataFrame to a string

    # Now, use the Llama model (assuming the 'completion' function from your code is available)
    prompt = f"Explain the following Excel data:\n\n{excel_data_string}"
    explanation = completion(prompt)
    print(explanation)
except Exception as e:
    print(f"An error occurred: {e}")

IndexError: list index out of range

## Additional References
- [PromptingGuide.ai](https://www.promptingguide.ai/)
- [LearnPrompting.org](https://learnprompting.org/)
- [Lil'Log Prompt Engineering Guide](https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/)


## Author & Contact

Edited by [Dalton Flanagan](https://www.linkedin.com/in/daltonflanagan/) (dalton@meta.com) with contributions from Mohsen Agsen, Bryce Bortree, Ricardo Juan Palma Duran, Kaolin Fire, Thomas Scialom.