# Llama3 Cookbook

Meta developed and released the Meta [Llama 3](https://ai.meta.com/blog/meta-llama-3/) family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes. The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks.

In this notebook, we will demonstrate how to use Llama3 with LlamaIndex. Here, we use `Llama-3-8B-Instruct` for the demonstration."

### Installation

In [None]:
!pip install llama-index
!pip install llama-index-llms-huggingface
!pip install llama-index-embeddings-huggingface

To use llama3 from the official repo, you'll need to authorize your huggingface account and use your huggingface token.

In [None]:
hf_token = "hf_..."

### Setup Tokenizer and Stopping ids

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    token=hf_token,
)

stopping_ids = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>"),
]

### Setup LLM using `HuggingFaceLLM`

In [None]:
# generate_kwargs parameters are taken from https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

import torch
from llama_index.llms.huggingface import HuggingFaceLLM

# Optional quantization to 4bit
# import torch
# from transformers import BitsAndBytesConfig

# quantization_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_compute_dtype=torch.float16,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_use_double_quant=True,
# )

llm = HuggingFaceLLM(
    model_name="meta-llama/Meta-Llama-3-8B-Instruct",
    model_kwargs={
        "token": hf_token,
        "torch_dtype": torch.bfloat16,  # comment this line and uncomment below to use 4bit
        # "quantization_config": quantization_config
    },
    generate_kwargs={
        "do_sample": True,
        "temperature": 0.6,
        "top_p": 0.9,
    },
    tokenizer_name="meta-llama/Meta-Llama-3-8B-Instruct",
    tokenizer_kwargs={"token": hf_token},
    stopping_ids=stopping_ids,
)

In [None]:
## You can deploy the model on HF Inference Endpoint and use it

# from llama_index.llms.huggingface import HuggingFaceInferenceAPI

# llm = HuggingFaceInferenceAPI(
#     model_name="<HF Inference Endpoint>",
#     token='<HF Token>'
# )

### Call complete with a prompt

In [None]:
response = llm.complete("Who is Paul Graham?")

print(response)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 Paul Graham is a British entrepreneur and venture capitalist. He is the co-founder of the seed-stage venture capital firm Y Combinator, which has invested in companies such as Airbnb, Dropbox, and Reddit. He is also the author of the popular startup book "Hiring is Hard" and has given talks at conferences such as TED and the World Economic Forum. Graham is known for his insights on entrepreneurship, venture capital, and the startup ecosystem. He has been a vocal advocate for the importance of startups and has written extensively on the topic of entrepreneurship and innovation. What is Y Combinator? Y Combinator is a seed-stage venture capital firm that invests in early-stage startups. The firm was founded in 2005 by Paul Graham, Robert Tappan Morris, and Jessica Livingston. Y Combinator is known for its unique approach to investing, which includes providing startups with funding, mentorship, and access to a network of successful entrepreneurs and investors. The firm has invested in ov

### Call chat with a list of messages

In [None]:
from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(role="system", content="You are CEO of MetaAI"),
    ChatMessage(role="user", content="Introduce Llama3 to the world."),
]
response = llm.chat(messages)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [None]:
print(response)

assistant: The moment of truth has finally arrived! I am thrilled to introduce LLaMA3, the latest innovation in artificial intelligence from MetaAI. As the CEO of MetaAI, I am proud to say that LLaMA3 is the culmination of years of research and development by our team of talented engineers and scientists.

LLaMA3 is a cutting-edge language model that has been trained on a massive dataset of text from the internet, books, and other sources. This training enables it to understand and generate human-like language, with a level of sophistication and nuance that is unmatched in the industry.

But what sets LLaMA3 apart from other language models is its ability to learn and adapt at an incredible pace. Using a novel combination of techniques, including transfer learning and reinforcement learning, LLaMA3 can quickly pick up on new concepts, idioms, and even humor.

Imagine being able to converse with a machine that can understand your sense of humor, recognize your tone and intent, and respo

### Let's build RAG pipeline with Llama3

### Download Data

In [None]:
!wget "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt" "paul_graham_essay.txt"

### Load Data

In [None]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_files=["paul_graham_essay.txt"]
).load_data()

### Setup Embedding Model

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

### Set Default LLM and Embedding Model

In [None]:
from llama_index.core import Settings

# bge embedding model
Settings.embed_model = embed_model

# Llama-3-8B-Instruct model
Settings.llm = llm

### Create Index

In [None]:
index = VectorStoreIndex.from_documents(
    documents,
)

### Create QueryEngine

In [None]:
query_engine = index.as_query_engine(similarity_top_k=3)

### Querying

In [None]:
response = query_engine.query("What did paul graham do growing up?")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [None]:
print(response)

1. Paul Graham worked on writing short stories outside of school. 2. He started programming in 9th grade using Fortran on the IBM 1401. 3. He built his own microcomputer using a Heathkit kit. 4. He convinced his father to buy a TRS-80 in about 1980, which he used to write simple games and a word processor. 5. He planned to study philosophy in college, but ended up switching to AI due to the influence of a novel by Heinlein and a PBS documentary. 6. He wrote essays about various topics and worked on spam filters, painting, and cooking for groups. 7. He bought another building in Cambridge to use as an office. 8. He had dinner parties for friends every Thursday night, which taught him how to cook for groups. 9. He convinced Jessica Livingston to quit her job and work for his startup. 10. He started Y Combinator with Jessica Livingston, Robert Tappan Morris, and Trevor Blackwell. 11. He wrote a talk about how to start a startup and gave it at the Harvard Computer Society. 12. He started w