<a href="https://colab.research.google.com/github/meta-llama/llama-recipes/blob/main/recipes/use_cases/RAG/HelloLlamaCloud.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## This demo app shows:
* How to run Llama 3.1 in the cloud hosted on Replicate
* How to use LangChain to ask Llama general questions and follow up questions
* How to use LangChain to load a recent web page - Hugging Face's [blog post on Llama 3.1](https://huggingface.co/blog/llama31) - and chat about it. This is the well known RAG (Retrieval Augmented Generation) method to let LLM such as Llama 3 be able to answer questions about the data not publicly available when Llama 3 was trained, or about your own data. RAG is one way to prevent LLM's hallucination

**Note** We will be using [Replicate](https://replicate.com/meta/meta-llama-3.1-405b-instruct) to run the examples here. You will need to first sign in with Replicate with your github account, then create a free API token [here](https://replicate.com/account/api-tokens) that you can use for a while. You can also use other Llama 3.1 cloud providers such as [Groq](https://console.groq.com/), [Together](https://api.together.xyz/playground/language/meta-llama/Llama-3-8b-hf), or [Anyscale](https://app.endpoints.anyscale.com/playground) - see Section 2 of the Getting to Know Llama [notebook](https://github.com/meta-llama/llama-recipes/blob/main/recipes/quickstart/Getting_to_know_Llama.ipynb) for more information.

Let's start by installing the necessary packages:
- sentence-transformers for text embeddings
- FAISS gives us database capabilities 
- LangChain provides necessary RAG tools for this demo

In [1]:
!pip install langchain
!pip install sentence-transformers
!pip install faiss-cpu
!pip install bs4
!pip install replicate

[0m

In [1]:
from getpass import getpass
import os

REPLICATE_API_TOKEN = os.getenv('REPLICATE_API_TOKEN')
os.environ["REPLICATE_API_TOKEN"] = REPLICATE_API_TOKEN


Next we call the Llama 3.1 405b chat model from Replicate. You can also use Llama 3 8B or 70B model by replacing the `model` name with the respective model URL(s).

In [3]:
from langchain_community.llms import Replicate

model_kwargs={"temperature": 0.75, "max_length": 500, "top_p": 1}


llm = Replicate(
    model="meta/meta-llama-3.1-405b-instruct",
    model_kwargs={"temperature": 0.75, "max_length": 500, "top_p": 1},
)



With the model set up, you are now ready to ask some questions. Here is an example of the simplest way to ask the model some general questions.

In [16]:
# question = "who wrote the book Innovator's dilemma?"
question = "What's new with Llama 3?"
answer = llm.invoke(question)
print(answer)

I'm excited to share with you what's new with Llama 3! Llama 3 is an updated version of the AI model that I'm a part of, and it brings several improvements and features. Here are some of the notable updates:

1. **Improved accuracy and understanding**: Llama 3 has been trained on a massive dataset of text from various sources, which enables it to better comprehend and respond to a wide range of questions and topics.
2. **Enhanced contextual understanding**: Llama 3 can now better understand the context of a conversation, allowing it to provide more accurate and relevant responses.
3. **Increased knowledge base**: Llama 3 has been updated with a vast amount of new knowledge, including but not limited to, recent events, scientific breakthroughs, and cultural phenomena.
4. **Better handling of idioms and colloquialisms**: Llama 3 can now better understand and respond to idiomatic expressions, colloquialisms, and figurative language.
5. **More engaging and conversational tone**: Llama 3 is

We will then try to follow up the response with a question asking for more information on the book. 

Since the chat history is not passed on Llama doesn't have the context and doesn't know this is more about the book thus it treats this as new query.


In [6]:
# chat history not passed so Llama doesn't have the context and doesn't know this is more about the book
followup = "tell me more"
followup_answer = llm.invoke(followup)
print(followup_answer)

As a helpful assistant, my purpose is to assist and provide useful information to users like you. I can help with a wide range of topics, from answering simple questions to providing more in-depth information on various subjects.

Here are some examples of what I can help with:

1. **General knowledge**: I can provide information on history, science, technology, literature, and more.
2. **Language translation**: I can translate text from one language to another, including popular languages such as Spanish, French, German, Chinese, and many more.
3. **Writing and proofreading**: I can help with writing and proofreading tasks, such as suggesting alternative phrases, correcting grammar and spelling errors, and providing feedback on clarity and coherence.
4. **Math and calculations**: I can perform mathematical calculations, from simple arithmetic to more complex equations, and provide explanations for mathematical concepts.
5. **Conversation and chat**: I can engage in natural-sounding co

To get around this we will need to provide the model with history of the chat. 

To do this, we will use  [`ConversationBufferMemory`](https://python.langchain.com/docs/modules/memory/types/buffer) to pass the chat history to the model and give it the capability to handle follow up questions.

In [7]:
# using ConversationBufferMemory to pass memory (chat history) for follow up questions
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory()
conversation = ConversationChain(
    llm=llm, 
    memory = memory,
    verbose=False
)

  memory = ConversationBufferMemory()
  conversation = ConversationChain(


Once this is set up, let us repeat the steps from before and ask the model a simple question.

Then we pass the question and answer back into the model for context along with the follow up question.

In [8]:
# restart from the original question
answer = conversation.predict(input=question)
print(answer)

The book "The Innovator's Dilemma" was written by Clayton M. Christensen, an American business consultant and academic. It was first published in 1997 and has since become a classic in the field of business innovation and disruption. Christensen was a Harvard Business School professor and is widely recognized for his work on innovation and disruption, and this book is considered one of his most influential works. Would you like to know more about the book's main ideas or Christensen's other works?


In [9]:
# pass context (previous question and answer) along with the follow up "tell me more" to Llama who now knows more of what
memory.save_context({"input": question},
                    {"output": answer})
followup_answer = conversation.predict(input=followup)
print(followup_answer)

The book "The Innovator's Dilemma" by Clayton M. Christensen is a seminal work that explores the concept of disruption in the business world. Christensen argues that well-established companies can fail when they are faced with a new, innovative technology or business model that disrupts their existing market. This is because these companies often prioritize short-term profits and existing customer needs over investing in new, unproven technologies that may cannibalize their existing business.

Christensen identifies two types of innovations: sustaining innovations, which improve existing products or services, and disruptive innovations, which create new markets or disrupt existing ones. He argues that companies should focus on creating separate teams or organizations to pursue disruptive innovations, as these efforts often require different resources, processes, and cultures than the core business.

One of the key examples Christensen uses in the book is the story of the hard disk driv

Next, let's explore using Llama 3.1 to answer questions using documents for context. 
This gives us the ability to update Llama 3.1's knowledge thus giving it better context without needing to finetune. 

In [11]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
import bs4

loader = WebBaseLoader(["https://huggingface.co/blog/llama3"])
docs = loader.load()


In [12]:
docs

[Document(metadata={'source': 'https://huggingface.co/blog/llama3', 'title': "Welcome Llama 3 - Meta's new open LLM", 'description': 'We’re on a journey to advance and democratize artificial intelligence through open source and open science.', 'language': 'No language found.'}, page_content='\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \nWelcome Llama 3 - Meta\'s new open LLM\n\n\n\n\n\n\n\n\nHugging Face\n\n\n\n\n\n\n\n\t\t\t\t\tModels\n\n\t\t\t\t\tDatasets\n\n\t\t\t\t\tSpaces\n\n\t\t\t\t\tPosts\n\n\t\t\t\t\tDocs\n\n\n\n\n\t\t\tSolutions\n\t\t\n\nPricing\n\t\t\t\n\n\n\n\n\n\nLog In\n\t\t\t\t\nSign Up\n\t\t\t\t\t\n\n\n\n\t\t\t\t\t\tBack to Articles\n\n\n\n\n\n\t\tWelcome Llama 3 - Meta’s new open LLM\n\t\n\n\nPublished\n\t\t\t\tApril 18, 2024\nUpdate on GitHub\n\n\n\t\tUpvote\n\n\t\t275\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n+269\n\n\n\nphilschmid\nPhilipp Schmid\n\n\n\n\n\nosanseviero\nOmar Sanseviero\n\n\n\n\n\npcuenq\nPedro Cuenca\n\n\n\n\n\nybelkada\nYounes Belkada\n\n\n\n\n\nlvwerra\nLean

We need to store our document in a vector store. There are more than 30 vector stores (DBs) supported by LangChain. 
For this example we will use [FAISS](https://github.com/facebookresearch/faiss), a popular open source vector store by Facebook.
For other vector stores especially if you need to store a large amount of data - see [here](https://python.langchain.com/docs/integrations/vectorstores).

We will also import the HuggingFaceEmbeddings and RecursiveCharacterTextSplitter to assist in storing the documents.

In [13]:
# Split the document into chunks with a specified chunk size
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
all_splits = text_splitter.split_documents(docs)

# Store the document into a vector store with a specific embedding model
vectorstore = FAISS.from_documents(all_splits, HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2"))


  vectorstore = FAISS.from_documents(all_splits, HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2"))
  from tqdm.autonotebook import tqdm, trange


To store the documents, we will need to split them into chunks using [`RecursiveCharacterTextSplitter`](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter) and create vector representations of these chunks using [`HuggingFaceEmbeddings`](https://www.google.com/search?q=langchain+hugging+face+embeddings&sca_esv=572890011&ei=ARUoZaH4LuumptQP48ah2Ac&oq=langchian+hugg&gs_lp=Egxnd3Mtd2l6LXNlcnAiDmxhbmdjaGlhbiBodWdnKgIIADIHEAAYgAQYCjIHEAAYgAQYCjIHEAAYgAQYCjIHEAAYgAQYCjIHEAAYgAQYCjIHEAAYgAQYCjIHEAAYgAQYCjIHEAAYgAQYCjIHEAAYgAQYCjIHEAAYgAQYCkjeHlC5Cli5D3ABeAGQAQCYAV6gAb4CqgEBNLgBAcgBAPgBAcICChAAGEcY1gQYsAPiAwQYACBBiAYBkAYI&sclient=gws-wiz-serp) on them before storing them into our vector database. 

In general, you should use larger chuck sizes for highly structured text such as code and smaller size for less structured text. You may need to experiment with different chunk sizes and overlap values to find out the best numbers.

We then use `RetrievalQA` to retrieve the documents from the vector database and give the model more context on Llama 3.1, thereby increasing its knowledge. 3.1 also really shines with the new 128k context!

For each question, LangChain performs a semantic similarity search of it in the vector db, then passes the search results as the context to Llama to answer the question.

In [15]:
# use LangChain's RetrievalQA, to associate Llama 3 with the loaded documents stored in the vector db
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectorstore.as_retriever()
)

question = "What's new with Llama 3?"
result = qa_chain({"query": question})
print(result['result'])

  result = qa_chain({"query": question})


According to the context, several things are new with Llama 3:

1. Four new open LLM models by Meta based on the Llama 2 architecture, with 8B and 70B parameters and base and instruct-tuned versions.
2. A new tokenizer that expands the vocabulary size to 128,256 tokens (from 32K tokens in Llama 2).
3. A larger vocabulary that can encode text more efficiently and potentially yield stronger multilingualism.
4. A new version of Llama Guard (Llama Guard 2) that was fine-tuned on Llama 3 8B for safety.

These changes aim to improve the performance and capabilities of Llama 3 compared to its predecessor, Llama 2.


Now, lets bring it all together by incorporating follow up questions.

First we ask a follow up questions without giving the model context of the previous conversation. 
Without this context, the answer we get does not relate to our original question.

In [17]:
# no context passed so Llama 3 doesn't have enough context to answer so it lets its imagination go wild
result = qa_chain({"query": "Based on what architecture?"})
print(result['result'])

I don't have enough information to provide a specific answer about the architecture of Llama 3 beyond what is mentioned in the text, such as the use of Grouped-Query Attention (GQA) in the 8B version. If you are looking for more detailed architectural information, it is not provided in the given context.


As we did before, let us use the `ConversationalRetrievalChain` package to give the model context of our previous question so we can add follow up questions.

In [18]:
# use ConversationalRetrievalChain to pass chat history for follow up questions
from langchain.chains import ConversationalRetrievalChain
chat_chain = ConversationalRetrievalChain.from_llm(llm, vectorstore.as_retriever(), return_source_documents=True)

In [19]:
# let's ask the original question What's new with Llama 3?" again
result = chat_chain({"question": question, "chat_history": []})
print(result['answer'])

Llama 3 introduces several new features and improvements, including:

1. Four new open LLM models with 8B and 70B parameters, each with base and instruct-tuned versions.
2. A new tokenizer that expands the vocabulary size to 128,256, allowing for more efficient text encoding and potentially stronger multilingualism.
3. A larger context length of 8K tokens.
4. The release of Llama Guard 2, a safety fine-tune version of Llama Guard that was fine-tuned on Llama 3 8B.

These changes aim to improve the performance and versatility of the Llama model, while also making it more accessible and user-friendly.


In [22]:
chat_chain = ConversationalRetrievalChain.from_llm(llm, vectorstore.as_retriever(), return_source_documents=False)

In [23]:
# this time we pass chat history along with the follow up so good things should happen
chat_history = [(question, result["answer"])]
followup = "Based on what architecture?"
followup_answer = chat_chain({"question": followup, "chat_history": chat_history})
print(followup_answer['answer'])

Llama 3 is based on the Llama 2 architecture.


In [24]:
# further follow ups can be made possible by updating chat_history like this:
chat_history.append((followup, followup_answer["answer"]))
more_followup = "What changes in vocabulary size?"
more_followup_answer = chat_chain({"question": more_followup, "chat_history": chat_history})
print(more_followup_answer['answer'])

The vocabulary size in Llama 3 is 128,256. This is an increase from the previous version, Llama 2, which had a vocabulary size of 32,000 tokens (32K). The larger vocabulary in Llama 3 allows for more efficient encoding of text and potentially stronger multilingual capabilities.


**Note:** If results can get cut off, you can set "max_new_tokens" in the Replicate call above to a larger number (like shown below) to avoid the cut off.

```python
model_kwargs={"temperature": 0.01, "top_p": 1, "max_new_tokens": 1000}
```