Reference: https://www.datacamp.com/tutorial/llama-3-1-rag

## Prerequisits:
- Install [ollama](https://ollama.com/)
- Run ollama from terminal: ```ollama run llama3.1```



In [1]:
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

# Access environment variables
AZURE_OPENAI_EMBEDDING_MODEL = os.getenv('AZURE_OPENAI_EMBEDDING_MODEL')
AZURE_OPENAI_ENDPOINT = os.getenv('AZURE_OPENAI_ENDPOINT')
AZURE_OPENAI_API_KEY = os.getenv('AZURE_OPENAI_API_KEY')
AZURE_OPENAI_API_VERSION = os.getenv('AZURE_OPENAI_API_VERSION')

In [2]:
from langchain_community.document_loaders import WebBaseLoader

# List of URLs to load documents from
urls = [
    "https://lilianweng.github.io/posts/2023-06-23-agent/",
    "https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/",
    "https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/",
]
# Load documents from the URLs
docs = [WebBaseLoader(url).load() for url in urls]


USER_AGENT environment variable not set, consider setting it to identify your requests.


In [3]:
docs_list = [item for sublist in docs for item in sublist]

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize a text splitter with specified chunk size and overlap
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=250, chunk_overlap=0
)
# Split the documents into chunks
doc_splits = text_splitter.split_documents(docs_list)

In [5]:
len(doc_splits)

194

In [6]:
from langchain_community.vectorstores import SKLearnVectorStore
from langchain_openai import AzureOpenAIEmbeddings
# Create embeddings for documents and store them in a vector store
vectorstore = SKLearnVectorStore.from_documents(
    documents=doc_splits,
    embedding=AzureOpenAIEmbeddings(
        model=AZURE_OPENAI_EMBEDDING_MODEL,
        azure_endpoint=AZURE_OPENAI_ENDPOINT,
        api_key=AZURE_OPENAI_API_KEY,
        openai_api_version=AZURE_OPENAI_API_VERSION
        )
)
retriever = vectorstore.as_retriever(k=4)

In [7]:
query = "What is CoT?"
retrieved_docs = retriever.invoke(query)

In [8]:
retrieved_docs[0].__dict__

{'id': None,
 'metadata': {'id': 'daa40931-65ca-461d-bf9c-e1fbba38f7b6',
  'source': 'https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/',
  'title': "Prompt Engineering | Lil'Log",
  'description': 'Prompt Engineering, also known as In-Context Prompting, refers to methods for how to communicate with LLM to steer its behavior for desired outcomes without updating the model weights. It is an empirical science and the effect of prompt engineering methods can vary a lot among models, thus requiring heavy experimentation and heuristics.\nThis post only focuses on prompt engineering for autoregressive language models, so nothing with Cloze tests, image generation or multimodality models.',
  'language': 'en'},
 'page_content': 'Two main types of CoT prompting:',
 'type': 'Document'}

In [9]:
from IPython.display import display, Markdown

for indx, doc in enumerate(retrieved_docs, start=1):
    md_content = f"""### Document {indx}

**Source:** {doc.metadata['source']}

**Title:** {doc.metadata['title']}

**Description:** {doc.metadata['description']}

**Language:** {doc.metadata['language']}

**Content:** {doc.page_content}
    """
    display(Markdown(md_content))

### Document 1

**Source:** https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/

**Title:** Prompt Engineering | Lil'Log

**Description:** Prompt Engineering, also known as In-Context Prompting, refers to methods for how to communicate with LLM to steer its behavior for desired outcomes without updating the model weights. It is an empirical science and the effect of prompt engineering methods can vary a lot among models, thus requiring heavy experimentation and heuristics.
This post only focuses on prompt engineering for autoregressive language models, so nothing with Cloze tests, image generation or multimodality models.

**Language:** en

**Content:** Two main types of CoT prompting:
    

### Document 2

**Source:** https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/

**Title:** Prompt Engineering | Lil'Log

**Description:** Prompt Engineering, also known as In-Context Prompting, refers to methods for how to communicate with LLM to steer its behavior for desired outcomes without updating the model weights. It is an empirical science and the effect of prompt engineering methods can vary a lot among models, thus requiring heavy experimentation and heuristics.
This post only focuses on prompt engineering for autoregressive language models, so nothing with Cloze tests, image generation or multimodality models.

**Language:** en

**Content:** Few-shot CoT. It is to prompt the model with a few demonstrations, each containing manually written (or model-generated) high-quality reasoning chains.
    

### Document 3

**Source:** https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/

**Title:** Prompt Engineering | Lil'Log

**Description:** Prompt Engineering, also known as In-Context Prompting, refers to methods for how to communicate with LLM to steer its behavior for desired outcomes without updating the model weights. It is an empirical science and the effect of prompt engineering methods can vary a lot among models, thus requiring heavy experimentation and heuristics.
This post only focuses on prompt engineering for autoregressive language models, so nothing with Cloze tests, image generation or multimodality models.

**Language:** en

**Content:** Fig. 3. Comparing CoT and PoT. (Image source: Chen et al. 2022).
External APIs#
TALM (Tool Augmented Language Models; Parisi et al. 2022) is a language model augmented with text-to-text API calls. LM is guided to generate |tool-call and tool input text conditioned on task input text to construct API call requests. When |result shows up, the specified tool API is called and the returned result gets appended to the text sequence. The final output is generated following |output token.
    

### Document 4

**Source:** https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/

**Title:** Prompt Engineering | Lil'Log

**Description:** Prompt Engineering, also known as In-Context Prompting, refers to methods for how to communicate with LLM to steer its behavior for desired outcomes without updating the model weights. It is an empirical science and the effect of prompt engineering methods can vary a lot among models, thus requiring heavy experimentation and heuristics.
This post only focuses on prompt engineering for autoregressive language models, so nothing with Cloze tests, image generation or multimodality models.

**Language:** en

**Content:** Definition: Classify the sentiment of the given movie review, "positive" or "negative".
Input: i'll bet the video game is a lot more fun than the film.
Output:
Self-Consistency Sampling#
Self-consistency sampling (Wang et al. 2022a) is to sample multiple outputs with temperature > 0 and then selecting the best one out of these candidates.
The criteria for selecting the best candidate can vary from task to task. A general solution is to pick majority vote. For tasks that are easy to validate such as a programming question with unit tests, we can simply run through the interpreter and verify the correctness with unit tests.
Chain-of-Thought (CoT)#
Chain-of-thought (CoT) prompting (Wei et al. 2022) generates a sequence of short sentences to describe reasoning logics step by step, known as reasoning chains or rationales, to eventually lead to the final answer. The benefit of CoT is more pronounced for complicated reasoning tasks, while using large models (e.g. with more than 50B parameters). Simple tasks only benefit slightly from CoT prompting.
Types of CoT prompts#
    

In [10]:
from langchain_ollama import ChatOllama
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
# Define the prompt template for the LLM
prompt = PromptTemplate(
    template="""You are an assistant for question-answering tasks.
    Use the following documents to answer the question.
    If you don't know the answer, just say that you don't know.
    Use three sentences maximum and keep the answer concise:
    Question: {question}
    Documents: {documents}
    Answer:
    """,
    input_variables=["question", "documents"],
)

In [11]:
# Initialize the LLM with Llama 3.1 model
llm = ChatOllama(
    model="llama3.1",
    temperature=0,
)

In [12]:
# Create a chain combining the prompt template and LLM
rag_chain = prompt | llm | StrOutputParser()

In [13]:
# Define the RAG application class
class RAGApplication:
    def __init__(self, retriever, rag_chain):
        self.retriever = retriever
        self.rag_chain = rag_chain
    def run(self, question):
        # Retrieve relevant documents
        documents = self.retriever.invoke(question)
        # Extract content from retrieved documents
        doc_texts = "\\n".join([doc.page_content for doc in documents])
        # Get the answer from the language model
        answer = self.rag_chain.invoke({"question": question, "documents": doc_texts})
        return answer

In [14]:
# Initialize the RAG application
rag_application = RAGApplication(retriever, rag_chain)
# Example usage
question = "What is CoT?"
answer = rag_application.run(question)
print("Question:", question)
print("Answer:", answer)

Question: What is CoT?
Answer: Chain-of-Thought (CoT) prompting generates a sequence of short sentences to describe reasoning logics step by step, known as reasoning chains or rationales, to eventually lead to the final answer. This type of prompting is more beneficial for complicated reasoning tasks and uses large models with over 50B parameters. It involves providing a few demonstrations, each containing manually written high-quality reasoning chains.
