## **1. Loading Documents for RAG with LangChain**

### **The standard RAG workflow**

<img src="./images/rag.png" width=50%>

To enable a RAG workflow, we need to set up our data sources for retrieval, which starts with loading the documents to uild up the knowledge base, splitting them into chunks to be processed, and creating numerical representations from text called embeddings. These embeddings, or vectors, are stored in a vector database for future retrieval.

<img src="./images/data-for-retrieval.png" width=50%>

This part (Loading documents (CSV, PDF, HTML)) is covered (__1. Langchain document loaders__) in the chapter [3 - Retrieval Augmented Generation (RAG).ipynb](../Developing%20LLM%20Applications%20with%20LangChain/3%20-%20Retrieval%20Augmented%20Generation%20(RAG).ipynb) of the Course [Developing LLM applications with LangChain](../Developing%20LLM%20Applications%20with%20LangChain).

Shortly,

- To load CSV files, we use the `CSVLoader` class
- To load PDF files, we use the `PyPDFLoader` class
- To load HTML files, we use the `UnstructuredHTMLLoader` class

from the `langchain.document_loaders` module.

In [11]:
from langchain_community.document_loaders import UnstructuredHTMLLoader

html_loader = UnstructuredHTMLLoader('./datasets/datacamp-blog.html')

document = html_loader.load()
first_document = document[0]

print("Content:", first_document.page_content)
print("Metadata:", first_document.metadata)

Content: Skip to main content

HomeBlogPython

How to Learn Python From Scratch in 2024: An Expert Guide

Discover how to learn Python, its applications, and the demand for Python skills. Start your Python journey today ​​with our comprehensive guide.

Updated Jul 2024 · 19 min read

Share

As one of the most popular programming languages out there, many people want to learn Python. But how do you go about getting started? In this guide, we explore everything you need to know to begin your learning journey, including a step-by-step guide and learning plan and some of the most useful resources to help you succeed.

What is Python?

Python is a high-level, interpreted programming language created by Guido van Rossum and first released in 1991. It is designed with an emphasis on code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.

Python supports multiple programming paradigms, including pr

### **2. Text splitting, embeddings and vector storage**

This part (text splitting) is covered (__2. Splitting external data for retrieval__) in the chapter [3 - Retrieval Augmented Generation (RAG).ipynb](../Developing%20LLM%20Applications%20with%20LangChain/3%20-%20Retrieval%20Augmented%20Generation%20(RAG).ipynb) of the Course [Developing LLM applications with LangChain](../Developing%20LLM%20Applications%20with%20LangChain).

__Note__: In the DataCamp course the package `langchain_text_splitters` is used. However, we will use the package `langchain.text_splitter` as it is official module from the `langchain` library.

To remind,
- **CharacterTextSplitter**:
- **RecursiveCharacterTextSplitter**

In [16]:
from langchain.text_splitter import CharacterTextSplitter

text = """Machine learning is a fascinating field.\n\nIt involves algorithms and models that can learn from data.
These models can then make predictions or decisions without being explicitly programmed to perform the task.\n
This capability is increasingly valuable in various industries, from finance to healthcare.\n\n
There are many types of machine learning, including supervised, unsupervised, and reinforcement learning.\n
Each type has its own strengths and applications."""

text_splitter = CharacterTextSplitter(
    separator='\n\n',
    chunk_size=100,
    chunk_overlap=10
)

chunks = text_splitter.split_text(text)

print(chunks)
print([len(chunk) for chunk in chunks])

Created a chunk of size 168, which is longer than the specified 100
Created a chunk of size 106, which is longer than the specified 100


['Machine learning is a fascinating field.', 'It involves algorithms and models that can learn from data.\nThese models can then make predictions or decisions without being explicitly programmed to perform the task.', 'This capability is increasingly valuable in various industries, from finance to healthcare.', 'There are many types of machine learning, including supervised, unsupervised, and reinforcement learning.', 'Each type has its own strengths and applications.']
[40, 168, 91, 105, 49]


In [17]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    separators=['\n\n', '\n', " ", ""],
    chunk_size=100,
    chunk_overlap=10
)

chunks = splitter.split_text(text)

print(chunks)
print([len(chunk) for chunk in chunks])

['Machine learning is a fascinating field.', 'It involves algorithms and models that can learn from data.', 'These models can then make predictions or decisions without being explicitly programmed to perform', 'perform the task.', 'This capability is increasingly valuable in various industries, from finance to healthcare.', 'There are many types of machine learning, including supervised, unsupervised, and reinforcement', 'learning.', 'Each type has its own strengths and applications.']
[40, 59, 98, 17, 91, 95, 9, 49]


**Splitting documents**

We import the document, and swap `.split_text()` method with `.split_documents()` method.

In [6]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader('./datasets/rag-paper.pdf')
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

chunks = splitter.split_documents(documents)

chunks_iter = iter(chunks)

print(next(chunks_iter))
print([len(chunk.page_content) for chunk in chunks])

page_content='Retrieval-Augmented Generation for
Knowledge-Intensive NLP Tasks
Patrick Lewis†‡, Ethan Perez⋆,
Aleksandra Piktus†, Fabio Petroni†, Vladimir Karpukhin†, Naman Goyal†, Heinrich Küttler†,
Mike Lewis†, Wen-tau Yih†, Tim Rocktäschel†‡, Sebastian Riedel†‡, Douwe Kiela†
†Facebook AI Research; ‡University College London; ⋆New York University;
plewis@fb.com
Abstract
Large pre-trained language models have been shown to store factual knowledge
in their parameters, and achieve state-of-the-art results when ﬁne-tuned on down-
stream NLP tasks. However, their ability to access and precisely manipulate knowl-
edge is still limited, and hence on knowledge-intensive tasks, their performance
lags behind task-speciﬁc architectures. Additionally, providing provenance for their
decisions and updating their world knowledge remain open research problems. Pre-
trained models with a differentiable access mechanism to explicit non-parametric' metadata={'producer': 'pdfTeX-1.40.21', 'creator': 'La

### **Embedding and storage**

Remember that embeddings are numerical representations of text. Embedding models aim to capture the "meaning" of the text, and these numbers map the text's position in a high-dimensional, or vector space.

<img src="./images/embedding.png" width=50% height=50%>

Vector stores are databases specifically designed to store and retrieve this high-dimensional vector data.

When documents are embedded and stored, similar documents are located closer together in the vector space. When the RAG application receives a user input, it will be embedded and used to query the database, returning the most similar documents.

This part (Embedding and storage) is also covered (__RAG storage and retrieval using vector databases__) in the chapter [3 - Retrieval Augmented Generation (RAG).ipynb](../Developing%20LLM%20Applications%20with%20LangChain/3%20-%20Retrieval%20Augmented%20Generation%20(RAG).ipynb) of the Course [Developing LLM applications with LangChain](../Developing%20LLM%20Applications%20with%20LangChain).

To remind, we will use an embedding model from OpenAI and store the vectors in a Chroma vector database.

### **Building an LCEL retrieval chain**

The retrieval chain will take a question input, insert it into the chain using `RunnablePassthrough` and assign it to "question". `RunnablePassthrough` allows inputs to be inserted into chains unchanged.

We retrieve the relevant documents from the vector store and assign to "context", integrate both of these into a prompt template, pass the prompt to the model to generate an output, and parse the output into our favored format, such as a string. Before building the chain, we need to create three components: a retriever, which is derived from our vector store, a prompt template for combining the user question and retrieved context, and a model to generate the response.

<img src='./images/lcel-chain.png' width=50%>


This is again covered in the chapter [3 - Retrieval Augmented Generation (RAG).ipynb](../Developing%20LLM%20Applications%20with%20LangChain/3%20-%20Retrieval%20Augmented%20Generation%20(RAG).ipynb) of the Course [Developing LLM applications with LangChain](../Developing%20LLM%20Applications%20with%20LangChain).

The only difference is that we used `ChatPromptTemplate.from_messages([("human", message)])`, Now, we will use `.from_template("""prompt""")` method.

To remind,

__Building a chain using LCEL and RunnablePassthrough__:  <br>
We start by opening parentheses so we can define our chain over multiple lines. Then, create a `dictionary` that takes the input from `RunnablePassthrough`, assigns it to `"question"`, and uses it to query and retrieve chunks from the retriever, which are assigned to `"context"`. <br>
`RunnablePassthrough` is essentially a _placeholder_ in our chain that allows us to pass data through without modifying it.  <br>
The retrieved `"context"` and user `"question"` are then passed into the prompt using the LCEL pipe, and then into the `LLM`. Finally, a string output parser is used to parse the model output as a string.

Given that our vector_store contains chunks from a RAG academic paper, let's `invoke` the chain to request the paper's findings. 

In [14]:
import tiktoken

# Load the encoder for the OpenAI text-embedding-3-small model
enc = tiktoken.encoding_for_model("text-embedding-3-small")

# Calculate tokens for each chunk
tokens_per_chunk = [len(enc.encode(chunk.page_content)) for chunk in chunks]
total_tokens = sum(tokens_per_chunk)

# Cost calculation (using text-embedding-3-small rate)
cost_per_1k_tokens = 0.00002

# Display detailed information
print(f'Number of chunks: {len(chunks)}')
print(f'Tokens per chunk: {tokens_per_chunk}')
print(f'Total tokens: {total_tokens:,}')
print(f'Estimated cost: ${(cost_per_1k_tokens * total_tokens/1000):.4f}')

# Optional: Display average tokens per chunk
avg_tokens = total_tokens / len(chunks)
print(f'Average tokens per chunk: {avg_tokens:.1f}')

Number of chunks: 92
Tokens per chunk: [234, 198, 211, 151, 312, 231, 225, 188, 225, 178, 219, 275, 280, 240, 113, 264, 230, 222, 216, 259, 91, 208, 210, 213, 215, 234, 139, 433, 240, 229, 216, 237, 266, 256, 253, 210, 220, 91, 484, 295, 337, 183, 216, 206, 193, 192, 203, 198, 217, 297, 283, 142, 263, 289, 291, 328, 186, 285, 296, 293, 277, 227, 289, 281, 309, 302, 230, 307, 272, 272, 279, 296, 273, 308, 281, 308, 258, 273, 126, 206, 195, 234, 55, 203, 203, 232, 208, 209, 99, 263, 203, 66]
Total tokens: 21,653
Estimated cost: $0.0004
Average tokens per chunk: 235.4


In [7]:
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
import os
import openai

# Load the PDF document
loader = PyPDFLoader('./datasets/rag-paper.pdf')
documents = loader.load()

# Split the document into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = splitter.split_documents(documents)

# Set up the OpenAI API key
openai.api_key = os.environ["OPENAI_API_KEY"]

# Create embeddings model
embedding_model = OpenAIEmbeddings(
    api_key=openai.api_key,
    model='text-embedding-3-small'
)

# Create vector store
vector_store = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory='./datasets/chromadb'
)

# Create retriever
retriever = vector_store.as_retriever(
    search_type='similarity',
    search_kwargs={"k": 2}
)

# Define the prompt template
prompt = ChatPromptTemplate.from_template("""
Use the following pieces of context to answer the question at the end.
If you don't know the answer, say that you don't know.
Context: {context}
Question: {question}
""")

# Initialize the language model
llm = ChatOpenAI(model='gpt-4o-mini', api_key=openai.api_key, temperature=0)

# Create the chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Invoke the chain with the correct input format
result = chain.invoke("What are the key findings or results presented in the paper?")
print(result)

The key findings or results presented in the paper include:

1. **Factuality and Specificity**: RAG models are found to be more factual and specific than BART for Jeopardy question generation.

2. **Generation Diversity**: The study investigates generation diversity by calculating the ratio of distinct n-grams to total n-grams generated by different models. It is shown that RAG-Sequence generates more diverse outputs than RAG-Token, and both RAG models are significantly more diverse than BART without the need for diversity-promoting decoding.

3. **Retrieval Mechanism Effectiveness**: The paper assesses the effectiveness of the retrieval mechanism in RAG by conducting ablations where the retriever is frozen during training. The results indicate that learned retrieval improves performance across all tasks. 

4. **Gold Article Retrieval**: It is noted that a gold article is present in the top 10 retrieved articles in 90% of cases, and relevant information is retrieved in 71% of cases. 

