## **1. Loading and splitting code files**

In the previous chapter, we looked at PDF, CSV, and HTML files. Now we're going to extend this a little further to Python and Markdown files.

### **Loading Markdown files (.md)**

The `UnstructuredMarkdownLoader` class can be used to load markdown files the same way as other file formats we've looked at before: by instantiating the class on the file path, and using the `.load()` method to load it into memory. We could integrate these documents into a RAG application to read code documentation and make recommendations.

__Note__: `UnstructuredMarkdownLoader` requires `markdown` package to be installed.

In [11]:
from langchain_community.document_loaders import UnstructuredMarkdownLoader

loader = UnstructuredMarkdownLoader('./datasets/README.md')
markdown_content = loader.load()

print(markdown_content[0])

page_content='🦜️🔗 LangChain

⚡ Build context-aware reasoning applications ⚡

Looking for the JS/TS library? Check out LangChain.js.

To help you ship LangChain apps to production faster, check out LangSmith. LangSmith is a unified developer platform for building, testing, and monitoring LLM applications. Fill out this form to speak with our sales team.

Quick Install

With pip: bash pip install langchain

With conda: bash conda install langchain -c conda-forge

🤔 What is LangChain?

LangChain is a framework for developing applications powered by large language models (LLMs).

For these applications, LangChain simplifies the entire application lifecycle:

Open-source libraries: Build your applications using LangChain's open-source building blocks, components, and third-party integrations. Use LangGraph to build stateful agents with first-class streaming and human-in-the-loop support.

Productionization: Inspect, monitor, and evaluate your apps with LangSmith so that you can constantly o

### **Loading Python files (.py)**

Imagine we have a codebase and would like to have a way to talk with it and ask it questions about it. We could achieve this by integrating Python files into a RAG application. The `PythonLoader` class and the `.load()` method can be used to load these files into memory. The resulting documents have `.page_content` and metadata attributes for accessing the document's details. Remember parsing Python files can be tricky, because it has its own syntax with _imports_, _classes_, _functions_ and much more that need to be preserved during chunking. 

In [1]:
from langchain_community.document_loaders import PythonLoader

loader = PythonLoader('./datasets/chatbot.py')

python_data = loader.load()

print(python_data[0])

page_content='from abc import ABC, abstractmethod

class LLM(ABC):
  @abstractmethod
  def complete_sentence(self, prompt):
    pass

class OpenAI(LLM):
  def complete_sentence(self, prompt):
    return prompt + " ... OpenAI end of sentence."
  
class Anthropic(LLM):
  def complete_sentence(self, prompt):
    return prompt + " ... Anthropic end of sentence."

class ChatBot:
  def _get_llm(self, provider):
    if provider == "OpenAI":
      return OpenAI()
    elif provider == "Anthropic":
      return Anthropic()
    
  def chat(self, prompt, provider):
    # Return an llm object, then call complete_sentence()
    llm = self._get_llm(provider)
    return llm.complete_sentence(prompt)' metadata={'source': './datasets/chatbot.py'}


### **Splitting code files**

We wil use our _best_ tool for document splitting: `RecursiveCharacterTextSplitter`. We set `chunk_size`, and `chunk_overlap` as it should be. Splitting the documents with `.split_documents()` method, we can print the content of each chunk.

In [2]:
from langchain_community.document_loaders import PythonLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# The python file is already loaded in the previous code cell

python_split = RecursiveCharacterTextSplitter(chunk_size=150, chunk_overlap=10)   # We didn't specify the separators, so it will use the default ones - which are ["\n\n", "\n", " ", ""]

chunks = python_split.split_documents(python_data)

for i, chunk in enumerate(chunks[:3]):
    print(f"Chunk {i+1}:\n{chunk.page_content}\n")

Chunk 1:
from abc import ABC, abstractmethod

class LLM(ABC):
  @abstractmethod
  def complete_sentence(self, prompt):
    pass

Chunk 2:
class OpenAI(LLM):
  def complete_sentence(self, prompt):
    return prompt + " ... OpenAI end of sentence."
  
class Anthropic(LLM):

Chunk 3:
def complete_sentence(self, prompt):
    return prompt + " ... Anthropic end of sentence."



Notice that the split between chunks two and three splits the Anthropic class, and because chunks are processed separately, key context has been lost. Our current strategy is naive because it doesn't consider structures like classes and functions. Let's change this!

### **Splitting by language**

We split the loaded Python file using `RecursiveCharacterTextSplitter` again, but this time, we will use the `.from_language()` method. This method has a language argument, which refers to coding languages, what we can set to `Language.PYTHON`, and the rest of the arguments stay the same.

This will modify the default separators list from the hierarchy of apragraphs, sentences, and words, to try splitting on classes and function definitions before moving on to the standard separators.

In [3]:
from langchain.text_splitter import Language    # We already imported the other required modules in the previous code cell

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=150, chunk_overlap=10
)

chunks = python_splitter.split_documents(python_data)
for i, chunk in enumerate(chunks[:3]):
    print(f"Chunk {i+1}:\n{chunk.page_content}\n")

Chunk 1:
from abc import ABC, abstractmethod

class LLM(ABC):
  @abstractmethod
  def complete_sentence(self, prompt):
    pass

Chunk 2:
class OpenAI(LLM):
  def complete_sentence(self, prompt):
    return prompt + " ... OpenAI end of sentence."

Chunk 3:
class Anthropic(LLM):
  def complete_sentence(self, prompt):
    return prompt + " ... Anthropic end of sentence."



As we can see, the splitter was able to split on class definitions, so all of that context is kept together.

Note that this approach isn't final, and depending on the size of the classes and functions relative to the `chunk_size`, we may get differing results.

In [6]:
loader = PythonLoader('./datasets/rag.py')

python_data = loader.load()

print(python_data[0].page_content)

__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import ChatOpenAI
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
import shutil
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

openai_api_key = os.environ["OPENAI_API_KEY"]

loader = PyPDFLoader("rag_paper.pdf")
documents = loader.load()
# Split the documents into manageable chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_documents = text_splitter.split_documents(documents)

# Initialize the a

In [7]:
# Create a Python-aware recursive character splitter
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=300, chunk_overlap=100
)

# Split the Python content into chunks
chunks = python_splitter.split_documents(python_data)

for i, chunk in enumerate(chunks[:3]):
    print(f"Chunk {i+1}:\n{chunk.page_content}\n")

Chunk 1:
__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import ChatOpenAI
from langchain_huggingface import HuggingFaceEmbeddings

Chunk 2:
from langchain_openai import ChatOpenAI
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

Chunk 3:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
import shutil
import getpass
import os



## **2. Advanced splitting methods**

### **Limitations of our current splitting strategies**

- Splits are naive (not context-aware)
  - Ignores context of surrounding text
- Splits are made using characters, not tokens
  - LLM's break text into tokens, or smaller units of text, for processing
  - Splitting on characters can lead to a risk of exceeding the model __context window__
    - __Context window__: is the maximum number of tokens/text units that can be processed at once

We'll introduce methods to make our splitter more aware of the document's context and enable splitting on tokens:
- `SemanticChunker`
- `TokenTextSplitter`

### **Splitting on tokens**

When we split on tokens, the `chunk_size` and `chunk_overlap` refer to the _number of tokens_ in the chunk, rather than characters, so a `chunk_size` of five means we can have a _maximum of five tokens in the chunk_.

<img src='./images/tokensplit.png' width=50%>

In [None]:
import tiktoken      # We use this library to count tokens
from langchain.text_splitter import TokenTextSplitter

example_string = "Mary had a little lamb, it's fleece was white as snow."

encoding = tiktoken.encoding_for_model('gpt-4o-mini')
splitter = TokenTextSplitter(encoding_name=encoding.name, chunk_size=10, chunk_overlap=2)

chunks = splitter.split_text(example_string)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk}\n")

Chunk 1:
Mary had a little lamb, it's fleece was white

Chunk 2:
 was white as snow.



Let's now check if it is able to keep to the chunk_size 10 tokens.

In [14]:
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\nNo. tokens: {len(encoding.encode(chunk))}\n{chunk}\n")

Chunk 1:
No. tokens: 10
Mary had a little lamb, it's fleece was white

Chunk 2:
No. tokens: 5
 was white as snow.



### **Semantic splitting**

To perform semantic splitting, we'll need an embedding model to generate text embeddings to determine the shift in topic. We'll use a model from OpenAI. We instantiate the semantic splitting class, passing the embedding model. We pass two additional parameters: 
- `breakpoint_threshold_type`, which sets the metric at which embeddings are compared, and
- `breakpoint_threshold_amount`, which sets the metric's threshold at which to perform the split.

Like other splitters, we use the `.split_documents()` method to apply the splitter, in this case, __rag-paper.pdf__ academic paper. The semantic splitter reached the threshold of `0.8` and performed the splits; for the first chunk, splitting after the first two sentences of the abstract.

__Note__: The SemanticChunker is a module from the `langchain_experimental` library, which needs to be installed.

To read more about SemanticChunker, see [here](https://api.python.langchain.com/en/latest/text_splitter/langchain_experimental.text_splitter.SemanticChunker.html).

In [15]:
from langchain_openai import OpenAIEmbeddings
from langchain_experimental.text_splitter import SemanticChunker
from langchain_community.document_loaders import PyPDFLoader
import os
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")

loader = PyPDFLoader('./datasets/rag-paper.pdf')
data = loader.load()

embeddings = OpenAIEmbeddings(api_key=openai.api_key, model="text-embedding-3-small")

semantic_splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_amount=0.8,   # this value is between 0 and 1. When 0, the chunker will not split the text. When 1, the chunker will split the text at every sentence.
    breakpoint_threshold_type="gradient"
)

chunks = semantic_splitter.split_documents(data)
print(chunks[0])

page_content='Retrieval-Augmented Generation for
Knowledge-Intensive NLP Tasks
Patrick Lewis†‡, Ethan Perez⋆,
Aleksandra Piktus†, Fabio Petroni†, Vladimir Karpukhin†, Naman Goyal†, Heinrich Küttler†,
Mike Lewis†, Wen-tau Yih†, Tim Rocktäschel†‡, Sebastian Riedel†‡, Douwe Kiela†
†Facebook AI Research; ‡University College London; ⋆New York University;
plewis@fb.com
Abstract
Large pre-trained language models have been shown to store factual knowledge
in their parameters, and achieve state-of-the-art results when ﬁne-tuned on down-
stream NLP tasks.' metadata={'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2021-04-13T00:48:38+00:00', 'author': '', 'keywords': '', 'moddate': '2021-04-13T00:48:38+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': './datasets/rag-paper.pdf', 'total_pages': 19, 'page': 0, 'page_label': '1'}


## **3. Optimizing document retrieval**

<img src='./images/r-in-rag.png' width=50%>

So far, our document retrieval has consisted of a vector database containing embedded documents. The input to the RAG application is then used to query the vectors, using a distance metric to determine which vectors are closest and therefore most similar and relevant. This type of retrieval is known as **dense retrieval**.

  <div style="display: flex;">
    <!-- Left Column -->
    <div style="width: 45%; padding: 10px;">
    <b>Dense Retrieval</b><br><br>
    Encode chunks as a single vector with <b>non-zero</b> components, that is said to be "dense", that is, most of its component values are non-zero. <br><br>
    <img src='./images/dense.png' width=58%>
    <ul>
      <li><b>Pros</b>: Capturing semantic meaning</li>
      <li><b>Cons</b>: </li>
        <ul>
         <li>Computationally expensive</li>
         <li>May struggle with capturing rare-words or highly specific technical terms</li>
        </ul>
    </ul>
    </div>
    <!-- Right Column -->
    <div style="width: 48%; padding: 10px;">
    <b>Sparse Retrieval</b><br>
    Encode using <b>word matching</b> with mostly <b>zero</b> components. It is a method of finding information by matching specific keywords or terms in a query with those in documents. The resulting vectors contain many zeros, with only a few non-zero terms, which is why they are said to be "sparse". 
    <div>
    <img src='./images/sparse.png' width=90%>
    <ul>
      <li><b>Pros</b>: Precise, explainable, rare-word handling</li>
      <li><b>Cons</b>: </li>
        <ul>
         <li>Generalizability</li>
         <li>Not extracting the semantic meaning from the text</li>
        </ul>
    </ul>
    </div>
    </div>
</div>

### **Sparse retrieval methods**

Two popular methods for encoding text into sparse vectors are:
- **TF-IDF** (Term Frequency-Inverse Document Frequency) <br>
     Encodes documents using the words that make the document unique: Creates a sparse vector that measures a term's frequency in a document and rarity in other documents. This helps in identifying words that best represent the document's unique content.
- **BM25** (Best Matching 25)<br>
     Helps mitigate high-frequency words from saturating the encoding: BM25 is an improvement on TD-IDF that prevents high-frequency words from being over-emphasized in the encoding.

**BM25 retrieval**

The BM25Retriever class can be used to create a retriever from documents or text, just like the retrievers we have already used. We can use the `.from_texts()` method to create the retriever from these strings. The k value sets the number of items returned by the retriever when invoked.


In [18]:
from langchain_community.retrievers import BM25Retriever

chunks = [
    "Python was created by Guido van Rossum and released in 1991.",
    "Python is a popular language for machine learning (ML).",
    "The PyTorch library is a popular Python library for AI and ML."
]

bm25_retriever = BM25Retriever.from_texts(chunks, k=3)

results = bm25_retriever.invoke("When was Python created?")
print("Most Relevant Document:")
print(results[0].page_content)

Most Relevant Document:
Python was created by Guido van Rossum and released in 1991.


Looking at all three statements again, we can see that BM25 returned the statement with similar terms to the input that were also unique to the other statements.

**BM25 in RAG**

We'll create a RAG system to integrate a DataCamp blog post on RAG with an LLM. 

The first step is the same as before, but using the `.from_documents()` method as we're dealing with document chunks and not strings this time. Then, we use the same LCEL syntax as a standard dense retrieval RAG to integrate the retriever with a prompt template and LLM. Remember that `RunnablePassthrough` allows us to insert the input unchanged into the chain.

In [40]:
from langchain_community.document_loaders import UnstructuredHTMLLoader
from langchain_community.retrievers import BM25Retriever
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.output_parsers import StrOutputParser
from langchain_experimental.text_splitter import SemanticChunker 
import os
import openai

openai.api_key = os.getenv('OPENAI_API_KEY')

html_loader = UnstructuredHTMLLoader('./datasets/what-is-rag-blog.html')

document = html_loader.load()

embeddings = OpenAIEmbeddings(api_key=openai.api_key, model="text-embedding-3-small")

semantic_splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_amount=0.8,
    breakpoint_threshold_type="gradient"
)

chunks = semantic_splitter.split_documents(document)

retriever = BM25Retriever.from_documents(
    documents=chunks,
    k=5
)

prompt = ChatPromptTemplate.from_template("""
Use the following pieces of context to answer the question at the end.
Context: {context}
Question: {question}
""")

llm = ChatOpenAI(model='gpt-4o-mini', api_key=openai.api_key, temperature=0)


chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

result = chain.invoke("How can LLM hallucination impact a RAG application?")
print(result)

LLM hallucination can significantly impact a RAG (Retrieval-Augmented Generation) application by leading to the generation of inaccurate or misleading information. When an LLM produces content that is not grounded in the retrieved documents or factual data, it can create confusion and reduce the reliability of the insights derived from the application. This can be particularly problematic in contexts like market research, where accurate information is crucial for decision-making. To mitigate these issues, advanced techniques such as reranking and multi-step reasoning can be implemented to enhance the accuracy and relevance of the generated outputs, thereby addressing the challenges posed by hallucination.


## **4. Introduction to RAG evaluation**

Because our RAG architecture is made up of several processes, there are a few places where performance can be measured. 

<img src='./images/rag-eval.png' width=60%>

- We can evaluate the retrieval process to check if the retrieved documents are relevant to the query
- We can evaluate the generation process to see if the LLM hallucinated or misinterpreted the prompt
- We can evaluate the final output to measure the performance of the whole system.


### **Output accuracy: string evaluation**

To perform string evaluation, we need to define a prompt template and large language model to use for evaluation. The prompt template instructs the model to compare the strings and evaluate the model output for correctness, returning correct or incorrect. The model temperature is also set to zero to minimize variability.

```python
prompt_template = """You are an expert professor specialized in grading students' answers to questions.
You are grading the following question:{query}
Here is the real answer:{answer}
You are grading the following predicted answer:{result}
Respond with CORRECT or INCORRECT:
Grade:"""


prompt = PromptTemplate(
    input_variables=["query", "answer", "result"],
    template=prompt_template
)

eval_llm = ChatOpenAI(temperature = 0, model = "gpt-4o-mini", openai_api_key="...")
```

We initialize `LangChainStringEvaluator` from `LangSmith`, which is LangChain's platform for evaluating LLM applications. This evaluator first takes `"qa"`, which sets the evaluator to assess correctness, and also the LLM and prompt template to use. We then call the `.evaluate_strings()` method on the model prediction, reference answer, and input query to perform the evaluation.

```python
from langsmith.evaluation import LangChainStringEvaluator

qa_evaluator = LangChainStringEvaluator(
    "qa",
    config={
        "llm":eval_llm,
        "prompt":PROMPT
    }
)

score = qa_evaluator.evaluator.evaluate_strings(
    prediction=predicted_answer,
    reference=ref_answer,
    input=query
)
```

A score of zero indicates that predicted response was incorrect when compared to the reference answer. 

```python
print(f"score: {score}")
```

Output: <br>

```python
Score: {'reasoning': 'INCORRECT', 'value': 'INCORRECT', 'score': 0}
```

And we can see here that the model response was deemed incorrect, which makes sense on reviewing it again:

```python
query = "What are the main components of RAG architecture?"
predicted_answer = "Training and encoding"
ref_answer = "Retrieval and Generation"
```

### **RAGAS framework**

RAGAS was designed to evaluate both the retrieval and generation components of a RAG application. We will cover one metric for each component: faithfulness and context precision.

<img src='./images/ragas-score.png' width=50%>


- **Faithfulness**
  Assesses whether the generated output represents the retrieved documents well. It is calculated as:

  $$
  \text{Faithfulness} = \frac{\text{Number of claims that can be inferred from the context}}{\text{Total number of claims}}
  $$

  Because faithfulness is a proportion, _it is normalized to between zero and one_, where a higher score indicates greater faithfulness.

Ragas integrates nicely with LangChain, and the first step involves defining the models for the evaluator to use: one for generation and another for embeddings. Next, we define an evaluation chain, passing it the faithfulness metric from ragas and the two models we defined.

```python
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas.integrations.langchain import EvaluatorChain
from ragas.metrics import faithfulness

llm = ChatOpenAI(model= "gpt-4o-mini", api_key="...")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small", api_key="...")

faithfulness_chain = EvaluatorChain(
    metric=faithfulness,
    llm=llm,
    embeddings=embeddings
)
```

To evaluate a model's response, we instantiate the chain, passing it a dictionary with `"question"`, `"answer"`, and `"contexts"` keys. `"question"` is the query sent to the RAG application, `"answer"` is the response, and `"contexts"` are the document chunks available to the model. A perfect faithfulness score of one indicates that the model's response could be fully inferred from the context provided.

```python
eval_result = faithfulness_chain({
    "question": "How does the RAG model improve question answering with LLMs?",
    "answer": "The RAG model improves question answering by combining the retrieval of documents...",
    "contexts": [
        "The RAG model integrates document retrieval with LLMs by first retrieving relevant passages...",
        "By incorporating retrieval mechanisms, RAG leverages external knowledge sources, allowing the...",
        ]
})
print(eval_result)
```

Output:<br>
```python
'faithfulness': 1.0
```

### **Context precision**

Context precision measures how relevant the retrieved documents are to the query. 

A context precision score closer to one means the retrieved context is highly relevant. The only change we need to make to the faithfulness evaluation chain is to import and use the `content_precision` metric instead.

```python
from ragas.metrics import context_precision

llm = ChatOpenAI(model="gpt-4o-mini", api_key="...")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small", api_key="...")

context_precision_chain = EvaluatorChain(
    metric=context_precision,
    llm=llm,
    embeddings=embeddings
)
```

The `context_precision_chain` similarly takes a dictionary with `"question"`, `"contexts"`, and `"ground_truth"` keys, representing the input query, the retrieved documents, and the ground truth document that should have been retrieved. Printing the results, we can see that we achieved a high context precision, indicating that the retrieval process is returning highly relevant documents.

```python
eval_result = context_precision_chain({
    "question": "How does the RAG model improve question answering with large language models?"
    "ground_truth": "The RAG model improves question answering by combining the retrieval of...",
    "contexts": [
        "The RAG model integrates document retrieval with LLMs by first retrieving...",
        "By incorporating retrieval mechanisms, RAG leverages external knowledge sources...","
        ]
})

print(f"Context Precision: {eval_result['context_precision']}")
```

Output:<br>
```python
Context Precision: 0,999999995
```