# RAG with Langchain

## What you will learn in this course 🧐🧐

Once you know how to manipulate Documents and store them in a VectorDB, you have everything you need to perform RAG. We will still use Langchain to do so. In this course, you will learn:

* How to query a VectorDB 
* Retrievers 

## Demo Setup 

For this demo, you will need to: 

* Run a Weaviate Database 
```bash 
docker run -p 8080:8080 -p 50051:50051 cr.weaviate.io/semitechnologies/weaviate:1.27.0
```

* Populate the VectorDB with the code we wrote in the previous lesson. For standard environment setup, you can use docker:

```bash 
docker run -v $(pwd):/home/jovyan -p 8888:8888 jupyter/datascience-notebook
```

<Note type="note">

Yes you will have two containers running at the same time. That's completely fine 👌

</Note>

Then follow the code below:

In [1]:
# install package
%pip install -Uqq langchain-weaviate
%pip install langchain langchain_mistralai -q
%pip install -qU langchain-community beautifulsoup4
%pip install -qU weaviate-client
%pip install sentence-transformers -q 
%pip install transformers -q

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [None]:
from langchain_community.document_loaders import RecursiveUrlLoader
from langchain_weaviate.vectorstores import WeaviateVectorStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from transformers import AutoTokenizer
from bs4 import BeautifulSoup
import weaviate

# Add a BeautifulSoup Extractor 
# This function will be used to read the HTML extracted from our Loader
# and parsed in a more readable manner
def bs4_extractor(html: str) -> str:
    """Extract only titles and paragraphs of an HTML content"""
    try:
        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(html, 'html.parser')
        
        # Extract the title
        title = soup.title.string if soup.title else "No title found"
        
        # Extract all paragraphs
        paragraphs = [p.get_text() for p in soup.find_all('p')]
        
        # Combine title and paragraphs into a single string
        extracted_content = title + "\n" + "\n".join(paragraphs)
    
        return extracted_content
    
    except Exception as e:
        return f"An error occurred: {str(e)}"

# This instanciate a loader
loader = RecursiveUrlLoader(
    "https://starwars.fandom.com/wiki/Jedha", # Everything about Jedha
    max_depth=1, # How deep crawler will follow links (here we technically don't follow any links to retrieve limited amount of data)
    use_async=False,
    extractor=bs4_extractor, # This can be replaced by a function to extract HTML from the web page (let's say you might want to only extract <table></table> you could create a function for that)
    metadata_extractor=None, # Same as the above
    timeout=10, # Maximum time in seconds before raises a TimeOut error
    continue_on_failure=True, # Continue to crawl even if there are some parsing errors
    prevent_outside=True, # Prevent from loading URLs which are not children of the root URL -> Good to prevent attacks
    # check out full documentation if you want to read about all arguments - https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.recursive_url_loader.RecursiveUrlLoader.html#langchain_community.document_loaders.recursive_url_loader.RecursiveUrlLoader.__init__
)

# Now we need to load the actual documents 
docs = loader.load()

# Here we use pretrained Tokenizer offered by hugging face. This gives us definitely more 
# accurate splitting
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

# Instanciate a splitter 
# There are plenty of different splitters see below to learn more
splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(tokenizer), # Maximum of 1000 characters in each splitted documents)

# Now create splits 
splitted_docs = splitter[0].split_documents(docs)

client = weaviate.connect_to_local(
    #host="host.docker.internal",  # Use host.docker.internal if you are running it inside a docker container
    port=8080,
    grpc_port=50051,
)

# Instanciate Embeddings
embeddings = HuggingFaceEmbeddings()

# Now we can load our documents into our Database 
# Depending on the amount of data 
# The time necessary to execute the cell will vary
vectorstore = WeaviateVectorStore.from_documents(
    splitted_docs, 
    embeddings, 
    client=client, 
    by_text=False, 
    tenant="Wookieepedia", # This is the name of the collection
)

  embeddings = HuggingFaceEmbeddings()
  embeddings = HuggingFaceEmbeddings()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2025-May-16 12:07 PM - langchain_weaviate.vectorstores - INFO - Tenant Wookieepedia does not exist in index LangChain_59d4f6a4963e4befa58a225f7ce05e4c. Creating tenant.


<Note type="note">

The above code is just the full code from the previous lecture. If you want to have more details, feel free to refer to it. 

</Note>

## What are Retrievers 🫴

When you are querying a VectorDB, you will need to use what Langchain calls a **Retriever**. This is simply the tool that is here to *retrieve* relevant data from your database. 

There various algorithms behind evevery retrievers. Some of them are using Unsupervised Machine Learning, others are using LLMs and some others are using just simple word match. You can find the list of all retrievers here:

* [All Langchain Retrievers](https://python.langchain.com/docs/integrations/retrievers/)


You can either use one of the above or VectorDBs often come with pre-built retrievers that you can use as well! Let's see how that works right now:

In [4]:
# Retrieve Mistral API key from .env
from dotenv import load_dotenv

load_dotenv()

True

In [5]:
from langchain_mistralai import ChatMistralAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain import hub

llm = ChatMistralAI(model="mistral-large-latest")

# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 2, "tenant": "Wookieepedia"})

# Create prompt. 
# This can also be found at hub.pull("rlm/rag-prompt")
prompt = """
You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.

Question: {question} 

Context: {context} 

Answer:
"""

# This is the basic chat prompt template 
# You can then add a MessagePlaceHolder etc. 
# to add memory to your LLM app!
prompt = ChatPromptTemplate(
    ("system", prompt)
)

# This is a helper function to join all the documents that will be retrieved
# by the retriever and then just concatenated as one big string that will placed at {context} in the prompt above 
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Chain will first receive a question from the user
# This will populate the "context" that will retrieve all document based on the {question} thanks to `retriever`
# After context is retrieved by the retriever it will directly go to `format_docs` function 
# At the same time "question" will be passed through the next phase of the chain (the `prompt`) 
# This is done by `RunnablePassthrough` which purpose is to pass information through the chain
# Finally the output is parsed as string
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain.invoke("What was the initial name of Jedha?")

'The initial name of Jedha was NiJedha. Jedha, also known as the Pilgrim Moon, the Cold Moon, or the Kyber Heart, was formerly known as NiJedha. NiJedha was the Holy City of Jedha, a spiritual hub for many faiths.'

## Resources 📚📚

* [Weaviate - Langchain](https://python.langchain.com/docs/integrations/vectorstores/weaviate/)
* [`RecursiveUrlLoader`](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.recursive_url_loader.RecursiveUrlLoader.html#langchain_community.document_loaders.recursive_url_loader.RecursiveUrlLoader.__init__)
* [HuggingFace Tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer)
* [All Langchain Splitters](https://python.langchain.com/api_reference/text_splitters/index.html)