# Manipulate Vector Databases

## What you will learn in this course 🧐🧐

When you want to perform RAG, the hardest part is not actually to have a LLM query your VectorDB. It is rather to populate your VectorDB the right way. In this course, you will learn:

* How to create a VectorDB 
* How to populate a VectorDB 
* How to query a VectorDB 

## Create a VectorDB


Let's first start by creating a Vector Database. There are a number of VectorDB you can choose but as of today, we definitely recommend [Weaviate](https://weaviate.io/). It provides a free cluster and is one of the most mature product as of today. 

To create your VectorDB:

* Create an account on Weaviate 
* Then create a *Sandbox* Cluster 
* Select the right region and you're good to go 🏎️💨

<Video video="https://vimeo.com/1022125822" />


## Demo Setup 

Now let's again prepare our notebook for the demo to run. As per usual, we definitely advise you to run everything within a container. However, it will be very useful this time to have a local volume to work with.

Here is the command you should run:

```bash 
docker run -v $(pwd):/home/jovyan -p 8888:8888 jupyter/datascience-notebook
```

Then make sure you have a below packages installed:

In [1]:
# install package
%pip install -Uqq langchain-weaviate
%pip install langchain langchain_mistralai -q
%pip install -qU langchain-community beautifulsoup4
%pip install -qU weaviate-client
%pip install sentence-transformers -q 
%pip install transformers

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


Alright now we are ready to roll! 🤘 


## Load documents 

Now the first thing we need to do is obviously to load some kind of documents to populate our database. The way it works is that:

1. You need to load your document (a CSV file, A Google Drive document, a PowerPoint...)
2. You extract the text contained in that document and create chunks of that text 
3. You convert these chunks of text into Embeddings and store them into your VectorDB 

Let's tackle the first step 💪

### Choose a Document Loader 

One of the neat features of Langchain is its large integrations with lots of tools. Depending on your need, you can choose any type of Document Loader you need. 

For this example, let's use `RecursiveUrlLoader` that lets you recursively scrape all child links from a root URL and parse them into Documents. Let's say we want to know everything about Jedha. We could read everything about it on [Wookipeedia](https://starwars.fandom.com/wiki/Main_Page)

In [1]:
from langchain_community.document_loaders import RecursiveUrlLoader
from bs4 import BeautifulSoup

# Add a BeautifulSoup Extractor 
# This function will be used to read the HTML extracted from our Loader
# and parsed in a more readable manner
def bs4_extractor(html: str) -> str:
    """Extract only titles and paragraphs of an HTML content"""
    try:
        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(html, 'html.parser')
        
        # Extract the title
        title = soup.title.string if soup.title else "No title found"
        
        # Extract all paragraphs
        paragraphs = [p.get_text() for p in soup.find_all('p')]
        
        # Combine title and paragraphs into a single string
        extracted_content = title + "\n" + "\n".join(paragraphs)
    
        return extracted_content
    
    except Exception as e:
        return f"An error occurred: {str(e)}"

# This instanciate a loader
loader = RecursiveUrlLoader(
    "https://starwars.fandom.com/wiki/Jedha", # Everything about Jedha
    max_depth=1, # How deep crawler will follow links (here we technically don't follow any links to retrieve limited amount of data)
    use_async=False,
    extractor=bs4_extractor, # This can be replaced by a function to extract HTML from the web page (let's say you might want to only extract <table></table> you could create a function for that)
    metadata_extractor=None, # Same as the above
    timeout=10, # Maximum time in seconds before raises a TimeOut error
    continue_on_failure=True, # Continue to crawl even if there are some parsing errors
    prevent_outside=True, # Prevent from loading URLs which are not children of the root URL -> Good to prevent attacks
    # check out full documentation if you want to read about all arguments - https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.recursive_url_loader.RecursiveUrlLoader.html#langchain_community.document_loaders.recursive_url_loader.RecursiveUrlLoader.__init__
)

# Now we need to load the actual documents 
docs = loader.load()
docs[:5]

[Document(metadata={'source': 'https://starwars.fandom.com/wiki/Jedha', 'content_type': 'text/html; charset=UTF-8', 'title': 'Jedha | Wookieepedia | Fandom', 'description': "Jedha, also known as the Pilgrim Moon, the Cold Moon, or the Kyber Heart, and formerly known as NiJedha, was a small desert moon which orbited the planet NaJedha. Located in the Jedha system of the galaxy's Mid Rim, the moon had a cold climate due to its lasting winter. The historical and...", 'language': 'en'}, page_content='Jedha | Wookieepedia | Fandom\nWookieepedia\nTo remove ads, create an account.Join Wookieepedia today!\n\nREAD MORE\n\n\nContent approaching.\n\nTales of Enlightenment: New Prospects, The High Republic: Convergence, Tales of Enlightenment: A Different Perspective, The High Republic Adventures (2022) 4, Peace and Unity, Star Wars: The High Republic (Marvel Comics 2022), The High Republic: The Battle of Jedha, The High Republic Adventures (2022) 5, The High Republic Adventures (2022) 6, The High

Great we have content! 👏 Now we need to split that content into chunks. This is a best practice to optimize your models context window and also not overflow it with useless information. We want it to only focus on small parts of the document that are relevant. 

That is why we need to have chunks. Let's see how we can do that:

In [2]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer

# Here we use pretrained Tokenizer offered by hugging face. This gives us definitely more 
# accurate splitting
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

# Instanciate a splitter 
# There are plenty of different splitters see below to learn more
splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(tokenizer), # Maximum of 1000 characters in each splitted documents)

# Now create splits 
splitted_docs = splitter[0].split_documents(docs)

# Compare docs size from splitted_docs size 
print("Successfully splitted documents 📃")
print(f"Initial number of documents:{len(docs)}\nNumber of splitted documents: {len(splitted_docs)}")

Successfully splitted documents 📃
Initial number of documents:1
Number of splitted documents: 4


<Note type="tip">

In this particular example, `RecursiveUrlLoader` has a method called `load_and_split()` that you could have used. Here we wanted to split it in two parts for you to understand things better. 

If you want to learn more: 

* [`.load_and_split()`](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.recursive_url_loader.RecursiveUrlLoader.html#langchain_community.document_loaders.recursive_url_loader.RecursiveUrlLoader.load_and_split)

</Note>


<Note type="note" title="AutoTokenizer 🤔">

Why did we use `.from_huggingface_tokenizer()` method and how is it build? 

You could also use *simpler* yet powerful text splitters in Langchain. You can find the list here:

- [All Langchain Splitters](https://python.langchain.com/api_reference/text_splitters/index.html)

ALl these splitters are pretty powerful except for pure text (not markdown, python etc but pure string). Therefore it is definitely more powerful to use a pretrained tokenizer that understand pure strings of text and therefore know how to best split documents. That is why we used HuggingFace Tokenizers. 

If you want to learn more about HuggingFace Tokenizers, feel free to read this documentation:

* [HuggingFace Tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer)

</Note>

## Load all these documents into our Database 

Alright now let's load our documents into our VectorDB. This is the moment where you need to go back to your Weaviate account and grab your API Key and your DB URI 👇

![](https://full-stack-assets.s3.eu-west-3.amazonaws.com/get_weaviate_info.png)

Once you have that we can move on to loading documents in the DB:

Now there are two ways to create use Weaviate, you can either use Weaviate Cloud or use Weaviate on-premise. For our example we will use the latter. We are going to run another container using the Weaviate image:

```bash
docker run -p 8080:8080 -p 50051:50051 cr.weaviate.io/semitechnologies/weaviate:1.27.0
```

In [None]:
import weaviate

client = weaviate.connect_to_local(
    #host="host.docker.internal",  # Use host.docker.internal if you are running it inside a docker container
    port=8080,
    grpc_port=50051,
)

# Verify that this is ready
print(client.is_ready())

True


## Create Embeddings

Perfect now we connected to the Database, the next thing we need to do is to create Embeddings before we can actually send the documents to Weaviate. For that we can use `HuggingFaceEmbeddings()`. It is a free Embedding tool made by HuggingFace 🤗

In [6]:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings()

  embeddings = HuggingFaceEmbeddings()
  embeddings = HuggingFaceEmbeddings()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Store documents into vector store

Alright and now final piece: store these documents into the vector store. Now we will be using Langchain integration `WeaviateVectorStore`

In [7]:
from langchain_weaviate.vectorstores import WeaviateVectorStore

# Now we can load our documents into our Database 
# Depending on the amount of data 
# The time necessary to execute the cell will vary
vectorstore = WeaviateVectorStore.from_documents(
    splitted_docs, 
    embeddings, 
    client=client, 
    by_text=False, 
    tenant="Wookieepedia", # This is the name of the collection
)

vectorstore

2025-May-16 12:06 PM - langchain_weaviate.vectorstores - INFO - Tenant Wookieepedia does not exist in index LangChain_18453ab13f04419291a9ed975091ab30. Creating tenant.


<langchain_weaviate.vectorstores.WeaviateVectorStore at 0x300ae98b0>

And now we can test our vector store and retrieve relevant information about what we stored! 

In [8]:
query = "What was the initial name of Jedha?"
docs = vectorstore.similarity_search(
    query, 
    k=2,
    tenant="Wookieepedia"
)

# Print the first 100 characters of each result
for i, doc in enumerate(docs):
    print(f"\n## DOCUMENT {i+1}:\n")
    print(doc.page_content)


## DOCUMENT 1:

Jedha | Wookieepedia | Fandom
Wookieepedia
To remove ads, create an account.Join Wookieepedia today!

READ MORE


Content approaching.

Tales of Enlightenment: New Prospects, The High Republic: Convergence, Tales of Enlightenment: A Different Perspective, The High Republic Adventures (2022) 4, Peace and Unity, Star Wars: The High Republic (Marvel Comics 2022), The High Republic: The Battle of Jedha, The High Republic Adventures (2022) 5, The High Republic Adventures (2022) 6, The High Republic Adventures (2022) 7, The High Republic: Path of Vengeance, The High Republic: Cataclysm, The High Republic: Quest for Planet X, Tales of Villainy: The Gaze Electric, Reign of the Empire: The Mask of Fear, Star Wars Jedi: Survivor, Guardians of the Whills, Shu-Torun Lives, Star Wars Book IX: The Ashes of Jedha, The Veteran, Alphabet Squadron, Galaxy's Edge 3, Dawn of Rebellion, Star Wars: Rogue One: The Ultimate Visual Guide, Star Wars: The High Republic: Chronicles of the Jedi, U

## Resources 📚📚

* [Weaviate - Langchain](https://python.langchain.com/docs/integrations/vectorstores/weaviate/)
* [`RecursiveUrlLoader`](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.recursive_url_loader.RecursiveUrlLoader.html#langchain_community.document_loaders.recursive_url_loader.RecursiveUrlLoader.__init__)
* [HuggingFace Tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer)
* [All Langchain Splitters](https://python.langchain.com/api_reference/text_splitters/index.html)