[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/weaviate/recipes/blob/main/integrations/llm-agent-frameworks/langchain/loading-data/langchain-simple-pdf-multitenant.ipynb)

# Multilanguage RAG filtering by multiple PDFs per tenant with Langchain and OpenAi

In [1]:
# lets install our super tools
%pip install -Uq langchain-weaviate langchain-community > /dev/null 2>&1
%pip install -q langchain-openai tiktoken langchain pypdf > /dev/null 2>&1

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


you must have a valid key for OpenAi in OPENAI_API_KEY environment variable

In [2]:
import weaviate, os

client = weaviate.connect_to_embedded(
    headers={
        "X-OpenAi-Api-Key": os.environ.get("OPENAI_API_KEY"), # Replace with your OpenAi key
    }
)

print("Client is Ready?", client.is_ready())

{"action":"startup","build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","level":"info","msg":"Feature flag LD integration disabled: could not locate WEAVIATE_LD_API_KEY env variable","time":"2025-12-10T20:54:21-03:00"}
{"action":"startup","build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2025-12-10T20:54:21-03:00"}
{"action":"startup","auto_schema_enabled":{},"build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","level":"info","msg":"auto schema enabled setting is set to \"\u0026{\u003cnil\u003e {{{} {0 0}} 0 0 {{} 0} {{} 0}} true}\"","time":"2025-12-10T20:54:21-03:00"}
{"build_git_commi

Client is Ready? True


Let's check our Client and Server Version:

In [3]:
print(f"Client: {weaviate.__version__}, Server: {client.get_meta().get('version')}")

Client: 4.15.4, Server: 1.30.5


## Let's create our Collection beforehand

this will ensure the collection is created with a vectorizer and generative config. 
Make sure to use the same model while creating and passing the embeddings to langchain

In [4]:
from weaviate import classes as wvc
# clear this collection before creating it
client.collections.delete("WikipediaLangChainMT")
# lets make sure its vectorizer is what the one we want
collection = client.collections.create(
    name="WikipediaLangChainMT",
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(
        model="text-embedding-3-small",
        dimensions=512
    ),
    generative_config=wvc.config.Configure.Generative.openai(
        model="gpt-4o-mini",
    ),
    multi_tenancy_config=wvc.config.Configure.multi_tenancy(
        enabled=True, auto_tenant_creation=True, auto_tenant_activation=True
    )
)

{"action":"load_all_shards","build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","level":"error","msg":"failed to load all shards: context canceled","time":"2025-12-10T20:54:25-03:00"}
{"action":"hnsw_prefill_cache_async","build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","level":"info","msg":"not waiting for vector cache prefill, running in background","time":"2025-12-10T20:54:25-03:00","wait_for_cache_prefill":false}
{"build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","level":"info","msg":"Completed loading shard collection1_9lo68dI5wYDu in 3.841917ms","time":"2025-12-10T20:54:25-03:00"}
{"action":"hnsw_vector_cache_prefill","build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","count":3000,"index_id":"main","level":"info","limit":100000

Now we have a Weaviate client!
Let's read our 2 pdf files, [brazil-wikipedia-article-text.pdf](./brazil-wikipedia-article-text.pdf) and [netherlands-wikipedia-article-text.pdf](./netherlands-wikipedia-article-text.pdf)

Then chunk them and ingest using Langchain.

**Note:** we will pass different tenants per pdf file

In [5]:
collection.config.get(
            simple=False
        ).multi_tenancy_config.enabled

True

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings

from langchain.document_loaders import PyPDFLoader

from langchain_weaviate.vectorstores import WeaviateVectorStore

text_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small", dimensions=512)

# import first article
loader = PyPDFLoader("brazil-wikipedia-article-text.pdf", extract_images=False)
docs = loader.load_and_split(text_splitter)
print(f"GOT {len(docs)} chunks for Brazil")
db = WeaviateVectorStore.from_documents(docs, embeddings, client=client, index_name="WikipediaLangChainMT", tenant="brazil")


# import second article
loader = PyPDFLoader("netherlands-wikipedia-article-text.pdf", extract_images=False)
docs = loader.load_and_split(text_splitter)
print(f"GOT {len(docs)} chunks for Netherlands")
db = WeaviateVectorStore.from_documents(docs, embeddings, client=client, index_name="WikipediaLangChainMT", tenant="netherlands")

2025-Dec-10 08:54 PM - langchain_weaviate.vectorstores - INFO - Tenant brazil does not exist in index WikipediaLangChainMT. Creating tenant.


GOT 247 chunks for Brazil


{"action":"hnsw_prefill_cache_async","build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","level":"info","msg":"not waiting for vector cache prefill, running in background","time":"2025-12-10T20:54:28-03:00","wait_for_cache_prefill":false}
{"build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","level":"info","msg":"Created shard wikipedialangchainmt_brazil in 1.520125ms","time":"2025-12-10T20:54:28-03:00"}
{"action":"hnsw_vector_cache_prefill","build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","count":1000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2025-12-10T20:54:28-03:00","took":37125}
2025-Dec-10 08:54 PM - langchain_weaviate.vectorstores - INFO - Tenant netherlands does not exist in index WikipediaLangChainMT. Creating tenant.
2025-Dec-10 08:54 PM

GOT 274 chunks for Netherlands


{"action":"hnsw_prefill_cache_async","build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","level":"info","msg":"not waiting for vector cache prefill, running in background","time":"2025-12-10T20:54:32-03:00","wait_for_cache_prefill":false}
{"build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","level":"info","msg":"Created shard wikipedialangchainmt_netherlands in 2.376625ms","time":"2025-12-10T20:54:32-03:00"}
{"action":"hnsw_vector_cache_prefill","build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","count":1000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2025-12-10T20:54:32-03:00","took":47042}


In [7]:
collection = client.collections.use("WikipediaLangChainMT")

let's count how many objects we have in total

In [8]:
response = collection.with_tenant("brazil").aggregate.over_all(total_count=True)
print("Brazil Tenant:", response)
response = collection.with_tenant("netherlands").aggregate.over_all(total_count=True)
print("Netherland Tenant:", response)

Brazil Tenant: AggregateReturn(properties={}, total_count=247)
Netherland Tenant: AggregateReturn(properties={}, total_count=274)


Now, how many objects we have per source?

In [9]:
response = collection.with_tenant("brazil").aggregate.over_all(group_by="source")
for group in response.groups:
    print(group.grouped_by.value, group.total_count)

brazil-wikipedia-article-text.pdf 247


Langchain added some metadata, like `source` `page`. Let's get one object.

In [10]:
object = collection.with_tenant("brazil").query.fetch_objects(limit=1, include_vector=True).objects[0]

In [11]:
object.properties.keys()

dict_keys(['creationdate', 'page_label', 'text', 'title', 'producer', 'creator', 'page', 'source', 'total_pages'])

In [12]:
print(object.properties.get("source"))
print(object.properties.get("page"))
print(object.properties.get("text"))

brazil-wikipedia-article-text.pdf
0.0
explorer Pedro Álvares Cabral, who claimed the discovered land for the Portuguese Empire. Brazil remained a Portuguese
colony until 1808 when the capital of the empire was transferred from Lisbon to Rio de Janeiro. In 1815, the colony was
elevated to the rank of kingdom upon the formation of the United Kingdom of Portugal, Brazil and the Algarves.


## Let's ask in French, a content in English
lets do a RAG directly using only Weaviate

In [13]:
# This is our query
query_text = "Quelle est la nourriture traditionnelle de ce pays?"
# This is our prompt
prompt = f"Answer the question and make sure to use all the provided context: {query_text}. Answer in English"
# lets get our collection with a specific tenant
collection_tenant = collection.with_tenant("brazil")

query = collection_tenant.generate.near_text(
    query=query_text,
    auto_limit=2, # we want the first 2 semantic groups only
    grouped_task=prompt
)
print('Objects matched:', len(query.objects))
print("Text", query.generative.text)

Objects matched: 8
Text The traditional food of Brazil is diverse and varies by region, reflecting the country's mix of indigenous and immigrant populations. Some notable dishes include:

1. **Feijoada** - Considered the national dish of Brazil, it is a hearty stew of black beans with pork or beef.
2. **Farofa** - A dish made from toasted cassava flour, often served as a side.
3. **Moqueca** - A fish stew made with coconut milk, tomatoes, onions, and peppers.
4. **Acarajé** - A deep-fried ball made from black-eyed peas, typically filled with shrimp and served with spicy sauce.
5. **Pão de queijo** - Cheese bread made from cassava flour and cheese.
6. **Coxinha** - A popular snack that is a fried pastry filled with shredded chicken.
7. **Brigadeiros** - Chocolate fudge balls that are a common dessert.

Additionally, a typical Brazilian meal often includes rice and beans, accompanied by beef, salad, and fried eggs. The national beverage is coffee, and cachaça, a liquor made from sugar ca

## Objects used for this generative search

In [14]:
for object in query.objects[0:5]:
    print("#### page:", object.properties.get("page"), "####")
    print(object.properties.get("text"))

#### page: 13.0 ####
flour (farofa). Fried potatoes, fried cassava, fried banana, fried meat and fried cheese are very often eaten in lunch and
served in most typical restaurants. Popular snacks are pastel (a fried pastry); coxinha (a variation of chicken croquete); pão
de queijo (cheese bread and cassava flour / tapioca); pamonha (corn and milk paste); esfirra (a variation of Lebanese
#### page: 13.0 ####
Cuisine
Brazilian cuisine varies greatly by region, reflecting the country's varying mix of indigenous and immigrant populations. This
has created a national cuisine marked by the preservation of regional differences. Examples are Feijoada, considered the
country's national dish; and regional foods such as beiju, feijão tropeiro, vatapá, moqueca, polenta (from Italian cuisine) and
#### page: 13.0 ####
acarajé (from African cuisine).
The national beverage is coffee and cachaça is Brazil's native liquor. Cachaça is distilled from sugar cane and is the main
ingredient in the national co

Note that because we have used `collection.with_tenant("brazil")`, the content will be searched and generated only for that specific tenant.

Let's get our other tenant

In [15]:
# let's get our collection with a specific tenant
collection_tenant = collection.with_tenant("netherlands")

# and we will ask in Portuguese, but prompt it to answer in English
query_text = "Qual é a comida tradicional deste país?"
prompt = f"Answer the question and make sure to use all the provided context: {query_text}. Answer in English"

query = collection_tenant.generate.near_text(
    query=query_text,
    auto_limit=2, # we want the first 2 semantic groups only
    grouped_task=prompt
)

print(query.generative.text)

The traditional food of the Netherlands includes a variety of pastries and savory dishes. Notable examples are:

1. **Vlaai** - A type of pie from Limburg.
2. **Moorkop** and **Bossche Bol** - Typical pastries from Brabant.
3. **Worstenbroodje** - A roll with a sausage of ground beef, known as sausage bread.
4. **Kibbeling** - Battered white fish that has become a national fast food.
5. **Kruidkoek** - Spiced cakes, with varieties like Groninger koek.
6. **Fries roggebrood** - Hard textured rye bread.
7. **Stroopwafel** - A cookie filled with syrup, often made with a lot of butter and sugar.

The traditional alcoholic beverages include beer and **Jenever**, a type of gin. The Dutch diet traditionally consisted of potatoes, meat, and seasonal vegetables, reflecting the agricultural roots of the country.


# Using Langchain to query data and answer questions on specific tenants

Up until now, we used Langchain to ingest our data, and we queried Weaviate directly.

Now, let's use Langchain also to query. If you noticed, after ingesting our data, langchain will return us a vectorstore. 

We can use that vector store, or initiate a new one. Let's initiate a new one, passing an empty docs []

**Note**: we are also passing the tenant parameter on query time or as search arguments

In [16]:
from langchain_openai import OpenAIEmbeddings
from langchain_weaviate import WeaviateVectorStore

embeddings = OpenAIEmbeddings(model="text-embedding-3-small", dimensions=512)
db = WeaviateVectorStore(embedding=embeddings, client=client, index_name="WikipediaLangChainMT", text_key="text")

### We can now search our data

In [17]:
# we can now do a similarity search on objects of a specific tenant
docs = db.similarity_search(
    query="traditional food",
    tenant="brazil",
    return_uuids=True,
    return_properties=["source", "title", "page" ],
    k=5
)
for doc in docs:
    print(doc.metadata)
    print(doc.page_content)
    print("###" * 5)

{'title': 'Brazil - Wikipedia Text Only, Convert to PDF', 'page': 13.0, 'source': 'brazil-wikipedia-article-text.pdf', 'uuid': '115d897d-ef3c-4018-b644-ec0a1a80eee8'}
flour (farofa). Fried potatoes, fried cassava, fried banana, fried meat and fried cheese are very often eaten in lunch and
served in most typical restaurants. Popular snacks are pastel (a fried pastry); coxinha (a variation of chicken croquete); pão
de queijo (cheese bread and cassava flour / tapioca); pamonha (corn and milk paste); esfirra (a variation of Lebanese
###############
{'title': 'Brazil - Wikipedia Text Only, Convert to PDF', 'page': 13.0, 'source': 'brazil-wikipedia-article-text.pdf', 'uuid': '88b85ab7-2e3e-4c48-9bb7-5c0de727758a'}
Cuisine
Brazilian cuisine varies greatly by region, reflecting the country's varying mix of indigenous and immigrant populations. This
has created a national cuisine marked by the preservation of regional differences. Examples are Feijoada, considered the
country's national dish; a

In [18]:
# now for netherlands
# we can now do a similarity search on objects of a specific tenant
docs = db.similarity_search(
    query="traditional food",
    tenant="netherlands",
    return_uuids=True,
    return_properties=["source", "title", "page" ],
    k=5
)
for doc in docs:
    print(doc.metadata)
    print(doc.page_content)
    print("###" * 5)

{'title': 'Netherlands - Wikipedia Text Only, Convert to PDF', 'page': 14.0, 'source': 'netherlands-wikipedia-article-text.pdf', 'uuid': 'a163a99f-7adb-4726-a6dc-8f705933381b'}
widely available and typical for the region. 
Kibbeling
, once a local delicacy consisting of small chunks of battered white fish, has
become a national fast food, just as lekkerbek.
The Southern Dutch cuisine consists of the cuisines of the Dutch provinces of North Brabant and Limburg and the Flemish Region in
###############
{'title': 'Netherlands - Wikipedia Text Only, Convert to PDF', 'page': 14.0, 'source': 'netherlands-wikipedia-article-text.pdf', 'uuid': 'dcb8d483-d412-47be-88e6-959d9fed0f21'}
amount of fish. The various dried sausages, belonging to the metworst-family of Dutch sausages are found throughout this region.
Also smoked sausages are common, of which (
Gelderse
) 
rookworst
 is the most renowned. Larger sausages are often eaten
alongside 
stamppot
, 
hutspot
 or 
zuurkool
 (sauerkraut); whereas

### Filter by a property
the property `page` is automatically added by LangChain.

More on how to add [multiple operands](https://weaviate.io/developers/weaviate/api/graphql/filters#multiple-operands) and [nested filters](https://weaviate.io/developers/weaviate/search/filters#nested-filters)

In [19]:
# change below to get chunks per different files / countries
from weaviate.classes.query import Filter

where_filter = Filter.by_property("page").greater_or_equal(15)
docs = db.similarity_search("traditional food", filters=where_filter, tenant="netherlands", k=3)
print(docs)

[Document(metadata={'creationdate': datetime.datetime(2023, 10, 31, 22, 1, 32, tzinfo=datetime.timezone.utc), 'page_label': '16', 'title': 'Netherlands - Wikipedia Text Only, Convert to PDF', 'producer': 'Qt 4.8.6', 'creator': 'wkhtmltopdf 0.12.2.1', 'page': 15.0, 'source': 'netherlands-wikipedia-article-text.pdf', 'total_pages': 16.0}, page_content='https://web.archive.org/web/20090919141403/http://www.volksk.....icle455140.ece/Prehistorische_akker_gevonden_bij_Swifterbant\nhttps://web.archive.org/web/20171010141442/https://openacces.....20Age%20a%20critical%20review%5B1%5D_Redacted.pdf?\nsequence=1\nhttps://openaccess.leidenuniv.nl/bitstream/handle/1887/19822.....20Age%20a%20critical%20review%5B1%5D_Redacted.pdf?\nsequence=1'), Document(metadata={'creationdate': datetime.datetime(2023, 10, 31, 22, 1, 32, tzinfo=datetime.timezone.utc), 'page_label': '16', 'title': 'Netherlands - Wikipedia Text Only, Convert to PDF', 'producer': 'Qt 4.8.6', 'creator': 'wkhtmltopdf 0.12.2.1', 'page': 15

### You can also do some question and answering

In [20]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

from weaviate.classes.query import Filter

# client = weaviate.connect_to_weaviate_cloud(...)

embeddings = OpenAIEmbeddings(model="text-embedding-3-small", dimensions=512)
db = WeaviateVectorStore.from_documents([], embeddings, client=client, index_name="WikipediaLangChainMT")

# you can add your filters like this
where_filter = Filter.by_property("page").greater_or_equal(2)

# we want our retriever to filter the results and use a specific tenant
retriever = db.as_retriever(search_kwargs={
    "filters": where_filter, 
    "tenant": "brazil"
})

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

llm = ChatOpenAI(model="gpt-4o-mini")
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

response = rag_chain.invoke({"input": "What is he traditional food of this country?"})
print(response["answer"])

The traditional food of Brazil includes Feijoada, which is considered the national dish, along with other regional specialties like moqueca, vatapá, and various fried snacks. A typical meal often consists of rice and beans accompanied by beef, salad, fried potatoes, and a fried egg. Desserts such as brigadeiros and bolo de rolo are also popular.


In [21]:
#lets close our connection
client.close()

{"action":"restapi_management","build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","level":"info","msg":"Shutting down... ","time":"2025-12-10T20:54:47-03:00","version":"1.30.5"}
{"action":"restapi_management","build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","level":"info","msg":"Stopped serving weaviate at http://127.0.0.1:8079","time":"2025-12-10T20:54:47-03:00","version":"1.30.5"}
{"action":"telemetry_push","build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","level":"info","msg":"telemetry terminated","payload":"\u0026{MachineID:1836d2a8-efad-48d7-9c7e-dd6859dcb085 Type:TERMINATE Version:1.30.5 ObjectsCount:523 OS:darwin Arch:arm64 UsedModules:[generative-openai text2vec-openai] CollectionsCount:3}","time":"2025-12-10T20:54:48-03:00"}
{"action":"telemetry_push","build_git_commit":"62dcafac32","