[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/weaviate/recipes/blob/main/integrations/llm-agent-frameworks/langchain/loading-data/langchain-simple-pdf.ipynb)

# Multilanguage RAG filtering by multiple PDFs with Langchain and OpenAi

In [1]:
# lets install our super tools
%pip install -Uqq langchain-weaviate langchain-community
%pip install langchain-openai tiktoken langchain pypdf

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-openai 0.3.27 requires langchain-core<1.0.0,>=0.3.66, but you have langchain-core 1.1.3 which is incompatible.
langchain 0.3.26 requires langchain-core<1.0.0,>=0.3.66, but you have langchain-core 1.1.3 which is incompatible.
langchain 0.3.26 requires langchain-text-splitters<1.0.0,>=0.3.8, but you have langchain-text-splitters 1.0.0 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
Collecting langchain-core<1.0.0,>=0.3.66 (from langchain-openai)
  Using cached langchain_core-0.3.80-py3-none-any.whl.metadata (3.2 kB)
Collecting langchain-text-splitters<1.0.0,>=0.3.8 (from langchain)
  Using cached langchain_text_splitters-0.3.11-py3-none-any.whl.metadata (1.8 kB)
Using cached langchain_core-0.3.80-py3-none-any.whl (450 kB)
Using cached langchain_tex

you must have a valid key for OpenAi in OPENAI_API_KEY environment variable

In [2]:
import weaviate, os

client = weaviate.connect_to_embedded(
    headers={
        "X-OpenAi-Api-Key": os.environ.get("OPENAI_API_KEY"), # Replace with your OpenAi key
    }
)

print("Client is Ready?", client.is_ready())

{"action":"startup","build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","level":"info","msg":"Feature flag LD integration disabled: could not locate WEAVIATE_LD_API_KEY env variable","time":"2025-12-10T20:41:44-03:00"}
{"action":"startup","build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2025-12-10T20:41:44-03:00"}
{"action":"startup","auto_schema_enabled":{},"build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","level":"info","msg":"auto schema enabled setting is set to \"\u0026{\u003cnil\u003e {{{} {0 0}} 0 0 {{} 0} {{} 0}} true}\"","time":"2025-12-10T20:41:44-03:00"}
{"build_git_commi

Client is Ready? True


Let's check our Client and Server Version:

In [3]:
print(f"Client: {weaviate.__version__}, Server: {client.get_meta().get('version')}")

Client: 4.15.4, Server: 1.30.5


## Let's create our Collection beforehand

this will ensure the collection is created with a vectorizer and generative config. 
Make sure to use the same model while creating and passing the embeddings to langchain

In [4]:
from weaviate import classes as wvc
# clear this collection before creating it
client.collections.delete("WikipediaLangChain")
# lets make sure its vectorizer is what the one we want
collection = client.collections.create(
    name="WikipediaLangChain",
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(
        model="text-embedding-3-small",
        dimensions=512
    ),
    generative_config=wvc.config.Configure.Generative.openai(
        model="gpt-4o-mini",
    ),
)

{"action":"load_all_shards","build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","level":"error","msg":"failed to load all shards: context canceled","time":"2025-12-10T20:41:46-03:00"}


Now we have a Weaviate client!
Let's read our 2 pdf files, [brazil-wikipedia-article-text.pdf](./brazil-wikipedia-article-text.pdf) and [netherlands-wikipedia-article-text.pdf](./netherlands-wikipedia-article-text.pdf)

Then chunk them and ingest using Langchain.

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings

from langchain.document_loaders import PyPDFLoader

from langchain_weaviate.vectorstores import WeaviateVectorStore

text_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small", dimensions=512)

# import first article
loader = PyPDFLoader("brazil-wikipedia-article-text.pdf", extract_images=False)
docs = loader.load_and_split(text_splitter)
print(f"GOT {len(docs)} chunks for Brazil")
db = WeaviateVectorStore.from_documents(docs, embeddings, client=client, index_name="WikipediaLangChain")


# import second article
loader = PyPDFLoader("netherlands-wikipedia-article-text.pdf", extract_images=False)
docs = loader.load_and_split(text_splitter)
print(f"GOT {len(docs)} chunks for Netherlands")
db = WeaviateVectorStore.from_documents(docs, embeddings, client=client, index_name="WikipediaLangChain")

{"action":"hnsw_prefill_cache_async","build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","level":"info","msg":"not waiting for vector cache prefill, running in background","time":"2025-12-10T20:41:47-03:00","wait_for_cache_prefill":false}
{"build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","level":"info","msg":"Completed loading shard collection1_9lo68dI5wYDu in 1.971417ms","time":"2025-12-10T20:41:47-03:00"}
{"action":"hnsw_vector_cache_prefill","build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","count":3000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2025-12-10T20:41:47-03:00","took":133208}
{"action":"telemetry_push","build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","level":"info","

GOT 247 chunks for Brazil
GOT 274 chunks for Netherlands


lets first use our collection

In [6]:
collection = client.collections.use("WikipediaLangChain")

let's count how many objects we have in total

In [7]:
response = collection.aggregate.over_all(total_count=True)
print(response)

AggregateReturn(properties={}, total_count=521)


Now, how many objects we have per source?

In [8]:
response = collection.aggregate.over_all(group_by="source")
for group in response.groups:
    print(group.grouped_by.value, group.total_count)

netherlands-wikipedia-article-text.pdf 274
brazil-wikipedia-article-text.pdf 247


Langchain added some metadata, like `source` `page`. Let's get one object.

In [9]:
object = collection.query.fetch_objects(limit=1, include_vector=True).objects[0]

In [10]:
object.properties.keys()

dict_keys(['creationdate', 'page_label', 'total_pages', 'source', 'page', 'title', 'creator', 'text', 'producer'])

In [11]:
print(object.properties.get("source"))
print(object.properties.get("page"))
print(object.properties.get("text"))

netherlands-wikipedia-article-text.pdf
2.0
"Frisian freedom"), which resented the imposition of the feudal system.
Around 1000 AD, due to several agricultural developments, the economy started to develop at a fast pace, and the higher
productivity allowed workers to farm more land or become tradesmen. Towns grew around monasteries and castles, and a


## Let's ask in French, a content in English
lets do a RAG directly using only Weaviate

In [12]:
# This is our query
query = "Quelle est la nourriture traditionnelle de ce pays?"
# This is our prompt
prompt = f"Answer the question and make sure to use all the provided context: {query}. Answer in English"
# lets filter it out, and only use this specific file
source_file = "brazil-wikipedia-article-text.pdf"
#source_file = "netherlands-wikipedia-article-text.pdf"

query = collection.generate.near_text(
    query=query,
    filters=wvc.query.Filter.by_property("source").equal(source_file),
    auto_limit=2, # we want the first 2 semantic groups only
    grouped_task=prompt
)
print('Objects matched:', len(query.objects))
print("Text", query.generative.text)

Objects matched: 8
Text The traditional food of Brazil is diverse and varies by region, reflecting the country's mix of indigenous and immigrant populations. Some notable examples include:

- **Feijoada**: Considered the national dish of Brazil, it is a hearty stew of black beans with pork or beef.
- **Beiju**: A type of tapioca pancake.
- **Feijão Tropeiro**: A dish made with beans, eggs, and manioc flour.
- **Vatapá**: A creamy dish made from bread, shrimp, coconut milk, and peanuts.
- **Moqueca**: A fish stew made with coconut milk and palm oil.

Common meals often consist of rice and beans, typically served with beef, salad, french fries, and a fried egg. Popular snacks include **pastel** (fried pastry), **coxinha** (chicken croquette), and **pão de queijo** (cheese bread). Desserts such as **brigadeiros** (chocolate fudge balls) and **bolo de rolo** (roll cake with guava paste) are also popular.

The national beverage is coffee, and **cachaça**, a liquor made from sugar cane, is t

## Objects used for this generative search

In [13]:
for object in query.objects[0:5]:
    print("#### page:", object.properties.get("page"), "####")
    print(object.properties.get("text"))

#### page: 13.0 ####
Cuisine
Brazilian cuisine varies greatly by region, reflecting the country's varying mix of indigenous and immigrant populations. This
has created a national cuisine marked by the preservation of regional differences. Examples are Feijoada, considered the
country's national dish; and regional foods such as beiju, feijão tropeiro, vatapá, moqueca, polenta (from Italian cuisine) and
#### page: 13.0 ####
flour (farofa). Fried potatoes, fried cassava, fried banana, fried meat and fried cheese are very often eaten in lunch and
served in most typical restaurants. Popular snacks are pastel (a fried pastry); coxinha (a variation of chicken croquete); pão
de queijo (cheese bread and cassava flour / tapioca); pamonha (corn and milk paste); esfirra (a variation of Lebanese
#### page: 13.0 ####
acarajé (from African cuisine).
The national beverage is coffee and cachaça is Brazil's native liquor. Cachaça is distilled from sugar cane and is the main
ingredient in the national co

Note that we used a filter, so the content will be searched and generated only for that specific pdf.
Let's change the filter to the second pdf file.

In [14]:
# We can filter it out, now for Netherlands
source_file = "netherlands-wikipedia-article-text.pdf"
# and we will ask in Portuguese, but prompt it to asnwer in English
query = "Qual é a comida tradicional deste país?. Answer in english"

prompt = f"Answer the question and make sure to use all the provided context: {query}. Answer in English"

# now generating the answer using Wikipedia

query = collection.generate.near_text(
    query=query,
    filters=wvc.query.Filter.by_property("source").equal(source_file),
    auto_limit=2, # we want the first 2 semantic groups only
    grouped_task=prompt
)

print(query.generative.text)

The traditional food of the Netherlands includes several notable dishes. One popular item is **kibbeling**, which consists of small chunks of battered white fish and has become a national fast food. Another traditional dish is **lekkerbek**. The typical Dutch dinner traditionally consists of potatoes, a portion of meat, and seasonal vegetables. 

In terms of pastries, the **Vlaai** from Limburg and the **Moorkop** and **Bossche Bol** from Brabant are well-known. Additionally, **worstenbroodje**, which is a roll with a sausage of ground beef, is a popular savory pastry. 

For sweets, **stroopwafel** is a famous cookie that contains a lot of butter and sugar, often filled with something like almond paste, known as **gevulde koek**. The traditional alcoholic beverages include beer and **Jenever**.


And of course, we can use different filters, and get different content for our questions. 

In [15]:
# We can filter it out for multilpe sources
query = "What is a common cultural aspect of those two countries?"
# now generating the answer using Wikipedia
source_files = ["netherlands-wikipedia-article-text.pdf", "brazil-wikipedia-article-text.pdf"]
prompt = f"Answer the question and make sure to use all the provided context: {query}. Answer in English"

query = collection.generate.near_text(
    query=query,
    filters=wvc.query.Filter.by_property("source").contains_any(source_files),
    auto_limit=3,
    grouped_task=prompt
)

print(query.generative.text)

A common cultural aspect of Brazil and the Netherlands is their rich and diverse culinary traditions, which reflect a blend of indigenous and immigrant influences. In Brazil, the cuisine varies greatly by region, showcasing a mix of indigenous ingredients and dishes influenced by African, Portuguese, and other European cultures, with Feijoada being a notable national dish. Similarly, the Netherlands has a culinary heritage that includes traditional foods like cheese and Dutch pastries, as well as influences from its colonial past, particularly from Indonesia and Suriname. Both countries celebrate their unique culinary identities, which are shaped by their historical interactions and cultural exchanges.


# Using Langchain to query data and answer questions

Up until now, we used Langchain to ingest our data, and we queried Weaviate directly.

Now, let's use Langchain also to query. If you noticed, after ingesting our data, langchain will return us a vectorstore. 

We can use that vector store, or initiate a new one. Let's initiate a new one, passing an empty docs []

In [16]:
embeddings = OpenAIEmbeddings(model="text-embedding-3-small", dimensions=512)
db = WeaviateVectorStore(embedding=embeddings, client=client, index_name="WikipediaLangChain", text_key="text")

### We can now search our data

In [17]:
# we can now do a similarity search on all objects
docs = db.similarity_search(
    query="traditional food",
    return_uuids=True,
    return_properties=["source", "title", "page" ],
    k=5
)
for doc in docs:
    print(doc.metadata)
    print(doc.page_content)
    print("###" * 5)

{'title': 'Netherlands - Wikipedia Text Only, Convert to PDF', 'source': 'netherlands-wikipedia-article-text.pdf', 'page': 14.0, 'uuid': 'a5a9b02b-9e59-4d6a-aac1-71bc0775d8cb'}
widely available and typical for the region. 
Kibbeling
, once a local delicacy consisting of small chunks of battered white fish, has
become a national fast food, just as lekkerbek.
The Southern Dutch cuisine consists of the cuisines of the Dutch provinces of North Brabant and Limburg and the Flemish Region in
###############
{'title': 'Netherlands - Wikipedia Text Only, Convert to PDF', 'source': 'netherlands-wikipedia-article-text.pdf', 'page': 14.0, 'uuid': '40651f32-4456-4737-8bce-97fc149f95c7'}
amount of fish. The various dried sausages, belonging to the metworst-family of Dutch sausages are found throughout this region.
Also smoked sausages are common, of which (
Gelderse
) 
rookworst
 is the most renowned. Larger sausages are often eaten
alongside 
stamppot
, 
hutspot
 or 
zuurkool
 (sauerkraut); whereas

### Filter by a property
the property `source` is automatically added by LangChain.

More on how to add [multiple operands](https://weaviate.io/developers/weaviate/api/graphql/filters#multiple-operands) and [nested filters](https://weaviate.io/developers/weaviate/search/filters#nested-filters)

In [18]:
# change bellow to get chunks per different files / countries
source_file = "brazil-wikipedia-article-text.pdf"
#source_file = "netherlands-wikipedia-article-text.pdf"
where_filter = wvc.query.Filter.by_property("source").equal(source_file)
docs = db.similarity_search("traditional food", filters=where_filter, k=3)
print(docs)

[Document(metadata={'creationdate': datetime.datetime(2023, 10, 31, 22, 3, 6, tzinfo=datetime.timezone.utc), 'page_label': '14', 'total_pages': 16.0, 'source': 'brazil-wikipedia-article-text.pdf', 'page': 13.0, 'title': 'Brazil - Wikipedia Text Only, Convert to PDF', 'creator': 'wkhtmltopdf 0.12.2.1', 'producer': 'Qt 4.8.6'}, page_content='flour (farofa). Fried potatoes, fried cassava, fried banana, fried meat and fried cheese are very often eaten in lunch and\nserved in most typical restaurants. Popular snacks are pastel (a fried pastry); coxinha (a variation of chicken croquete); pão\nde queijo (cheese bread and cassava flour / tapioca); pamonha (corn and milk paste); esfirra (a variation of Lebanese'), Document(metadata={'producer': 'Qt 4.8.6', 'page_label': '14', 'total_pages': 16.0, 'source': 'brazil-wikipedia-article-text.pdf', 'page': 13.0, 'title': 'Brazil - Wikipedia Text Only, Convert to PDF', 'creator': 'wkhtmltopdf 0.12.2.1', 'creationdate': datetime.datetime(2023, 10, 31, 

### You can also do some question and answering

In [19]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

from weaviate.classes.query import Filter

# client = weaviate.connect_to_weaviate_cloud(...)

embeddings = OpenAIEmbeddings(model="text-embedding-3-small", dimensions=512)
db = WeaviateVectorStore.from_documents([], embeddings, client=client, index_name="WikipediaLangChain")

source_file = "brazil-wikipedia-article-text.pdf"
#source_file = "netherlands-wikipedia-article-text.pdf"
where_filter = Filter.by_property("source").equal(source_file)

# we want our retriever to filter the results
retriever = db.as_retriever(search_kwargs={"filters": where_filter})

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

llm = ChatOpenAI(model="gpt-4o-mini")
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

response = rag_chain.invoke({"input": "What is he traditional food of this country?"})
print(response["answer"])



Brazil's traditional food varies by region, with notable dishes including Feijoada, beiju, and vatapá. A typical meal often consists of rice and beans served with beef, salad, and fried eggs. Popular snacks include coxinha and pão de queijo.


In [20]:
#lets close our connection
client.close()

{"action":"restapi_management","build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","level":"info","msg":"Shutting down... ","time":"2025-12-10T20:42:17-03:00","version":"1.30.5"}
{"action":"restapi_management","build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","level":"info","msg":"Stopped serving weaviate at http://127.0.0.1:8079","time":"2025-12-10T20:42:17-03:00","version":"1.30.5"}
{"action":"telemetry_push","build_git_commit":"62dcafac32","build_go_version":"go1.24.3","build_image_tag":"HEAD","build_wv_version":"1.30.5","level":"info","msg":"telemetry terminated","payload":"\u0026{MachineID:85ee30ca-f2c2-4319-bb6e-eae79c18ffa2 Type:TERMINATE Version:1.30.5 ObjectsCount:2 OS:darwin Arch:arm64 UsedModules:[generative-openai text2vec-openai] CollectionsCount:2}","time":"2025-12-10T20:42:17-03:00"}
{"build_git_commit":"62dcafac32","build_go_version":"go1.24.3"