[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/weaviate/recipes/blob/main/integrations/llm-agent-frameworks/langchain/loading-data/langchain-simple-pdf-multitenant.ipynb)

# Multilanguage RAG filtering by multiple PDFs per tenant with Langchain and OpenAi

In [1]:
# lets install our super tools
%pip install -Uqq langchain-weaviate langchain-community
%pip install langchain-openai tiktoken langchain pypdf

Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


you must have a valid key for OpenAi in OPENAI_API_KEY environment variable

In [2]:
import weaviate, os

client = weaviate.connect_to_embedded(
    headers={
        "X-OpenAi-Api-Key": os.environ.get("OPENAI_API_KEY"), # Replace with your Cohere key
    }
)

print("Client is Ready?", client.is_ready())


Started /Users/dudanogueira/.cache/weaviate-embedded: process ID 42937


{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2024-09-02T14:54:41-03:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2024-09-02T14:54:41-03:00"}
{"level":"info","msg":"No resource limits set, weaviate will use all available memory and CPU. To limit resources, set LIMIT_RESOURCES=true","time":"2024-09-02T14:54:41-03:00"}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50050","time":"2024-09-02T14:54:41-03:00"}
{"action":"restapi_management","level":"info","msg":"Serving weaviate at http://127.0.0.1:8079","time":"2024-09-02T14:54:41-03:00"}


Client is Ready? True


{"level":"info","msg":"Completed loading shard testcollection_64HzLSXOVuGn in 3.742875ms","time":"2024-09-02T14:54:42-03:00"}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-09-02T14:54:42-03:00","took":60708}
{"level":"info","msg":"Completed loading shard testcollection2_AhtmABIV35w1 in 3.796208ms","time":"2024-09-02T14:54:42-03:00"}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-09-02T14:54:42-03:00","took":53250}
{"level":"info","msg":"Completed loading shard wikipedialangchain_TYt3LayzT5YG in 2.826458ms","time":"2024-09-02T14:54:42-03:00"}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-09-02T14:54:42-03:00","took":6213375}


## Let's create our class beforehand

In [3]:
from weaviate import classes as wvc
# delete the collection before creating a new one
client.collections.delete("WikipediaLangChainMT")
# lets make sure its vectorizer is what the one we want
collection = client.collections.create(
    name="WikipediaLangChainMT",
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),
    generative_config=wvc.config.Configure.Generative.openai(),
    multi_tenancy_config=wvc.config.Configure.multi_tenancy(
        enabled=True, auto_tenant_creation=True, auto_tenant_activation=True
    )
)

Now we have a Weaviate client!
Let's read our 2 pdf files, [brazil-wikipedia-article-text.pdf](./brazil-wikipedia-article-text.pdf) and [netherlands-wikipedia-article-text.pdf](./netherlands-wikipedia-article-text.pdf)

Then chunk them and ingest using Langchain.

**Note:** we will pass different tenants per pdf file

In [4]:
collection.config.get(
            simple=False
        ).multi_tenancy_config.enabled

True

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
#from langchain.embeddings import OpenAIEmbeddings
from langchain_openai import OpenAIEmbeddings

from langchain.document_loaders import PyPDFLoader

from langchain_weaviate.vectorstores import WeaviateVectorStore


text_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)
embeddings = OpenAIEmbeddings()

# import first article
loader = PyPDFLoader("brazil-wikipedia-article-text.pdf", extract_images=False)
docs = loader.load_and_split(text_splitter)
print(f"GOT {len(docs)} docs for Brazil")
db = WeaviateVectorStore.from_documents(docs, embeddings, client=client, index_name="WikipediaLangChainMT", tenant="brazil")


# import second article
loader = PyPDFLoader("netherlands-wikipedia-article-text.pdf", extract_images=False)
docs = loader.load_and_split(text_splitter)
print(f"GOT {len(docs)} docs for Netherlands")
db = WeaviateVectorStore.from_documents(docs, embeddings, client=client, index_name="WikipediaLangChainMT", tenant="netherlands")

2024-Sep-02 02:54 PM - langchain_weaviate.vectorstores - INFO - Tenant brazil does not exist in index WikipediaLangChainMT. Creating tenant.


GOT 247 docs for Brazil


{"level":"info","msg":"Created shard wikipedialangchainmt_brazil in 1.15075ms","time":"2024-09-02T14:55:01-03:00"}
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-09-02T14:55:01-03:00","took":38292}
2024-Sep-02 02:55 PM - langchain_weaviate.vectorstores - INFO - Tenant netherlands does not exist in index WikipediaLangChainMT. Creating tenant.


GOT 274 docs for Netherlands


{"level":"info","msg":"Created shard wikipedialangchainmt_netherlands in 1.772209ms","time":"2024-09-02T14:55:04-03:00"}
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-09-02T14:55:04-03:00","took":52125}


In [6]:
# lets first get our collection
collection = client.collections.get("WikipediaLangChainMT")

let's count how many objects we have in total

In [7]:
response = collection.with_tenant("brazil").aggregate.over_all(total_count=True)
print("Brazil Tenant:", response)
response = collection.with_tenant("netherlands").aggregate.over_all(total_count=True)
print("Netherland Tenant:", response)

Brazil Tenant: AggregateReturn(properties={}, total_count=247)
Netherland Tenant: AggregateReturn(properties={}, total_count=274)


Now, how many objects we have per source?

In [8]:
response = collection.with_tenant("brazil").aggregate.over_all(group_by="source")
for group in response.groups:
    print(group.grouped_by.value, group.total_count)

brazil-wikipedia-article-text.pdf 247


Langchain added some metadata, like `source` `page`. Let's get one object.

In [9]:
object = collection.with_tenant("brazil").query.fetch_objects(limit=1).objects[0]

In [10]:
object.properties.keys()

dict_keys(['text', 'page', 'source'])

In [11]:
print(object.properties.get("source"))
print(object.properties.get("page"))
print(object.properties.get("text"))

brazil-wikipedia-article-text.pdf
5.0
Brazil has the most known species of plants (55,000), freshwater fish (3,000), and mammals (over 689). It also ranks third
on the list of countries with the most bird species (1,832) and second with the most reptile species (744). The number of
fungal species is unknown but is large. Brazil is second only to Indonesia as the country with the most endemic species.


## Let's ask in French, a content in English, a specific tenant

In [12]:
# lets do a RAG directly using only Weaviate

# This is our prompt.
generateTask = "Quelle est la nourriture traditionnelle de ce pays?"

collection_tenant = collection.with_tenant("brazil")

query = collection_tenant.generate.near_text(
    query="tradicional food",
    limit=10,
    grouped_task=generateTask
)
print(query.generated)

La nourriture traditionnelle du Brésil comprend des plats tels que la farine (farofa), les pommes de terre frites, la cassave frite, la banane frite, la viande frite et le fromage frit. Les snacks populaires incluent le pastel (une pâtisserie frite), la coxinha (une variation de croquette de poulet), le pão de queijo (pain au fromage et farine de manioc / tapioca), la pamonha (pâte de maïs et de lait), l'esfirra (une variation de pâtisserie libanaise) et l'acarajé (de la cuisine africaine). Les plats traditionnels brésiliens comprennent la feijoada, considérée comme le plat national du pays, ainsi que des plats régionaux tels que le beiju, le feijão tropeiro, le vatapá, la moqueca, la polenta (de la cuisine italienne), le kibbeh (de la cuisine arabe), l'empanada et l'empada. Les desserts brésiliens incluent des brigadeiros (boules de fudge au chocolat), du bolo de rolo (gâteau roulé à la goiabada), de la cocada (une douceur à la noix de coco), des beijinhos (truffes à la noix de coco e

those were some of the objects used for this generation

In [13]:
for object in query.objects[0:10]:
    print(object.properties)

{'text': 'flour (farofa). Fried potatoes, fried cassava, fried banana, fried meat and fried cheese are very often eaten in lunch and\nserved in most typical restaurants. Popular snacks are pastel (a fried pastry); coxinha (a variation of chicken croquete); pão\nde queijo (cheese bread and cassava flour / tapioca); pamonha (corn and milk paste); esfirra (a variation of Lebanese', 'page': 13.0, 'source': 'brazil-wikipedia-article-text.pdf'}
{'text': "Cuisine\nBrazilian cuisine varies greatly by region, reflecting the country's varying mix of indigenous and immigrant populations. This\nhas created a national cuisine marked by the preservation of regional differences. Examples are Feijoada, considered the\ncountry's national dish; and regional foods such as beiju, feijão tropeiro, vatapá, moqueca, polenta (from Italian cuisine) and", 'page': 13.0, 'source': 'brazil-wikipedia-article-text.pdf'}
{'text': 'pastry); kibbeh (from Arabic cuisine); empanada (pastry) and empada, little salt pies f

Note that we used a filter, so the content will be searched and generated only for that specific pdf.
Let's change the filter to the second pdf file.

In [14]:
# We can filter it out, now for Netherlands
generateTask = "Qual é a comida tradicional deste país? Answer in English"
# let's get our collection with a specific tenant
collection_tenant = collection.with_tenant("netherlands")

query = collection_tenant.generate.near_text(
    query="tradicional food",
    limit=10,
    grouped_task=generateTask
)

print(query.generated)

The traditional food of the Netherlands typically consists of potatoes, meat, and seasonal vegetables for dinner. The diet was historically high in carbohydrates and fat, reflecting the needs of laborers. Some typical Dutch foods include mayonnaise, whole-grain mustards, chocolate, buttermilk, seafood like herring and mussels, cookies, stroopwafel, beer, and Jenever. The cuisine varies by region, with different specialties in the north, south, and western parts of the country. Some regional dishes include Fries roggebrood, Kibbeling, and worstenbroodje. Dairy products, cheeses like Gouda and Edam, and various sausages are also common in Dutch cuisine.


## Using Langchain to query data and answer questions on specific tenants

Up until now, we used Langchain to ingest our data, and we queried Weaviate directly.

Now, let's use Langchain also to query. If you noticed, after ingesting our data, langchain will return us a vectorstore. 

We can use that vector store, or initiate a new one. Let's initiate a new one, passing an empty docs []

**Note**: we are also passing the tenant parameter on query time or as search arguments

In [15]:
embeddings = OpenAIEmbeddings()
db = WeaviateVectorStore(embedding=embeddings, client=client, index_name="WikipediaLangChainMT", text_key="text")

### We can now search our data

In [16]:
# we can now do a similarity search on objects of a specific tenant
docs = db.similarity_search("traditional food", tenant="brazil")
print(docs)

[Document(metadata={'page': 7.0, 'source': 'brazil-wikipedia-article-text.pdf'}, page_content='accounting for 32% of the total trade. Other large trading partners include the United States, Argentina, the Netherlands and\nCanada. Its automotive industry is the eighth-largest in the world. In the food industry, Brazil was the second-largest\nexporter of processed foods in the world in 2019. The country was the second-largest producer of pulp in the world and the'), Document(metadata={'page': 7.0, 'source': 'brazil-wikipedia-article-text.pdf'}, page_content="making up 6.6% of total GDP.\nBrazil is one of the largest producers of various agricultural commodities, and also has a large cooperative sector that\nprovides 50% of the food in the country. It has been the world's largest producer of coffee for the last 150 years. Brazil is the\nworld's largest producer of sugarcane, soy, coffee and orange; is one of the top 5 producers of maize, cotton, lemon,"), Document(metadata={'page': 10.0, 

In [17]:
# now for netherlands
# we can now do a similarity search on objects of a specific tenant
docs = db.similarity_search("traditional food", tenant="netherlands")
print(docs)

[Document(metadata={'page': 14.0, 'source': 'netherlands-wikipedia-article-text.pdf'}, page_content='(in its modern form) and \nZeeuwse bolus\n are\ngood examples. Cookies are also produced in great number and tend to contain a lot of butter and sugar, like \nstroopwafel\n, as well\nas a filling of some kind, mostly almond, like \ngevulde koek\n. The traditional alcoholic beverages of this region are beer (strong pale\nlager) and \nJenever'), Document(metadata={'page': 14.0, 'source': 'netherlands-wikipedia-article-text.pdf'}, page_content='widely available and typical for the region. \nKibbeling\n, once a local delicacy consisting of small chunks of battered white fish, has\nbecome a national fast food, just as lekkerbek.\nThe Southern Dutch cuisine consists of the cuisines of the Dutch provinces of North Brabant and Limburg and the Flemish Region in'), Document(metadata={'page': 14.0, 'source': 'netherlands-wikipedia-article-text.pdf'}, page_content='amount of fish. The various dried

### Filter by a property
the property `page` is automatically added by LangChain.

More on how to add [multiple operands](https://weaviate.io/developers/weaviate/api/graphql/filters#multiple-operands) and [nested filters](https://weaviate.io/developers/weaviate/search/filters#nested-filters)

In [18]:
#source_file = "netherlands-wikipedia-article-text.pdf"
where_filter = wvc.query.Filter.by_property("page").greater_or_equal(15)
docs = db.similarity_search("traditional food", filters=where_filter, tenant="netherlands")
print(docs)

[Document(metadata={'page': 15.0, 'source': 'netherlands-wikipedia-article-text.pdf'}, page_content='nationally and internationally\nPhotographs\nAn album of photos of Holland (Netherlands) in 1935 and 1958\nhttps://openaccess.leidenuniv.nl/bitstream/handle/1887/1108/171_060.pdf?sequence=1\nhttp://www.volkskrant.nl/wetenschap/article455140.ece/Prehistorische_akker_gevonden_bij_Swifterbant'), Document(metadata={'page': 15.0, 'source': 'netherlands-wikipedia-article-text.pdf'}, page_content='sequence=1\nhttps://books.google.com/books?id=8isNLCXfNycC&pg=PA411\nhttps://books.google.com/books?id=8isNLCXfNycC&pg=PA508\nhttps://web.archive.org/web/20120114182245/http://www.digita.....story.uh.edu/database/article_display_printable.cfm?HHID=682\nhttp://www.digitalhistory.uh.edu/database/article_display_printable.cfm?HHID=682\nhttp://www.lrb.co.uk/v23/n07/murray-sayle/japan-goes-dutch'), Document(metadata={'page': 15.0, 'source': 'netherlands-wikipedia-article-text.pdf'}, page_content='https://

### You can also do some question answering

In [19]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

from weaviate.classes.query import Filter

# client = weaviate.connect_to_weaviate_cloud(...)

embeddings = OpenAIEmbeddings()
db = WeaviateVectorStore.from_documents([], embeddings, client=client, index_name="WikipediaLangChainMT")

# you can add your filters like this
where_filter = wvc.query.Filter.by_property("page").greater_or_equal(2)

# we want our retriever to filter the results and use a specific tenant
retriever = db.as_retriever(search_kwargs={
    "filters": where_filter, 
    "tenant": "brazil"
})

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

llm = ChatOpenAI(model="gpt-4o-mini")
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

response = rag_chain.invoke({"input": "What is he traditional food of this country?"})
print(response["answer"])

One of the traditional foods of Brazil is Feijoada, which is considered the country's national dish. Additionally, regional foods like beiju, feijão tropeiro, vatapá, and moqueca also reflect the diverse culinary heritage of the country.


In [20]:
#lets close our embedded server
client.close()

{"action":"restapi_management","level":"info","msg":"Shutting down... ","time":"2024-09-02T14:57:30-03:00"}
{"action":"restapi_management","level":"info","msg":"Stopped serving weaviate at http://127.0.0.1:8079","time":"2024-09-02T14:57:30-03:00"}
