[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/weaviate/recipes/blob/main/integrations/llm-agent-frameworks/langchain/loading-data/langchain-simple-pdf.ipynb)

# Multilanguage RAG filtering by multiple PDFs with Langchain and OpenAi

In [None]:
# lets install our super tools
%pip install -Uqq langchain-weaviate langchain-community
%pip install langchain-openai tiktoken langchain pypdf

you must have a valid key for OpenAi in OPENAI_API_KEY environment variable

In [None]:
import weaviate, os

client = weaviate.connect_to_embedded(
    headers={
        "X-OpenAi-Api-Key": os.environ.get("OPENAI_API_KEY"), # Replace with your OpenAi key
    }
)

print("Client is Ready?", client.is_ready())

Started /Users/dudanogueira/.cache/weaviate-embedded: process ID 41028


{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2024-09-02T14:43:58-03:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2024-09-02T14:43:58-03:00"}
{"level":"info","msg":"No resource limits set, weaviate will use all available memory and CPU. To limit resources, set LIMIT_RESOURCES=true","time":"2024-09-02T14:43:58-03:00"}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50050","time":"2024-09-02T14:43:58-03:00"}
{"action":"restapi_management","level":"info","msg":"Serving weaviate at http://127.0.0.1:8079","time":"2024-09-02T14:43:58-03:00"}


Client is Ready? True


{"level":"info","msg":"Completed loading shard testcollection_64HzLSXOVuGn in 3.169709ms","time":"2024-09-02T14:43:59-03:00"}
{"level":"info","msg":"Completed loading shard testcollection2_AhtmABIV35w1 in 3.222792ms","time":"2024-09-02T14:43:59-03:00"}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-09-02T14:43:59-03:00","took":46208}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-09-02T14:43:59-03:00","took":41625}
{"level":"info","msg":"Completed loading shard wikipedialangchain_m1fT5DHE8evq in 2.359167ms","time":"2024-09-02T14:43:59-03:00"}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-09-02T14:43:59-03:00","took":43500}


## Let's create our Collection beforehand

this will ensure the collection is created with a vectorizer and generative config. 
Make sure to use the same model while creating and passing the embeddings to langchain

In [3]:
from weaviate import classes as wvc
# clear this collection before creating it
client.collections.delete("WikipediaLangChain")
# lets make sure its vectorizer is what the one we want
collection = client.collections.create(
    name="WikipediaLangChain",
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),
    generative_config=wvc.config.Configure.Generative.openai(),
)

{"level":"info","msg":"Created shard wikipedialangchain_TYt3LayzT5YG in 838.042µs","time":"2024-09-02T14:44:03-03:00"}
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-09-02T14:44:03-03:00","took":38959}


Now we have a Weaviate client!
Let's read our 2 pdf files, [brazil-wikipedia-article-text.pdf](./brazil-wikipedia-article-text.pdf) and [netherlands-wikipedia-article-text.pdf](./netherlands-wikipedia-article-text.pdf)

Then chunk them and ingest using Langchain.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings

from langchain.document_loaders import PyPDFLoader

from langchain_weaviate.vectorstores import WeaviateVectorStore

text_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)
embeddings = OpenAIEmbeddings()

# import first article
loader = PyPDFLoader("brazil-wikipedia-article-text.pdf", extract_images=False)
docs = loader.load_and_split(text_splitter)
print(f"GOT {len(docs)} docs for Brazil")
db = WeaviateVectorStore.from_documents(docs, embeddings, client=client, index_name="WikipediaLangChain")


# import second article
loader = PyPDFLoader("netherlands-wikipedia-article-text.pdf", extract_images=False)
docs = loader.load_and_split(text_splitter)
print(f"GOT {len(docs)} docs for Netherlands")
db = WeaviateVectorStore.from_documents(docs, embeddings, client=client, index_name="WikipediaLangChain")

GOT 247 docs for Brazil
GOT 274 docs for Netherlands


In [5]:
# lets first get our collection
collection = client.collections.get("WikipediaLangChain")

let's count how many objects we have in total

In [6]:
response = collection.aggregate.over_all(total_count=True)
print(response)

AggregateReturn(properties={}, total_count=521)


Now, how many objects we have per source?

In [7]:
response = collection.aggregate.over_all(group_by="source")
for group in response.groups:
    print(group.grouped_by.value, group.total_count)

netherlands-wikipedia-article-text.pdf 274
brazil-wikipedia-article-text.pdf 247


Langchain added some metadata, like `source` `page`. Let's get one object.

In [8]:
object = collection.query.fetch_objects(limit=1).objects[0]

In [9]:
object.properties.keys()

dict_keys(['text', 'page', 'source'])

In [10]:
print(object.properties.get("source"))
print(object.properties.get("page"))
print(object.properties.get("text"))

netherlands-wikipedia-article-text.pdf
0.0
Belgium to the south, with a North Sea coastline to the north and west. It also has
a border with France on the split island of Saint Martin in the Caribbean. It shares
maritime borders with the United Kingdom, Germany and Belgium. The official
language is Dutch, with West Frisian as a secondary official language in the


## Let's ask in French, a content in English

In [11]:
# lets do a RAG directly using only Weaviate

# This is our prompt.
generateTask = "Quelle est la nourriture traditionnelle de ce pays?"
# lets filter it out, and only use this specific file
source_file = "brazil-wikipedia-article-text.pdf"
#source_file = "netherlands-wikipedia-article-text.pdf"

query = collection.generate.near_text(
    query="tradicional food",
    filters=wvc.query.Filter.by_property("source").equal(source_file),
    limit=10,
    grouped_task=generateTask
)
print(query.generated)

La nourriture traditionnelle du Brésil comprend des plats tels que la farine (farofa), les pommes de terre frites, la cassave frite, la banane frite, la viande frite et le fromage frit. Des snacks populaires incluent le pastel (une pâtisserie frite), la coxinha (une variation de croquette de poulet), le pão de queijo (pain au fromage et farine de manioc / tapioca), la pamonha (pâte de maïs et de lait), l'esfirra (une variation de pâtisserie libanaise) et l'acarajé (de la cuisine africaine). Les plats traditionnels incluent la feijoada, considérée comme le plat national du pays, ainsi que des plats régionaux tels que le beiju, le feijão tropeiro, le vatapá, la moqueca, la polenta (de la cuisine italienne), le kibbeh (de la cuisine arabe), l'empanada et l'empada. Les desserts brésiliens comprennent des douceurs comme les brigadeiros (boules de fudge au chocolat), le bolo de rolo (gâteau roulé à la goiabada), la cocada (une douceur à la noix de coco), les beijinhos (truffes à la noix de c

those were some of the objects used for this generation

In [12]:
for object in query.objects[0:10]:
    print(object.properties)

{'text': 'flour (farofa). Fried potatoes, fried cassava, fried banana, fried meat and fried cheese are very often eaten in lunch and\nserved in most typical restaurants. Popular snacks are pastel (a fried pastry); coxinha (a variation of chicken croquete); pão\nde queijo (cheese bread and cassava flour / tapioca); pamonha (corn and milk paste); esfirra (a variation of Lebanese', 'page': 13.0, 'source': 'brazil-wikipedia-article-text.pdf'}
{'text': "Cuisine\nBrazilian cuisine varies greatly by region, reflecting the country's varying mix of indigenous and immigrant populations. This\nhas created a national cuisine marked by the preservation of regional differences. Examples are Feijoada, considered the\ncountry's national dish; and regional foods such as beiju, feijão tropeiro, vatapá, moqueca, polenta (from Italian cuisine) and", 'page': 13.0, 'source': 'brazil-wikipedia-article-text.pdf'}
{'text': 'pastry); kibbeh (from Arabic cuisine); empanada (pastry) and empada, little salt pies f

Note that we used a filter, so the content will be searched and generated only for that specific pdf.
Let's change the filter to the second pdf file.

In [13]:
# We can filter it out, now for Netherlands
generateTask = "Qual é a comida tradicional deste país?. Answer in english"
# now generating the answer using Wikipedia
source_file = "netherlands-wikipedia-article-text.pdf"

query = collection.generate.near_text(
    query="tradicional food",
    filters=wvc.query.Filter.by_property("source").equal(source_file),
    limit=10,
    grouped_task=generateTask
)

print(query.generated)

The traditional food of the Netherlands typically consists of potatoes, meat, and seasonal vegetables for dinner. The diet was historically high in carbohydrates and fat, reflecting the needs of laborers. Some typical Dutch foods include mayonnaise, whole-grain mustards, chocolate, buttermilk, seafood like herring and mussels, and pastries like stroopwafel and gevulde koek. The cuisine varies by region, with different specialties in the north, south, and western parts of the country. Beer and Jenever are traditional alcoholic beverages in the region.


And of course, we can use different filters, and get different content for our questions

In [14]:
# We can filter it out for multilpe sources
generateTask = "What is in common on the food of thouse two countries?"
# now generating the answer using Wikipedia
source_files = ["netherlands-wikipedia-article-text.pdf", "brazil-wikipedia-article-text.pdf"]

query = collection.generate.near_text(
    query="tradicional food",
    filters=wvc.query.Filter.by_property("source").contains_any(source_files),
    limit=10,
    grouped_task=generateTask
)

print(query.generated)

Both Brazil and the Netherlands have a variety of fried foods in their cuisine. In Brazil, fried potatoes, fried cassava, fried banana, fried meat, and fried cheese are commonly eaten, while in the Netherlands, fried fish dishes like kibbeling and lekkerbek are popular. Additionally, both countries have a tradition of using flour in their dishes, such as in Brazilian farofa and Dutch cookies and pastries.


## Using Langchain to query data and answer questions

Up until now, we used Langchain to ingest our data, and we queried Weaviate directly.

Now, let's use Langchain also to query. If you noticed, after ingesting our data, langchain will return us a vectorstore. 

We can use that vector store, or initiate a new one. Let's initiate a new one, passing an empty docs []

In [17]:
embeddings = OpenAIEmbeddings()
db = WeaviateVectorStore(embedding=embeddings, client=client, index_name="WikipediaLangChain", text_key="text")

### We can now search our data

In [None]:
# we can now do a similarity search on all objects
docs = db.similarity_search("traditional food", return_uuids=True)
print(docs)

[Document(metadata={'page': 14.0, 'source': 'netherlands-wikipedia-article-text.pdf'}, page_content='(in its modern form) and \nZeeuwse bolus\n are\ngood examples. Cookies are also produced in great number and tend to contain a lot of butter and sugar, like \nstroopwafel\n, as well\nas a filling of some kind, mostly almond, like \ngevulde koek\n. The traditional alcoholic beverages of this region are beer (strong pale\nlager) and \nJenever'), Document(metadata={'page': 14.0, 'source': 'netherlands-wikipedia-article-text.pdf'}, page_content='widely available and typical for the region. \nKibbeling\n, once a local delicacy consisting of small chunks of battered white fish, has\nbecome a national fast food, just as lekkerbek.\nThe Southern Dutch cuisine consists of the cuisines of the Dutch provinces of North Brabant and Limburg and the Flemish Region in'), Document(metadata={'page': 14.0, 'source': 'netherlands-wikipedia-article-text.pdf'}, page_content='amount of fish. The various dried

### Filter by a property
the property `source` is automatically added by LangChain.

More on how to add [multiple operands](https://weaviate.io/developers/weaviate/api/graphql/filters#multiple-operands) and [nested filters](https://weaviate.io/developers/weaviate/search/filters#nested-filters)

In [20]:
# change bellow to get chunks per different files / countries
source_file = "brazil-wikipedia-article-text.pdf"
#source_file = "netherlands-wikipedia-article-text.pdf"
where_filter = wvc.query.Filter.by_property("source").equal(source_file)
docs = db.similarity_search("traditional food", filters=where_filter)
print(docs)

[Document(metadata={'page': 7.0, 'source': 'brazil-wikipedia-article-text.pdf'}, page_content='accounting for 32% of the total trade. Other large trading partners include the United States, Argentina, the Netherlands and\nCanada. Its automotive industry is the eighth-largest in the world. In the food industry, Brazil was the second-largest\nexporter of processed foods in the world in 2019. The country was the second-largest producer of pulp in the world and the'), Document(metadata={'page': 7.0, 'source': 'brazil-wikipedia-article-text.pdf'}, page_content="making up 6.6% of total GDP.\nBrazil is one of the largest producers of various agricultural commodities, and also has a large cooperative sector that\nprovides 50% of the food in the country. It has been the world's largest producer of coffee for the last 150 years. Brazil is the\nworld's largest producer of sugarcane, soy, coffee and orange; is one of the top 5 producers of maize, cotton, lemon,"), Document(metadata={'page': 10.0, 

### You can also do some question answering

In [21]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

from weaviate.classes.query import Filter

# client = weaviate.connect_to_weaviate_cloud(...)

embeddings = OpenAIEmbeddings()
db = WeaviateVectorStore.from_documents([], embeddings, client=client, index_name="WikipediaLangChain")

source_file = "brazil-wikipedia-article-text.pdf"
#source_file = "netherlands-wikipedia-article-text.pdf"
where_filter = Filter.by_property("source").equal(source_file)

# we want our retriever to filter the results
retriever = db.as_retriever(search_kwargs={"filters": where_filter})

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

llm = ChatOpenAI(model="gpt-4o-mini")
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

response = rag_chain.invoke({"input": "What is he traditional food of this country?"})
print(response["answer"])

One of the traditional foods of Brazil is Feijoada, which is considered the country's national dish. Other regional foods include beiju, feijão tropeiro, vatapá, and moqueca. Brazilian cuisine reflects a rich mix of indigenous and immigrant influences.


In [22]:
#lets close our embedded server
client.close()

{"action":"restapi_management","level":"info","msg":"Shutting down... ","time":"2024-09-02T14:49:55-03:00"}
{"action":"restapi_management","level":"info","msg":"Stopped serving weaviate at http://127.0.0.1:8079","time":"2024-09-02T14:49:55-03:00"}
