In [1]:
%load_ext autoreload
%autoreload 2

Necessary dependencies

In [2]:
# !pip install -U langchain_community tiktoken langchainhub langchain langgraph langchain-text-splitters
# !pip install -U langchain-nomic
# !pip install ollama
# !pip install -qU langchain-ollama
# !pip install --upgrade --quiet  rank_bm25 > /dev/null
# !pip install -U langchain-community faiss-cpu langchain-openai tiktoken
#!pip install --upgrade --quiet  scikit-learn
# !pip install -U langchain-chroma

In [28]:
from langchain.retrievers import EnsembleRetriever
from langchain_core.retrievers import BaseRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.retrievers import TFIDFRetriever
from langchain_community.vectorstores import FAISS
from langchain_chroma import Chroma
from langchain_community.embeddings import OllamaEmbeddings

## Preparing the data

We will create chunks using **RecursiveCharacterTextSplitter**. We will also add the origin page for each chunk, so in the case of a search resulting in two or more chunks retrieved from the same page, we can retrieve the entire page to avoid losing too much context

In [6]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader

In [7]:
# This ensures that all chunks are exactly 450 characters.
# The last 50 characters of the chunk are actually the start of the next chunk, this helps to preserve cohesiveness between the chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=450,
                                               chunk_overlap=50,
                                               length_function=len,
                                               is_separator_regex=False,
                                               )

### We are going to use the manual from the famous game Age of Empires 2

In [8]:
aoe2 = PyPDFLoader("docs/Age_of_Empires_2_Manual.pdf").load()

In [9]:
# Sneak Peek
aoe2[0:10]

[Document(metadata={'source': 'docs/Age_of_Empires_2_Manual.pdf', 'page': 0}, page_content='Information in this document, including URL and other Internet Web site references, is subject to\nchange without notice. The example companies, organizations, products, people and eventsdepicted herein are fictitious unless otherwise noted. No association with any real company,\norganization, product, person or event is intended or should be inferred. Complying with all\napplicable copyright laws is the responsibility of the user. Without limiting the rights undercopyright, no part of this document may be reproduced, stored in or introduced into a retrieval\nsystem, or transmitted in any form or by any means (electronic, mechanical, photocopying,\nrecording, or otherwise), or for any purpose, without the express written permission of MicrosoftCorporation.\nMicrosoft may have patents, patent applications, trademarks, copyrights, or other intellectual\nproperty rights covering subject matter in t

#### Let's chunk it and add the page number as metadata

In [10]:
aoe2_chunks = []
for page in aoe2:
    aoe2_chunks.extend(text_splitter.create_documents([page.page_content], [page.metadata]))

In [27]:
aoe2_chunks[:10]

[Document(metadata={'source': 'docs/Age_of_Empires_2_Manual.pdf', 'page': 0}, page_content='Information in this document, including URL and other Internet Web site references, is subject to\nchange without notice. The example companies, organizations, products, people and eventsdepicted herein are fictitious unless otherwise noted. No association with any real company,\norganization, product, person or event is intended or should be inferred. Complying with all'),
 Document(metadata={'source': 'docs/Age_of_Empires_2_Manual.pdf', 'page': 0}, page_content='applicable copyright laws is the responsibility of the user. Without limiting the rights undercopyright, no part of this document may be reproduced, stored in or introduced into a retrieval\nsystem, or transmitted in any form or by any means (electronic, mechanical, photocopying,\nrecording, or otherwise), or for any purpose, without the express written permission of MicrosoftCorporation.'),
 Document(metadata={'source': 'docs/Age_of_E

In [12]:
len(aoe2_chunks)

749

# BM25
Unfortunately the BM25 retriever provided by langchain can't be save locally. We would need to reconstruct it later.

In [13]:
# BM25 Inverted Index
bm25_retriever = BM25Retriever.from_documents(aoe2_chunks)

In [14]:
bm25_retriever.k = 10
bm25_retriever.invoke("Villagers")

[Document(metadata={'source': 'docs/Age_of_Empires_2_Manual.pdf', 'page': 89}, page_content='Stone Mining, Gold Mining, Stone Shaft Mining, Gold ShaftMining (Lumber Camp, Mining Camp); Heavy Plow (Town\nCenter)\nBuild speed — Treadmill Crane (University)\nYour units resistant to other Monks — Faith (Monastery)\nVillagers perform the economic work for your civilization. They chop wood, mine stone'),
 Document(metadata={'source': 'docs/Age_of_Empires_2_Manual.pdf', 'page': 28}, page_content='The more villagers at work gathering resources, the faster your stockpile grows. Villagers\ncan deposit the resources more quickly if you build Mills near sources of food, LumberCamps near forests, and Mining Camps near stone and gold mines.\nResearching the following technologies improves your villagers’ gathering abilities:'),
 Document(metadata={'source': 'docs/Age_of_Empires_2_Manual.pdf', 'page': 117}, page_content='ARCHERVILLAGER(MALE)VILLAGER(FEMALE)\nTRADE COG\nSCOUT CAVALRYTRANSPORTSHIPGALLE

# TF-IDF
For this example, it will suffice to use TF-IDF

https://python.langchain.com/v0.2/docs/integrations/retrievers/tf_idf/

In [15]:
tfidf_retriever = TFIDFRetriever.from_documents(aoe2_chunks)

In [17]:
tfidf_retriever.k = 10
tfidf_retriever.invoke("Villagers")

[Document(metadata={'source': 'docs/Age_of_Empires_2_Manual.pdf', 'page': 50}, page_content='see Chapters III and IV):\nz Create new villagers.\nz Deposit all resources (wood, food, gold, and stone) into your stockpile.\nz Advance to the next age.\nz Research technology that improves your villagers and buildings.\nz Ring the town bell to garrison villagers safely inside during enemy\nattack.B uildings'),
 Document(metadata={'source': 'docs/Age_of_Empires_2_Manual.pdf', 'page': 28}, page_content='The more villagers at work gathering resources, the faster your stockpile grows. Villagers\ncan deposit the resources more quickly if you build Mills near sources of food, LumberCamps near forests, and Mining Camps near stone and gold mines.\nResearching the following technologies improves your villagers’ gathering abilities:'),
 Document(metadata={'source': 'docs/Age_of_Empires_2_Manual.pdf', 'page': 6}, page_content='you can pay for improvements to your civilization. For more information abou

In [16]:
# Save
tfidf_retriever.save_local("tfidf_aoe2.pkl")

# Indexing with Dense vectors

We will be using the embeddings from nomic

In [18]:
embedding = OllamaEmbeddings(model="nomic-embed-text:latest")  # 768 dims

# Chroma
Chroma DB implements **ANN** vis **HNSW**

In [50]:
# CHROMA (HNSW)
chroma_vectorstore = Chroma(persist_directory="chroma_aoe2",  embedding_function=embedding)
chroma_vectorstore.add_documents(aoe2_chunks)

['03deb97d-80a9-4823-887c-3cc15e7a9d79',
 '514cd30a-ef39-486e-ab5b-9a9131f49423',
 'add713c6-6627-4fbb-8ec7-b50a093aee30',
 '78a3600c-8277-455a-aaea-b987e5a58ffa',
 '37b6f517-2004-43e6-8b1d-0ed217ce277d',
 'edf65033-ef50-43c0-89da-4e9be32f0321',
 'bc96adfa-958a-41df-86d3-98c2742f8d2f',
 '952c5a9b-d971-4651-aff6-252a6a9ea72f',
 '17dd9cdb-2c11-4f8c-9b8b-4ce1b6465ded',
 '5eeeb8ee-8847-4cf5-9974-82437e310d06',
 'cfa6a5fa-8005-43e5-b2b5-aa6e1e4e4fcc',
 '42be01ad-3a6c-4114-8960-7bd42327d161',
 '58984a19-597f-4d54-9c28-f2ecf7373c87',
 'd81f5893-6525-4175-a7c2-85a6006f3a0b',
 '0a6c2d1b-4912-4a36-a921-bcdfa2ba6b41',
 'af2bbdfc-bfaf-4464-9b5e-8cd351e71951',
 '384c42c8-479b-4380-9108-f9891305c686',
 '6cdb0b91-9a7a-448d-bac9-e42eee07ffcd',
 'e9053da1-5fbf-46b5-90a4-adc8ff254927',
 'af1c218c-04ca-4aa9-98db-fd6df39efa57',
 '72c307ef-bbe1-4804-9766-55b9288103e1',
 '58eb8e47-2af9-4249-89b4-c8b06c7ec0a8',
 '1d89b75c-0ded-4d13-8012-4f9108c2438e',
 '291ddf33-41c2-42a3-b0b7-b1570d9dcf65',
 'a4d0de78-08bc-

In [51]:
chroma_retriever = chroma_vectorstore.as_retriever(search_kwargs={"k": 6})
chroma_retriever.invoke("Villagers")

[Document(metadata={'page': 26, 'source': 'docs/Age_of_Empires_2_Manual.pdf'}, page_content='24 Chapter III  -  Building Your EmpireChapter III\nPutting your villagers to work\nVillagers are invaluable to your civilization. Their primary function\nis to gather wood, food, gold, and stone from the land and deposit it\nin your stockpile. They also construct buildings and repair damaged\nbuildings, boats, and siege weapons. In a pinch, they can evenengage in combat. Fishing Ships also contribute to population count'),
 Document(metadata={'page': 89, 'source': 'docs/Age_of_Empires_2_Manual.pdf'}, page_content='and gold, hunt, forage, fish, herd sheep, and farm. They also construct buildings andrepair damaged buildings, ships, and siege weapons. If necessary, they can also engage in\ncombat. Villager gender is randomly determined when you create a new villager. They\nperform the same tasks regardless of their gender.\nThe great percentage of people in the Middle Ages were\npeasants, serfs, 

# FAISS

In [22]:
# FAISS (HNSW + Product Quantization to reduce memory)
faiss_vectorstore = FAISS.from_documents(aoe2_chunks, embedding=embedding)

In [25]:
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 5})
faiss_retriever.invoke("Villagers")

[Document(metadata={'source': 'docs/Age_of_Empires_2_Manual.pdf', 'page': 26}, page_content='24 Chapter III  -  Building Your EmpireChapter III\nPutting your villagers to work\nVillagers are invaluable to your civilization. Their primary function\nis to gather wood, food, gold, and stone from the land and deposit it\nin your stockpile. They also construct buildings and repair damaged\nbuildings, boats, and siege weapons. In a pinch, they can evenengage in combat. Fishing Ships also contribute to population count'),
 Document(metadata={'source': 'docs/Age_of_Empires_2_Manual.pdf', 'page': 89}, page_content='and gold, hunt, forage, fish, herd sheep, and farm. They also construct buildings andrepair damaged buildings, ships, and siege weapons. If necessary, they can also engage in\ncombat. Villager gender is randomly determined when you create a new villager. They\nperform the same tasks regardless of their gender.\nThe great percentage of people in the Middle Ages were\npeasants, serfs, 

In [26]:
# Save
faiss_vectorstore.save_local("faiss_aoe2")

## Asking single shot questions

In [29]:
def make_llama_3_prompt(user, system="", context=""):
    if system != "":
        system_prompt = (
            f"<|start_header_id|>system<|end_header_id|>\n\n{system}<|eot_id|>"
        )
    return f"<|begin_of_text|>{system_prompt}<|start_header_id|>user<|end_header_id|>\n\n{user}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

In [30]:
from langchain_ollama import ChatOllama
llm = ChatOllama(model="llama3.1:latest")

Let's see if it works?

In [31]:
llm.invoke("hello").content

'Hello! How can I assist you today?'

tags=['FAISS', 'OllamaEmbeddings'] vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x177dba4e0> search_kwargs={'k': 5}


In [45]:
def custom_retriever(retriever: BaseRetriever, user_query:str, k):
    retriever.k = k
    retrieved_docs = retriever.invoke(user_query)
    context = ""
    print(faiss_retriever)
    for doc in retrieved_docs:
        context += f"Extracted from page {doc.metadata['page']} \n{doc.page_content} \n\n"
        print(doc.metadata['page'])
    return context

def query(user_query:str, retriever: BaseRetriever,  llm: ChatOllama, k=5):
    context = custom_retriever(retriever, user_query, k)
    system_prompt = ("You are helpful assistant, your role is to assist people getting their way around the rules and mechanics of the famous game Age of Empires 2."
                     "You have the task to answer using the following context"
                     f"<CONTEXT>{context}</CONTEXT>"
                     "Keep you answers brief, 50 words at max."
                     "If the answer is not contained in the context, say you don't know")
    prompt = make_llama_3_prompt(user_query, system_prompt)
    answer = llm.invoke(prompt)
    return answer.content

In [46]:
answer = query("How can I increase the number of villagers, give me a good a tactic", bm25_retriever, llm, k=5)
print(answer)

tags=['FAISS', 'OllamaEmbeddings'] vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x177dba4e0> search_kwargs={'k': 5}
37
30
22
37
38
According to page 30, you can increase the population limit by changing it in the "Population" box in the pregame settings. A good tactic is to set a high population limit early on, especially if you're playing as a civilization that benefits from a large number of villagers, such as the Franks or the Mongols.


In [44]:
answer = query("How can I increase the number of villagers, give me a good a tactic", tfidf_retriever, llm, k=5)
print(answer)

tags=['FAISS', 'OllamaEmbeddings'] vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x177dba4e0> search_kwargs={'k': 5}
37
30
22
37
38
Extracted from page 30 

You can increase the population limit before starting a game by changing it in the Population box in the pregame settings.


In [35]:
answer = query("How can I increase the number of villagers, give me a good a tactic", chroma_retriever, llm, k=5)
print(answer)

According to page 30, you can increase the population limit by changing it in the "Population" box in the pregame settings.


In [36]:
answer = query("How can I increase the number of villagers, give me a good a tactic", faiss_retriever, llm, k=5)
print(answer)

According to page 30, you can increase the population limit by changing it in the "Population" box in the pre-game settings. This will allow you to support more villagers, military units, or ships.


In [None]:
# For later

# # initialize the ensemble retriever
# ensemble_retriever = EnsembleRetriever(
#     retrievers=[bm25_retriever, faiss_retriever], weights=[1, 0]
# )