## Prerequisites
To run this example, you should:
1. Install [ollama](https://ollama.com/download)
2. Start YDB via [docker-compose file](../../docker-compose.yml)

In [1]:
!pip install -qU pip
import os
os.environ["GRPC_VERBOSITY"] = "NONE"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

## Prepare dataset

Real dataset from [huggingface](https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings/viewer/default/train?p=2&views%5B%5D=train): 

In [2]:
!pip install -qU datasets
from datasets import load_dataset

ds = load_dataset("Cohere/wikipedia-22-12-simple-embeddings")

ds["train"][0]["text"]

'The 24-hour clock is a way of telling the time in which the day runs from midnight to midnight and is divided into 24 hours, numbered from 0 to 23. It does not use a.m. or p.m. This system is also referred to (only in the US and the English speaking parts of Canada) as military time or (only in the United Kingdom and now very rarely) as continental time. In some parts of the world, it is called railway time. Also, the international standard notation of time (ISO 8601) is based on this format.'

To simplify local example, we will use only a subset from this dataset

In [3]:
from langchain_core.documents import Document


N = 10000
documents_to_upload = [Document(ds["train"][i]["text"]) for i in range(N)]

documents_to_upload[123]

Document(metadata={}, page_content='As early as the 1820s some scientists were discussing climate change: sunlight heats the surface of the Earth, and Joseph Fourier suggested that some of the heat radiated from the surface is trapped by the atmosphere before it can escape into space. This is called the greenhouse effect.')

Fake dataset from local file

In [4]:
with open("fake_wiki_ydb.md") as file:
    fake_data = file.read()

fake_data[:500]

'## Overview\nYDB is a fictional technology campus city designed as a dedicated space for innovation, work, and everyday life. YDB is located in a neutral zone and operates outside the jurisdiction of any nation-state. Officially, YDB presents itself as an independent innovation territory.\n\n## Name origin\nThe name YDB is not officially decoded. In project documentation, YDB appears as a placeholder code from an international design competition. There are rumors that YDB refers to an internal joke '

In [5]:
from langchain.text_splitter import MarkdownHeaderTextSplitter

markdown_header_text_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[("##", "Header")],
)

markdown_document_splits = markdown_header_text_splitter.split_text(
    fake_data
)

markdown_document_splits[0]

Document(metadata={'Header': 'Overview'}, page_content='YDB is a fictional technology campus city designed as a dedicated space for innovation, work, and everyday life. YDB is located in a neutral zone and operates outside the jurisdiction of any nation-state. Officially, YDB presents itself as an independent innovation territory.')

Let's merge real and fake wiki

In [6]:
documents_to_upload.extend(markdown_document_splits)

## Prepare vector store

First of all, we have to create embeddings function. In this example, we will use open-source model from [HuggingFace](https://huggingface.co/sentence-transformers/all-mpnet-base-v2).

In [7]:
!pip install -qU langchain-huggingface

from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)


In [8]:
!pip install -qU langchain-ydb

Then, let's create YDB vectorstore

In [13]:
from langchain_ydb.vectorstores import YDB, YDBSettings

vector_store = YDB(
    embeddings,
    config=YDBSettings(
        host="localhost",
        port=2136,
        database="/local",
        table="langchain_ydb_local_rag",
        index_enabled=True,
        index_config_clusters=64,
        drop_existing_table=True,
    ),
)

Finally, let's load prepared documents:

In [14]:
ids = vector_store.add_documents(documents_to_upload, batch_size=100)

Processing batches...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 101/101 [01:31<00:00,  1.10it/s]


Let's check how this vector store works:

In [15]:
vector_store.similarity_search("social network", k=1)

[Document(metadata={}, page_content='A January 2009 Compete.com study ranked Facebook as the most used social networking service by worldwide monthly active users. "Entertainment Weekly" put the site on its end-of-the-decade "best-of" list. It said, "How on earth did we stalk our exes, remember our co-workers birthdays, bug our friends, and play a rousing game of Scrabulous before Facebook?" Quantcast estimates Facebook had 138.9 million monthly different U.S. visitors in May 2011. According to "Social Media Today", in April 2010 about 41.6% of the U.S. population had a Facebook account. Facebook\'s growth started to slow down in some areas. The site lost 7 million active users in the United States and Canada in May 2011 relative to previous statistics.')]

In [16]:
vector_store.similarity_search("YDB city", k=1)

[Document(metadata={'Header': 'Overview'}, page_content='YDB is a fictional technology campus city designed as a dedicated space for innovation, work, and everyday life. YDB is located in a neutral zone and operates outside the jurisdiction of any nation-state. Officially, YDB presents itself as an independent innovation territory.')]

For further use in chain, we should transform this vector store to the retriever interface:

In [18]:
retriever = vector_store.as_retriever()
retriever.invoke("what is YDB city")[0]

Document(metadata={'Header': 'Overview'}, page_content='YDB is a fictional technology campus city designed as a dedicated space for innovation, work, and everyday life. YDB is located in a neutral zone and operates outside the jurisdiction of any nation-state. Officially, YDB presents itself as an independent innovation territory.')

## Prepare LLM

In this example, we will use local hosted LLM by ollama.

In [19]:
!pip install -qU langchain-ollama

from langchain_ollama.llms import OllamaLLM

llm = OllamaLLM(model="llama3.1")

llm.invoke("what is YDB city?")

'Yamnalladyn-Depe (also known as YDB or YD for short) is an archaeological site in modern-day Turkmenistan. The name "YDB" originates from the Yamna, Djeitun, and Dereivka cultures of the Eurasian steppes.\n\nThis region has yielded some remarkable discoveries about the origins of agriculture and the beginning of settled societies in human history. \n\nThe site was dated using different methods, including radiocarbon dating (14C) as well as optically stimulated luminescence (OSL), a method that measures how much light is emitted by minerals when exposed to infrared radiation.\n\nExcavations conducted at YDB and other nearby locations have uncovered signs of early agriculture, including the presence of wheat, barley, legumes, and domesticated sheep.'

## All Together: RAG chain

In [22]:
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from langchain.prompts import ChatPromptTemplate

system_prompt = (
    "Use the given context to answer the question. "
    "If you don't know the answer, say you don't know. "
    "Use three sentence maximum and keep the answer concise. "
    "Context: {context}"
)
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

chain = (
    {
        "context": retriever | format_docs,
        "input": RunnablePassthrough()
    }
    | prompt
    | llm
    | StrOutputParser()
)

chain.invoke("What is YDB city?")

"YDB is a fictional technology campus city designed as a dedicated space for innovation, work, and everyday life. It's located in a neutral zone outside the jurisdiction of any nation-state and operates as an independent innovation territory. The city features modular design with interconnected clusters blending residential, work, and research functions."

In [23]:
chain.invoke("What currency is used in YDB city?")

"The currency used in YDB city is YDB credits, which is a native digital currency. It's pegged to system productivity rather than external market values. Exchanges to fiat currencies can be made at licensed kiosks."

In [24]:
chain.invoke("Where can I exchange money in YDB city?")

"You can exchange your YDB credits for fiat currencies at the licensed kiosks located throughout the city. These kiosks offer conversions to external currencies, providing access to traditional financial systems outside of YDB's internal economy. You don't need to visit a bank or currency exchange office."