# Basic RAG

https://api.python.langchain.com/en/latest/vectorstores/langchain_core.vectorstores.VectorStore.html

In [1]:
from dotenv import load_dotenv

load_dotenv()

True

## 1 - Text and Document

LangChain implements a Document abstraction, which is intended to represent a unit of text and associated metadata. It has two attributes:

- `page_content`: a string representing the content;
- `metadata`: a dict containing arbitrary metadata.

The metadata attribute can capture information about the source of the document, its relationship to other documents, and other information. Note that an individual Document object often represents a chunk of a larger document.

In [2]:
from langchain_core.documents import Document

documents = [
    Document(
        page_content="My name is Bob",
        metadata={"source": "my-text"},
    ),
    Document(
        page_content="I have one brother",
        metadata={"source": "my-text"},
    ),
    Document(
        page_content="My brother is Jim",
        metadata={"source": "my-text"},
    ),
    Document(
        page_content="I love playing soccer",
        metadata={"source": "my-text"},
    ),
    Document(
        page_content="Ronaldo is best football player",
        metadata={"source": "random-text"},
    ),
    Document(
        page_content="Dog is cute animal",
        metadata={"source": "radom-text"},
    ),
]

Here we've generated six documents, containing metadata indicating two distinct "sources".

## Vector Store

Vector search is a common way to store and search over unstructured data (such as unstructured text). 

The idea is to store numeric vectors that are associated with the text. 

Given a query, we can embed it as a vector of the same dimension and use vector similarity metrics to identify related data in the store

Let's start with defining embedding function

In [3]:
from langchain_openai import OpenAIEmbeddings

embedding=OpenAIEmbeddings()

Langchain has several integration of vector store: First we will use Chroma to store vectors in local as sqlitea

In [4]:
from langchain_chroma import Chroma

vectorstore = Chroma.from_documents(
    documents,
    embedding=OpenAIEmbeddings(),
    persist_directory="doc-db" # must be specified, if not it will be crashed
)

We can also store plain text

In [5]:
texts = [
    'My name is Bob',
    'I have one brother',
    'My brother is Jim',
    'I love soccer',
    'Ronaldo is the best football player',
    'Dog is cute animal',
]

vector1 = Chroma.from_texts(
    texts,
    embedding=embedding,
    persist_directory="text-db" # must be specified, if not it will be crashed
)

## 2 - Vector Store in Pinecoce

In [6]:
from langchain_pinecone import PineconeVectorStore 

index_name = "db"

pinecode = PineconeVectorStore.from_texts(
    texts, embedding, index_name=index_name
)

  from tqdm.autonotebook import tqdm


## Query Vector

In [8]:
vectorstore.similarity_search('footbal')

[Document(metadata={'source': 'my-text'}, page_content='I love playing soccer'),
 Document(metadata={'source': 'my-text'}, page_content='I love playing soccer'),
 Document(metadata={'source': 'random-text'}, page_content='Ronaldo is best football player'),
 Document(metadata={'source': 'random-text'}, page_content='Ronaldo is best football player')]

In [9]:
vectorstore.similarity_search_with_score("Footbal")

[(Document(metadata={'source': 'my-text'}, page_content='I love playing soccer'),
  0.29745159439630414),
 (Document(metadata={'source': 'my-text'}, page_content='I love playing soccer'),
  0.29745159439630414),
 (Document(metadata={'source': 'random-text'}, page_content='Ronaldo is best football player'),
  0.3417359514292345),
 (Document(metadata={'source': 'random-text'}, page_content='Ronaldo is best football player'),
  0.3417359514292345)]

In [10]:
cat_emb = embedding.embed_query("cat")

vectorstore.similarity_search_by_vector(cat_emb)

[Document(metadata={'source': 'radom-text'}, page_content='Dog is cute animal'),
 Document(metadata={'source': 'radom-text'}, page_content='Dog is cute animal'),
 Document(metadata={'source': 'my-text'}, page_content='My name is Bob'),
 Document(metadata={'source': 'my-text'}, page_content='My name is Bob')]

In [11]:
vectorstore.similarity_search("water", k=5)

[Document(metadata={'source': 'my-text'}, page_content='I love playing soccer'),
 Document(metadata={'source': 'my-text'}, page_content='I love playing soccer'),
 Document(metadata={'source': 'my-text'}, page_content='My name is Bob'),
 Document(metadata={'source': 'my-text'}, page_content='My name is Bob'),
 Document(metadata={'source': 'radom-text'}, page_content='Dog is cute animal')]

# Modify Vector DB

In [None]:
uuids = [str(i) for i in range(len(documents))]

vectorstore.add_documents(documents=documents, ids=uuids)

# Retriever

LangChain VectorStore objects do not subclass Runnable, and so cannot immediately be integrated into LangChain Expression Language chains.

LangChain Retrievers are Runnables, so they implement a standard set of methods (e.g., synchronous and asynchronous invoke and batch operations) and are designed to be incorporated in LCEL chains.

We can create a simple version of this ourselves, without subclassing Retriever. If we choose what method we wish to use to retrieve documents, we can create a runnable easily. Below we will build one around the similarity_search method:

In [16]:
from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda

retriever = RunnableLambda(vectorstore.similarity_search).bind(k=1) 
# .bind() pass parameter k=1 (select top result) whenever retriever is called

retriever.batch(["soccer", "family"])
# batch is using .invoke() in parallel using a thread pool executor

[[Document(metadata={'source': 'my-text'}, page_content='I love playing soccer')],
 [Document(metadata={'source': 'my-text'}, page_content='I have one brother')]]

Vectorstores implement an as_retriever method that will generate a Retriever, specifically a VectorStoreRetriever. These retrievers include specific search_type and search_kwargs attributes that identify what methods of the underlying vector store to call, and how to parameterize them. For instance, we can replicate the above with the following:

In [18]:
retriever = vectorstore.as_retriever(
    search_type="similarity", # alternatives: mmr, similarity_score_threshold
    search_kwargs={"k": 1},
)

retriever.batch(["soccer", "family"])

[[Document(metadata={'source': 'my-text'}, page_content='I love playing soccer')],
 [Document(metadata={'source': 'my-text'}, page_content='I have one brother')]]

# Chain Together

In [19]:
from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-3.5-turbo")

In [20]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

message = """
Answer this question using the provided context only.

{question}

Context:
{context}
"""

prompt = ChatPromptTemplate.from_messages([("human", message)])

rag_chain = {"context": retriever, "question": RunnablePassthrough()} | prompt | model

In [21]:
response = rag_chain.invoke("who is Jim?")

print(response.content)

Jim is the brother of the speaker.
