In [9]:
import os

from second_brain.config import settings

os.environ["OPENAI_API_KEY"] = settings.OPENAI_API_KEY

In [10]:
from langchain_mongodb.retrievers import (
    MongoDBAtlasParentDocumentRetriever,
)

from second_brain.application.rag import get_splitter
from second_brain.application.rag.embeddings import EmbeddingModelBuilder

embedding_model = EmbeddingModelBuilder().get_model()
parent_doc_retriever = MongoDBAtlasParentDocumentRetriever.from_connection_string(
    connection_string=settings.MONGODB_URI,
    embedding_model=embedding_model,
    child_splitter=get_splitter(200),
    parent_splitter=get_splitter(800),
    database_name=settings.MONGODB_DATABASE_NAME,
    collection_name="rag",
    text_key="page_content",
    search_kwargs={"k": 10},
)

In [11]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

# Retrieve and parse documents
retrieve = {
    "context": parent_doc_retriever
    | (lambda docs: "\n\n".join([d.page_content for d in docs])),
    "question": RunnablePassthrough(),
}
template = """Answer the question based only on the following context. If no context is provided, respond with I DON'T KNOW: \
{context}

Question: {question}
"""
# Define the chat prompt
prompt = ChatPromptTemplate.from_template(template)
# Define the model to be used for chat completion
llm = ChatOpenAI(temperature=0, model="gpt-4o-2024-11-20")
# Parse output as a string
parse_output = StrOutputParser()
# Naive RAG chain
rag_chain = retrieve | prompt | llm | parse_output

In [12]:
answer = rag_chain.invoke("How can I optimize LLMs for inference?")
print(answer)


Based on the provided context, you can optimize LLMs for inference by using **vllm**, which is described as a high-throughput and memory-efficient inference and serving engine for large language models (LLMs). Additionally, techniques like quantization and LoRA can be employed to reduce memory and compute resource requirements.


In [13]:
answer = rag_chain.invoke("What is RLHF?")
print(answer)

RLHF, or Reinforcement Learning from Human Feedback, is a technique used to fine-tune machine learning models, particularly language models (LMs), by incorporating human feedback. It involves training a reward model based on human annotations, which can include human-generated text or labels of human preferences between model outputs. This reward model is then used to guide the optimization of the language model's behavior, often through reinforcement learning algorithms like PPO (Proximal Policy Optimization). RLHF aims to align model outputs with human preferences, but it is resource-intensive and depends heavily on the quality of human annotations.


In [20]:
answer = rag_chain.invoke("How does Tensorflow Recommenders work?")
print(answer)

TensorFlow Recommenders (TFRS) is a library designed to simplify the development of recommendation systems. It allows developers to build custom models, such as two-tower architectures, for deep retrieval tasks. In the two-tower setup, one neural network tower generates embeddings for queries, while another tower generates embeddings for candidate items. These embeddings are mapped to a shared embedding space, where the similarity between a query and a candidate is determined by calculating the dot product of their embeddings. This approach enables efficient and scalable candidate retrieval by precomputing candidate embeddings and focusing on query embedding computation and similarity search during serving.


In [24]:
for i, doc in enumerate(parent_doc_retriever.invoke("How does Tensorflow Recommenders work?")):
    print(i, "  ", "-" * 100)
    print(doc.page_content[:100])

0    ----------------------------------------------------------------------------------------------------
[Contact sales ](https://cloud.google.com/contact/)[Get started for free ](https://console.cloud.goo
1    ----------------------------------------------------------------------------------------------------
## Background

To meet low latency serving requirements, large-scale recommenders are often deployed


In [25]:
for i, doc in enumerate(parent_doc_retriever.invoke("What is RAGAS?")):
    print(i, "  ", "-" * 100)
    print(doc.page_content[:100])

0    ----------------------------------------------------------------------------------------------------
Share

![Advanced Retrieval-Augmented Generation \(RAG\) implements pre-retrieval, retrieval, and po
1    ----------------------------------------------------------------------------------------------------
[![](data:image/svg+xml;charset=utf-8,%3Csvg%20height='1200'%20width='2400'%20xmlns='http://www.w3.o
2    ----------------------------------------------------------------------------------------------------
[![](data:image/svg+xml;charset=utf-8,%3Csvg%20height='670'%20width='1200'%20xmlns='http://www.w3.or
3    ----------------------------------------------------------------------------------------------------
Image By Author

Retrieval Augmented Generation (RAG) has been around for a while, taking many forms
