# RAG with Langchain 

This notebook shows the following: 
1. data ingestion with one repository
2. splitting (chunking) of documents
3. embedding those splits and creating a vector database (VB)
4. simple query of that VB
5. advanced query using a retriever and chains

---

## Setup

Make sure to clone the `sage-website` repository and pip install the following:
* langchain
* langchain_community
* unstructured
* markdown

In [86]:
from tqdm.notebook import tqdm
from langchain.vectorstores import Chroma
from langchain.document_loaders import DirectoryLoader, UnstructuredMarkdownLoader
from langchain.text_splitter import MarkdownTextSplitter
from langchain.embeddings import OllamaEmbeddings
from langchain.llms import Ollama
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

## Data Ingestion
> Integrate code

In [33]:
## Load Markdown files in /about/
md_path = "./sage-website/docs/about/"
md_loader = DirectoryLoader(md_path, glob='./*.md', loader_cls=UnstructuredMarkdownLoader)
md_docs = md_loader.load()
md_docs

[Document(metadata={'source': 'sage-website/docs/about/architecture.md'}, page_content='sidebar_label: Architecture sidebar_position: 2\n\nArchitecture\n\nThe cyberinfrastructure consists of coordinating hardware and software services enabling AI at the edge. Below is a quick summary of the different infrastructure pieces, starting at the highest-level and zooming into each component to understand the relationships and role each plays.\n\nHigh-Level Infrastructure\n\nThere are 2 main components of the cyberinfrastructure: - Nodes that exist at the edge - The cloud that hosts services and storage systems to facilitate running “science goals” @ the edge\n\nEvery edge node maintains connections to 2 core cloud components: one to a Beehive and one to a Beekeeper\n\nBeekeeper\n\nThe Beekeeper is an administrative server that allows system administrators to perform actions on the nodes such as gather health metrics and perform software updates. All nodes "phone home" to their Beekeeper and m

## Splitting
> Test with different splitters

In [55]:
# Break up text into chunks
chunk_size = 300
chunk_overlap = 0
md_splitter = MarkdownTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
splits = md_splitter.split_documents(md_docs)

In [57]:
splits

[Document(metadata={'source': 'sage-website/docs/about/architecture.md'}, page_content='sidebar_label: Architecture sidebar_position: 2\n\nArchitecture'),
 Document(metadata={'source': 'sage-website/docs/about/architecture.md'}, page_content='The cyberinfrastructure consists of coordinating hardware and software services enabling AI at the edge. Below is a quick summary of the different infrastructure pieces, starting at the highest-level and zooming into each component to understand the relationships and role each plays.'),
 Document(metadata={'source': 'sage-website/docs/about/architecture.md'}, page_content='High-Level Infrastructure\n\nThere are 2 main components of the cyberinfrastructure: - Nodes that exist at the edge - The cloud that hosts services and storage systems to facilitate running “science goals” @ the edge'),
 Document(metadata={'source': 'sage-website/docs/about/architecture.md'}, page_content='Every edge node maintains connections to 2 core cloud components: one to 

## Embedding & Vector Database Creation
> Try different model and VB

In [61]:
# This model was already installed through rag-basic.ipynb
ollama_emb = OllamaEmbeddings(model='mxbai-embed-large')

In [62]:
# Creates a Chroma VB from a list of Documents
# Make sure to run `Ollama serve` in terminal first to host
vectordb = Chroma.from_documents(splits, ollama_emb)

## Query + Generation
> Can change prompt

In [79]:
llm = Ollama(model='llama3.1')

In [82]:
# L2 distance comparison
# Can filter
question = "What is the Sage project?"
context = vectordb.similarity_search_with_score(query=question, k=4)
context

[(Document(metadata={'source': 'sage-website/docs/about/overview.md'}, page_content='sidebar_label: Overview sidebar_position: 1\n\nSage: A distributed software-defined sensor network.\n\nWhat is Sage?'),
  189.1428985595703),
 (Document(metadata={'source': 'sage-website/docs/about/overview.md'}, page_content='Geographically distributed sensor systems that include cameras, microphones, and weather and air quality stations can generate such large volumes of data that fast and efficient analysis is best performed by an embedded computer connected directly to the sensor. Sage is exploring new techniques for'),
  215.6899871826172),
 (Document(metadata={'source': 'sage-website/docs/about/overview.md'}, page_content='Sage is deploying sensor nodes that support machine learning frameworks in environmental testbeds in California, Colorado, and Kansas and in urban environments in Illinois and Texas. The reusable cyberinfrastructure running on these testbeds will give climate, traffic, and ecos

In [83]:
template = "Using this context: {context}. Respond to this question: {question}"
prompt = ChatPromptTemplate.from_template(template)
prompt

ChatPromptTemplate(input_variables=['context', 'question'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template='Using this context: {context}. Respond to this question: {question}'))])

In [94]:
# Just feeds the contexts to the llm - no summary
simple_chain = prompt | llm
print(simple_chain.invoke({"context": context, "question": question}))

According to the provided context, the Sage project appears to be a distributed software-defined sensor network. It involves deploying geographically distributed sensors that collect various types of data (e.g., environmental conditions, audio recordings) and utilizing machine learning frameworks on these nodes for analysis. The goal is to facilitate efficient and fast processing of this data through edge compute applications and cloud-based databases.


## Query + Chain Generation
> Can make retriever more specific

In [97]:
# integrate this
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [90]:
## Essentially summarizes the retrieved information 
retriever = vectordb.as_retriever(search_type='similarity', search_kwargs={"k": 1})

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

print(rag_chain.invoke("What is the project Sage?"))

According to the provided context, Sage is a "distributed software-defined sensor network".


In [93]:
## Essentially summarizes the retrieved information 
retriever = vectordb.as_retriever(search_type='similarity', search_kwargs={"k": 4})

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

print(rag_chain.invoke("What is the project Sage?"))

According to the provided context, Sage is described as:

"...Geographically distributed sensor systems that include cameras, microphones, and weather and air quality stations..."

Additionally, it's mentioned that Sage is "exploring new techniques for" analyzing data from these systems.

However, a more comprehensive description of Sage is found in another document, which states that Sage is:

"A distributed software-defined sensor network."

This suggests that Sage is an initiative or project focused on developing and deploying a decentralized, software-driven sensor network to collect and analyze data from various environmental sensors.


In [92]:
## Essentially summarizes the retrieved information 
retriever = vectordb.as_retriever(search_type='similarity', search_kwargs={"k": 10})

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

print(rag_chain.invoke("What is the project Sage?"))

According to the provided documents, Sage is a distributed software-defined sensor network that explores new techniques for efficient data analysis in environmental and urban settings. The project involves deploying sensor nodes that support machine learning frameworks in various testbeds across different states in the US. Sage aims to provide climate, traffic, and ecosystem scientists with reusable cyberinfrastructure running on these testbeds.

In simpler terms, Sage is a project that uses distributed sensors and software-defined networks to collect and analyze large volumes of data related to environmental and urban conditions, enabling more efficient and effective research and decision-making.
