# Enhanced Langchain RAG

This notebook shows the following: 
1. loading Markdown, Python, and Jupyter Notebook files from three different Sage repos
2. embedding that data with the "mxbai-embed-large" model and store in a Chroma vector database
3. querying that VB with RAG chains using llama3.1
4. testing different prompt templates
5. testing code generation

---

## Pipeline

#### Imports

In [None]:
from tqdm.notebook import tqdm
from langchain.vectorstores import Chroma
from langchain.document_loaders import DirectoryLoader, UnstructuredMarkdownLoader, PythonLoader, NotebookLoader
from langchain.text_splitter import MarkdownTextSplitter, RecursiveCharacterTextSplitter, PythonCodeTextSplitter
from langchain.embeddings import OllamaEmbeddings
from langchain.llms import Ollama
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

#### Data Ingestion
Using specific data loaders

In [None]:
paths = ["./sage-website/docs/", "./sage-data-client/", "./pywaggle/"]

In [None]:
# Recursively loads all "file_type" files in a list of directory "paths" with the appropriate "loader_class"
## returns a combined list of Documents from all paths
def repo_class_loader(paths, glob, loader_cls):
    combined_docs = []
    for path in paths:
        dir_loader = DirectoryLoader(path, glob=glob, loader_cls=loader_cls, recursive=True)
        docs = dir_loader.load()
        combined_docs.extend(docs)

    return combined_docs

In [None]:
%%time

# Load all Markdown files
md_docs = repo_class_loader(paths, './*.md', UnstructuredMarkdownLoader)
md_docs


CPU times: user 1.06 s, sys: 68.8 ms, total: 1.13 s
Wall time: 1.18 s


[Document(metadata={'source': 'sage-website/docs/contact-us.md'}, page_content='sidebar_label: Contact us\n\nContact us\n\nEmail\n\nFor support, general questions, or comments, you can always reach us at:\n\nsupport@waggle-edge.ai\n\nMessage Board\n\nWe also encourage developers and users to start a new topic or issue on the Waggle sensor message board:\n\nGitHub Discussions'),
 Document(metadata={'source': 'sage-website/docs/reference-guides/sesctl.md'}, page_content="sidebar_label: sesctl sidebar_position: 2\n\nsesctl: a tool to schedule jobs in Waggle edge computing\n\nThe tool sesctl is a command-line tool that communicates with an Edge scheduler in the cloud to manage user jobs. Users can create, edit, submit, suspend, and remove jobs via the tool.\n\nInstallation\n\nThe tool can be downloaded from the edge scheduler repository and be run on person's desktop or laptop.\n\n:::note Please make sure to download the correct version of the tool based on the system architecture. For exa

In [None]:
%%time

# Load all python files
py_docs = repo_class_loader(paths, './*.py', PythonLoader)
py_docs


CPU times: user 9.31 ms, sys: 12 ms, total: 21.3 ms
Wall time: 37.4 ms


[Document(metadata={'source': 'sage-data-client/tests/test_query.py'}, page_content='import unittest\nimport sage_data_client\nfrom io import BytesIO\nfrom datetime import datetime, timedelta\nimport pandas as pd\n\n\nclass TestQuery(unittest.TestCase):\n    def assertValueResponse(self, df):\n        self.assertIn("name", df.columns)\n        df.name.str\n        self.assertIn("timestamp", df.columns)\n        df.timestamp.dt\n        self.assertIn("value", df.columns)\n\n    def test_empty_response(self):\n        self.assertValueResponse(\n            sage_data_client.query(\n                start="-2000d",\n                filter={\n                    "name": "should.not.every.exist.XYZ",\n                },\n            )\n        )\n\n    def test_check_one_of_head_or_tail(self):\n        with self.assertRaises(ValueError):\n            sage_data_client.query(\n                start="2021-01-01T10:30:00",\n                end="2021-01-01T10:31:00",\n                head=3,\n    

In [None]:
%%time

# Load all notebook files
ipynb_docs = repo_class_loader(paths, './*.ipynb', NotebookLoader)
ipynb_docs


CPU times: user 70.9 ms, sys: 49.8 ms, total: 121 ms
Wall time: 166 ms


[Document(metadata={'source': 'sage-data-client/examples/plotting_example.ipynb'}, page_content='\'markdown\' cell: \'[\'# Basic Plotting Example\']\'\n\n\'code\' cell: \'[\'import sage_data_client\']\'\n\n\'markdown\' cell: \'["First, we\'ll query the last 7 days of temperature data from W022\'s BME680 sensor."]\'\n\n\'code\' cell: \'[\'df = sage_data_client.query(\\n\', \'    start="-7d",\\n\', \'    filter={\\n\', \'        "name": "env.temperature",\\n\', \'        "vsn": "W022",\\n\', \'        "sensor": "bme680",\\n\', \'    }\\n\', \')\']\'\n\n\'markdown\' cell: \'["Next, we\'ll plot a simple line chart."]\'\n\n\'code\' cell: \'[\'df.set_index("timestamp").value.plot()\']\'\n\n\'markdown\' cell: \'["Finally, we\'ll plot the temperature distribution."]\'\n\n\'code\' cell: \'[\'df.value.hist(bins=100)\']\'\n\n\'code\' cell: \'[]\'\n\n'),

#### Splitting
Using specific text splitters

In [None]:
chunk_size = 2000
chunk_overlap = 0

In [None]:
%%time

# Split Markdown files
md_splitter = MarkdownTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
md_splits = md_splitter.split_documents(md_docs)
md_splits

CPU times: user 4.06 ms, sys: 769 μs, total: 4.83 ms
Wall time: 4.88 ms


[Document(metadata={'source': 'sage-website/docs/contact-us.md'}, page_content='sidebar_label: Contact us\n\nContact us\n\nEmail\n\nFor support, general questions, or comments, you can always reach us at:\n\nsupport@waggle-edge.ai\n\nMessage Board\n\nWe also encourage developers and users to start a new topic or issue on the Waggle sensor message board:\n\nGitHub Discussions'),
 Document(metadata={'source': 'sage-website/docs/reference-guides/sesctl.md'}, page_content="sidebar_label: sesctl sidebar_position: 2\n\nsesctl: a tool to schedule jobs in Waggle edge computing\n\nThe tool sesctl is a command-line tool that communicates with an Edge scheduler in the cloud to manage user jobs. Users can create, edit, submit, suspend, and remove jobs via the tool.\n\nInstallation\n\nThe tool can be downloaded from the edge scheduler repository and be run on person's desktop or laptop.\n\n:::note Please make sure to download the correct version of the tool based on the system architecture. For exa

In [None]:
%%time

# Split Python files
py_splitter = PythonCodeTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
py_splits = py_splitter.split_documents(py_docs)
py_splits

CPU times: user 3.87 ms, sys: 726 μs, total: 4.6 ms
Wall time: 5.14 ms


[Document(metadata={'source': 'sage-data-client/tests/test_query.py'}, page_content='import unittest\nimport sage_data_client\nfrom io import BytesIO\nfrom datetime import datetime, timedelta\nimport pandas as pd'),
 Document(metadata={'source': 'sage-data-client/tests/test_query.py'}, page_content='class TestQuery(unittest.TestCase):\n    def assertValueResponse(self, df):\n        self.assertIn("name", df.columns)\n        df.name.str\n        self.assertIn("timestamp", df.columns)\n        df.timestamp.dt\n        self.assertIn("value", df.columns)\n\n    def test_empty_response(self):\n        self.assertValueResponse(\n            sage_data_client.query(\n                start="-2000d",\n                filter={\n                    "name": "should.not.every.exist.XYZ",\n                },\n            )\n        )\n\n    def test_check_one_of_head_or_tail(self):\n        with self.assertRaises(ValueError):\n            sage_data_client.query(\n                start="2021-01-01T10

In [None]:
%%time

# Split Notebook files
ipynb_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
ipynb_splits = ipynb_splitter.split_documents(ipynb_docs)
ipynb_splits

CPU times: user 448 μs, sys: 16 μs, total: 464 μs
Wall time: 527 μs


[Document(metadata={'source': 'sage-data-client/examples/plotting_example.ipynb'}, page_content='\'markdown\' cell: \'[\'# Basic Plotting Example\']\'\n\n\'code\' cell: \'[\'import sage_data_client\']\'\n\n\'markdown\' cell: \'["First, we\'ll query the last 7 days of temperature data from W022\'s BME680 sensor."]\'\n\n\'code\' cell: \'[\'df = sage_data_client.query(\\n\', \'    start="-7d",\\n\', \'    filter={\\n\', \'        "name": "env.temperature",\\n\', \'        "vsn": "W022",\\n\', \'        "sensor": "bme680",\\n\', \'    }\\n\', \')\']\'\n\n\'markdown\' cell: \'["Next, we\'ll plot a simple line chart."]\'\n\n\'code\' cell: \'[\'df.set_index("timestamp").value.plot()\']\'\n\n\'markdown\' cell: \'["Finally, we\'ll plot the temperature distribution."]\'\n\n\'code\' cell: \'[\'df.value.hist(bins=100)\']\'\n\n\'code\' cell: \'[]\''),
 Document(metadata={'source': 'sage-data-client/examples/contrib/geospatial-mapping-example.ipynb'}, page_content='\'markdown\' cell: \'[\'### Invest

In [None]:
# combine splits
combined_splits = md_splits + py_splits + ipynb_splits
combined_splits

[Document(metadata={'source': 'sage-website/docs/contact-us.md'}, page_content='sidebar_label: Contact us\n\nContact us\n\nEmail\n\nFor support, general questions, or comments, you can always reach us at:\n\nsupport@waggle-edge.ai\n\nMessage Board\n\nWe also encourage developers and users to start a new topic or issue on the Waggle sensor message board:\n\nGitHub Discussions'),
 Document(metadata={'source': 'sage-website/docs/reference-guides/sesctl.md'}, page_content="sidebar_label: sesctl sidebar_position: 2\n\nsesctl: a tool to schedule jobs in Waggle edge computing\n\nThe tool sesctl is a command-line tool that communicates with an Edge scheduler in the cloud to manage user jobs. Users can create, edit, submit, suspend, and remove jobs via the tool.\n\nInstallation\n\nThe tool can be downloaded from the edge scheduler repository and be run on person's desktop or laptop.\n\n:::note Please make sure to download the correct version of the tool based on the system architecture. For exa

#### Embedding & Vector Database Creation
> Try different model and VB

In [None]:
# This model was already installed through rag-basic.ipynb
ollama_emb = OllamaEmbeddings(model='mxbai-embed-large')

In [None]:
%%time

# Creates a Chroma VB from a list of Documents
# Make sure to run `Ollama serve` in terminal first to host
# ~399 embeddings
vectordb = Chroma.from_documents(combined_splits, ollama_emb)

print("\n\n")




CPU times: user 1.89 s, sys: 374 ms, total: 2.26 s
Wall time: 8min 48s


---

## Testing Different Templates

#### Setup

In [None]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [None]:
llm = Ollama(model='llama3.1')

In [None]:
retriever = vectordb.as_retriever(search_type='similarity', search_kwargs={"k": 5})

#### Simple Prompt

In [None]:
## Simple template and prompt
simple_template = "Using the following this context: {context}. Respond to this question: {question}"
simple_prompt = ChatPromptTemplate.from_template(simple_template)
simple_prompt

ChatPromptTemplate(input_variables=['context', 'question'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template='Using the following this context: {context}. Respond to this question: {question}'))])

In [None]:
%%time

## Simple Chain About Sage
simple_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | simple_prompt ## Up to here it produces the prompt
    | llm
    | StrOutputParser()
)

for chunk in simple_chain.stream("What is the Sage project?"):
    print(chunk, end="", flush=True)

print("\n\n")

The Sage project appears to be a distributed software-defined sensor network, as stated in the context provided.


CPU times: user 79.5 ms, sys: 42.8 ms, total: 122 ms
Wall time: 1min 17s


In [None]:
%%time

## Simple Chain NOT About Sage
simple_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | simple_prompt ## Up to here it produces the prompt
    | llm
    | StrOutputParser()
)

for chunk in simple_chain.stream("Why is the sky blue?"):
    print(chunk, end="", flush=True)

print("\n\n")

You didn't ask a question about Chirpstack or the provided text. You appeared to be presenting a block of text and then asked "Why is the sky blue?", which isn't related to the content.

If you'd like to ask something specific, I'd be happy to help with your question about Chirpstack!


CPU times: user 189 ms, sys: 96.4 ms, total: 286 ms
Wall time: 2min 35s


#### Enhanced Prompt

In [None]:
## Enhanced template and prompt
enhanced_template = """You are an assistant in question-answering tasks.
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Always say "thanks for asking!" at the end of the answer.

{context}

Question: {question}"""

enhanced_prompt = ChatPromptTemplate.from_template(enhanced_template)
enhanced_prompt

ChatPromptTemplate(input_variables=['context', 'question'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template='You are an assistant in question-answering tasks.\nUse the following pieces of context to answer the question at the end.\nIf you don\'t know the answer, just say that you don\'t know, don\'t try to make up an answer.\nAlways say "thanks for asking!" at the end of the answer.\n\n{context}\n\nQuestion: {question}'))])

In [None]:
%%time

## Enhanced Chain About Sage
enhanced_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | enhanced_prompt ## Up to here it produces the prompt
    | llm
    | StrOutputParser()
)

for chunk in enhanced_chain.stream("What is the Sage project?"):
    print(chunk, end="", flush=True)

print("\n\n")

The Sage project is a distributed software-defined sensor network, which enables fast and efficient analysis of large volumes of data from geographically distributed sensors, such as cameras, microphones, and weather stations. It uses machine learning algorithms to process data and transmits results over the network to central computer servers.

thanks for asking!


CPU times: user 147 ms, sys: 65.9 ms, total: 213 ms
Wall time: 1min 20s


In [None]:
%%time

## Enhanced Chain NOT About Sage
enhanced_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | enhanced_prompt ## Up to here it produces the prompt
    | llm
    | StrOutputParser()
)

for chunk in enhanced_chain.stream("Why is the sky blue?"):
    print(chunk, end="", flush=True)

print("\n\n")

I don't know. 

Thanks for asking!


CPU times: user 95.6 ms, sys: 56 ms, total: 152 ms
Wall time: 2min 12s


#### Spanish Prompt

In [None]:
## Enhanced template and prompt
spanish_template = """You are an assistant in question-answering tasks that speaks Spanish.
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Give your answer in Spanish.

{context}

Question: {question}"""

spanish_prompt = ChatPromptTemplate.from_template(spanish_template)
spanish_prompt

ChatPromptTemplate(input_variables=['context', 'question'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="You are an assistant in question-answering tasks that speaks Spanish.\nUse the following pieces of context to answer the question at the end.\nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\nGive your answer in Spanish.\n\n{context}\n\nQuestion: {question}"))])

In [None]:
%%time

## Spanish Chain About Sage
spanish_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | spanish_prompt ## Up to here it produces the prompt
    | llm
    | StrOutputParser()
)

for chunk in spanish_chain.stream("What is the Sage project?"):
    print(chunk, end="", flush=True)

print("\n\n")

La pregunta es sobre el proyecto "Sage". Según la información proporcionada, Sage es una red de sensores definida por software distribuida. Es un sistema geográficamente distribuido que incluye cámaras, microfones y estaciones meteorológicas y de calidad del aire, entre otros. El objetivo de Sage es explorar nuevas técnicas para aplicar algoritmos de aprendizaje automático a los datos de sensores inteligentes y crear software reutilizable que pueda ejecutar programas en la computadora embebida y transmitir los resultados a servidores de ordenadores centrales.

En resumen, el proyecto Sage es una red de sensores distribuida que utiliza técnicas de aprendizaje automático para analizar datos de diferentes fuentes y proporcionar información valiosa para científicos que estudian la impacto de fenómenos naturales y urbanización en ecosistemas naturales y infraestructura urbana.


CPU times: user 361 ms, sys: 132 ms, total: 493 ms
Wall time: 2min 4s


---

## Testing More Questions

In [None]:
%%time

## Enhanced Chain
enhanced_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | enhanced_prompt ## Up to here it produces the prompt
    | llm
    | StrOutputParser()
)

for chunk in enhanced_chain.stream("What is the role of Waggle in Sage?"):
    print(chunk, end="", flush=True)

print("\n\n")

Waggle provides a number of interfaces which other computing and HPC systems can build on top of, allowing for monitoring data from the edge and triggering actions when values exceed a threshold or an unusual event is detected. It also enables running plugins @ the edge through its operating system, which includes k3s to create a protected & isolated run-time environment.

thanks for asking!


CPU times: user 186 ms, sys: 126 ms, total: 312 ms
Wall time: 3min 16s


In [None]:
%%time

## Enhanced Chain
enhanced_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | enhanced_prompt ## Up to here it produces the prompt
    | llm
    | StrOutputParser()
)

for chunk in enhanced_chain.stream("How can I get in contact with the Sage team?"):
    print(chunk, end="", flush=True)

print("\n\n")

According to the provided document, it's recommended that you start by reading the Sage Overview in the Sage docs to understand what Sage is and how it might connect to your work. It also suggests ensuring that your team members have accounts in the Sage Portal, and reviewing the Edge apps tutorial in the Sage docs.

Additionally, it's mentioned that during the Hackathon, they will review how to use the portal and parts of the Edge app tutorial. However, having done some preliminary work ahead of time will allow more time for their team to provide support for your unique application.

So, to get in contact with the Sage team, you should start by following these steps:

1. Read the Sage Overview in the Sage docs.
2. Ensure that your team members have accounts in the Sage Portal.
3. Review the Edge apps tutorial in the Sage docs.

By doing so, you'll be well-prepared for the Hackathon and can get support from their team.

Thanks for asking!


CPU times: user 403 ms, sys: 181 ms, total: 5

#### Changing prompt and receiver for better experience

In [None]:
## Sage template and prompt
## Changed to be extremely specific
sage_template = """You are an expert on the Sage project, and you only answer any question related to it.
Use the following pieces of context about the Sage project to answer the question at the end.
Provide as much detail as possible in answering the question from the provided context.
Your answer should be super informative.
If you don't know the answer or the question is unrelated to Sage, just say that you don't know, don't try to make up an answer.
If asked to provide code, only generate code provided in the context.
Always say "Thanks for asking!" at the end of the answer.

{context}

Question: {question}"""

sage_prompt = ChatPromptTemplate.from_template(sage_template)
sage_prompt

ChatPromptTemplate(input_variables=['context', 'question'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template='You are an expert on the Sage project, and you only answer any question related to it.\nUse the following pieces of context about the Sage project to answer the question at the end.\nProvide as much detail as possible in answering the question from the provided context.\nYour answer should be super informative.\nIf you don\'t know the answer or the question is unrelated to Sage, just say that you don\'t know, don\'t try to make up an answer.\nIf asked to provide code, only generate code provided in the context.\nAlways say "Thanks for asking!" at the end of the answer.\n\n{context}\n\nQuestion: {question}'))])

In [None]:
sage_retriever = vectordb.as_retriever(search_type='similarity', search_kwargs={"k":15})

In [None]:
%%time

## Sage Chain
sage_chain = (
    {"context": sage_retriever | format_docs, "question": RunnablePassthrough()}
    | sage_prompt ## Up to here it produces the prompt
    | llm
    | StrOutputParser()
)

for chunk in sage_chain.stream("Are there any specific examples on how Sage nodes are used?"):
    print(chunk, end="", flush=True)

print("\n\n")

Yes, according to the provided context, the Wild Sage Node (or Wild Waggle Node) is mentioned as a custom built weather-proof enclosure intended for remote outdoor installation. It features software and hardware resilience via a custom operating system and custom circuit board.

Additionally, there's an example of using Sage Data Client to watch the data stream and print nodes where the internal temperature exceeds a threshold. This suggests that Sage nodes are used in various settings, possibly for monitoring environmental conditions or other types of data.

The Node installation manual (https://sagecontinuum.org/docs/installation-manuals/wsn-manual) also mentions details on how to install and use Wild Sage Nodes, which implies that these devices have practical applications in real-world scenarios.

Thanks for asking!


CPU times: user 335 ms, sys: 129 ms, total: 464 ms
Wall time: 5min 27s


In [None]:
%%time

## Sage Chain
sage_chain = (
    {"context": sage_retriever | format_docs, "question": RunnablePassthrough()}
    | sage_prompt ## Up to here it produces the prompt
    | llm
    | StrOutputParser()
)

for chunk in sage_chain.stream("Can you give me a breakdown of the Sage infrastructure?"):
    print(chunk, end="", flush=True)

print("\n\n")

Based on the provided text, here is a breakdown of the Sage infrastructure:

1. **Waggle**: A computing platform that provides a number of interfaces which other computing and HPC systems can build on top.
	* Allows for cloud compute & HPC on edge data
2. **Sage Data Client**: An official Python API client for interacting with the Sage data service.
	* Provides a simple query function to talk to the data API
	* Returns results in an easy-to-use Pandas DataFrame
3. **Globus**: A login system used by Sage, which requires organization credentials for access.
4. **ECR (Elastic Container Registry)**: A container registry service where apps can be published and deployed.
5. **Nodes**: Physical devices that support machine learning frameworks and store data in environmental testbeds or urban environments.
6. **Sage Data Service**: A platform that provides a data API for querying and retrieving data from nodes.

The Sage infrastructure seems to be designed for managing large volumes of sensor 

---

## Testing Code Generation

In [None]:
%%time

## Sage Chain
sage_chain = (
    {"context": sage_retriever | format_docs, "question": RunnablePassthrough()}
    | sage_prompt ## Up to here it produces the prompt
    | llm
    | StrOutputParser()
)

for chunk in sage_chain.stream("Can you provide me code that captures an image and publishes it?"):
    print(chunk, end="", flush=True)

print("\n\n") # Result: Perfect code!

Here's the code snippet from the context:

```python
import numpy as np
from waggle.plugin import Plugin
from waggle.data.vision import Camera

def main():
    with Plugin() as plugin:
        # open camera and take snapshot
        with Camera() as camera:
            snapshot = camera.snapshot()

            # compute mean color
            mean_color = compute_mean_color(snapshot.data)

            # publish mean color
            plugin.publish("color.mean.r", mean_color[0], timestamp=snapshot.timestamp)
            plugin.publish("color.mean.g", mean_color[1], timestamp=snapshot.timestamp)
            plugin.publish("color.mean.b", mean_color[2], timestamp=snapshot.timestamp)

if __name__ == "__main__":
    main()
```

This code captures an image using the `Camera` class, computes its mean color, and publishes it with a message name of "color.mean.r", "color.mean.g", and "color.mean.b" respectively. 

Please note that you might need to install the required libraries (`numpy`, `wag

In [None]:
%%time

long_question = """
Can you provide me all the necessary code for two plugins? 
The first plugin captures an image from the camera, processes it with the function “count_pedestrians” and publishes the number with the name “number.pedestrians”. 
The second plugin subscribes to that data stream, captures a picture, runs the same “count_pedestrians” function, and prints out which image had the most pedestrians with the associated timestamp. 
Write it as two separate files. 
"""


## Sage Chain
sage_chain = (
    {"context": sage_retriever | format_docs, "question": RunnablePassthrough()}
    | sage_prompt ## Up to here it produces the prompt
    | llm
    | StrOutputParser()
)

for chunk in sage_chain.stream(long_question):
    print(chunk, end="", flush=True)

print("\n\n") # Result: almost perfect. Did not include timestamps and no comparison logic

Here are the two plugins written in Python:

**Plugin 1: `camera_plugin.py`**
```python
import numpy as np
from waggle.plugin import Plugin
from waggle.data.vision import Camera

def count_pedestrians(image):
    # TO DO: implement pedestrian counting function
    return np.random.randint(0, 100)  # placeholder for now

with Plugin() as plugin:
    camera = Camera()
    snapshot = camera.snapshot()
    pedestrians = count_pedestrians(snapshot.data)
    plugin.publish("number.pedestrians", pedestrians)
```

**Plugin 2: `subscriber_plugin.py`**
```python
import numpy as np
from waggle.plugin import Plugin, Subscriber
from waggle.data.vision import Camera

def count_pedestrians(image):
    # TO DO: implement pedestrian counting function
    return np.random.randint(0, 100)  # placeholder for now

with Plugin() as plugin:
    subscriber = Subscriber()
    while True:
        data = subscriber.subscribe("number.pedestrians")
        if data is not None:
            snapshot = Camera().snaps

In [None]:
%%time

long_question = """
Write me a plugin that accesses stored bird recordings and compares them to what the current node is picking up on a microphone to detect if there are any similar birds. 
These results are published. You may assume the function to compare is given as compare_birds.
"""


## Sage Chain
sage_chain = (
    {"context": sage_retriever | format_docs, "question": RunnablePassthrough()}
    | sage_prompt ## Up to here it produces the prompt
    | llm
    | StrOutputParser()
)

for chunk in sage_chain.stream(long_question):
    print(chunk, end="", flush=True)

print("\n\n") # Result: Gave rough outline and made up a few functions

Here's an example of how you could implement such a plugin:

```python
class BirdWatcherPlugin(Plugin):
    def __init__(self, name="BirdWatcher"):
        self.name = name
        self.bird_database = None

    def setup(self, camera):
        self.bird_database = DataRepository.get_instance().get_bird_database()

    def process_sample(self, sample):
        # Compare the current microphone data with stored bird recordings
        similar_birds = []
        for bird in self.bird_database:
            if compare_birds(sample.data, bird['audio_data']):
                similar_birds.append(bird['name'])

        # Publish the results
        plugin.upload_string(f"Detected {len(similar_birds)} birds: {', '.join(similar_birds)}")
```

You would need to implement `compare_birds` function to compare two audio samples and return True if they are similar, False otherwise.

Also, this example assumes that you have a `DataRepository.get_instance()` method to get the instance of the Data Reposi

In [None]:
%%time

## Sage Chain
sage_chain = (
    {"context": sage_retriever | format_docs, "question": RunnablePassthrough()}
    | sage_prompt ## Up to here it produces the prompt
    | llm
    | StrOutputParser()
)

for chunk in sage_chain.stream("Can you provide me the code to make a first-person shooter video game?"):
    print(chunk, end="", flush=True)

print("\n\n")

I'm afraid I can't do that. Creating a fully-fledged first-person shooter (FPS) game is a complex task that requires significant expertise in programming, computer graphics, game design, and sound design. 

However, if you'd like, I can provide some general information or point you to resources about FPS games.

Alternatively, if you have any specific questions about how to implement certain aspects of an FPS game (e.g., collision detection, camera control, etc.), I'll do my best to help with that!


CPU times: user 306 ms, sys: 188 ms, total: 493 ms
Wall time: 6min 50s


In [None]:
%%time

## Sage Chain
sage_chain = (
    {"context": sage_retriever | format_docs, "question": RunnablePassthrough()}
    | sage_prompt ## Up to here it produces the prompt
    | llm
    | StrOutputParser()
)

for chunk in sage_chain.stream("Can you provide me code that makes a small blue box in python?"):
    print(chunk, end="", flush=True)

print("\n\n") # Still generated code despite being advised not to provide code for unrelated Sage questions

Here's an example of how to create a small blue box using Python and the Tkinter library:

```python
import tkinter as tk

# Create the main window
root = tk.Tk()

# Set the size of the window
root.geometry("200x100")

# Create a blue rectangle (box)
blue_box = tk.Frame(root, bg="blue", width=50, height=20)

# Place the blue box at position (0, 30) on the window
blue_box.place(x=0, y=30)

# Run the application
root.mainloop()
```

This code will create a small blue rectangle in a window.


CPU times: user 368 ms, sys: 224 ms, total: 592 ms
Wall time: 4min 8s


In [None]:
%%time
## In manual augmentation, I provided clean dataframes and asked it to provide code in two files

long_question = """
Can you generate the code to graph the line plot and histogram of the temperature and pressure for the last three hours for node W0B0?
"""


## Sage Chain
sage_chain = (
    {"context": sage_retriever | format_docs, "question": RunnablePassthrough()}
    | sage_prompt ## Up to here it produces the prompt
    | llm
    | StrOutputParser()
)

for chunk in sage_chain.stream(long_question):
    print(chunk, end="", flush=True)

print("\n\n") # Result: no such thing as "meta.vsn" and I couldn't get holoview to work

Here's how you can do it. I will use the `sage_data_client` library to query the data from the past 3 hours.

```python
import sage_data_client
import pandas as pd
import holoviews as hv
from holoviews import opts

# Set up Holoviews with a suitable backend (in this case, Plotly)
hv.extension('bokeh')
opts.defaults(
    opts.Scatter(line_width=2),
)

temperature_df = sage_data_client.query(
    start="-3h",
    filter={
        "name": "env.temperature",
        "meta.vsn": "W0B0"
    }
)

pressure_df = sage_data_client.query(
    start="-3h",
    filter={
        "name": "env.pressure",
        "meta.vsn": "W0B0"
    }
)


# Create the plots
temperature_line_plot = temperature_df.hvplot.line(x='timestamp', y='value')
pressure_line_plot = pressure_df.hvplot.line(x='timestamp', y='value')

temperature_histogram = temperature_df.hvplot.hist(x='value', bins=20)
pressure_histogram = pressure_df.hvplot.hist(x='value', bins=20)

# Display the plots
temperature_line_plot
pressure_line_plot

t

In [None]:
%%time
## In manual augmentation, I provided clean dataframes and asked it to provide code in two files

long_question = """
Can you generate the code to plot the line plot and histogram of the temperature and pressure for the last three hours for node W0B0?
Use matplotlib for plotting.
"""


## Sage Chain
sage_chain = (
    {"context": sage_retriever | format_docs, "question": RunnablePassthrough()}
    | sage_prompt ## Up to here it produces the prompt
    | llm
    | StrOutputParser()
)

for chunk in sage_chain.stream(long_question):
    print(chunk, end="", flush=True)

print("\n\n") 
# Result: plt.hist does not return two values "temperature_hist, _" 
## also, the temp and pressure values are on completely different scales so plots are unscaled

Based on your specifications, I will write the code to answer your question. Here it is:

```python
import sage_data_client
import matplotlib.pyplot as plt
import pandas as pd

# query temperature data for last 3h
temperature_df = sage_data_client.query(start="-3h", filter={"name": "env.temperature"})

# query pressure data for last 3h
pressure_df = sage_data_client.query(start="-3h", filter={"name": "env.pressure"})

# filter data for node W0B0
filtered_temperature_df = temperature_df[temperature_df["meta.vsn"] == "W0B0"]
filtered_pressure_df = pressure_df[pressure_df["meta.vsn"] == "W0B0"]

# create line plot of temperature and pressure
plt.figure(figsize=(10, 6))
plt.plot(filtered_temperature_df["timestamp"], filtered_temperature_df["value"], label="Temperature")
plt.plot(filtered_pressure_df["timestamp"], filtered_pressure_df["value"], label="Pressure")

# add title and labels
plt.title("Temperature and Pressure for Node W0B0 (Last 3h)")
plt.xlabel("Timestamp")
plt.ylabel("Value")


In [None]:
%%time
## In manual augmentation, I provided clean dataframes and asked it to provide code in two files

long_question = """
Can you generate the code to plot the line plot and histogram of the temperature and pressure for the last three hours for node W0B0?
Use two separate functions, one for temperature and one for pressure, and plot with matplotlib.
"""


## Sage Chain
sage_chain = (
    {"context": sage_retriever | format_docs, "question": RunnablePassthrough()}
    | sage_prompt ## Up to here it produces the prompt
    | llm
    | StrOutputParser()
)

for chunk in sage_chain.stream(long_question):
    print(chunk, end="", flush=True)

print("\n\n") 

Here is how you can create a code that generates line plots and histograms for both temperature and pressure data for the last 3 hours:

```python
import sage_data_client
import pandas as pd
from metpy.plots import USCOUNTIES
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature

def plot_temperature():
    # query temperature data for node W0B0 and last 3 hours
    df = sage_data_client.query(
        start='-3h',
        filter={
            "name": "env.temperature",
            "meta.vsn": "W0B0"
        }
    )
    
    # plot line graph of temperature over time
    plt.figure(figsize=(10,6))
    plt.plot(df['timestamp'], df['value'])
    plt.title('Temperature (degF) for Node W0B0')
    plt.xlabel('Time')
    plt.ylabel('Temperature (degF)')
    plt.show()

def plot_pressure():
    # query pressure data for node W0B0 and last 3 hours
    df = sage_data_client.query(
        start='-3h',
        filter={
            "name": "env.pressure",
 

<img src="sdc-plots/plot.png">

<img src="sdc-plots/hist.png">