# Building Private Q&A Assistant Using Mongo and Open Source Model

## Introduction

This notebook is designed to demonstrate how to implement a document Question-and-Answer (Q&A) task using SuperDuperDB in conjunction with open-source model and MongoDB. It provides a step-by-step guide and explanation of each component involved in the process.

Implementing a document Question-and-Answer (Q&A) system using SuperDuperDB, open-source model, and MongoDB can find applications in various real-life scenarios:

1. **Customer Support Chatbots:** Enable a chatbot to answer customer queries by extracting information from documents, manuals, or knowledge bases stored in MongoDB or any other SuperDuperDB supported database using Q&A.

2. **Legal Document Analysis:** Facilitate legal professionals in quickly extracting relevant information from legal documents, statutes, and case laws, improving efficiency in legal research.

3. **Medical Data Retrieval:** Assist healthcare professionals in obtaining specific information from medical documents, research papers, and patient records for quick reference during diagnosis and treatment.

4. **Educational Content Assistance:** Enhance educational platforms by enabling students to ask questions related to course materials stored in a MongoDB database, providing instant and accurate responses.

5. **Technical Documentation Search:** Support software developers and IT professionals in quickly finding solutions to technical problems by querying documentation and code snippets stored in MongoDB or any other database supported by SuperDuperDB. We did that!

6. **HR Document Queries:** Simplify HR processes by allowing employees to ask questions about company policies, benefits, and procedures, with answers extracted from HR documents stored in MongoDB or any other database supported by SuperDuperDB.

7. **Research Paper Summarization:** Enable researchers to pose questions about specific topics, automatically extracting relevant information from a MongoDB repository of research papers to generate concise summaries.

8. **News Article Information Retrieval:** Empower users to inquire about specific details or background information from a database of news articles stored in MongoDB or any other database supported by SuperDuperDB, enhancing their understanding of current events.

9. **Product Information Queries:** Improve e-commerce platforms by allowing users to ask questions about product specifications, reviews, and usage instructions stored in a MongoDB database.

By implementing a document Q&A system with SuperDuperDB, open-source model, and MongoDB, these use cases demonstrate the versatility and practicality of such a solution across different industries and domains.

All is possible without zero friction with SuperDuperDB. Now back into the notebook.

## Prerequisites

Before starting the implementation, make sure you have the required libraries installed by running the following commands:

In [1]:
# !pip install superduperdb
# !pip install vllm
# !pip install sentence_transformers numpy==1.24.4

## Connect to datastore 

First, we need to establish a connection to a MongoDB datastore via SuperDuperDB. You can configure the `MongoDB_URI` based on your specific setup. 
Here are some examples of MongoDB URIs:

* For testing (default connection): `mongomock://test`
* Local MongoDB instance: `mongodb://localhost:27017`
* MongoDB with authentication: `mongodb://superduper:superduper@mongodb:27017/documents`
* MongoDB Atlas: `mongodb+srv://<username>:<password>@<atlas_cluster>/<database>`

In [2]:
from superduperdb import superduper
from superduperdb.backends.mongodb import Collection
import os

mongodb_uri = os.getenv("MONGODB_URI", "mongomock://test")

# SuperDuperDB, now handles your MongoDB database
# It just super dupers your database
db = superduper(mongodb_uri, artifact_store='filesystem://./data/')

collection = Collection('questiondocs')

[32m 2024-Mar-29 11:23:39.49[0m| [1mINFO    [0m | [36mTaruns-Laptop.local[0m| [36msuperduperdb.base.build[0m:[36m65  [0m | [1mData Client is ready. mongomock.MongoClient('localhost', 27017)[0m
[32m 2024-Mar-29 11:23:39.50[0m| [1mINFO    [0m | [36mTaruns-Laptop.local[0m| [36msuperduperdb.base.build[0m:[36m38  [0m | [1mConnecting to Metadata Client with engine:  mongomock.MongoClient('localhost', 27017)[0m
[32m 2024-Mar-29 11:23:39.50[0m| [1mINFO    [0m | [36mTaruns-Laptop.local[0m| [36msuperduperdb.base.build[0m:[36m148 [0m | [1mConnecting to compute client: None[0m
[32m 2024-Mar-29 11:23:39.50[0m| [1mINFO    [0m | [36mTaruns-Laptop.local[0m| [36msuperduperdb.base.datalayer[0m:[36m89  [0m | [1mBuilding Data Layer[0m


## Load Dataset

In this example, we use the internal textual data from the `superduperdb` project's API documentation. The objective is to create a chatbot that can offer information about the project. You can either load the data from your local project or use the provided data.

If you have the SuperDuperDB project locally and want to load the latest version of the API, uncomment the following cell:

In [3]:
import glob
import re

ROOT = '../docs/hr/content/docs/'

STRIDE = 3       # stride in numbers of lines
WINDOW = 25       # length of window in numbers of lines

files = sorted(glob.glob(f'{ROOT}/**/*.md', recursive=True))

def get_chunk_link(chunk, file_name):
    # Get the original link of the chunk
    file_link = file_name[:-3].replace(ROOT, 'https://docs.superduperdb.com/docs/docs/')
    # If the chunk has subtitles, the link to the first subtitle will be used first.
    first_title = (re.findall(r'(^|\n)## (.*?)\n', chunk) or [(None, None)])[0][1]
    if first_title:
        # Convert subtitles and splice URLs
        first_title = first_title.lower()
        first_title = re.sub(r'[^a-zA-Z0-9]', '-', first_title)
        file_link = file_link + '#' + first_title
    return file_link

def create_chunk_and_links(file, file_prefix=ROOT):
    with open(file, 'r') as f:
        lines = f.readlines()
    if len(lines) > WINDOW:
        chunks = ['\n'.join(lines[i: i + WINDOW]) for i in range(0, len(lines), STRIDE)]
    else:
        chunks = ['\n'.join(lines)]
    return [{'txt': chunk, 'link': get_chunk_link(chunk, file)}  for chunk in chunks]


all_chunks_and_links = sum([create_chunk_and_links(file) for file in files], [])

Otherwise, you can load the data from an external source. The text chunks include code snippets and explanations, which will be utilized to construct the document Q&A chatbot.

In [4]:
# Use !curl to download the 'superduperdb_docs.json' file
!curl -O https://datas-public.s3.amazonaws.com/superduperdb_docs.json

import json
from IPython.display import Markdown

# Open the downloaded JSON file and load its contents into the 'chunks' variable
with open('superduperdb_docs.json') as f:
    all_chunks_and_links = json.load(f)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  737k  100  737k    0     0   225k      0  0:00:03  0:00:03 --:--:--  225k


View the chunk content:

In [5]:
from IPython.display import *

# Assuming 'chunks' is a list or iterable containing markdown content
chunk_and_link = all_chunks_and_links[48]
print(chunk_and_link['link'])
Markdown(chunk_and_link['txt'])

https://docs.superduperdb.com/docs/docs/data_integrations/sql#setup


---

sidebar_position: 3

---



# SQL



`superduperdb` supports SQL databases via the [`ibis` project](https://ibis-project.org/).

With `superduperdb`, queries may be built which conform to the `ibis` API, with additional 

support for complex data-types and vector-searches.



## Setup



The first step in working with an SQL table, is to define a table and schema



```python

from superduperdb.backends.ibis.query import Table

from superduperdb.backends.ibis.field_types import dtype

from superduperdb import Encoder, Schema



my_enc = Encoder('my-enc')



schema = Schema('my-schema', fields={'img': my_enc, 'text': dtype('str'), 'rating': dtype('int')})



db = superduper()




The chunks of text contain both code snippets and explanations, making them valuable for constructing a document Q&A chatbot. The combination of code and explanations enables the chatbot to provide comprehensive and context-aware responses to user queries.

As usual we insert the data. The `Document` wrapper allows `superduperdb` to handle records with special data types such as images,
video, and custom data-types.

In [6]:
from superduperdb import Document

# Insert multiple documents into the collection
insert_ids = db.execute(collection.insert_many([Document(chunk_and_link) for chunk_and_link in all_chunks_and_links]))
print(insert_ids[:5])

[32m 2024-Mar-29 11:23:43.07[0m| [1mINFO    [0m | [36mTaruns-Laptop.local[0m| [36msuperduperdb.backends.local.compute[0m:[36m34  [0m | [1mSubmitting job. function:<function callable_job at 0x1388c8c20>[0m
[32m 2024-Mar-29 11:23:43.25[0m| [32m[1mSUCCESS [0m | [36mTaruns-Laptop.local[0m| [36msuperduperdb.backends.local.compute[0m:[36m40  [0m | [32m[1mJob submitted on <superduperdb.backends.local.compute.LocalComputeBackend object at 0x13d731210>.  function:<function callable_job at 0x1388c8c20> future:9ac9c7af-96dd-42fa-a1db-21b67f964669[0m
([ObjectId('66065767927452587a8bb01f'), ObjectId('66065767927452587a8bb020'), ObjectId('66065767927452587a8bb021'), ObjectId('66065767927452587a8bb022'), ObjectId('66065767927452587a8bb023'), ObjectId('66065767927452587a8bb024'), ObjectId('66065767927452587a8bb025'), ObjectId('66065767927452587a8bb026'), ObjectId('66065767927452587a8bb027'), ObjectId('66065767927452587a8bb028'), ObjectId('66065767927452587a8bb029'), ObjectId(

## Create a Vector-Search Index

To enable question-answering over your documents, set up a standard `superduperdb` vector-search index using `sentence_transformers` (other options include `torch`, `openai`, `transformers`, etc.).

A `Model` is a wrapper around a self-built or ecosystem model, such as `torch`, `transformers`, `openai`.

In [7]:
import sentence_transformers
from superduperdb import Model, ObjectModel, vector

model = Model(
    identifier='embedding', 
    object=sentence_transformers.SentenceTransformer('BAAI/bge-large-en-v1.5'),
    encoder=vector(shape=(1024,)),
    predict_method='encode', # Specify the prediction method
    postprocess=lambda x: x.tolist(),  # Define postprocessing function
    batch_predict=True, # Generate predictions for a set of observations all at once 
    datatype=vector(shape=(1024,))
)

In [8]:
vector = model.predict_one('This is a test')
print('vector size: ', len(vector))

[32m 2024-Mar-29 11:23:50.95[0m| [1mINFO    [0m | [36mTaruns-Laptop.local[0m| [36msuperduperdb.components.component[0m:[36m375 [0m | [1mInitializing ObjectModel : embedding[0m
[32m 2024-Mar-29 11:23:50.95[0m| [1mINFO    [0m | [36mTaruns-Laptop.local[0m| [36msuperduperdb.components.component[0m:[36m378 [0m | [1mInitialized  ObjectModel : embedding successfully[0m
vector size:  1024


A `Listener` essentially deploys a `Model` to "listen" to incoming data, computes outputs, and then saves the results in the database via `db`.

In [9]:
# Import the Listener class from the superduperdb module
from superduperdb import Listener


# Create a Listener instance with the specified model, key, and selection criteria
listener = Listener(
    model=model,          # The model to be used for listening
    key='txt',            # The key field in the documents to be processed by the model
    select=collection.find()  # The selection criteria for the documents
)

A `VectorIndex` wraps a `Listener`, allowing its outputs to be searchable.

In [11]:
# Import the VectorIndex class from the superduperdb module
from superduperdb import VectorIndex

# Add a VectorIndex to the SuperDuperDB database with the specified identifier and indexing listener
_ = db.add(
    VectorIndex(
        identifier='my-index',        # Unique identifier for the VectorIndex
        indexing_listener=listener    # Listener to be used for indexing documents
    )
)

[32m 2024-Mar-29 11:23:54.09[0m| [1mINFO    [0m | [36mTaruns-Laptop.local[0m| [36msuperduperdb.components.component[0m:[36m375 [0m | [1mInitializing DataType : dill[0m
[32m 2024-Mar-29 11:23:54.09[0m| [1mINFO    [0m | [36mTaruns-Laptop.local[0m| [36msuperduperdb.components.component[0m:[36m378 [0m | [1mInitialized  DataType : dill successfully[0m
[32m 2024-Mar-29 11:23:58.91[0m| [1mINFO    [0m | [36mTaruns-Laptop.local[0m| [36msuperduperdb.components.component[0m:[36m375 [0m | [1mInitializing DataType : dill[0m
[32m 2024-Mar-29 11:23:58.91[0m| [1mINFO    [0m | [36mTaruns-Laptop.local[0m| [36msuperduperdb.components.component[0m:[36m378 [0m | [1mInitialized  DataType : dill successfully[0m
[32m 2024-Mar-29 11:24:04.09[0m| [1mINFO    [0m | [36mTaruns-Laptop.local[0m| [36msuperduperdb.backends.local.compute[0m:[36m34  [0m | [1mSubmitting job. function:<function method_job at 0x1388c8cc0>[0m


992it [00:00, 86414.04it/s]

[32m 2024-Mar-29 11:24:04.25[0m| [1mINFO    [0m | [36mTaruns-Laptop.local[0m| [36msuperduperdb.components.component[0m:[36m375 [0m | [1mInitializing ObjectModel : embedding[0m





[32m 2024-Mar-29 11:24:05.91[0m| [1mINFO    [0m | [36mTaruns-Laptop.local[0m| [36msuperduperdb.components.component[0m:[36m378 [0m | [1mInitialized  ObjectModel : embedding successfully[0m
[32m 2024-Mar-29 11:26:52.19[0m| [1mINFO    [0m | [36mTaruns-Laptop.local[0m| [36msuperduperdb.components.model[0m:[36m661 [0m | [1mAdding 992 model outputs to `db`[0m
[32m 2024-Mar-29 11:26:57.48[0m| [32m[1mSUCCESS [0m | [36mTaruns-Laptop.local[0m| [36msuperduperdb.backends.local.compute[0m:[36m40  [0m | [32m[1mJob submitted on <superduperdb.backends.local.compute.LocalComputeBackend object at 0x13d731210>.  function:<function method_job at 0x1388c8cc0> future:e0e95c3d-cbe0-443b-b34c-f1ec225e9947[0m
[32m 2024-Mar-29 11:27:03.94[0m| [1mINFO    [0m | [36mTaruns-Laptop.local[0m| [36msuperduperdb.backends.local.compute[0m:[36m34  [0m | [1mSubmitting job. function:<function callable_job at 0x1388c8c20>[0m
[32m 2024-Mar-29 11:27:04.70[0m| [1mINFO    [0

Loading vectors into vector-table...: 992it [00:00, 1971.13it/s]

[32m 2024-Mar-29 11:27:05.21[0m| [32m[1mSUCCESS [0m | [36mTaruns-Laptop.local[0m| [36msuperduperdb.backends.local.compute[0m:[36m40  [0m | [32m[1mJob submitted on <superduperdb.backends.local.compute.LocalComputeBackend object at 0x13d731210>.  function:<function callable_job at 0x1388c8c20> future:61d2520b-80e7-4aab-80ce-367742859e40[0m





In [12]:
# Execute a find_one operation on the SuperDuperDB collection
document = db.execute(collection.find_one())
document['txt']

AttributeError: 'Document' object has no attribute 'content'

In [None]:
from superduperdb.backends.mongodb import Collection
from superduperdb import Document as D
from IPython.display import *

# Define the query for the search
# query = 'Code snippet how to create a `VectorIndex` with a torchvision model'
query = 'can you explain vector-indexes with `superduperdb`?'

# Execute a search using SuperDuperDB to find documents containing the specified query
result = db.execute(
    collection
        .like(D({'txt': query}), vector_index='my-index', n=5)
        .find()
)

# Display a horizontal rule to separate results
display(Markdown('---'))

# Display each document's 'txt' field and separate them with a horizontal rule
for r in result:
    display(Markdown(r['txt']))
    display(r['link'])
    display(Markdown('---'))

## Create a LLM Component

In this step, a LLM component is created and added to the system. This component is essential for the Q&A functionality:

In [None]:
from superduperdb.ext.llm.vllm import VllmModel

# Define the prompt for the llm model
prompt_template = (
    'Use the following description and code snippets about SuperDuperDB to answer this question about SuperDuperDB\n'
    'Do not use any other information you might have learned about other python packages\n'
    'Only base your answer on the code snippets retrieved and provide a very concise answer\n'
    '{context}\n\n'
    'Here\'s the question:{input}\n'
    'answer:'
)

# Create an instance of llm with the specified model and prompt
llm = VllmModel(identifier='llm',
                 model_name='mistralai/Mistral-7B-Instruct-v0.2', 
                 prompt_template=prompt_template,
                 inference_kwargs={"max_tokens":512})

# Add the llm instance
db.add(llm)

# Print information about the models in the SuperDuperDB database
print(db.show('model'))

## Ask Questions to Your Docs

Finally, you can ask questions about the documents. You can target specific queries and use the power of MongoDB for vector-search and filtering rules. Here's an example of asking a question:

In [None]:
from superduperdb import Document
from IPython.display import Markdown

def question_the_doc(question):
    # Use the SuperDuperDB model to generate a response based on the search term and context
    output, sources = db.predict(
        model_name='llm',
        input=question,
        context_select=(
            collection
                .like(Document({'txt': question}), vector_index='my-index', n=5)
                .find()
        ),
        context_key='txt',
    )
    
    # Get the reference links corresponding to the answer context
    links = '\n'.join(sorted(set([source.unpack()['link'] for source in sources])))
    
    # Display the generated response using Markdown
    return Markdown(output.content + f'\n\nrefs: \n\n{links}')

In [None]:
question_the_doc("can you explain vector-indexes with `superduperdb`?'")

In [None]:
question_the_doc("What databases and AI frameworks does SuperDuperDB support?")

## Now you can build an API as well just like we did
### FastAPI Question the Docs Apps Tutorial
This tutorial will guide you through setting up a basic FastAPI application for handling questions with documentation. The tutorial covers both local development and deployment to the Fly.io platform.
https://github.com/SuperDuperDB/chat-with-your-docs-backend