# Semantic Search and Q&A using KDB.AI and OpenAI

This guide demonstrates how to use KDB.AI to run fast and scalable semantic vector search on unstructured text documents using OpenAI API to generate embeddings.

Semantic search allows users to perform searches based on the meaning or similarity of the data rather than exact matches. It works by converting the query into a vector representation and then finding similar vectors in the database. This way, even if the query and the data in the database are not identical, the system can identify and retrieve the most relevant results based on their semantic meaning.

### Aim
In this tutorial, we'll walk you through the process of performing semantic search on documents, taking PDFs as example, using KDB.AI as the vector store and Open AI for language embeddings. We will cover the following topics:

1. Setup
2. Load PDF Data
3. Create Sentence Vector Embeddings
4. Store Embeddings in KDB.AI
5. Run similarity search on KDB.AI
6. Setup Q&A using ChatGPT and KDB.AI
7. Delete the KDB.AI Table

---

## 1. Setup

### Install dependencies

In order to successfully run this sample, the [Setup](https://github.com/KxSystems/kdbai-samples/blob/main/README.md#setup) steps in the repository's `README.md` file must be completed.
This will ensure that you have installed all of the relevant packages and versions needed for this sample.
If you have not completed these setup steps, please navigate to the repositories `README.md` file and follow the steps detailed there.

In [1]:
# Load Data
import pypdf
from nltk.tokenize import sent_tokenize
import json

In [2]:
# Embeddings
import numpy as np
import pandas as pd
from typing import List

In [3]:
# Vector DB
import os
import tiktoken
import getpass
import openai
import kdbai_client as kdbai
import time

## 2. Load PDF Data

### Read Text From PDF Document

We leverage the power of PyPDF2 for PDF processing and `nltk` for advanced natural language processing. The code below extracts content from each page of the PDF and processes it to identify sentences.

The PDF we are using is [this research paper](https://arxiv.org/pdf/2308.05801.pdf) presenting information on the formation of Interstellar Objects in the Milky Way.

In [4]:
# Read PDF file
with open("data/research_paper.pdf", "rb") as pdf_file:
    pdf_pages = pypdf.PdfReader(pdf_file).pages
    page_list = [page.extract_text() for page in pdf_pages]

In [5]:
# Concatenate text from each page
full_pdf_text = "".join(page_list)

### Split The Text Into Sentences

<div class="alert alert-block alert-warning">
    <b>Note: </b>
    Before running the following line of code, please ensure that you have installed the English sentence tokenizer as stated in the `README.md` file in this repository.
</div>

In [6]:
# Split the PDF into sentences
pdf_sentences = sent_tokenize(full_pdf_text)
len (pdf_sentences)

591

#### Error-Tip

In case you skipped steps from `README.md` and encounter an error on resource 'punkt' not found, use below:
- import nltk
- nltk.download ('punkt')

In [7]:
# Check the content
pdf_sentences[0]

'Draft version August 14, 2023\nTypeset using L ATEX default style in AASTeX631\nThe Galactic Interstellar Object Population: A Framework for Prediction and Inference\nMatthew J. Hopkins\n ,1Chris Lintott\n ,1Michele T. Bannister\n ,2J.'

In [8]:
# Create Dataframe and verify
df = pd.DataFrame({'Sentences': pdf_sentences})
df.head()

Unnamed: 0,Sentences
0,"Draft version August 14, 2023\nTypeset using L..."
1,"Ted Mackereth\n ,3, 4, 5, ∗and\nJohn C. Forbes..."
2,We define a novel framework: firstly to predic...
3,We predict the spatial and compositional distr...
4,Selecting ISO water mass\nfraction as an examp...


## 3. Create Vector Embeddings

Next, we use the OpenAI API to create embeddings for our collection of sentences.

### Selecting an OpenAI Embedding Model

There are different types of Embedding models available. We will be using the OpenAI model - see [OpenAI Embeddings](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings) for the details. The diversity among these primarily stems from variations in their training data. Selecting the ideal model for your needs involves matching the domain and task closely, while also considering the benefits of incorporating larger datasets to enhance scale.

This tutorial will use the `text-embedding-ada-002` pre-trained model. This embedding model can create sentence and document embeddings that can be used for a wide variety of tasks including semantic search which makes it a good choice for our needs.

### Define OpenAI Client

<div class="alert alert-block alert-warning">
    <b>Note: </b>
    You'll need an OpenAI account and associated API key to proceed.
</div>

Click here to ([create a free account](https://beta.openai.com/signup)). For OpenAI code details:
 
> Navigate at [cookbook.openai.com](https://cookbook.openai.com)

Example code and guides for accomplishing common tasks with the - [OpenAI API](https://platform.openai.com/docs/introduction).

In [9]:
# Setup OpenAI and input the API keys created on your OpenAI account
OPENAI_API_KEY = getpass.getpass("OpenAI API Key:")
openai.api_key = OPENAI_API_KEY

OpenAI API Key:········


In [10]:
# Define OpenAI Client
from openai import OpenAI
client = OpenAI(api_key=OPENAI_API_KEY)

### Define Embedding Model

In [11]:
emb_model = "text-embedding-ada-002"

### Generate Embeddings using this model

In [12]:
# Define Embeddings Function
def get_embedding_vec(input):
  """Returns the embeddings vector for a given input"""
  return client.embeddings.create(input=input,model=emb_model).data[0].embedding

In [None]:
# Create and Verify Embeddings
df['Embeddings']= df['Sentences'].apply(get_embedding_vec)
len (df['Embeddings'][0])

#### Error-Tip

In case there is an error on Rate limit reached, you would need to make very small dataset requests or make a payment on OpenAI to increase rate limit. Alternatively we have shared this df as csv and you can use below to load and check it:
- df = pd.read_csv ('data/openai_embedded_data.csv')
- df['Embeddings'] = df['Embeddings'].apply(json.loads)
- len (df['Embeddings'][0])

In [14]:
# Save the Embedded DF to CSV
df_store = df.copy()
df_store['Embeddings'] = df_store['Embeddings'].apply(json.dumps)
df_store.to_csv('output/openai_embedded_data.csv', index=False)
del df_store
df.head()

Unnamed: 0,Sentences,Embeddings
0,"Draft version August 14, 2023\nTypeset using L...","[-0.00248929625377059, 0.006371329538524151, -..."
1,"Ted Mackereth\n ,3, 4, 5, ∗and\nJohn C. Forbes...","[0.006372543051838875, 0.002492946805432439, -..."
2,We define a novel framework: firstly to predic...,"[0.0013814108679071069, 0.0014908972661942244,..."
3,We predict the spatial and compositional distr...,"[0.014146137051284313, -0.0005565343308262527,..."
4,Selecting ISO water mass\nfraction as an examp...,"[0.01576417125761509, 0.004918665625154972, 0...."


## 4. Store Embeddings in KDB.AI

With the embeddings created, we need to store them in a vector database to enable efficient searching.

### Define KDB.AI Session

<div class="alert alert-block alert-warning">
    <b>Note: </b>
    You'll need an KDB.AI account and associated API key to proceed.
</div>

KDB.AI comes in two offerings:

> - [KDB.AI Cloud](https://trykdb.kx.com/kdbai/signup/) - For experimenting with smaller generative AI projects with a vector database in our cloud.
> - [KDB.AI Server](https://trykdb.kx.com/kdbaiserver/signup/) - For evaluating large scale generative AI applications on-premises or on your own cloud provider.

Depending on which you use there will be different setup steps and connection details required.

##### Option 1. KDB.AI Cloud

To use KDB.AI Cloud, you will need two session details - a URL endpoint and an API key.
To get these you can sign up for free [here](https://trykdb.kx.com/kdbai/signup).

You can connect to a KDB.AI Cloud session using `kdbai.Session` and passing the session URL endpoint and API key details from your KDB.AI Cloud portal.

If the environment variables `KDBAI_ENDPOINTS` and `KDBAI_API_KEY` exist on your system containing your KDB.AI Cloud portal details, these variables will automatically be used to connect.
If these do not exist, it will prompt you to enter your KDB.AI Cloud portal session URL endpoint and API key details.

In [15]:
# Input details of KDB.AI Endpoint and API Keys from your account 
KDBAI_ENDPOINT = (
    os.environ["KDBAI_ENDPOINT"]
    if "KDBAI_ENDPOINT" in os.environ
    else getpass.getpass("KDB.AI endpoint: ")
)
KDBAI_API_KEY = (
    os.environ["KDBAI_API_KEY"]
    if "KDBAI_API_KEY" in os.environ
    else getpass.getpass("KDB.AI API key: ")
)

KDB.AI endpoint: ········
KDB.AI API key: ········


In [16]:
# Define KDB.AI Session
session = kdbai.Session(api_key=KDBAI_API_KEY, endpoint=KDBAI_ENDPOINT)

##### Option 2. KDB.AI Server

To use KDB.AI Server, you will need download and run your own container.
To do this, you will first need to sign up for free [here](https://trykdb.kx.com/kdbaiserver/signup/). 

You will receive an email with the required license file and bearer token needed to download your instance.
Follow instructions in the signup email to get your session up and running.

Once the [setup steps](https://code.kx.com/kdbai/gettingStarted/kdb-ai-server-setup.html) are complete you can then connect to your KDB.AI Server session using `kdbai.Session` and passing your local endpoint.

In [17]:
# Define the session for above case
# session = kdbai.Session(endpoint="http://localhost:8082")

### Define Vector DB Table Schema

The next step is to define a schema for our KDB.AI table where we will store our embeddings. Our table will have two columns.

At this point you will select the index and metric you want to use for searching.

With KDB.AI we have the choice between HNSW (Hierarchical Navigable Small World), IVF, IVFPQ and Flat indexing methods. Generally, for semantic search of documents, the HNSW indexing method might be more suitable. Here's why:

- **Search Speed and Approximation**: HNSW is designed for fast approximate nearest neighbor searches. It can efficiently handle high-dimensional data, which is common in natural language processing tasks involving text documents.
- **Semantic Representation**: The Sentence Transformers library, used in this example, generates embeddings that capture semantic meaning. HNSW is well-suited for indexing such embeddings and performing semantic searches.
- **Scalability**: HNSW is scalable and can handle large datasets effectively, making it suitable for applications with a vast number of documents.

HNSW provides approximate search results, meaning that the nearest neighbors might not be exact matches but are close in terms of similarity.

In [18]:
# Define Schema
openai_pdf_schema = {
    "columns": [
        {"name": "Sentences", "pytype": "str"},
        {
            "name": "Embeddings",
            "vectorIndex": {"dims": 1536, "metric": "L2", "type": "hnsw"},
        },
    ]
}

### Create Vector DB Table

Use the KDB.AI `create_table` function to create a table that matches the defined schema in the vector database.

In [19]:
# First ensure the table does not already exist in database
try:
    session.table("openai_pdf").drop()
    time.sleep(5)
except kdbai.KDBAIException:
    pass

In [20]:
# Create empty table openai_pdf
table = session.create_table("openai_pdf", openai_pdf_schema)

We can use `query` to see our table exists but is empty.

In [21]:
# Confirm that table exists and is empty after creation
table.query()

Unnamed: 0,Sentences,Embeddings


### Add Embedded Data to KDB.AI Table

In [22]:
# Insert dataframe into KDB.AI table
table.insert(df)

True

### Verify Data Has Been Inserted

Running `table.query()` should show us that data has been added.

In [23]:
# Confirm if table has the data just inserted
table.query()

Unnamed: 0,Sentences,Embeddings
0,"Draft version August 14, 2023\nTypeset using L...","[-0.00248929625377059, 0.006371329538524151, -..."
1,"Ted Mackereth\n ,3, 4, 5, ∗and\nJohn C. Forbes...","[0.006372543051838875, 0.002492946805432439, -..."
2,We define a novel framework: firstly to predic...,"[0.0013814108679071069, 0.0014908972661942244,..."
3,We predict the spatial and compositional distr...,"[0.014146137051284313, -0.0005565343308262527,..."
4,Selecting ISO water mass\nfraction as an examp...,"[0.01576417125761509, 0.004918665625154972, 0...."
...,...,...
586,"2021, ApJ, 922, 189,\ndoi: 10.3847/1538-4357/a...","[-0.00573846697807312, -0.006318436004221439, ..."
587,"2020,\nNature Methods, 17, 261, doi: 10.1038/s...","[0.0037363762967288494, -0.0014329071855172515..."
588,"A., Frinchaboy, P. M., et al.","[0.0008434861665591598, -0.01327595580369234, ..."
589,"2013, AJ, 146, 81, doi: 10.1088/0004-6256/146/...","[-0.01480394322425127, 0.0031033621635288, -0...."


## 5. Run similarity search on KDB.AI

Now that the embeddings are stored in KDB.AI, we can perform semantic search using `search`. 

First, we embed our search term using the OpenAI model as before. Then we search our index to return the three most similar vectors.

In [24]:
search_term1 = "number of interstellar objects in the milky way"

In [25]:
# Get the embedding of the search term
vec_search_term1 = get_embedding_vec(search_term1)

### Searching with number of nearest neighbours set to 3

In [26]:
# Fetching Results with n=3
results1 = table.search([vec_search_term1], n=3)
results1[0]

Unnamed: 0,Sentences,Embeddings,__nn_distance
0,"Ted Mackereth\n ,3, 4, 5, ∗and\nJohn C. Forbes...","[0.006372543051838875, 0.002492946805432439, -...",0.210837
1,"In this work, we develop\nthis method and appl...","[0.009463178925216198, 0.020949233323335648, -...",0.296038
2,"Keywords: Interstellar objects (52), Small Sol...","[-0.002602155553176999, -0.00834919698536396, ...",0.303887


### Searching the closest neighbour

In [27]:
# Fetching Results with default n
results2 = table.search([vec_search_term1])
results2[0]['Sentences']

0    Ted Mackereth\n ,3, 4, 5, ∗and\nJohn C. Forbes...
Name: Sentences, dtype: object

### Printing sentences alongside scores for the search results within n = 3

In [28]:
for index, row in results1[0].iterrows():
    sentence = row['Sentences']
    nn_distance = row['__nn_distance']
    print(f"{index + 1}. {sentence} (Score: {nn_distance:.3f})")

1. Ted Mackereth
 ,3, 4, 5, ∗and
John C. Forbes
2
1Department of Physics, University of Oxford, Denys Wilkinson Building, Keble Road, Oxford, OX1 3RH, UK
2School of Physical and Chemical Sciences—Te Kura Mat¯ u, University of Canterbury, Private Bag 4800, Christchurch 8140, New Zealand
3Just Group plc, Enterprise House, Bancroft road, Reigate, Surrey RH2 7RP, UK
4Canadian Institute for Theoretical Astrophysics, University of Toronto, 60 St. George Street, Toronto, ON, M5S 3H8, Canada
5Dunlap Institute for Astronomy and Astrophysics, University of Toronto, 50 St. George Street, Toronto, ON M5S 3H4, Canada
ABSTRACT
The Milky Way is thought to host a huge population of interstellar objects (ISOs), numbering
approximately 1015pc−3around the Sun, which are formed and shaped by a diverse set of processes
ranging from planet formation to galactic dynamics. (Score: 0.211)
2. In this work, we develop
this method and apply it to the stellar population of the Milky Way, estimated with data from t

## 6. Setup Q&A using ChatGPT and KDB.AI

This section depicts the implementation of a Question Answering system with ChatGPT, KDB.AI and OpenAI Embeddings.
First we start with ChatGPT to answer independetly, then use the query to be answered from PDF text using ChatGPT and then replicate the same using KDB.AI and OpenAI Embeddings. 

### Define a query

We define our query to be asked.

In [29]:
query = 'What is Milky Way thought to host?'

<div class="alert alert-block alert-warning">
    <b>Note: </b>
    This can also be taken as an user input but we have taken a specific case to simplify results.
</div>

### Using ChatGPT to answer

First we try calling OpenAI API using ChatGPT model to answer the above query

#### Define ChatGPT Model

In [30]:
GPT_MODEL = "gpt-3.5-turbo"

#### Fetching result using the above ChatGPT model

In [31]:
response = openai.chat.completions.create(
    messages=[
        {'role': 'system', 'content': 'You answer questions about scientific papers.'},
        {'role': 'user', 'content': query},
    ],
    model=GPT_MODEL,
    temperature=0,
)

print(response.choices[0].message.content)

The Milky Way is thought to host a variety of astronomical objects and phenomena. These include:

1. Stars: The Milky Way is home to billions of stars, including our own Sun. These stars vary in size, age, and composition.

2. Planets: The Milky Way is believed to host numerous planets, both within our own solar system and around other stars. These exoplanets may have diverse characteristics and potential for habitability.

3. Nebulae: Nebulae are vast clouds of gas and dust. The Milky Way contains various types of nebulae, such as emission nebulae (e.g., the Orion Nebula) and reflection nebulae (e.g., the Pleiades).

4. Star clusters: The Milky Way contains both open star clusters (e.g., the Pleiades) and globular star clusters (e.g., Omega Centauri). These clusters are groups of stars that formed together and are gravitationally bound.

5. Black holes: The Milky Way is believed to harbor a supermassive black hole at its center, known as Sagittarius A*. Additionally, there may be nume

### Creating chunks of data from the stored PDF text

We first use the earlier stored PDF text and break it into chunks to be processed by ChatGPT separately.
This will be used to create different queries to fetch the answers.

In [32]:
# Define the function to split text into chunks
def split_text_into_chunks(text, max_tokens_per_chunk=4096):
    chunks = []
    words = text.split()

    current_chunk = words[0]
    for word in words[1:]:
        if len(current_chunk) + len(word) + 1 <= max_tokens_per_chunk:  # +1 for the space between words
            current_chunk += ' ' + word
        else:
            chunks.append(current_chunk)
            current_chunk = word

    chunks.append(current_chunk)
    return chunks

In [33]:
# Split the text into chunks with a maximum of 4096 tokens per chunk
chunks = split_text_into_chunks(full_pdf_text, max_tokens_per_chunk=4096)

In [34]:
# Print the chunks if needed using below:
# for i, chunk in enumerate(chunks, start=1):
#     print(f"Chunk {i}: {chunk}"

### Using chunked data with ChatGPT to fetch answers

We now use the same ChatGPT model and function to fetch the answer to the query from the data chunks created from the PDF.

In [35]:
# Finding Answer of a Question from a pre-selected chunk
# You can query all chunks in a loop as well but we have limited it for the example
query = f"""Use the below (chunk of) article to answer the subsequent question. If the answer cannot be found, write "I don't know."

Article:
\"\"\"
{chunks[0]}
\"\"\"

Question: What is Milky Way thought to host?"""

response = openai.chat.completions.create(
messages=[
    {'role': 'system', 'content': 'You answer questions about scientific papers.'},
    {'role': 'user', 'content': query},
],
model=GPT_MODEL,
temperature=0,
)

print(response.choices[0].message.content)

The Milky Way is thought to host a huge population of interstellar objects (ISOs).


### Using KDB.AI to fetch answer from closest record

Now we demonstrate the use of KDB.AI to fetch the answer of the query.

#### Define the search term

In [36]:
search_term2 = "What is Milky Way thought to host?"

#### Create embedding of the search term and then search the KDB.AI table for result

In [37]:
vec_search_term2=get_embedding_vec(search_term2)
results3 = table.search([vec_search_term2])

#### Finding answer for the Question using ChatGPT model fetching results from KDB.AI

In [38]:
query = f"""Use the below (chunk of) article to answer the subsequent question. If the answer cannot be found, write "I don't know."

Article:
\"\"\"
{results3[0]['Sentences'].str.cat(sep=' ')}
\"\"\"

Question: What is Milky Way though to host?"""

response = openai.chat.completions.create(
messages=[
    {'role': 'system', 'content': 'You answer questions about scientific papers.'},
    {'role': 'user', 'content': query},
],
model=GPT_MODEL,
temperature=0,
)

print(response.choices[0].message.content)

The Milky Way is thought to host a huge population of interstellar objects (ISOs).


#### Showing query results directly from the KDB.AI table

In [39]:
table.query(filter=[("like", "Sentences", "*Milky Way*")])

Unnamed: 0,Sentences,Embeddings
0,"Ted Mackereth\n ,3, 4, 5, ∗and\nJohn C. Forbes...","[0.006372543051838875, 0.002492946805432439, -..."
1,"In this work, we develop\nthis method and appl...","[0.009463178925216198, 0.020949233323335648, -..."
2,2.APOGEE AND STELLAR DENSITY MODELLING\nTo pre...,"[-0.005855499301105738, -0.0030739670619368553..."
3,While APOGEE’s main sample is not representati...,"[-0.006474930793046951, 0.012832699343562126, ..."
4,"APOGEE is a near-infrared,\nhigh-resolution ( ...","[0.013261232525110245, -0.0006685049156658351,..."
5,To restrict our sample to the Milky Way’s disk...,"[0.006987773813307285, 0.01712271198630333, 0...."
6,Density Modelling of Red Giants across the Gal...,"[-0.00954350270330906, 0.004846843425184488, -..."
7,To build our model of the Milky Way disk betwe...,"[0.004540927708148956, 0.020696885883808136, -..."
8,The two main distinct chemodynamical populatio...,"[0.008082730695605278, 0.01970331184566021, -0..."
9,This approach gives us simple but accurate mod...,"[-0.008088158443570137, 0.025545835494995117, ..."


## 7. Delete the KDB.AI Table

Once finished with the table, it is best practice to drop it.

In [40]:
table.drop()

True

### We hope you found this sample helpful !