# Lab 10

Today, we'll explore a vector database - Pinecone. It is based on [two tutorials](https://docs.pinecone.io/examples/notebooks) from the Pinecone developers.

## Setup

If you haven't already, set up accounts with Pinecone and OpenAI:

Pinecone - https://app.pinecone.io

Once you create an account they will ask you a few questions to get it set up. Choose Python as the language. The rest you can answer however you like.

OpenAI API - https://platform.openai.com/overview

Note, if you are creating a new account you should get some free credits that you can use. If you already have an account, the cost for the lab should be very small.


In [0]:
# api key from app.pinecone.io
pinecone_api_key = 'INSERT_YOUR_PINECONE_KEY_HERE'

# api key from platform.openai.com
openai_api_key = 'INSERT_YOUR_OPENAI_KEY_HERE'

Now, let's install the libraries we need:

In [0]:
!pip install -qU \
  openai==0.27.7 \
  pinecone-client==3.1.0 \
  pinecone-datasets==0.7.0 \
  sentence-transformers==2.2.2 \
  tqdm

Note, some Windows users have also found that they needed to isntall `pyarrow-11.0.0`

## Data Download

In this notebook we will use a pre-processed dataset from Pinecone Datasets.

If you are curious about what pre-processing they did. see [this notebook](https://github.com/pinecone-io/examples/blob/master/learn/search/semantic-search/semantic-search.ipynb).

In [0]:
from pinecone_datasets import load_dataset

dataset = load_dataset('quora_all-MiniLM-L6-bm25')

# we drop metadata as will use blob column
dataset.documents.drop(['metadata'], axis=1, inplace=True)
dataset.documents.rename(columns={'blob': 'metadata'}, inplace=True)

# we will use 80K rows of the dataset between rows 240K -> 320K
dataset.documents.drop(dataset.documents.index[320_000:], inplace=True)
dataset.documents.drop(dataset.documents.index[:240_000], inplace=True)

# Print out a sample from the dataset to show what we are working with
dataset.head()

In [0]:
print(len(dataset))

## Creating an Index

Now the data is ready, we can set up our index to store it.

We begin by initializing our connection to Pinecone. This is where your API key is needed.

In [0]:
import os
from pinecone import Pinecone

# configure client
pc = Pinecone(api_key=api_key)

Now we set up our index specification. This allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all available providers and regions [here](https://docs.pinecone.io/docs/projects).

In [0]:
from pinecone import ServerlessSpec

cloud = os.environ.get('PINECONE_CLOUD') or 'aws'
region = os.environ.get('PINECONE_REGION') or 'us-east-1'

spec = ServerlessSpec(cloud=cloud, region=region)

Now we create a new index called `semantic-search-fast`. It's important that we align the index `dimension` and `metric` parameters with those required by the `MiniLM-L6` model.

In [0]:
import time

index_name = 'semantic-search-fast'

existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=384,  # dimensionality of minilm
        metric='dotproduct',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

Upsert the data to put it in the database (this can take 2-5 minutes):

In [0]:
from tqdm.auto import tqdm

for batch in tqdm(dataset.iter_documents(batch_size=500), total=160):
    index.upsert(batch)

## Making Queries

Now that our index is populated we can begin making queries. We are performing a semantic search for *similar questions*, so we should embed and search with another question.

Note that we use the same model as the one used above. That's critical - otherwise the vector spaces will not be meaningfully comparable.

In [0]:
from sentence_transformers import SentenceTransformer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device=device)
model

Now let's query.

In [0]:
query = "which city has the highest population in the world?"

# create the query vector
xq = model.encode(query).tolist()

# now query
xc = index.query(vector=xq, top_k=5, include_metadata=True)
xc

In the returned response `xc` we can see the most relevant questions to our particular query — we don't have any exact matches but we can see that the returned questions are similar in the topics they are asking about. We can reformat this response to be a little easier to read:

In [0]:
for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

These are good results, let's try and modify the words being used to see if we still surface similar results.

In [0]:
query = "which metropolis has the highest number of people?"

# create the query vector
xq = model.encode(query).tolist()

# now query
xc = index.query(vector=xq, top_k=5, include_metadata=True)
for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

Here we used different terms in our query than that of the returned documents. We substituted **"city"** for **"metropolis"** and **"populated"** for **"number of people"**.

Despite these very different terms and *lack* of term overlap between query and returned documents — we get highly relevant results — this is the power of *semantic search*.

## Task 1

Try changing the model and querying again. You can find alternative models [here](https://sbert.net/docs/pretrained_models.html). Note that you will need to choose one with the same dimensionality (384). Clicking on the "info" symbol next to the model names will tell you information including their dimensionality.

Find a model that gives similar results and a model that gives different results.

In [0]:
# TODO

# Solution

query = "which metropolis has the highest number of people?"

# Similar
model2 = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2', device=device)
xq = model2.encode(query).tolist()
xc = index.query(vector=xq, top_k=5, include_metadata=True)
print("Similar")
for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['text']}")
print()

# Different
model3 = SentenceTransformer('sentence-transformers/paraphrase-MiniLM-L3-v2', device=device)
xq = model3.encode(query).tolist()
xc = index.query(vector=xq, top_k=5, include_metadata=True)
print("Different")
for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

# Retrieval Enhanced Generative Question Answering

Next, we will see how these queries can be used with an LLM to generate better outputs.

We will again use data that has already been prepared (for details, see [this notebook](https://github.com/pinecone-io/examples/blob/master/learn/generation/openai/gen-qa-openai.ipynb)).

In [0]:
from pinecone_datasets import load_dataset

dataset = load_dataset('youtube-transcripts-text-embedding-ada-002')

# we drop sparse_values as they are not needed for this example
dataset.documents.drop(['metadata'], axis=1, inplace=True)
dataset.documents.rename(columns={'blob': 'metadata'}, inplace=True)

# Print a sample of the data
dataset.head()

Again, we will set up a pinecone database:

In [0]:
index_name = 'gen-qa-openai-fast'
# check if index already exists (it shouldn't if this is first time)
if index_name not in pc.list_indexes().names():
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=1536,  # dimensionality of text-embedding-ada-002
        metric='cosine',
        spec=spec
    )
# connect to index
index = pc.Index(index_name)
# view index stats
index.describe_index_stats()

As in the previous section, we'll insert the data into the database (this can take 5-10 minutes):

In [0]:
for batch in dataset.iter_documents(batch_size=100):
    index.upsert(batch)

Now we've added all of our langchain docs to the index. With that we can move on to retrieval and then answer generation.

## Retrieval

To search through our documents we first need to create a query vector `xq`. Using `xq` we will retrieve the most relevant chunks from the LangChain docs. To create that query vector we must initialize a `text-embedding-ada-002` embedding model with OpenAI. For this, you need an [OpenAI API key](https://platform.openai.com/).

In [0]:
import openai

openai.api_key = openai_api_key

embed_model = "text-embedding-ada-002"

In [0]:
query = (
    "Which training method should I use for sentence transformers when " +
    "I only have pairs of related sentences?"
)

res = openai.Embedding.create(
    input=[query],
    engine=embed_model
)

# retrieve from Pinecone
xq = res['data'][0]['embedding']

# get relevant contexts (including the questions)
res = index.query(vector=xq, top_k=2, include_metadata=True)

res

We write some functions to handle the retrieval and completion steps:

In [0]:
limit = 3750

import time

def retrieve(query):
    res = openai.Embedding.create(
        input=[query],
        engine=embed_model
    )

    # retrieve from Pinecone
    xq = res['data'][0]['embedding']

    # get relevant contexts
    contexts = []
    time_waited = 0
    while (len(contexts) < 3 and time_waited < 60 * 12):
        res = index.query(vector=xq, top_k=3, include_metadata=True)
        contexts = contexts + [
            x['metadata']['text'] for x in res['matches']
        ]
        print(f"Retrieved {len(contexts)} contexts, sleeping for 15 seconds...")
        time.sleep(15)
        time_waited += 15

    if time_waited >= 60 * 12:
        print("Timed out waiting for contexts to be retrieved.")
        contexts = ["No contexts retrieved. Try to answer the question yourself!"]


    # build our prompt with the retrieved contexts included
    prompt_start = (
        "Answer the question based on the context below.\n\n"+
        "Context:\n"
    )
    prompt_end = (
        f"\n\nQuestion: {query}\nAnswer:"
    )
    # append contexts until hitting limit
    for i in range(1, len(contexts)):
        if len("\n\n---\n\n".join(contexts[:i])) >= limit:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(contexts[:i-1]) +
                prompt_end
            )
            break
        elif i == len(contexts)-1:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(contexts) +
                prompt_end
            )
    return prompt


def complete(prompt):
    # instructions
    sys_prompt = "You are a helpful assistant that always answers questions."
    # query text-davinci-003
    res = openai.ChatCompletion.create(
        model='gpt-3.5-turbo-0613',
        messages=[
            {"role": "system", "content": sys_prompt},
            {"role": "user", "content": prompt}
        ],
        temperature=0
    )
    return res['choices'][0]['message']['content'].strip()

In [0]:
# first we retrieve relevant items from Pinecone
query_with_contexts = retrieve(query)
query_with_contexts

In [0]:
# then we complete the context-infused query
complete(query_with_contexts)

And we get a pretty great answer straight away, specifying to use _multiple-rankings loss_ (also called _multiple negatives ranking loss_).



## Task 2

Try adjusting the number of contexts down to 1, to see the impact on retrieval quality.

In [0]:
# TODO

# Solution

limit = 1

query_with_contexts = retrieve(query)
print(query_with_contexts)

complete(query_with_contexts)

## Pack up

Once you're done with the lab, delete the indices to save resources:

In [0]:
pc.delete_index('gen-qa-openai-fast')
pc.delete_index('semantic-search-fast')