# Semantic Search with Pinecone and OpenAI

**What is semantic search?**
- Semantic search is a data searching technique which uses the intent and contextual meaning behind a search query to deliver more relevant results.

**Lexical search vs Semantic search**
- Semantic search aims to understand the meaning behind user queries and provide contextually relevant results, while lexical search relies on literal matching of keywords.

**Pinecone Indexes**
- Indexes in Pinecone usually refer to the data structures used to efficiently organize and search through vectors of high-dimensional embeddings. It's basically where we're going to store our vectors to be retrieved later. 
	- There's different types of indexes that can store our vectors. In this notebook, we're going to use the free tier or the starter index. **We can only create one starter index**.
	
	Read more: https://docs.pinecone.io/docs/indexes

**Our workflow is as follows:**

1. Embedding and indexing:
	- Generate vector embeddings of our text data using the OpenAI Embedding API.
	- Upload these vector embeddings into Pinecone.
2. Searching:
	- Pass the query text through the OpenAI Embedding API once again.
	- Use the resulting vector embedding as a query and send it to Pinecone.
	- Receive semantically similar text in response, even if they don't share any keywords with the query.

## Setup:

We need to setup our environment and retrieve API keys for OpenAI and Pinecone
1. Run this command in the terminal to install Pinecone, OpenAI, and python-dotenv using pip :

    ``` bash
    pip install pinecone-client openai python-dotenv
    # This command will install Pinecone, OpenAI, and dotenv for python.
    ```
2. Run this command in the terminal to verify if the installation was successfull. 

    ``` bash
    pip list
    # This will list all the installed python packages, locate `pinecone-client`, `openai`, and `python-dotenv` packages to verify if they're successfully installed.
    ```

3. Create a `.env` file and get your API keys for OpenAI and Pinecone. 
	- For OpenAI get your API key at https://platform.openai.com/api-keys.
	- For Pinecone get your API keys at https://app.pinecone.io/.
	- Paste the both API keys to the `.env` file.

    ``` bash
    # Your .env should  look like this. Replace the value of OPENAI_API_KEY and PINECONE_API_KEY with your actual OpenAI and Pinecone API keys. 
    OPENAI_API_KEY=sk-###
    PINECONE_API_KEY=###
    ```

## Let's get started

Let's initialize our connection to the OpenAI embeddings:

In [55]:
import os
from openai import OpenAI
from dotenv import load_dotenv, find_dotenv

# Load the .env
load_dotenv(find_dotenv())

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

We can now create embeddings with OpenAI embedding model as follows:

In [56]:
MODEL = "text-embedding-3-small"

response = client.embeddings.create(
    input="Hello villagers",
    model=MODEL
)

# Print out the vectors
print(response.data[0].embedding)

[0.054234445095062256, -0.012024283409118652, -0.026923412457108498, 0.01757209002971649, -0.012961030937731266, -0.0059919534251093864, 0.014931431040167809, 0.013025633990764618, -0.0071588498540222645, -0.007485903799533844, -0.020172370597720146, -0.02647119015455246, -0.006952926982194185, 0.0027274691965430975, -0.02001086249947548, -0.015682443976402283, -0.03488576412200928, 0.029733654111623764, -0.015795499086380005, 0.06344041228294373, 0.03280230984091759, 0.006932738237082958, -0.02230427786707878, -0.02238503284752369, 0.0036924805026501417, 0.02306336723268032, -0.02721412666141987, -0.00802695658057928, 0.09380394965410233, -0.07532741129398346, -0.016845302656292915, -0.042089030146598816, 0.0005435759667307138, -0.03310917690396309, 0.014026984572410583, -0.001265820348635316, -0.006464364472776651, -0.06990073621273041, -0.022449636831879616, -0.005313619039952755, 0.024339281022548676, -0.025066068395972252, -0.030815759673714638, -0.007615110371261835, 0.0026164324



### Creating an index using the web interface.


We can create an index using the web interface. 
- Navigate to the Pinecone console. https://app.pinecone.io
- Navigate to the index page. Click the 'Create index' button
- Enter the name for your index, congfigure the dimension and metric. 
    - We're using the `text-embedding-3-small` embedding model. The dimension for the model is 1536, but using the `dimension` parameter in the OpenAI Embedding API, we can short it to 512. Read more about the OpenAPI embedding API parameters : https://platform.openai.com/docs/api-reference/embeddings/create
    - The recommended metric to use is the cosine. Read more: https://platform.openai.com/docs/api-reference/embeddings/create#embeddings-create-dimensions
- The capacity mode should default to starter for free tiers. 
- Click the 'Create index' below.
### Creating an index programmatically

Let's create the connection to the Pinecone Vector Database.

In [57]:
import os
from pinecone import Pinecone, PodSpec
from dotenv import load_dotenv, find_dotenv

# Load the .env
load_dotenv(find_dotenv())

# Initialize Pinecone
pc = Pinecone(api_key=os.getenv('PINECONE_API_KEY'))


We can now create an index programmatically as follows:

In [58]:
import time

INDEX_NAME = "semantic-search-openai-01"
DIMENSIONS = 512

# Check if index exists, create it if not
if INDEX_NAME not in pc.list_indexes().names():
    pc.create_index(
        name=INDEX_NAME,
        dimension=DIMENSIONS,
        metric="cosine",
        spec=PodSpec(environment="gcp-starter")
    )
	
    # Wait for the index to be initialized
    print("Waiting for index to be ready...")
    while not pc.describe_index(INDEX_NAME).status["ready"]:
        time.sleep(1)
    print("Index is ready.")
    
# Connect to the index
index = pc.Index(INDEX_NAME)
time.sleep(1)
# View index stats
index.describe_index_stats()

Waiting for index to be ready...
Index is ready.


{'dimension': 512,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

## Populating the Index
We can use the upsert operation to write records into an index namespace. If a record ID already exists, upsert overwrites the entire record. To update only part of a record, use the update operation instead. Read more: https://docs.pinecone.io/docs/upsert-data

Let's create our simple dataset of programming questions.

In [59]:
# Define your custom dataset
custom_dataset = [
    "What is a variable in programming?",
    "How do you declare a function in Python?",
    "Explain the concept of object-oriented programming.",
    "What is the difference between a list and a tuple in Python?",
    "What is recursion and when would you use it?",
    "How do you handle errors in programming?",
    "What is the purpose of version control systems like Git?",
    "Explain the difference between compile-time and runtime errors.",
    "What is the role of a constructor in object-oriented programming?",
    "What are the main principles of clean code?",
    "What is the difference between '==' and 'is' in Python?",
    "How does inheritance work in object-oriented programming?",
    "What are the benefits of using object-oriented programming?",
    "How do you optimize code for performance?",
    "What is the purpose of comments in code?",
    "Explain the concept of Big O notation.",
    "How do you debug a program when it's not working as expected?",
    "What are the main data types in programming?",
    "What is an API and how do you use it?",
    "How do you handle concurrency in programming?",
    "What is the purpose of unit testing?",
    "Explain the difference between procedural and object-oriented programming.",
    "What are the advantages of using libraries and frameworks in programming?",
    "What are some common data structures and their use cases?",
    "How do you manage dependencies in a project?",
    "What is the role of a package manager in programming?",
    "What is the purpose of a loop in programming?",
    "How do you optimize database queries?",
    "What is the role of indexing in databases?",
    "What is a callback function and when would you use one?",
]

Now let's upload the dataset to Pinecone.

In [60]:
# This is only to show us a progress bar to help us know if the upload is completed. 
from tqdm.auto import tqdm

# Initialize variables
count = 0  # Used to create unique IDs
batch_size = 10  # Process in batches of 10

# Iterate over batches of data
for i in tqdm(range(0, len(custom_dataset), batch_size)):
    # Set end position of current batch
    i_end = min(i + batch_size, len(custom_dataset))
    
    # Get batch of lines and IDs
    lines_batch = custom_dataset[i: i_end]
    ids_batch = [str(n) for n in range(i, i_end)]
    
    # Create embeddings for the batch
    res = client.embeddings.create(input=lines_batch, model=MODEL, dimensions=DIMENSIONS)
    embeds = [record.embedding for record in res.data]
    
    # Prepare metadata for the batch
    meta = [{'text': line} for line in lines_batch]
    
    # Combine embeddings, IDs, and metadata for upsert
    to_upsert = zip(ids_batch, embeds, meta)
    
    # Upsert the batch into Pinecone index
    index.upsert(vectors=list(to_upsert))


100%|██████████| 3/3 [00:03<00:00,  1.06s/it]


## Sending a query
Now that we've populated our index. We're ready to start searching through it. Searching works a lot like the populating/indexing process. First, we type in a question or topic we're interested in. Then, we use the OpenAI embedding model to create the vector for the question or topic. 

In [63]:
query = "What is object-oriented programming?"

VECTOR = client.embeddings.create(input=query, model=MODEL, dimensions=DIMENSIONS).data[0].embedding

Once we have our vector ready, we use it to search through our organized data or index using the query function in Pinecone. Read more: https://docs.pinecone.io/docs/query-data. The response from Pinecone includes our original text in the metadata field, let's print out the `top_k` most similar questions and their respective similarity scores.

Read more: https://docs.pinecone.io/reference/query

In [64]:
res = index.query(vector=[VECTOR], top_k=5, include_metadata=True)
for match in res['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")

0.92: Explain the concept of object-oriented programming.
0.81: What are the benefits of using object-oriented programming?
0.74: Explain the difference between procedural and object-oriented programming.
0.65: How does inheritance work in object-oriented programming?
0.58: What is the role of a constructor in object-oriented programming?


Great! Our semantic search pipline is able to identify and return the most semantically similar questions or topic from our vector database or index. 