# Vector Databases for Embeddings with Pinecone

## Creating a Pinecone Client

### Question:
Throughout the course, you'll write Python code to interact with Pinecone via the Pinecone Python client. As a first step, you'll need to create your own Pinecone API key. Pinecone API keys used in this course's exercises will not be stored in any way.

To create a key, you'll first need to create a Pinecone starter account, which is free, by visiting [Pinecone's website](https://www.pinecone.io/). Next, navigate to the API Keys page to create your key.

### Instructions:
- Import the class used to create a Pinecone client from `pinecone`.
- Instantiate the Pinecone class, passing in your API key.

In [3]:
# Import the Pinecone library
from pinecone import Pinecone
pc = Pinecone(api_key = "api_key")

## Exercise: Your First Pinecone Index

### Question:
With your Pinecone client initialized, you're all set to begin creating an index! Indexes are used to store records, including the vectors and associated metadata, as well as serving queries and other manipulations. As you progress through the course, you'll see how these different steps build up to a modern AI system built on a vector database.

If you accidentally create a valid index that doesn't meet the specifications detailed in the instructions, you'll need to add the following code before your `.create_index()` code to delete it and re-create it:

```python
pc.delete_index('my-first-index')


In [5]:
from pinecone import ServerlessSpec

# Create your Pinecone index
pc.create_index(
    name='my-first-index',
    dimension=256,
    spec=ServerlessSpec(
        cloud='aws',
        region='us-east-1'
    )
)

{
    "name": "my-first-index",
    "metric": "cosine",
    "host": "my-first-index-b2pnslq.svc.aped-4627-b74a.pinecone.io",
    "spec": {
        "serverless": {
            "cloud": "aws",
            "region": "us-east-1"
        }
    },
    "status": {
        "ready": true,
        "state": "Ready"
    },
    "vector_type": "dense",
    "dimension": 256,
    "deletion_protection": "disabled",
    "tags": null
}

## Exercise: Connecting to an Index

### Question:
To begin ingesting vectors and performing vector manipulations in your newly-created Pinecone index, you'll first need to connect to it! The resulting index object has a number of methods for ingesting, manipulating, and exploring the contents of the index in Python.

The `Pinecone` class has already been imported for you and will be available throughout the course.

### Instructions:
- Initialize the Pinecone connection with your API key.
- Connect to the `"my-first-index"` index.
- Print key statistics about the index using an index method.

In [6]:
# Connect to your index
index = pc.Index("my-first-index")

# Print the index statistics
print(index.describe_index_stats())

{'dimension': 256,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {},
 'total_vector_count': 0,
 'vector_type': 'dense'}


## Exercise: Deleting an Index

### Question:
If you have an index that has gone stale, perhaps it's time to delete it! **Deleting an index will also delete all of the data it contains, so be cautious when doing this in your own projects!**

### Instructions:
- Initialize the Pinecone connection with your API key.
- Delete the index you've been working with: `"my-first-index"`.
- List your indexes to verify it has been deleted.

In [7]:
# Delete your Pinecone index
pc.delete_index('my-first-index')

# List your indexes
print(pc.list_indexes())

[{
    "name": "datacamp-index",
    "metric": "cosine",
    "host": "datacamp-index-b2pnslq.svc.aped-4627-b74a.pinecone.io",
    "spec": {
        "serverless": {
            "cloud": "aws",
            "region": "us-east-1"
        }
    },
    "status": {
        "ready": true,
        "state": "Ready"
    },
    "vector_type": "dense",
    "dimension": 1536,
    "deletion_protection": "disabled",
    "tags": null
}, {
    "name": "pinecone-datacamp",
    "metric": "cosine",
    "host": "pinecone-datacamp-b2pnslq.svc.aped-4627-b74a.pinecone.io",
    "spec": {
        "serverless": {
            "cloud": "aws",
            "region": "us-east-1"
        }
    },
    "status": {
        "ready": true,
        "state": "Ready"
    },
    "vector_type": "dense",
    "dimension": 1536,
    "deletion_protection": "disabled",
    "tags": null
}, {
    "name": "dotproduct-index",
    "metric": "dotproduct",
    "host": "dotproduct-index-b2pnslq.svc.aped-4627-b74a.pinecone.io",
    "spec": {


## Exercise: Checking Dimensionality

### Question:
You now have the know-how to begin ingesting vectors into a new Pinecone index! Before you jump in, you should check that your vectors are compatible with the dimensionality of your new index.

A list of dictionaries containing records to ingest has been provided as `vectors`. Here's a preview of its structure:

```python
vectors = [
    {
        "id": "0",
        "values": [0.025525547564029694, ..., 0.0188823901116848],
        "metadata": {"genre": "action", "year": 2024}
    },
    ...
]


In [8]:
vectors = [
    {
        "id": "0",
        "values": [0.0] * 1536,  # Example 1536-dimensional vector
        "metadata": {"genre": "action", "year": 2024}
    },
    {
        "id": "1",
        "values": [0.1] * 1536,
        "metadata": {"genre": "comedy", "year": 2023}
    }
]

# Check that each vector has a dimensionality of 1536
vector_dims = [len(vector["values"]) == 1536 for vector in vectors]
print("All vectors have the correct dimensionality:", all(vector_dims))

All vectors have the correct dimensionality: True


## Exercise: Ingesting Vectors with Metadata

### Question:
It's ingesting time! You'll be ingesting vectors, which is a list of dictionaries containing the vector values, IDs, and associated metadata. They have already been provided in a format that can be directly inserted into the index without further manipulation.

Here's another reminder about the structure of `vectors`:

```python
vectors = [
    {
        "id": "0",
        "values": [0.025525547564029694, ..., 0.0188823901116848],
        "metadata": {"genre": "action", "year": 2024}
    },
    ...
]


In [11]:
# Define the index name
index_name = "datacamp-index"

# Connect to your index
index = pc.Index(index_name)

vectors = [
    {
        "id": "0",
        "values": [0.025525547564029694] * 1536,
        "metadata": {"genre": "action", "year": 2024}
    },
    {
        "id": "1",
        "values": [0.0188823901116848] * 1536,
        "metadata": {"genre": "comedy", "year": 2023}
    }
]

# Format vectors for Pinecone upsert
upsert_data = [(v["id"], v["values"], v["metadata"]) for v in vectors]

# Ingest the vectors and metadata
index.upsert(vectors=upsert_data)

# Print the index statistics
stats = index.describe_index_stats()
print("Index Statistics:", stats)

Index Statistics: {'dimension': 1536,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 1000},
                'namespace1': {'vector_count': 50},
                'namespace2': {'vector_count': 50}},
 'total_vector_count': 1100,
 'vector_type': 'dense'}


## Exercise: Fetching Vectors

### Question:
In this exercise, you've been provided with a list of `ids` containing IDs of different records in your `'datacamp-index'` index. You'll use these IDs to retrieve the associated records and explore their metadata.

### Instructions:
- Initialize the Pinecone connection with your API key.
- Retrieve the vectors with IDs in the `ids` list from the connected index.
- Create a list of dictionaries containing the metadata from each record in `fetched_vectors`.

In [15]:
# List of IDs to fetch
ids = ['0', '1']

# Fetch vectors from the connected Pinecone index
fetched_response = index.fetch(ids=ids)

# Extract metadata safely
metadatas = [
    fetched_response.vectors[id].metadata
    for id in ids if id in fetched_response.vectors
]

# Print metadata
print("Fetched Metadata:", metadatas)

Fetched Metadata: [{'genre': 'action', 'year': 2024.0}, {'genre': 'comedy', 'year': 2023.0}]


## Returning the Most Similar Vectors  

Querying vectors is foundational to many AI applications. It involves embedding a user input, comparing it to the vectors in the database, and returning the most similar vectors.  

In this exercise, you've been provided with a mystery vector called `vector`, and you'll use it to query your index called **"datacamp-index"**.  

### Instructions
- Initialize the Pinecone connection with your API key.  
- Retrieve the **three records** with vectors that are most similar to `vector`.  


In [16]:
# Example vector (ensure it matches the dimensionality of your index, e.g., 1536)
vector = [0.01] * 1536  # Replace with an actual query vector

# Retrieve the top three most similar records
query_result = index.query(
    vector=vector,  # The query vector
    top_k=3,  # Number of most similar results
    include_metadata=True  # Retrieve metadata for better context
)

# Print query results
print("Most Similar Vectors:", query_result)

Most Similar Vectors: {'matches': [{'id': '1',
              'metadata': {'genre': 'comedy', 'year': 2023.0},
              'score': 0.999999821,
              'values': []},
             {'id': '0',
              'metadata': {'genre': 'action', 'year': 2024.0},
              'score': 0.999999642,
              'values': []},
             {'id': '419', 'score': 0.0761137679, 'values': []}],
 'namespace': '',
 'usage': {'read_units': 6}}


## Filtering Queries  

In this exercise, you'll practice querying the **"datacamp-index"** Pinecone index. You'll connect to the index and query it using the vector provided to retrieve similar vectors. You'll also use **metadata filtering** to optimize your querying and return the most relevant search results.  

### Instructions
- Initialize the Pinecone connection with your API key.  
- Retrieve the **MOST similar** vector to the vector provided.  
- Use **metadata filtering** to only search through vectors where the metadata **'year' equals 2024**.  


In [17]:
# Retrieve the MOST similar vector with the year 2024
query_result = index.query(
    vector = vector,
    filter =  {
        "year":{"$eq":2024}
    },
    top_k = 1
)
print(query_result)

{'matches': [{'id': '0', 'score': 0.999999642, 'values': []}],
 'namespace': '',
 'usage': {'read_units': 5}}


## Multiple Metadata Filters  

Pinecone allows for multiple **metadata filters** in a single query, enabling more precise search results. In this exercise, you'll query the **"datacamp-index"** Pinecone index using **multiple filters** with comparison operators.  

### Instructions
- Initialize the Pinecone connection with your API key.  
- Retrieve the **MOST similar** vector to the vector provided.  
- Use **metadata filtering** to **only** search through vectors where:  
  - The **"genre"** metadata is `"thriller"`.  
  - The **"year"** is **less than 2018**.  


In [18]:
# Retrieve the MOST similar vector with genre and year filters
query_result = index.query(
    vector=vector,
    top_k=1,
    filter={
        "genre": {"$eq": "thriller"},
        "year" : {"$lt": 2018}
    }
)
print(query_result)

{'matches': [], 'namespace': '', 'usage': {'read_units': 5}}


# Defining a function for chunking
To be able to batch upserts in a reproducible way, you'll need to define a function to split your list of vectors into chunks.

The built-in `itertools` module has already been imported for you.

## Instructions
- Convert the iterable input into an iterator.
- Slice it into chunks of size `batch_size` using the `itertools` module.
- Yield the current chunk.


In [24]:
import itertools

def chunks(iterable, batch_size=100):
    """A helper function to break an iterable into chunks of size batch_size."""
    # Convert the iterable into an iterator
    it = iter(iterable)
    # Slice the iterator into chunks of size batch_size
    chunk = tuple(itertools.islice(it, batch_size))
    while chunk:
        # Yield the chunk
        yield chunk
        chunk = tuple(itertools.islice(it, batch_size))

# Batching upserts in chunks
In this exercise, you'll practice ingesting vectors into the `'datacamp-index'` Pinecone index in series, batch-by-batch.

- Upsert the vectors in vectors in batches of 100 vectors into 'datacamp-index'.


In [25]:
# Upsert vectors in batches of 100
for chunk in chunks(vectors, batch_size=100):
    index.upsert(vectors=chunk)

# Retrieve statistics of the connected Pinecone index
print(index.describe_index_stats())


{'dimension': 1536,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 1000},
                'namespace1': {'vector_count': 50},
                'namespace2': {'vector_count': 50}},
 'total_vector_count': 1100,
 'vector_type': 'dense'}


# Batching upserts in parallel
In this exercise, you'll practice ingesting vectors into the `'datacamp-index'` Pinecone index in parallel. You'll need to connect to the index, upsert vectors in batches asynchronously, and check the updated metrics of the `'datacamp-index'` index.

## Instructions
1. Initialize the Pinecone client to allow **20 simultaneous requests**.
2. Upsert the vectors in `vectors` in batches of **200 vectors per request asynchronously**, configuring **20 simultaneous requests**.
3. Print the updated metrics of the `'datacamp-index'` Pinecone index.


In [26]:
# Upsert vectors in batches of 200 vectors
with pc.Index('datacamp-index', pool_threads = 20) as index:
    async_results = [index.upsert(vectors=chunk, async_req = True) for chunk in chunks(vectors, batch_size=200)]
    [async_result.get() for async_result in async_results]

# Retrieve statistics of the connected Pinecone index
print(index.describe_index_stats())

{'dimension': 1536,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 1000},
                'namespace1': {'vector_count': 50},
                'namespace2': {'vector_count': 50}},
 'total_vector_count': 1100,
 'vector_type': 'dense'}


# Upserting vectors for semantic search
Time to embed some text data and upsert the vectors and metadata into your `'pinecone-datacamp'` index! You've been given a dataset named `squad_dataset.csv`, and a sample of **200 rows** has been loaded into the DataFrame, `df`.

In this exercise, you will interact with the OpenAI API to use their embedding model. **You don't need to create or use your own API key**—a valid OpenAI client has been created for you and assigned to the `client` variable.

Your task is to:
- Embed the text using OpenAI's API.
- Upsert the embeddings and metadata into the Pinecone index under the namespace **squad_dataset**.

## Instructions
1. **Initialize** the Pinecone client with your API key (**OpenAI client is already available as `client`**).
2. **Extract** the `'id'`, `'text'`, and `'title'` metadata from each row in the batch.
3. **Encode texts** using `'text-embedding-3-small'` from OpenAI with **dimensionality 1536**.
4. **Upsert** the vectors and metadata into **Pinecone** under the namespace `'squad_dataset'`.


In [None]:
import numpy as np
import pandas as pd
from uuid import uuid4
import pinecone
import time
import openai

# Initialize the Pinecone client
index = pc.Index('pinecone-datacamp')
client = openai.OpenAI(api_key="OPENAI_key")
df = pd.read_csv("squad_dataset.csv")

# Define batch size
batch_limit = 100

# Function to process and upsert batches
def process_and_upsert(batch):
    """Encodes texts, generates embeddings, and upserts them into Pinecone."""
    try:
        # Extract metadata from each row
        metadatas = [{
            "text_id": str(row['id']),  # Ensure ID is a string
            "text": row['text'],
            "title": row['title']
        } for _, row in batch.iterrows()]

        # Convert texts to a list
        texts = batch['text'].tolist()

        # Generate unique IDs for each text
        ids = [str(uuid4()) for _ in range(len(texts))]

        # Encode texts using OpenAI's embedding model
        response = client.embeddings.create(input=texts, model="text-embedding-3-small")
        embeds = [np.array(x.embedding) for x in response.data]

        # Ensure embeddings match text count
        if len(embeds) != len(texts):
            raise ValueError("Mismatch between embeddings and texts!")

        # Upsert vectors along with metadata into the specified namespace
        index.upsert(vectors=list(zip(ids, embeds, metadatas)), namespace='squad_dataset')
        print(f"Successfully upserted {len(texts)} vectors.")

    except Exception as e:
        print(f"Error processing batch: {e}")

# Iterate over the DataFrame in batches
for batch in np.array_split(df, max(1, len(df) // batch_limit)):
    process_and_upsert(batch)
    time.sleep(0.5)

### Querying vectors for semantic search  
In this exercise, you'll create a query vector from the question, 'What is in front of the Notre Dame Main Building?'. Using this embedded query, you'll query the 'squad_dataset' namespace from the 'pinecone-datacamp' index and return the top five most similar vectors.  

#### Instructions  

- Initialize the Pinecone client with your API key (the OpenAI client is available as `client`).  
- Create a query vector by embedding the query provided with the same OpenAI embedding model you used for embedding the other vectors.  
- Query the "squad_dataset" namespace using `query_emb`, returning the top five most similar results.  


In [36]:
query = "What is in front of the Notre Dame Main Building?"

# Create the query vector
query_response = client.embeddings.create(
    input=query,
    model="text-embedding-3-small"
)
query_emb = query_response.data[0].embedding

# Query the index and retrieve the top five most similar vectors
retrieved_docs = index.query(
    vector = query_emb,
    top_k = 5,
    namespace = 'squad_dataset',
)

for result in retrieved_docs['matches']:
    print(f"{result['id']}: {round(result['score'], 2)}")
    print('\n')

123b1f17-f34f-4b19-ba5f-0503f9353d5a: 0.46
f64af2fb-616a-400e-b06d-2ff8f7f2a25d: 0.46
2e85bf0f-eba9-40a7-aa82-25bff072bcdb: 0.46
bd6280dc-9853-4849-acc2-e8f5ab7635ab: 0.46
b29ed957-b6ac-4cdb-9279-4d42b748ebd2: 0.29


### Upserting YouTube transcripts  
In this following exercise, you'll create a chatbot that can answer questions about YouTube videos by ingesting video transcripts and additional metadata into your 'pinecone-datacamp' index.  

To start, you'll prepare data from the youtube_rag_data.csv file and upsert the vectors with all of their metadata into the 'pinecone-datacamp' index. The data is provided in the DataFrame `youtube_df`.  

Here's an example transcript from the `youtube_df` DataFrame:  

**id:**  
35Pdoyi6ZoQ-t0.0  

**title:**  
Training and Testing an Italian BERT - Transformers From Scratch #4  

**text:**  
Hi, welcome to the video. So this is the fourth video in a Transformers from Scratch mini-series. So if you haven't been following along, we've essentially covered what you can see on the screen. So we got some data. We built a tokenizer with it...  

**url:**  
[https://youtu.be/35Pdoyi6ZoQ](https://youtu.be/35Pdoyi6ZoQ)  

**published:**  
01-01-2024  

#### Instructions  
- **Initialize** the Pinecone client with your API key (the OpenAI client is available as `client`).  
- **Extract** the `'id'`, `'text'`, `'title'`, `'url'`, and `'published'` metadata from each row.  
- **Encode** texts using `'text-embedding-3-small'` from OpenAI.  
- **Upsert** the vectors and metadata to a namespace called `'youtube_rag_dataset'`.  


In [37]:
batch_limit = 100

youtube_df = pd.read_csv("youtube_rag_data.csv")

for batch in np.array_split(youtube_df, len(youtube_df) / batch_limit):
    # Extract the metadata from each row
    metadatas = [{
      "text_id": row['id'],
      "text": row['text'],
      "title": row['title'],
      "url": row['url'],
      "published": row['published']} for _, row in batch.iterrows()]
    texts = batch['text'].tolist()

    ids = [str(uuid4()) for _ in range(len(texts))]

    # Encode texts using OpenAI
    response = client.embeddings.create(input=texts, model="text-embedding-3-small")
    embeds = [np.array(x.embedding) for x in response.data]

    # Upsert vectors to the correct namespace
    index.upsert(vectors=zip(ids, embeds, metadatas), namespace='youtube_rag_dataset')

print(index.describe_index_stats())

Dimension: 1536
Index Fullness: 0.0
Namespaces:
  squad_dataset: 801 vectors
  youtube_rag_dataset: 200 vectors
Total Vector Count: 1001


# Building a Retrieval Function

## Overview
A key process in the Retrieval Augmented Generation (RAG) workflow is retrieving data from the database. In this exercise, you'll design a custom function called `retrieve()` that will perform this crucial process in the final exercise of the course.

## Instructions
- **Initialize** the Pinecone client with your API key (the OpenAI client is available as `client`).
- **Define** the function `retrieve()` that takes four parameters: `query`, `top_k`, `namespace`, and `emb_model`.
- **Embed** the input query using the `emb_model` argument.
- **Retrieve** the `top_k` similar vectors to `query_emb` with metadata, specifying the `namespace` provided to the function as an argument.


In [39]:
# Define a retrieve function that takes four arguments: query, top_k, namespace, and emb_model
def retrieve(query, top_k, namespace, emb_model):
    # Encode the input query using OpenAI
    query_response = client.embeddings.create(
        input=query,
        model=emb_model
    )

    query_emb = query_response.data[0].embedding

    # Query the index using the query_emb
    docs = index.query(vector=query_emb, top_k=top_k, namespace=namespace,include_metadata=True)

    retrieved_docs = []
    sources = []
    for doc in docs['matches']:
        retrieved_docs.append(doc['metadata']['text'])
        sources.append((doc['metadata']['title'], doc['metadata']['url']))

    return retrieved_docs, sources

documents, sources = retrieve(
  query="How to build next-level Q&A with OpenAI",
  top_k=3,
  namespace='youtube_rag_dataset',
  emb_model="text-embedding-3-small"
)
print(documents)
print(sources)

"and our model into a pipeline, into a Q&A pipeline. So again, we get this pipeline from the Transformers library. So we come down here, do from Transformers, import pipeline. And now what we want to do is just initialize a pipeline object. So to do that, we just write pipeline. And then in here, what we need to add is a model type. So obviously, you can see up here, we have all of these different tasks. So summarization, text generation and so on. The Transformers library needs to understand, or this pipeline object needs to understand which one of those pipelines or functions we are intending to use. So to tell it that we want to do question answering, we just write question answering. And that basically sets the wrapper of the pipeline to handle question answering formats. So we'll see our input and for our input, we will be passing a context and a question. So we'll see that it will convert into the right structure that we need for question answering, which is the CLS context separ

# RAG Question Answering Function

## Overview
You're almost there! The final piece in the Retrieval Augmented Generation (RAG) workflow is to integrate the retrieved documents with a question-answering model.

A `prompt_with_context_builder()` function has already been defined and made available to you. This function takes the documents retrieved from the Pinecone index and integrates them into a prompt that the question-answering model can ingest:

```python
def prompt_with_context_builder(query, docs):
    delim = '\n\n---\n\n'
    prompt_start = 'Answer the question based on the context below.\n\nContext:\n'
    prompt_end = f'\n\nQuestion: {query}\nAnswer:'

    prompt = prompt_start + delim.join(docs) + prompt_end
    return prompt
```

## Instructions

- Initialize the Pinecone client with your API key (the OpenAI client is available as client).
- Retrieve the three most similar documents to the query text from the 'youtube_rag_dataset' namespace.
- Generate a response to the provided prompt and sys_prompt using OpenAI's 'gpt-4o-mini' model, specified using the chat_model function argument.

In [40]:
query = "How to build next-level Q&A with OpenAI"

# Retrieve the top three most similar documents and their sources
documents, sources = retrieve(query, top_k=3, namespace='youtube_rag_dataset', emb_model="text-embedding-3-small")

prompt_with_context = prompt_with_context_builder(query, documents)
print(prompt_with_context)

def question_answering(prompt, sources, chat_model):
    sys_prompt = "You are a helpful assistant that always answers questions."

    # Use OpenAI chat completions to generate a response
    res = client.chat.completions.create(
        model=chat_model,
        messages=[
            {"role": "system", "content": sys_prompt},
            {"role": "user", "content": prompt}
        ],
        temperature=0
    )
    answer = res.choices[0].message.content.strip()
    answer += "\n\nSources:"
    for source in sources:
        answer += "\n" + source[0] + ": " + source[1]

    return answer

answer = question_answering(
  prompt=prompt_with_context,
  sources=sources,
  chat_model='gpt-4o-mini')
print(answer)

Answer the question based on the context below.

Context:
and our model into a pipeline, into a Q&A pipeline. So again, we get this pipeline from the Transformers library. So we come down here, do from Transformers, import pipeline. And now what we want to do is just initialize a pipeline object. So to do that, we just write pipeline. And then in here, what we need to add is a model type. So obviously, you can see up here, we have all of these different tasks. So summarization, text generation and so on. The Transformers library needs to understand, or this pipeline object needs to understand which one of those pipelines or functions we are intending to use. So to tell it that we want to do question answering, we just write question answering. And that basically sets the wrapper of the pipeline to handle question answering formats. So we'll see our input and for our input, we will be passing a context and a question. So we'll see that it will convert into the right structure that we ne