<a href="https://colab.research.google.com/github/sudarshan-koirala/youtube-stuffs/blob/main/llamaindex/Llama_index_pinecone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Llama-Index with Pinecone

In this notebook, we will demo how to use the `llama-index` (previously GPT-index) library with Pinecone for semantic search.

In [None]:
%%capture
!pip install llama-index datasets pinecone-client openai transformers

We can go ahead and load the [SQuAD dataset from Huggingface](https://huggingface.co/datasets/squad), which contains questions and answer pairs from Wikipedia articles.

[Datasets](https://github.com/huggingface/datasets) github repo.

We'll then convert the dataset into a pandas DataFrame and keep only the unique 'context' fields, which are the text passages that the questions are based on.

In [None]:
from datasets import load_dataset

data = load_dataset('squad', split='train')
data = data.to_pandas()[['id', 'context', 'title']]
data.drop_duplicates(subset='context', keep='first', inplace=True)
data.head()

In [None]:
data.shape

The following code transforms our DataFrame into a list of Document objects, ready for indexing with llama_index.

Each document contains the follwing:
- text passage
- a unique id
- an extra field for the article title.

In [None]:
from llama_index import Document

docs = []

for i, row in data.iterrows():
    docs.append(Document(
        text=row['context'],
        doc_id=row['id'],
        extra_info={'title': row['title']}
    ))
print(len(docs))
docs[0]

In [None]:
documents = docs[:100]
len(documents)

### Indexing in Pinecone

[Pinecone](https://www.pinecone.io/) is a managed vector database service designed for machine learning applications. We're using it in this context to store and retrieve embeddings generated by our language model, enabling efficient and scalable semantic similarity-based search.

Get the relevant API key and environment that we [get for **free** in the console](https://app.pinecone.io/), then create a new index.

The index has a dimension of 1536 and uses cosine similarity, which is the recommended metric for comparing vectors produced by the `text-embedding-ada-002` model we'll be using.

In [None]:
import pinecone

# find API key in console at app.pinecone.io
os.environ['PINECONE_API_KEY'] = 'PINECONE_API_KEY'
# environment is found next to API key in the console
os.environ['PINECONE_ENVIRONMENT'] = 'asia-southeast1-gcp'

# initialize connection to pinecone
pinecone.init(
    api_key=os.environ['PINECONE_API_KEY'],
    environment=os.environ['PINECONE_ENVIRONMENT']
)

# create the index if it does not exist already
index_name = 'llama-index-pinecone'
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        index_name,
        dimension=1536,
        metric='cosine'
    )

# connect to the index
pinecone_index = pinecone.Index(index_name)

Here, we're initializing a `PineconeVectorStore` with our previously created Pinecone index. This object will serve as the storage and retrieval interface for our document embeddings in Pinecone's vector database.

In [None]:
from llama_index.vector_stores import PineconeVectorStore

# we can select a namespace (acts as a partition in an index)
namespace = '' # default namespace

vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

Next we initialize the `GPTVectorStoreIndex` with our list of `Document` objects, using the `PineconeVectorStore` as storage and `OpenAIEmbedding` model for embeddings.

`StorageContext` is used to configure the storage setup, and `ServiceContext` sets up the embedding model. The `GPTVectorStoreIndex` handles the indexing and querying process, making use of the provided storage and service contexts.

In [None]:
from llama_index import GPTVectorStoreIndex, StorageContext, ServiceContext
from llama_index.embeddings.openai import OpenAIEmbedding

# setup our storage (vector db)
storage_context = StorageContext.from_defaults(
    vector_store=vector_store
)

import os
# https://platform.openai.com/account/api-keys
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"


# setup the index/query process, ie the embedding model (and completion if used)
embed_model = OpenAIEmbedding(model='text-embedding-ada-002', embed_batch_size=100)
service_context = ServiceContext.from_defaults(embed_model=embed_model)

index = GPTVectorStoreIndex.from_documents(
    documents, storage_context=storage_context,
    service_context=service_context
)

Finally we can build a query engine from the `index` we build and use this engine to perform a query.

In [None]:
query_engine = index.as_query_engine()
res = query_engine.query("In what year was the college of engineering established at the University of Notre Dame?")
print(res)

In [None]:
data.context[20]

In [None]:
query_engine = index.as_query_engine()
res = query_engine.query("When was the First Year of Studies program established at the University of Notre Dame has?")
print(res)

Delete the `index` if not necessary.

In [None]:
pinecone.delete_index(index_name)

If you want to learn more into pinecone, you can visit the [pinecone github examples](https://github.com/pinecone-io/examples/tree/master)

---