# Vector Database: Pinecone
A vector database is a type of database that stores information in a way that's really useful for searching and analyzing data. Instead of just storing text or numbers, it stores data as vectors, which are basically lists of numbers that represent different characteristics of the data.

For example, if you're storing information about movies, instead of just saving the titles and actors' names, you might save each movie as a vector that includes numbers representing things like the genre, the year it was made, the ratings it received, and so on.

Vector databases are purpose-built to handle the unique structure of vector embeddings. They index vectors for easy search and retrieval by comparing values and finding those that are most similar to one another.

In [2]:
!pip install sentence-transformers pinecone-client -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.8/132.8 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.4/207.4 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25h

## 1. Imports

In [3]:
import pandas as pd

## 2. Load the dataset
Dataset used in this experiment is the small version of [dataset](https://www.kaggle.com/datasets/quora/question-pairs-dataset) from kaggle.

In [4]:
data = pd.read_csv("./questions_modified.csv")
data.head()

Unnamed: 0.1,Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


Here, question1 and question2 are two columns that contains questions asked in Quora and the is_duplicate column tells whether the question1 and question2 are duplicate or not.

In [5]:
len(data)

200

In [6]:
data[data['is_duplicate']==1].head()

Unnamed: 0.1,Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
5,5,5,11,12,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1
7,7,7,15,16,How can I be a good geologist?,What should I do to be a great geologist?,1
11,11,11,23,24,How do I read and find my YouTube comments?,How can I see all my Youtube comments?,1
12,12,12,25,26,What can make Physics easy to learn?,How can you make physics easy to learn?,1
13,13,13,27,28,What was your first sexual experience like?,What was your first sexual experience?,1


## 3. Pinecone
Pinecone is a vector database that lets you create and search indices of vector embeddings from any model, data source, or model.

In [33]:
from pinecone import Pinecone, PodSpec

In [34]:
pc = Pinecone(api_key="2836b3f5-3c9d-455f-ab21-0532d2f76a08", environment='gcp-starter')

Let's first import sentence transformer

In [12]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-mpnet-base-v2')

In [13]:
embeding = model.encode("Hello, from Nepal")
len(embeding)

768

In [36]:
pc.create_index(name="question-search", dimension=768, metric="cosine", spec=PodSpec(
        environment='gcp-starter',
        pod_type='starter'
    ))

Get the created index

In [37]:
index = pc.Index('question-search')

In [38]:
print(index)

<pinecone.data.index.Index object at 0x7e7a9ffe7790>


We gonna store the each data(row) in the form of tuple which has three value `id`, `vector` and, `metadata`.

In [40]:
# [(id,vector,metadata)]

In [42]:
question_list = []
for i, row in data.iterrows():
  question_list.append(
      (
        str(row['id']),
        model.encode(row['question1']).tolist(),
        {
            'is_duplicate': int(row['is_duplicate']),
            'question1': row['question1']
        }
      )
  )
  # we have to insert only 50 vectors at a time.
  if len(question_list)==50:
    index.upsert(vectors=question_list)
    question_list = []

Now, let's search a particular question.

In [44]:
query = "How can we simplify the learning process of Physics?"
xq = model.encode([query]).tolist()
result = index.query(vector=xq, top_k=2, includeMetadata=True)

In [45]:
print(result)

{'matches': [{'id': '12',
              'metadata': {'is_duplicate': 1.0,
                           'question1': 'What can make Physics easy to learn?'},
              'score': 0.822684,
              'values': []},
             {'id': '90',
              'metadata': {'is_duplicate': 0.0,
                           'question1': 'What is the best reference book for '
                                        'physics class 11th?'},
              'score': 0.456025362,
              'values': []}],
 'namespace': '',
 'usage': {'read_units': 6}}


another example

In [47]:
query = "How to leverage internet for business"
xq = model.encode([query]).tolist()
result = index.query(vector=xq, top_k=2, includeMetadata=True)

In [48]:
print(result)

{'matches': [{'id': '78',
              'metadata': {'is_duplicate': 0.0,
                           'question1': 'How can I make money through the '
                                        'Internet?'},
              'score': 0.579001486,
              'values': []},
             {'id': '28',
              'metadata': {'is_duplicate': 0.0,
                           'question1': 'What is best way to make money '
                                        'online?'},
              'score': 0.52560991,
              'values': []}],
 'namespace': '',
 'usage': {'read_units': 6}}


This is the way the semantic search works, Semantic search is like having a search engine that understands the meaning behind words, not just the words themselves. Instead of matching keywords, it tries to understand the intent and context of a search query to provide more relevant results.