# Semantic search example

Taken from https://docs.pinecone.io/docs/semantic-text-search

## 1. Install deps

Copy pasted in terminal, not running in Jup

1. pinecone-client
2. datasets: Huggingface dataset library
3. sentence-transformers: Models to transform sentences into vectors

In [4]:
%pip install "pinecone-client[grpc]"==2.2.1 \
  datasets==2.12.0 \
  sentence-transformers==2.2.2

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Note: you may need to restart the kernel to use updated packages.


In [5]:
import pinecone
import os

%load_ext dotenv
%dotenv

# Load Pinecone API key
api_key = os.getenv('PINECONE_API_KEY')

pinecone.init(
    api_key=api_key,
    environment="northamerica-northeast1-gcp"  # find next to API key in console
)

# Read Quora dataset from huggingface

In [6]:
from datasets import load_dataset

dataset = load_dataset('quora', split='train[240000:320000]')
dataset

Found cached dataset quora (/home/hp/.cache/huggingface/datasets/quora/default/0.0.0/36ba4cd42107f051a158016f1bea6ae3f4685c5df843529108a54e42d86c1e04)


Dataset({
    features: ['questions', 'is_duplicate'],
    num_rows: 80000
})

Extract questions

In [7]:
questions = []

for record in dataset['questions']:
    questions.extend(record['text'])

# Clear duplicates
questions = list(set(questions))
print('\n'.join(questions[:5]))
print(len(questions))

Will Obama go down in history as a great president?
What if hitler won the war?
Do younger girls care about how ripped their boyfriend is?
How was the Alioth star named?
How can I learn to hack seriously?
136057


# Build embeddings with `all-MiniLM-L6-v2` model from sentence_transformers

This will convert questions into a set of vectors. `all-MiniLM-L6-v2` is a sentence transformer library.

In [8]:
from sentence_transformers import SentenceTransformer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
if device != 'cuda':
    print("CUDA unavilable, using CPU")

model = SentenceTransformer('all-MiniLM-L6-v2', device = device)
model


SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

SentenceTransformer params:
1. `max_seq_length = 256`: Max tokens that can be encoded. The rest is truncated
2. `word_embedding_dimension = 384`: Generates a 384 dimension vector

## Create a new Pinecone index and populate it with transformed questions

In [15]:
from tqdm.auto import tqdm

index_name = 'semantic-search'

if index_name not in pinecone.list_indexes():
  pinecone.create_index(
    name = index_name,
    dimension = model.get_sentence_embedding_dimension(),
    metric = 'cosine'
  )

index = pinecone.GRPCIndex(index_name)

# Upsert

batch_size = 128

for i in tqdm(range(0, len(questions), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(questions))
    # create IDs batch
    ids = [str(x) for x in range(i, i_end)]
    # create metadata batch- we store the question in metadata
    metadatas = [{'text': text} for text in questions[i:i_end]]

    # create embeddings with SentenceTransformer
    xc = model.encode(questions[i:i_end])
    
    # create records list for upsert
    records = zip(ids, xc, metadatas)
    # upsert to Pinecone
    index.upsert(vectors=records)

# check number of records in the index
index.describe_index_stats()

100%|██████████| 1063/1063 [20:14<00:00,  1.14s/it]


{'dimension': 384,
 'index_fullness': 0.1,
 'namespaces': {'': {'vector_count': 136057}},
 'total_vector_count': 136057}

## Time to query

In [17]:
query = 'which is the worst city in the world?'

# xq- transformed query
xq = model.encode(query).tolist()

xc = index.query(xq, top_k=5, include_metadata=True)

for result in xc['matches']:
    print(f"{result['score']}: {result['metadata']['text']}")

0.8286282: Which is the worst place to live in the world?
0.81821454: What is the worst place in the world to live?
0.7996257: Which are the worst cities of India?
0.7930561: What are the most dangerous cities in the world?
0.76766825: Which is the top worst country in the world?
