# Lesson 1 - Semantic Search

Welcome to Lesson 1.

To access the `requirement.txt` file, go to `File` and click on `Open`.

I hope you enjoy this course!

### Import the Needed Packages

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
!pip install Pinecone



In [None]:
!pip install DLAIUtils

[31mERROR: Could not find a version that satisfies the requirement DLAIUtils (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for DLAIUtils[0m[31m
[0m

In [None]:
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from pinecone import Pinecone, ServerlessSpec


import os
import time
import torch

In [None]:
from tqdm.auto import tqdm

### Load the Dataset

In [None]:
from datasets import load_dataset

dataset = load_dataset("sentence-transformers/quora-duplicates", "pair", split="train")

README.md: 0.00B [00:00, ?B/s]

pair/train-00000-of-00001.parquet:   0%|          | 0.00/10.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/149263 [00:00<?, ? examples/s]

In [None]:
dataset[:5]

{'anchor': ['Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?',
  'How can I be a good geologist?',
  'How do I read and find my YouTube comments?',
  'What can make Physics easy to learn?',
  'What was your first sexual experience like?'],
 'positive': ["I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?",
  'What should I do to be a great geologist?',
  'How can I see all my Youtube comments?',
  'How can you make physics easy to learn?',
  'What was your first sexual experience?']}

In [None]:
questions = []
for record in dataset:
    questions.append(record['anchor'])
    questions.append(record['positive'])
    # Ako želiš, možeš i negativne primere
    # questions.append(record['hard_negative'])

questions = list(set(questions))  # izbaci duplikate
print('\n'.join(questions[:10]))
print('-' * 50)
print(f'Number of questions: {len(questions)}')

Have you ever seen ghost in your real life?
What is 2g spectrum all about?
Do you burn more calories swimming or running?
Is there a proof that there are infinitely many transcendental numbers?
How will the scrapping of Rs 500 and Rs 1000 notes help in reducing black money and corruption?
How do I get to know what kind of person I am?
What does a register do in a computer?
How can I live?
How can I make my life beautiful and enjoyable?
What does being human really mean?
--------------------------------------------------
Number of questions: 149596


### Check cuda and Setup the model

**Note**: "Checking cuda" refers to checking if you have access to GPUs (faster compute). In this course, we are using CPUs. So, you might notice some code cells taking a little longer to run.

We are using *all-MiniLM-L6-v2* sentence-transformers model that maps sentences to a 384 dimensional dense vector space.

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
if device != 'cuda':
    print('Sorry no cuda.')
model = SentenceTransformer('all-MiniLM-L6-v2', device=device)

Sorry no cuda.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
query = 'which city is the most populated in the world?'
xq = model.encode(query)
xq.shape

(384,)

### Setup Pinecone

In [None]:
PINECONE_API_KEY = ""

In [None]:
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key=PINECONE_API_KEY)

INDEX_NAME = create_dlai_index_name("dl-ai")

if INDEX_NAME in [index["name"] for index in pc.list_indexes()]:
    pc.delete_index(INDEX_NAME)

print("Index name:", INDEX_NAME)

# kreiraj novi indeks u free-plan regiji
pc.create_index(
    name=INDEX_NAME,
    dimension=model.get_sentence_embedding_dimension(),
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1"),  # promenjeno
)

index = pc.Index(INDEX_NAME)
print(index)

Index name: dl-ai-1755633153
<pinecone.db_data.index.Index object at 0x7fbe3fc630b0>


### Create Embeddings and Upsert to Pinecone

In [None]:
batch_size = 200
vector_limit = 10000

questions_to_process = questions[:vector_limit]  # slice liste

import json
from tqdm import tqdm

for i in tqdm(range(0, len(questions_to_process), batch_size)):
    i_end = min(i + batch_size, len(questions_to_process))
    ids = [str(x) for x in range(i, i_end)]
    metadatas = [{'text': text} for text in questions_to_process[i:i_end]]
    xc = model.encode(questions_to_process[i:i_end])
    records = list(zip(ids, xc, metadatas))
    index.upsert(vectors=records)

100%|██████████| 50/50 [01:58<00:00,  2.37s/it]


In [None]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 10000}},
 'total_vector_count': 10000,
 'vector_type': 'dense'}

### Run Your Query

In [None]:
# small helper function so we can repeat queries later
def run_query(query):
  embedding = model.encode(query).tolist()
  results = index.query(top_k=10, vector=embedding, include_metadata=True, include_values=False)
  for result in results['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

In [None]:
run_query('which city has the highest population in the world?')

0.72: Which is the costliest city in the world to live in?
0.71: What's the best city in the world?
0.67: Which city in the world is the most beautiful to live in?
0.56: Who is the richest country in the world?
0.55: Which Country in the world has the most literate and educated citizens?
0.55: Which is the highest mountain in the world?
0.54: Where is the most beautiful place on the Earth?
0.53: Which is the happiest country in the world and why?
0.53: What do you think the most beautiful country in the whole world?
0.53: Which place is the most beautiful place in every country?


In [None]:
query = 'how do i make chocolate cake?'
run_query(query)

0.94: How can I make a delicious chocolate cake?
0.6: How do I bake a cake without an oven?
0.58: What should I do if I eat moldy chocolate?
0.53: What can happen if my dog ate chocolate cake? How dangerous can it be?
0.53: How do you make frosting without butter?
0.51: What should I do if a dog eats chocolate?
0.49: What are the risks of eating moldy chocolate?
0.48: Why do dogs like chocolate?
0.45: How do you make cotton candy flavoring? How is cotton candy made?
0.45: Why does dark chocolate taste so nasty?
