<a href="https://colab.research.google.com/github/ua-datalab/NLP-Speech/blob/main/Intro_to_Semantic_Search/Intro_to_Semantic_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 <center> <h1> Introduction to Semantic Search

Question: How are the nouns in this image connected to one another?

![](https://raw.githubusercontent.com/ua-datalab/NLP-Speech/main/Intro_to_Semantic_Search/Semantic-Search.png)

[source](https://www.singlegrain.com/blog/x/semantic-seo/)

# Housekeeping
1. Check that the recording is on
2. Check audio and screenshare
3. Share link to notebook in chat
4. Light mode and font size
5. Cohere signup and API key gen


# Overview

Semantic search operates at the level of meaning and context. When provided with an archive of documents, it is built to sorting through a high volume of content, and return an answer for the query by aggregating related sources.

The results of the semantic search bring together the aggregated information from several sources, to enrich the response.

There are two important things a semantic search needs to return:
- Accurate information that is related to the query
- Rankings for the results returned, based on how close they are to the query.

# What are the components needed for semantic search?
- A **language model** with **embeddings** that connects words within a vocabulary and within sentences.
- A dataset with information in a query-response/ topic-description format
- An index for speeding up the retreival process
- A user-provided query, appropriately processed



# Representing human language in a computer-readable format

  - We need information on how words are connected to each other in order to map their meaning.
  - We utilize a language model that **assign syntactic and semantic tags** to words
    - Connects them to their **lexemes**, and finds relations between them, builds context.
    - This enables the system to make connections synonyms, hyper and hyponyms, and see words as more than blocks of letters.

# Semantic and keyword searches- different utilities

Keyword searches are a great, lightweight tool for finding documents with the words we need. When combined with search tools adn conditionals, they can access and retreive information based on the user input. But, they have some limitations:
-  They cannot disambiguate between contexts as the engine does not have information on entities and relations (Apple the company and computers, vs apple the fruit and peanut butter).
- They require exact query words
- They can be hacked with things like keyword stuffing, because they look for words, not how they are used

Semantic search uses all the building blocks for NLP, as well as the building blocks of LLMs, and operates at a meaning level. It is a heavier system, but has higher utility.

# Some Definitions


- Entity- briefly, nouns in a given text
- Query- the user-provided input that will be used to perform the search
- Vector representations
  - Mathematical representation of data (such as words) in a multi-dimensional space, as an array of numbers.
  - Can capture relationships among data points using the numerical representation.
  - Work at the word level
- Embeddings
  - A type of vector representation that captures relationships at the lexcical and sentence levels
  - Words that share meaning are closer to each other than those that don't
  - It is **learned from data**
  - Used to capture semantic content, relationships, or patterns between objects.
  - They map discrete objects (like words or items) into continuous vector spaces.

<center><img src="https://raw.githubusercontent.com/ua-datalab/NLP-Speech/main/Intro_to_Semantic_Search/vector-representation.png" width="500"/> </center>

[source](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0231189#pone.0231189.ref009)
- Semantic web: a representation of the world wide web that is machine interpretable, where information is interconnected

<center><img src="https://raw.githubusercontent.com/ua-datalab/NLP-Speech/refs/heads/main/Intro_to_Semantic_Search/RDF_example_extended.svg" width="500"/> </center>

[source](https://en.wikipedia.org/wiki/File:RDF_example.svg)





# History: Google and Semantic Search

- Information is represented in the form of knowledge graphs, as entities and relations. Leads to RankBrain in 2015.
- [Google introduces Hummingbird](https://www.searchenginejournal.com/google-algorithm-history/hummingbird-update/)
- [BERT- bidirectional encoder representations from transformers](https://blog.google/products/search/search-language-understanding-bert/)

# A human approach to semantic search

Consider how you would interpret the following query, and break it down into information that you will need in order to provide a result:

> What is Dune?

# Some working examples

We will call pre-trained models or embeddings, and implement semantic search in a few ways:
1. We provide the documents, utilize embeddings to retreive information
2. We query a dataset created with information of a specific nature (factual questions and answers, wikipedia corpus), and assess if our search results provide useable results.

An example of a wikipedia dataset with embeddings:  https://huggingface.co/datasets/olmer/wiki_mpnet_embeddings

# Cohere

[An AI platform](https://docs.cohere.com/docs/models) that offers pre-trained SOTA models, an easy set of multi-platform tools to run them in, as well as educational content.

One of the simple out-of-the-box options for LLM research.

## Sign up and access an API key.

1. [Click here to sign up to Cohere](https://cohere.ai/).
2. Connect to the cohere dashboard, after signing in
3. Copy your API key from the API key section

<img src="https://raw.githubusercontent.com/ua-datalab/NLP-Speech/main/Intro_to_Semantic_Search/cohere_dashboard.png" width="150"/>

## Example 1: build a simple semantic search engine with Cohere.

How do we build features like StackOverflow's "similar questions" feature?

Basic breakdown of steps:
1. Get an archive of questions.
2. [Embed](https://docs.cohere.ai/embed-reference/) the archive to power the semantic search.
3. Create a search function using an index and nearest neighbor search.
4. Return all questions in the question bank which bear a high degree of similarity to the user query.

Source for code: https://github.com/cohere-ai/notebooks/blob/main/notebooks/guides/Basic_Semantic_Search.ipynb

### Setup

NOTE: User input required



In [None]:
# set up and imports:
!pip install "cohere<5" umap-learn altair annoy datasets tqdm

In [None]:

# import libraries
import numpy as np
import re
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

import cohere
from tqdm import tqdm
from datasets import load_dataset
import umap
import altair as alt
from annoy import AnnoyIndex
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth', None)

### Let's access a language model and open the cohere API.

In [None]:
model_name = "embed-english-v3.0"
#api_key = ""
api_key = open('cohere_api.txt').read()
input_type_embed = "search_document"

# Now we'll set up the cohere client.
co = cohere.Client(api_key)

### Examining the dataset

Next, we load our dataset, which is a set of questions and their answers. Today, we are using the [trec](https://www.tensorflow.org/datasets/catalog/trec) dataset which is made up of questions and their categories. For this, we use the `datasets` library, which can access all datasets on Hugging Face.

In [None]:
from datasets import load_dataset_builder
ds_builder = load_dataset_builder("trec")
# you will need to enter 'y' in the box below

In [None]:
# Inspect dataset description
ds_builder.info.description

In [None]:
# Get dataset
dataset = load_dataset("trec", split="train")

# Import into a pandas dataframe, take only the first 3000 rows
# We can change this
df = pd.DataFrame(dataset)[:3000]

# Preview the data to ensure it has loaded correctly
df.head(10)



### Create embeddings for the `trec` question bank:

In [None]:
# Get the embeddings
embeds = co.embed(texts=list(df['text']),
                  model=model_name,
                  input_type=input_type_embed).embeddings

In [None]:
# Check the dimensions of the embeddings
embeds = np.array(embeds)
embeds.shape

Next, we need an index that can store the embeddings and access them in an optimized way. This example uses [Annoy](https://github.com/spotify/annoy).

After building the index with our dataset, we can use it to retrieve the nearest neighbors either of existing questions, or of new questions that we embed.

In [None]:
# Create the search index, pass the size of embedding
search_index = AnnoyIndex(embeds.shape[1], 'angular')
# Add all the vectors to the search index
for i in range(len(embeds)):
    search_index.add_item(i, embeds[i])

search_index.build(10) # 10 trees
search_index.save('test.ann')

### Providing a user search.

Edit the string in "query" variable to add your question.

Note: We have only called the first 3000 rows in the dataset. We may need more or less data to get the right answer.**bold text**

In [None]:
query = "Who is the president of the US?"
input_type_query = "search_query"

# Get the query's embedding
query_embed = co.embed(texts=[query],
                  model=model_name,
                  input_type=input_type_query).embeddings

# Retrieve the nearest neighbors
similar_item_ids = search_index.get_nns_by_vector(query_embed[0],10,
                                                include_distances=True)
# Format the results
query_results = pd.DataFrame(data={'texts': df.iloc[similar_item_ids[0]]['text'],
                             'distance': similar_item_ids[1]})


print(f"Query:'{query}'\nNearest neighbors:")
print(query_results) # NOTE: Your results might look slightly different to ours.

### Thoughts:
When we have a small and limited dataset, our search results will reflect the size of our data, and be limited as well.

For semantic search at scale, we will need embeddings from a pre-trained model at scale.

## Example 2: Searching Wikipedia with Cohere's language model

### Load and examine the dataset:


In [None]:
import torch
#Load at max 1000 documents + embeddings
max_docs = 5000
docs_stream = load_dataset(f"Cohere/wikipedia-22-12-simple-embeddings", split="train", streaming=True)

docs = []
doc_embeddings = []

for doc in docs_stream:
    docs.append(doc)
    doc_embeddings.append(doc['emb'])
    if len(docs) >= max_docs:
        break

doc_embeddings = torch.tensor(doc_embeddings)

In [None]:
# check shape of output. It should have 1000 embeddings vectors, with 768 values each
doc_embeddings.shape

In [None]:
# Get the query, then embed it
query = 'When did Albert Einstein die?'
response = co.embed(texts=[query], model='multilingual-22-12')
query_embedding = response.embeddings
query_embedding = torch.tensor(query_embedding)

# Compute dot score between query embedding and document embeddings
dot_scores = torch.mm(query_embedding, doc_embeddings.transpose(0, 1))
top_k = torch.topk(dot_scores, k=3)

# Print results
print("Query:", query)
for doc_id in top_k.indices[0].tolist():
    print(docs[doc_id]['title'])
    print(docs[doc_id]['text'], "\n")

### Thoughts
In this example, we are able to get information that is a lot more than we asked for. Our limitations continue to be data size (1000-5000 documents), and ranking (when information is extracted from one document, it may not be in the same format as the question asked.

## Visualising distance and clusters

In [None]:
#@title Plot the archive we created{display-mode: "form"}

# UMAP reduces the dimensions from 1024 to 2 dimensions that we can plot
reducer = umap.UMAP(n_neighbors=20)
umap_embeds = reducer.fit_transform(embeds)
# Prepare the data to plot and interactive visualization
# using Altair
df_explore = pd.DataFrame(data={'text': df['text']})
df_explore['x'] = umap_embeds[:,0]
df_explore['y'] = umap_embeds[:,1]

# Plot
chart = alt.Chart(df_explore).mark_circle(size=60).encode(
    x=#'x',
    alt.X('x',
        scale=alt.Scale(zero=False)
    ),
    y=
    alt.Y('y',
        scale=alt.Scale(zero=False)
    ),
    tooltip=['text']
).properties(
    width=700,
    height=400
)
chart.interactive()

# Distilbert

Next, we will try another available option- accessing a model pre-trained on entries from the website Quora.

This is a much bigger model, with more data. With this code, you can play with more models and options.

Source for code: https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications/semantic-search



In [None]:
!pip install sentence_transformers

In [None]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import time
import gzip
import os
import torch
import csv

if not torch.cuda.is_available():
  print("Warning: No GPU found. Please add GPU to your notebook")

In [None]:
# Load model and limit size to 100k:
model_name = 'quora-distilbert-multilingual'
model = SentenceTransformer(model_name)

url = "http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"
dataset_path = "quora_duplicate_questions.tsv"
max_corpus_size = 100000

# Check if the dataset exists. If not, download and extract
# Download dataset if needed
if not os.path.exists(dataset_path):
    print("Download dataset")
    util.http_get(url, dataset_path)

# Get all unique sentences from the file
corpus_sentences = set()
with open(dataset_path, encoding='utf8') as fIn:
    reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_MINIMAL)
    for row in reader:
        corpus_sentences.add(row['question1'])
        if len(corpus_sentences) >= max_corpus_size:
            break

        corpus_sentences.add(row['question2'])
        if len(corpus_sentences) >= max_corpus_size:
            break

corpus_sentences = list(corpus_sentences)

In [None]:
print("Encode the corpus. This might take a while")
corpus_embeddings = model.encode(corpus_sentences, show_progress_bar=True, convert_to_tensor=True)

###############################
print("Corpus loaded with {} sentences / embeddings".format(len(corpus_sentences)))

### Setting up a function to encode the user query:

In [None]:
# Function that searches the corpus and prints questions that match our search
def search(inp_question):
    start_time = time.time()
    question_embedding = model.encode(inp_question, convert_to_tensor=True)
    hits = util.semantic_search(question_embedding, corpus_embeddings)
    end_time = time.time()
    hits = hits[0]  #Get the hits for the first query

    print("Input question:", inp_question)
    print("Results (after {:.3f} seconds):".format(end_time-start_time))
    for hit in hits[0:5]:
        print("\t{:.3f}\t{}".format(hit['score'], corpus_sentences[hit['corpus_id']]))

## Sample queries and Similarity

In [None]:
search("How can I learn Java online?")

In [None]:
search("What is the capital of the France?")

# Optional: Ollama

[Link to a demo that uses Ollama for semantic search, with a user-provided dataset](https://github.com/ua-datalab/NLP-Speech/blob/main/Intro_to_Semantic_Search/Semantic_Search_Ollama.ipynb)

Lear more in our GenAI workshops: https://github.com/ua-datalab/Generative-AI/wiki/Running-LLM-Locally:-Ollama

# Some more at-scale demos

- [Semantic experiences- a set of examples for integrating semantic information in NLP](https://research.google.com/semanticexperiences/)
- [Semantris- ML powered word association tetris](https://research.google.com/semantris/)


# References
1. [GloVe: Global Vectors for Word Representation](https://aclanthology.org/D14-1162/)
2. [Wide range screening of algorithmic bias in word embedding models using large sentiment lexicons reveals underreported bias types](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0231189#pone.0231189.ref009)
3. [LLM Vector Embedding](https://llmbuilt.com/llm-vector-embedding/)
4. [Integrating Vector Databases with LLMs: A Hands-On Guide](https://mlengineering.medium.com/integrating-vector-databases-with-llms-a-hands-on-guide-82d2463114fb)
5. [What are vector embeddings?](https://www.elastic.co/what-is/vector-embedding)
6. [Using A Large Language Model For Entity Extraction](https://cobusgreyling.medium.com/using-a-large-language-model-for-entity-extraction-6fffb988eb15)