<a href="https://colab.research.google.com/github/rorisDS/workshop_semantic_search/blob/main/Workshop_Semantic_Search_EN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# From Text to Knowledge: Building a Semantic Search Engine

This notebook is part of the presentation available in [this repository](https://github.com/rorisDS/workshop_semantic_search). It explores the use of NLP models to extract semantic information from texts and perform intelligent searches.

It covers the fundamental principles of Machine Learning and NLP, combining theory with practical examples and code that guide participants through the implementation of a semantic search engine. The goal is to learn from the generation of embeddings to their application in advanced queries.

This notebook includes all the necessary code to experiment with the presented concepts, along with explanations and comments that facilitate understanding of the technical aspects covered.

- Author: [Víctor Manuel Alonso Rorís](https://www.linkedin.com/in/victor-roris/)
- Date: 11th March, 2025

In [1]:
from IPython.display import clear_output

# Install the necessary libraries
!pip install requests
!pip install pymupdf
!pip install faiss-cpu
!pip install python-dotenv
!pip install langchain openai
!pip install -U langchain-community
!pip install openai

# Pre-installed by default in Google Colab
# !pip install transformers
# !pip install sentence-transformers

clear_output() # Clear all the logs generated in the output!

## Machine Learning

Machine Learning (ML) is a branch of Artificial Intelligence that enables computers to learn patterns from data without being explicitly programmed for each task. Through algorithms, ML models identify relationships within data and can make predictions or decisions automatically.

Learning can take different forms; one of the most important is supervised learning, where the model is trained on labeled data.

One of the most powerful techniques within Machine Learning is the use of Artificial Neural Networks (ANNs). These are inspired by the functioning of the human brain and are composed of layers of interconnected artificial neurons.

### Artificial Neural Networks

A neural network is composed of layers connected to each other, following a specific architecture. Each layer receives a set of inputs, processes them by applying weights and activation functions, and generates an output that is passed to the next layer. As the model is trained, it adjusts these weights to improve its accuracy on the target task.

![red_neuronal](https://raw.githubusercontent.com/rorisDS/workshop_semantic_search/refs/heads/main/images/neural_network.png)

### Machine Learning Libraries: PyTorch and TensorFlow

To simplify the process of implementing neural networks, high-level libraries such as [PyTorch](https://pytorch.org/tutorials/) and [TensorFlow](https://www.tensorflow.org/) have emerged, making it easier to create, train, and deploy Machine Learning and Deep Learning models.

Both libraries allow leveraging the power of GPUs to accelerate training and offer tools to manage large amounts of data. Thanks to them, developing neural networks has become much more accessible.


![pytorch_meme](https://raw.githubusercontent.com/rorisDS/workshop_semantic_search/refs/heads/main/images/pytorch_meme.png)


### Let's play

We define a simple model (*neural network*) using PyTorch

In [None]:
import torch

class TinyModel(torch.nn.Module):

    def __init__(self):
        super(TinyModel, self).__init__()

        # Layer 1
        self.linear1 = torch.nn.Linear(100, 200)  # Matrix of size 100x200
        self.activation = torch.nn.ReLU()         # Non-linear function

        # Layer 2
        self.linear2 = torch.nn.Linear(200, 10)   # Matrix of size 200x10
        self.softmax = torch.nn.Softmax(dim=1)   # Non-linear function

    def forward(self, x):
        # Where x is a vector of size: 1x100
        x = self.linear1(x)      # Matrix multiplication: 1x100 * 100x200 = 1x200
        x = self.activation(x)  # f(1x200) => 1x200
        x = self.linear2(x)     # Matrix multiplication: 1x200 * 200x10 = 1x10
        x = self.softmax(x)    # f(1x10) => 1x10
        return x

We instantiate the model we have defined and check that the weights are initially assigned **randomly**

In [None]:
tinymodel = TinyModel()
tinymodel

TinyModel(
  (linear1): Linear(in_features=100, out_features=200, bias=True)
  (activation): ReLU()
  (linear2): Linear(in_features=200, out_features=10, bias=True)
  (softmax): Softmax(dim=1)
)

In [None]:
print('Model parameters:')
for param in tinymodel.parameters():
    print(f"\t - Layer shape: {param.shape}")
    print(f"\t - Parameter values: {param}")
    print("\n\n")

Model parameters:
	 - Layer shape: torch.Size([200, 100])
	 - Parameter values: Parameter containing:
tensor([[ 0.0633, -0.0281,  0.0222,  ...,  0.0329,  0.0949, -0.0339],
        [ 0.0689, -0.0004, -0.0701,  ...,  0.0314,  0.0488,  0.0780],
        [-0.0619,  0.0046,  0.0879,  ...,  0.0621, -0.0494, -0.0843],
        ...,
        [-0.0674, -0.0516,  0.0099,  ...,  0.0037,  0.0067, -0.0194],
        [-0.0596,  0.0633,  0.0020,  ..., -0.0047,  0.0361,  0.0561],
        [-0.0635, -0.0875,  0.0652,  ..., -0.0957,  0.0725,  0.0385]],
       requires_grad=True)



	 - Layer shape: torch.Size([200])
	 - Parameter values: Parameter containing:
tensor([-0.0728,  0.0252, -0.0752, -0.0614,  0.0061,  0.0481, -0.0245,  0.0853,
         0.0440, -0.0361,  0.0119, -0.0554, -0.0101,  0.0252,  0.0661, -0.0266,
        -0.0665,  0.0517, -0.0252,  0.0553, -0.0130, -0.0881,  0.0660, -0.0744,
        -0.0091,  0.0815, -0.0925,  0.0610,  0.0428,  0.0926,  0.0470, -0.0564,
        -0.0651,  0.0289,  0.0665, 

Let's test how to predict with a model!

In [None]:
# Randomly generate an input vector of size: 1x100
x = torch.rand(1, 100)
print(f"Vector de entrada ({x.shape}): {x}")

Vector de entrada (torch.Size([1, 100])): tensor([[0.0230, 0.1339, 0.2984, 0.0569, 0.8461, 0.4045, 0.0106, 0.8515, 0.0085,
         0.7592, 0.5130, 0.3543, 0.4052, 0.8502, 0.3839, 0.6691, 0.4884, 0.9264,
         0.5466, 0.4687, 0.0610, 0.5027, 0.9052, 0.7645, 0.2529, 0.9052, 0.4440,
         0.8076, 0.8721, 0.5948, 0.6468, 0.7233, 0.5089, 0.0158, 0.3298, 0.1886,
         0.0869, 0.4712, 0.8874, 0.2572, 0.3252, 0.2349, 0.8436, 0.7764, 0.0037,
         0.2635, 0.1523, 0.0096, 0.5009, 0.7812, 0.6462, 0.1576, 0.8967, 0.1458,
         0.7308, 0.6310, 0.5613, 0.8280, 0.9921, 0.6338, 0.4049, 0.5843, 0.6172,
         0.9512, 0.1176, 0.9452, 0.5368, 0.6396, 0.2191, 0.6313, 0.6074, 0.1004,
         0.7480, 0.0536, 0.5951, 0.5639, 0.9991, 0.1301, 0.6595, 0.4566, 0.2220,
         0.1856, 0.3512, 0.8640, 0.1718, 0.6600, 0.2704, 0.8503, 0.3609, 0.5587,
         0.8409, 0.9374, 0.7814, 0.4407, 0.8359, 0.0105, 0.5884, 0.8495, 0.6542,
         0.8062]])


In [None]:
# Predict using the current model (parameters assigned randomly)
y = tinymodel(x)
print(f"Vector de salida ({y.shape}): {y}")

Vector de salida (torch.Size([1, 10])): tensor([[0.1004, 0.0899, 0.1047, 0.0984, 0.1104, 0.0905, 0.1069, 0.0965, 0.1022,
         0.1002]], grad_fn=<SoftmaxBackward0>)


Any model can be trained to adjust its weights or parameters to the training dataset. However, this process is beyond the scope of this workshop.

If you want to learn how to train a model in PyTorch, you can check out the following link: https://pytorch.org/tutorials/beginner/introyt/trainingyt.html

### Your Turn  

- **Repeat the model execution**: Instantiate the model again and run the operations to verify that the weights are indeed initialized randomly.  
- **Modify the architecture**: Add more layers or change the dimension of the matrices and observe how it affects the model's behavior.  
- **Explore other activation functions**: Research and try different non-linear functions available in PyTorch: [Activation functions documentation](https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity).  

## Tokenizer  

**Tokenizers** are key components in **NLP**. Their purpose is to translate text into data that models can process. Since models only work with numerical data, the main goal of tokenizers is to convert input text into **numerical vectors**.  


![tokenizer](https://raw.githubusercontent.com/rorisDS/workshop_semantic_search/refs/heads/main/images/NLP_tokenizer.png)

### HuggingFace

[**HuggingFace**](https://huggingface.co/docs) is an organization that offers tools and platforms for **Artificial Intelligence**, especially focused on **NLP**.  

Among the services provided by HuggingFace is the [**Model Hub**](https://huggingface.co/docs/hub/index), a platform that allows the community to store, discover, and share models. This makes it easier to access both models trained by individual contributors and models developed by HuggingFace itself. You can check how to use the Hub here: https://huggingface.co/docs/hub/models-the-hub  

One of HuggingFace's most prominent tools is the [**Transformers library**](https://huggingface.co/docs/transformers/index), which provides various mechanisms that abstract away the direct implementation of **NLP** functionalities, including **tokenization**. You can take a quick look at how to use it here: https://huggingface.co/docs/transformers/quicktour  




### How Tokenization Works

A **tokenizer** in **NLP** is a fundamental tool that converts raw text into smaller units called **tokens**, which can be words, subwords, characters, or special symbols. This segmentation makes it easier for NLP models to analyze, process, and understand the text.  

**Tokenization** algorithms split text into *subwords*. These algorithms are based on the principle that frequently used words should not be broken into smaller subwords, but rare words should be decomposed into meaningful subwords.  

For example, *\"unexpectedly\"* might be considered a rare word and could be split into *\"unexpected\"* and *\"ly\"*. These parts are likely to appear more frequently as independent subwords, while still preserving the meaning of *\"unexpectedly\"* through the combined meaning of *\"unexpected\"* and *\"ly\"*.  

These subwords can be mapped to **numerical IDs**. A **tokenizer** consists of a **vocabulary** that allows *direct conversion* between subword and ID. This tokenization dictionary acts simply as a mapping table between each token and a unique integer.  

It is important to note that this conversion should not be confused with an **embedding**, as it does not provide any semantic or contextual information about the text. While an **embedding** represents semantic relationships between words by placing them in a vector space where proximity indicates similarity in meaning, the assignment of IDs by the **tokenizer** is purely an indexing process. Each token receives an arbitrary and fixed number, without considering context or meaning in relation to other tokens.  

If you want to learn more: https://huggingface.co/learn/nlp-course/chapter2/4

### Let's play!

Selecting a model to use

In [None]:
# Using the HuggingFace Model Hub, we select a model
model_id = "bert-base-cased"

# If you want, you can visit the Hub and search for any other model: https://huggingface.co/models

Instantiating the `Tokenizer` for the selected model

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Writing a text to tokenize

In [None]:
sequence = "Learning about how a tokenizer works"

Splitting the text into tokens

In [None]:
tokens = tokenizer.tokenize(sequence)

print(tokens)

['Learning', 'about', 'how', 'a', 'token', '##izer', 'works']


Mapping each token to a numerical ID (**direct conversion using the tokenizer's internal dictionary!**)

In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[9681, 1164, 1293, 170, 22559, 17260, 1759]


We can convert the numerical IDs back to tokens and thus reconstruct the original text!

In [None]:
decoded_string = tokenizer.decode(ids)

print(decoded_string)

Learning about how a tokenizer works


### Your Turn  

- **Try other models**: Explore different models in the [HuggingFace Model Hub](https://huggingface.co/models). Do they generate the same tokens for the same text?  
- **Experiment with made-up words**: What happens when you input non-existent words or words from other languages?  
- **Analyze the first tokens**: What are the tokens represented by the first 10 IDs (0-9)? Does the tokenizer only represent words?  



### Insights  

Subwords can be mapped to numerical IDs using a tokenization dictionary, which simply acts as a lookup table between each token and a unique integer. It is important to note that this conversion should not be confused with an embedding, as it does not provide any semantic or contextual information about the text. While an embedding represents the semantic relationships between words by placing them in a vector space where proximity indicates similarity in meaning, the assignment of IDs by the tokenizer is purely an indexing process. Each token is assigned an arbitrary and fixed number, without considering the context or meaning of the token in relation to others.

## NLP Tasks  

Natural Language Processing (NLP) encompasses various tasks that enable machines to effectively understand, interpret, and generate text. These tasks use Machine Learning models and neural networks to extract meaning and context from human language.  

Some of these tasks are illustrated in the following image:  

![nlp_tasks](https://raw.githubusercontent.com/rorisDS/workshop_semantic_search/refs/heads/main/images/nlp_tasks.png)  

To learn more about the nature and functioning of these tasks, you can visit the Hugging Face website: [Hugging Face Tasks](https://huggingface.co/tasks).  

Additionally, Hugging Face offers a large collection of pretrained models covering these tasks in its repository: [Hugging Face Models](https://huggingface.co/models).  


### Let's Play  

Below are some code examples for performing various Natural Language Processing (NLP) tasks using the [Transformers](https://huggingface.co/docs/transformers/quicktour) library from Hugging Face. To do this, we will use models available in the [Hugging Face Model Hub](https://huggingface.co/models), exploring different ways to make predictions using models, tokenizers, and [pipelines](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines).  

#### Translation  

In this example, we will directly use the `model` and the `tokenizer` to generate the prediction.  

In [None]:
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer

# Model available through the Hugging Face hub:
#   https://huggingface.co/razwand/opus-mt-en-mul-finetuned_en_sp_translator
model_key = "razwand/opus-mt-en-mul-finetuned_en_sp_translator"

# Instantiate model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(model_key)
tokenizer = AutoTokenizer.from_pretrained(model_key)
model.eval()

# Translate
sentence = 'flowers are white'
input_ids = tokenizer(sentence, return_tensors='pt')
outputs = model.generate(**input_ids)
# The model output consists of tokens that must be converted back to text using the tokenizer
outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(outputs)

las flores son blancas


#### Text Classification

In this case, we will use a specific `pipeline` for text classification to abstract the execution of the `tokenizer` and the `model`.

In [None]:
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
from transformers import TextClassificationPipeline

# Model available through the Hugging Face hub:
#   https://huggingface.co/dima806/news-category-classifier-distilbert
model_key = "dima806/news-category-classifier-distilbert"

# Instantiate tokenizer, model, and pipeline
tokenizer = AutoTokenizer.from_pretrained(model_key)
model = AutoModelForSequenceClassification.from_pretrained(model_key)
nlp = TextClassificationPipeline(model=model, tokenizer=tokenizer)

text_news = 'Reducing emissions is essential to promote ecosystem regeneration'
print(nlp(text_news))

text_news = "A tight match that was decided in overtime by a penalty"
print(nlp(text_news))


Device set to use cpu


[{'label': 'GREEN', 'score': 0.8641167879104614}]
[{'label': 'SPORTS', 'score': 0.9832993149757385}]


#### Summarization

Finally, we demonstrate how to use the `pipeline` directly to perform the prediction.

In [None]:
from transformers import pipeline

# Model available through the Hugging Face hub:
#   https://huggingface.co/tennessejoyce/titlewave-t5-base
model_key = "tennessejoyce/titlewave-t5-base"

classifier = pipeline('summarization', model=model_key)
article = """In this workshop, participants will explore Neural Networks, NLP, and embeddings to build a semantic search engine using Python. They will learn how neural networks process text, how embeddings capture meaning, and how to use pre-trained models to convert text into vector representations. Using Hugging Face Transformers, TensorFlow, and FAISS, they will implement a search function that retrieves results based on semantic similarity rather than keywords. By the end, attendees will have a working NLP-powered search engine and a solid understanding of modern language models and their applications. No prior experience required!"""
classifier(article)

config.json:   0%|          | 0.00/941 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.86k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Device set to use cpu


[{'summary_text': 'How to build a semantic search engine using Python?'}]

#### Your Turn
* **Try other NLP tasks on your own**

## Embeddings  

The embedding of a text is its representation in a high-dimensional vector space. This embedding is expressed as a numerical vector that captures the semantic and contextual relationships of the text. In this space, texts with similar meanings tend to be located near each other, allowing similarity measurement.  

![embedding_relations](https://raw.githubusercontent.com/rorisDS/workshop_semantic_search/refs/heads/main/images/embeddings_relations.png)  

The use of NLP models allows generating embeddings that capture semantic relationships between words, phrases, or documents. Models such as `Word2Vec`, `GloVe`, `FastText`, and `Transformers` (e.g., `BERT`, `Sentence Transformers`) train these embeddings to reflect similarities in meaning, even when the words do not exactly match. These vectors can be used in tasks such as semantic search, text classification, content recommendation, and clustering, enabling algorithms to better understand language and find connections beyond simple word matching.  


### Sentence Tranformers

[Sentence Transformers](https://sbert.net/) is a Python library based on Hugging Face that abstracts and simplifies the management of models for generating embeddings. In other words, it is a library designed to generate vector representations of texts using deep learning models optimized for semantic search, text classification, clustering, and more. It is maintained by Hugging Face.  


### Let's play

Let's see how to obtain embeddings using NLP models.

In [None]:
# Model available through the Hugging Face hub:
# https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2
model_key = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"

In [None]:
# Texts to generate embeddings
sentences = ["Hello", "Good morning", "Goodbye"]

Obtaining embeddings using `transformers` is somewhat complex, as it involves several calculations on the model's output vector.

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch

# Load the model from the Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained(model_key)
model = AutoModel.from_pretrained(model_key)
model.eval()

# Tokenize the phrases
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Calculate the token embeddings
model_output = model(**encoded_input)

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Aggregate the token embeddings to obtain the sentence embeddings
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

sentence_embeddings.shape

tokenizer_config.json:   0%|          | 0.00/402 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/723 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Sentence embeddings:
tensor([[-0.0416, -0.0471, -0.0119,  ...,  0.2190, -0.0018, -0.1378],
        [-0.0352, -0.1824, -0.0093,  ...,  0.1286, -0.0311, -0.0367],
        [-0.0307,  0.0730, -0.0137,  ...,  0.1305,  0.1191, -0.0615]],
       grad_fn=<DivBackward0>)


torch.Size([3, 768])

`Sentence Transformers` allows abstracting the complexity of `transformers` for obtaining embeddings.

In [None]:
from sentence_transformers import SentenceTransformer

# Load the model from the Hugging Face Hub
model = SentenceTransformer(model_key)
sentence_embeddings = model.encode(sentences)

print("Sentence embeddings:")
print(sentence_embeddings)

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.90k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Sentence embeddings:
[[-0.04155214 -0.04714641 -0.01188689 ...  0.21895139 -0.00177131
  -0.13775115]
 [-0.0351808  -0.18238163 -0.00932718 ...  0.12861395 -0.03109084
  -0.03672406]
 [-0.03066282  0.0730494  -0.0137241  ...  0.13048011  0.11908463
  -0.06153324]]


###  Your Turn

- **Word Order**: Check if the embedding of *"the red car"* and *"the car red"* is the same. Does the model capture the difference in word order?

- **Try other models**: Explore different models in the [Hugging Face Model Hub](https://huggingface.co/models). Do they return the same embeddings for the same texts?

- **Text length**: Generate embeddings for sentences of different lengths that express the same idea (e.g., *"The dog sleeps"* vs. *"The dog is peacefully sleeping in its bed"*). How does the number of words affect the embedding?

- **Out-of-vocabulary words**: Try using made-up or very rare words. Does the model generate an embedding or return something different?

## Calculating Distances Between Embeddings

To compare the relationship between two embeddings, distance or similarity measures are used. The most common metric in NLP is **Cosine Similarity**, which measures the angle between two vectors in space.  

**Cosine Similarity** is calculated as:  

![](https://raw.githubusercontent.com/rorisDS/workshop_semantic_search/refs/heads/main/images/cosine_similarity.png)  

where 𝐴 and 𝐵 are the embedding vectors, 𝐴⋅𝐵 is their dot product, and ∥𝐴∥ and ∥𝐵∥ are their norms (magnitudes). Its value ranges between -1 and 1, where:  
- **1** indicates maximum similarity,  
- **0** indicates the vectors are orthogonal (unrelated), and  
- **-1** indicates complete opposition.  

This metric is ideal in NLP because it captures similarity regardless of the vector's magnitude, focusing only on semantic orientation.  



### Let's play

In [None]:
from sentence_transformers import SentenceTransformer

sentences = ["Hello", "Good morning", "red car"]

# Model available through the Hugging Face hub:
# https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2
model_key = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
model = SentenceTransformer(model_key)
sentence_embeddings = model.encode(sentences, convert_to_numpy=True)

In [None]:
embedding1 = sentence_embeddings[0]
embedding2 = sentence_embeddings[1]
embedding3 = sentence_embeddings[2]

Cosine Similarity Calculation Using `NumPy`

In [None]:
import numpy as np

dot_product = np.dot(embedding1, embedding2)
magnitude_1 = np.linalg.norm(embedding1)
magnitude_2 = np.linalg.norm(embedding2)

cosine_similarity = dot_product / (magnitude_1 * magnitude_2)
print(f"Cosine similarity using NumPy: {cosine_similarity}")

Similitud del coseno usando NumPy: 0.7888635396957397


Cosine Similarity Calculation Using `scikit-learn`

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity_result = cosine_similarity(embedding1.reshape(1, -1), embedding2.reshape(1, -1))
print(f"Cosine similarity using scikit-learn: {cosine_similarity_result[0][0]}")

Cosine similarity using scikit-learn: 0.7088038921356201


Cosine Similarity Calculation Using `sentence-transformers`

In [None]:
similarity = model.similarity(embedding1, embedding2)
print(f"Cosine similarity using sentence-transformers: {similarity.tolist()[0][0]}")

Cosine similarity using sentence-transformers: 0.7088038921356201


Comparing Similarity Between Different Phrases with Different Semantics

In [None]:
print(f"Similarity between texts '{sentences[0]}' and '{sentences[1]}': {model.similarity(sentence_embeddings[0], sentence_embeddings[1]).tolist()[0][0]}")
print(f"Similarity between texts '{sentences[0]}' and '{sentences[2]}': {model.similarity(sentence_embeddings[0], sentence_embeddings[2]).tolist()[0][0]}")

Similarity between texts 'Hello' and 'Good morning': 0.7088038921356201
Similarity between texts 'Hello' and 'red car': 0.20039469003677368


### Your Turn  

- **Phrase Similarity**: Calculate the similarity between the embeddings of phrases with similar meanings, such as *\"The weather is nice today\"* and *\"It's good weather\"*. What values do you get?  

- **Synonyms and Antonyms**: Compare the distance between the embeddings of synonyms and antonyms. Are the results consistent with their meanings?  

- **Days of the Week**: Calculate the distance between the embeddings of the days of the week. Do you notice any patterns?  

- **Try Other Models**: Explore different models on the [Hugging Face Model Hub](https://huggingface.co/models). Do you get the same similarity values for the same words?  

- **Cross-Language Comparison**: Obtain embeddings for phrases in different languages, such as *\"Hello, how are you?\"* and *\"Hola, ¿cómo estás?\"*. Are they similar if the model is multilingual?  

- **Explore Other Metrics**: Try different distance or similarity metrics for embeddings and compare the results (check the implementations in [sklearn](https://scikit-learn.org/stable/api/sklearn.metrics.html#module-sklearn.metrics.pairwise)).


## Dense search

Once we represent text as embeddings, we need an efficient way to search through large volumes of data. Instead of comparing each vector one by one, which would be very costly, we use vector databases that optimize this search through specialized structures.  

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/75/Base_de_datos_de_vectores_o_vectorial.jpg/1024px-Base_de_datos_de_vectores_o_vectorial.jpg" alt="vector_database" width="700"/>  

Tools like [FAISS](https://ai.meta.com/tools/faiss/) (by Meta) or [Annoy](https://github.com/spotify/annoy) (by Spotify) allow for fast and scalable similarity searches based on cosine distance (or similar metrics). These vector databases enable **Dense Search**, which refers to searching in dense embedding spaces.

### Faiss

[FAISS](https://ai.meta.com/tools/faiss/) (Facebook AI Similarity Search) es una biblioteca de código abierto desarrollada por Meta para realizar búsquedas de similitud en grandes colecciones de embeddings de manera eficiente. Está optimizada para manejar millones o incluso miles de millones de vectores, permitiendo encontrar rápidamente los más similares a una consulta dada.

FAISS logra esta eficiencia utilizando estructuras como:

* **Índices aproximados**: En lugar de comparar cada vector individualmente, FAISS usa técnicas como clustering y cuantización para reducir la cantidad de cálculos sin comprometer demasiado la precisión.
* **Soporte para GPU**: Acelera la búsqueda de vecinos más cercanos utilizando procesamiento en paralelo.
* **Diferentes métricas de distancia**: Como la distancia del coseno, distancia euclidiana y otras, adaptándose a distintas aplicaciones.

Gracias a estas características, FAISS es ampliamente utilizado en buscadores semánticos, sistemas de recomendación y otras aplicaciones que requieren búsqueda eficiente en espacios de alta dimensión.

### Let's play

In [2]:
from sentence_transformers import SentenceTransformer
from typing import List

class Embedder:

  def __init__(self, model_key: str = None):
    if model_key is None:
      # Model available through the Hugging Face hub:
      # https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2
      model_key = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
    self.model = SentenceTransformer(model_key, trust_remote_code=True)

  def embed_passages(self, passages: List[str]):
    # Generates embeddings for a list of passages
    return self.model.encode(passages, convert_to_numpy=True)

  def embed_query(self, query: str):
    # Some models may have a different function for encoding queries!
    return self.model.encode(query, convert_to_numpy=True)

embedder = Embedder()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.90k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/723 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/402 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [7]:
# Index passages
passages = [
    # Noise
    "hello", "good morning", "red car", "workshop", "the stove is on",
    "I think, therefore I am",
    # Texts about food
    "the campus food tastes like cardboard",
    "eating at a Michelin-starred restaurant is worthy of study",
    "my mom's cooking should be studied at the university"
]

# Generate embeddings for the passages
passage_embeddings = embedder.embed_passages(passages)

Build the index

In [8]:
import faiss

# Embedding dimensions
d = passage_embeddings[0].shape[0]
print(f"Embedding dimension: {d}")

# Create an index for embeddings of dimension d
index = faiss.IndexFlatIP(d)

Embedding dimension: 768


In [9]:
# Indexamos los embeddings en el indice
index.add(passage_embeddings)

In [19]:
# Generate an embedding for the query
query = ["How is the food at the university?"]

query_embedding = embedder.embed_query(query=query)

k = 4  # Retrieve the k closest neighbors
distances, indices = index.search(query_embedding, k)

# Display the results
for it, (query_indice, query_distance) in enumerate(zip(indices, distances)):
  print(f"Query: {query[it]}")
  print()
  for idx, dist in zip(query_indice, query_distance):
      print(f"Index: {idx}, Distance: {dist}")
      print("Text: ", passages[idx])
      print()
  print("--------------\n")

Query: How is my mother's food?

Index: 8, Distance: 5.882347106933594
Text:  my mom's cooking should be studied at the university

Index: 6, Distance: 4.879648208618164
Text:  the campus food tastes like cardboard

Index: 7, Distance: 4.0698347091674805
Text:  eating at a Michelin-starred restaurant is worthy of study

Index: 1, Distance: 2.2087652683258057
Text:  good morning

--------------

Query: How is your Internet connection?

Index: 1, Distance: 3.4579219818115234
Text:  good morning

Index: 0, Distance: 3.045036792755127
Text:  hello

Index: 5, Distance: 2.6832027435302734
Text:  I think, therefore I am

Index: 4, Distance: 1.9340052604675293
Text:  the stove is on

--------------



### Your Turn  

- **Modify the Query**: Change the query to something entirely different (e.g., *\"What is the color of the car?\"* or *\"What appliance is working?\"*). Are the retrieved passages still relevant? How do the distances change?  

- **Increase the Number of Neighbors**: Modify the value of `k` to retrieve more (or fewer) nearest neighbors. Do the results become less relevant as `k` increases?  

- **Add More Passages:** Expand the `passages` list with **at least 10 new entries** on different topics (e.g., technology, sports, music, history). Do the results stay relevant, or does the additional data introduce noise?  

- **Experiment with Multiple Queries**: Modify the code to **handle multiple queries simultaneously** (e.g., `query=["How is my mom's food?", "How is your Internet connection?"]`). How does the index respond to multiple searches?  

- **Use a Different Model**: Replace the current model with another one from the [Hugging Face Model Hub](https://huggingface.co/models). Do you observe any improvement or degradation in the search accuracy?

## Building a Semantic Search Engine

A semantic search engine retrieves relevant information based on the **meaning** of the text, rather than relying solely on **exact keyword matches**. To achieve this, we convert both the reference texts and the queries into **embeddings** and use a **vector database** to find the most similar results.

![semantic_search](https://raw.githubusercontent.com/rorisDS/workshop_semantic_search/refs/heads/main/images/buscsemant.png)

The construction of a semantic search engine consists of **two main stages**:

- **Indexing:**  
    In this phase, the reference materials are **converted into embeddings** and stored in a **vector database** (such as FAISS) to enable efficient retrieval.  
- **Searching:**  
    When a query is entered, its **embedding is generated** and compared against the indexed documents to retrieve the most relevant ones based on **semantic similarity**.  

Next, we will explore each of these stages in detail and learn how to **implement them in practice**.

### Indexing

The indexing process is divided into several phases within the **data ingestion pipeline (ETL)**:

1. **Text Extraction from Documents**: The first step is to **extract the content** from the documents to be indexed. These documents may come in different formats, such as **PDF, DOCX, or TXT**. Tools like **[PyMuPDF](https://pymupdf.readthedocs.io/en/latest/)** are used to extract text from PDF documents, ensuring that all relevant information is captured for further processing.  

2. **Text Splitting into Chunks**: Since documents are often lengthy, they need to be **split into smaller chunks**, called **passages** or **fragments**. This improves the efficiency of the search by enabling more precise answers. The text can be divided into blocks based on a **fixed number of words or sentences**.  

3. **Converting Fragments into Embeddings**: Each text fragment is converted into an **embedding**, which is a numerical representation in a **multidimensional space**. This is achieved using **deep learning models**, such as **Sentence Transformers**. These models transform the meaning of the text into **vectors**, enabling **semantic similarity comparisons** instead of relying on exact word matches.  

4. **Indexing in a Vector Database**: The generated embeddings are stored in a **specialized vector database**, such as **FAISS** (Facebook AI Similarity Search). FAISS enables **fast and efficient searches** using advanced vector comparison techniques, identifying the most relevant fragments for a given query.  

![indexation](https://raw.githubusercontent.com/rorisDS/workshop_semantic_search/refs/heads/main/images/buscsemant_indexation.png)


In [20]:
import requests
import fitz  # PyMuPDF
import os
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

def download_pdf(url, output_path: str):
    """Downloads a PDF from a URL"""
    response = requests.get(url)
    with open(output_path, "wb") as f:
        f.write(response.content)
    print(f"PDF downloaded: {output_path}")

def pdf_to_text(pdf_path: str) -> str:
    """Converts the content of a PDF into plain text"""
    doc = fitz.open(pdf_path)
    text = "\n".join([page.get_text("text") for page in doc])
    return text

def split_text(text: str, chunk_size: int = 200) -> list:
    """Splits the text into chunks"""
    words = text.split()
    return [" ".join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]

def create_embeddings(passages: list, model_name="sentence-transformers/all-MiniLM-L6-v2"):
    """Converts text passages into embeddings"""
    model = SentenceTransformer(model_name)
    embeddings = model.encode(passages, convert_to_numpy=True)
    return embeddings

def store_embeddings(embeddings, index_path="faiss_index.bin"):
    """Indexes embeddings in Faiss"""
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatL2(dimension)
    index.add(embeddings)
    faiss.write_index(index, index_path)
    print(f"Embeddings saved in {index_path}")

In [21]:
# 0. Retrieve the document to be processed
pdf_url = "https://d1.awsstatic.com/aws-analytics-content/OReilly_book_Natural-Language-and-Search_web.pdf"  # Example URL
pdf_path = "document.pdf"
download_pdf(pdf_url, pdf_path)

# 1. Extract text
text = pdf_to_text(pdf_path)

# 2. Split into chunks
passages = split_text(text)

# 3. Generate embeddings
embeddings = create_embeddings(passages)

# 4. Store embeddings in the vector database
index_path = "faiss_index.bin"
store_embeddings(embeddings, index_path)
print("Indexing completed.")

PDF downloaded: document.pdf


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embeddings saved in faiss_index.bin
Indexing completed.


### Search

Once the documents have been indexed and stored in a vector database, the semantic search engine can efficiently respond to user queries. This process is divided into several phases:

1. **Query Conversion to Embedding**: When a user enters a query, the system first converts it into an embedding. The same transformation model used for indexing the documents (e.g., Sentence Transformers) is employed. This conversion represents the meaning of the query in a vector space, similar to how the documents were represented during the indexing phase.

2. **Searching for the Embedding in the Vector Database**: Once the query's numerical representation is obtained, it is compared with the embeddings stored in the vector database (such as FAISS). FAISS uses efficient search algorithms to find the closest vectors in terms of similarity.

3. **Identifying the Most Similar Passages**: Once the most similar embeddings are identified, their identifiers are used to retrieve each passage. Finally, the corresponding textual content is presented to the user as a response.

![search](https://raw.githubusercontent.com/rorisDS/workshop_semantic_search/refs/heads/main/images/buscsemant_query.png)

In [22]:
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

def load_index(index_path="faiss_index.bin"):
    """Load the index from the specified file path"""
    index = faiss.read_index(index_path)
    return index

def search(query, index, passages, top_k=5):
    """Search the top_k most similar passages to the query"""

    # 1. Convert the query to embedding
    query_embedding = create_embeddings([query])

    # 2. Search for the top_k most similar results
    distances, indices = index.search(query_embedding, top_k)

    # 3. Identify the passages and their associated distance
    results = [
      (passages[idx], distances[0][i]) for i, idx in enumerate(indices[0])
    ]

    return results

In [23]:
# Load the index and texts
index_path = "faiss_index.bin"

index = load_index(index_path)

# Perform search
test_query = "How many dimensions should an embedding have?"
results = search(test_query, index, passages)

print("Search Results:")
for passage, score in results:
    print(f"Score: {score:.4f}\n{passage}\n")

Search Results:
Score: 1.2135
helps reduce nonsensical implications, but how many more dimen‐ sions is enough? 14 | Chapter 3: Vectors: Representing Semantic Information Dimensionality One answer is to use as many dimensions as there are words in the vocabulary. Let aardvark = (1,0,0,0,...), abacus = (0,1,0,0,...), apple = (0,0,1,0,...), and so on. With this encoding, no two words can be added to give another word, since no word’s representation can have more than one 1. This is often called sparse, keyword, or one-hot encoding, and the corresponding vectors are known as sparse or one-hot vectors. Sparse vectors are quite useful in search and other natural language tasks, and until quite recently they were considered to be the state of the art. But sparse vectors can’t combine in a meaningful way (no word has more than one “1”). Take the words red and fruit. Adding their vectors should not produce the vector corresponding to aardvark or banana. But it would make sense if it equaled the

### Your Turn

* **Add new materials**: Find more references (e.g., books, articles) and index their content to perform queries on them.

* **Evaluate performance in specialized domains**: Test with niche texts, such as legal documentation or scientific articles. If the searcher doesn't return good results, what factor is limiting it? Is it FAISS? The embedding model? The chunk size? The text extraction?

* **Try different models**: Explore the [Hugging Face Model Hub](https://huggingface.co/models) and try alternatives. Are the results consistent for the same queries?

* **Adjust chunk size**: During indexing, experiment with different chunk sizes and evaluate how they affect the results.

* **Explore other vector databases**: How does performance change if you use Annoy or another alternative instead of FAISS?

The code presented in this notebook is a basic starting point. In a real-world scenario, you should consider that:

* **Reference materials can come from multiple sources and formats** (PDFs, documents, web pages, databases, etc.).

* **Indexing should be dynamic**: You don't always index everything at once; new documents may be added later.

* **Managing metadata is key**: FAISS only returns a numeric identifier for each text chunk, so you need to store metadata that allows you to locate the original document and the exact position of the indexed fragment.

Many other optimizations can improve the performance and robustness of the system. If you want to build a real semantic search engine, it's time to keep exploring and improving!

    

## Application of a Semantic Search Engine: RAG

The concept of RAG (Retrieval-Augmented Generation) combines two components: retrieval and generation. Instead of generating an answer solely from the knowledge internalized in the LLM, the model first retrieves relevant information from an external database or search index (such as a semantic search engine), and then uses this retrieved information to generate a more precise and contextualized response.

This approach enhances the ability of LLMs to answer complex and specialized questions by accessing real-time information, making the system more powerful and adaptable to a wide range of tasks. Thus, RAG combines the best of semantic search with text generation, optimizing both the accuracy and relevance of the answers.

![esquema_rag](https://raw.githubusercontent.com/rorisDS/workshop_semantic_search/refs/heads/main/images/rag.png)



### LangChain

[**LangChain**](https://python.langchain.com/docs/tutorials/) is an open-source library designed to facilitate the creation of applications that integrate **Large Language Models** (**LLMs**) with various external data sources, such as knowledge bases, databases, or **semantic search engines**. LangChain allows the construction of complex and customized workflows for tasks such as information retrieval, text generation, and interaction with APIs (such as the one offered by **OpenAI**).

**LangChain** simplifies the development of powerful and scalable applications, helping to create search and text generation systems that efficiently leverage the power of **LLMs**.

### OpenAI

**[OpenAI](https://openai.com/)** is an organization that develops advanced models for generating and understanding text, images, and more. Through its **API**, **OpenAI** provides access to powerful models like **GPT-3** and **GPT-4**, enabling integration of these models into applications.

The **OpenAI** API is accessible via **HTTP** requests. To use it, you need to obtain an API key, which can be generated on the **OpenAI** website. To learn more, visit the following link: https://platform.openai.com/docs/quickstart



### Let's play!

* OpenAI API KEY

Enter your **OpenAI** API Key to use the **OpenAI** API (how to obtain the API key: https://platform.openai.com/docs/quickstart)

In [24]:
# Enter your OpenAI API Key
openai_api_key = "YOUR_OPENAI_API_KEY"

# --- This process is for me so I don't share my token with everyone while doing the workshop!!!
from google.colab import userdata
try:
  secret_openai_api_key = userdata.get('OPENAI_API_KEY')
except userdata.SecretNotFoundError:
  secret_openai_api_key = None  # Assign None if secret is not found

if secret_openai_api_key is not None:
  openai_api_key = secret_openai_api_key

import os
# LangChain gets the API KEY from the environment variables
os.environ["OPENAI_API_KEY"] = openai_api_key

* Without RAG

We make a query to **OpenAI** without context

In [34]:
from langchain.chat_models import ChatOpenAI

chat = ChatOpenAI(
    model="gpt-3.5-turbo",
    temperature=0.0
)

# Test 1: Question about a technical document
# query = "What is the default account and password for the TP-LINK C1200 router? Say 'I don't know' if you don't know the answer"  # If you want to force it not to invent an answer
query = "What is the default account and password for the TP-LINK C1200 router?"
response = chat.predict(query)
print("Question:", query)
print("Model's response:", response)
print("\n-----\n")

# Test 2: Question about a board game
# query = "Describe how to win the Game of Thrones board game? Say 'I don't know' if you don't know the answer"  # If you want to force it not to invent an answer
query = "Describe how to win the Game of Thrones board game?"
response = chat.predict(query)
response = chat.predict(query)
print("Question:", query)
print("Model's response:", response)
print("\n-----\n")

# Test 3: Question about a book
# query = "What is the significance of the black box? Say 'I don't know' if you don't know the answer"  # If you want to force it not to invent an answer
query = "What is the significance of the black box?"
response = chat.predict(query)
response = chat.predict(query)
print("Question:", query)
print("Model's response:", response)
print("\n-----\n")

Question: What is the default account and password for the TP-LINK C1200 router?
Model's response: The default username and password for the TP-LINK C1200 router is usually:

Username: admin
Password: admin

-----

Question: Describe how to win the Game of Thrones board game?
Model's response: To win the Game of Thrones board game, players must strategically navigate the political landscape of Westeros, build alliances, and conquer territories to gain control of the Iron Throne. Here are some key strategies to help you win the game:

1. Form alliances: Forming alliances with other players can help you gain control of territories and fend off attacks from rival houses. Make sure to choose your allies wisely and be prepared to betray them if necessary.

2. Expand your influence: Focus on expanding your influence across the map by conquering territories and building strongholds. This will give you more power and resources to use in future battles.

3. Manage your resources: Make sure to m

* With RAG

Including context for the **OpenAI** query through a semantic search

In [35]:
import requests
import fitz
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS


pdf_urls = [
    # TP-LINK C1200 router setting up instructions
    "https://www.time.com.my/wp-content/uploads/assets/pdf/TPLinkRouter_SetupGuide_Wifi.pdf",

    # Game of Thrones board game rules book
    "https://cdn.1j1ju.com/medias/05/ff/1f-a-game-of-thrones-the-board-game-rulebook.pdf",

    # "The Lottery" story by Shirley Jackson
    "https://bpb-us-e2.wpmucdn.com/sites.middlebury.edu/dist/d/2396/files/2019/09/jackson_lottery.pdf"
]

# Download PDFs
pdf_files = []
for pdf_url in pdf_urls:
  pdf_file = pdf_url.split("/")[-1]
  response = requests.get(pdf_url)
  with open(pdf_file, "wb") as f:
      f.write(response.content)
  pdf_files.append(pdf_file)

# Convert PDF to text using PyMuPDF
texts = []
for pdf_file in pdf_files:
  doc = fitz.open(pdf_file)
  text = "\n".join([page.get_text() for page in doc])
  texts.append(text)

# Split text into chunks using LangChain!
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
chunks = []
for text in texts:
  doc_chunks = text_splitter.split_text(text)
  chunks.extend(doc_chunks)

# Create document embedder using HuggingFace
model_name = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
embedder = HuggingFaceEmbeddings(model_name=model_name)

# Create the FAISS vector store using LangChain's implementation!
vector_store = FAISS.from_texts(chunks, embedder)

In [37]:
import os
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI


# Configure the OpenAI LLM (Language Model)
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.0)

# Set up the RetrievalQA chain (LangChain) combining semantic search with generation
rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever(search_kwargs={"k": 4}),
    return_source_documents=True
)


# Test 1: Question about a technical document
query = "What is the default account and password for the TP-LINK C1200 router?"
response = rag_chain(query)
print("Question:", query)
print("Model's answer:", response["result"])
print("\n-----\n")

# Test 2: Question about a board game
query = "Describe how to win the Game of Thrones board game?"
response = rag_chain(query)
print("Question:", query)
print("Model's answer:", response["result"])
print("\n-----\n")

# Test 3: Question about a book
query = "What is the significance of the black box?"
response = rag_chain(query)
print("Question:", query)
print("Model's answer:", response["result"])
print("\n-----\n")

Question: What is the default account and password for the TP-LINK C1200 router?
Model's answer: The default username for the TP-LINK C1200 router is "admin" and the default password is "TIME" followed by the last 4 characters of the MAC address.

-----

Question: Describe how to win the Game of Thrones board game?
Model's answer: To win the Game of Thrones board game, a player must either control the most areas containing either a Castle or Stronghold at the end of the 10th game round or immediately win if they control seven such areas at any time during the game. Players achieve this by mustering armies, conquering territory, forming alliances, and strategically placing Order tokens during the Planning Phase to gain control over Castles and Strongholds. Ultimately, the player who dominates the most key areas in Westeros will claim victory and the Iron Throne.

-----

Question: What is the significance of the black box?
Model's answer: The black box in Shirley Jackson's "The Lottery" 

### Your Turn

- **Test general knowledge queries**: Ask questions where the context is part of common knowledge, such as: *\"What is the capital of Spain?\"*. Do you need RAG to get a good answer, or can the model answer these on its own?

- **Create specific questions**: Analyze the reference material you've indexed and create questions whose answers can only be found in that content. Check how the system responds **with** and **without** RAG.

- **Integrate new information sources**: Add new documents to the index obtained from the web. For example, you can use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to extract content from HTML pages.

- **Explore different embedding models**: Review the [HuggingFace model hub](https://huggingface.co/models) and try other embedding models. What differences do you notice in the answers?

- **Explore different OpenAI models**: Review the [OpenAI model hub](https://platform.openai.com/docs/models) and try other LLMs by changing the model name in `ChatOpenAI(model_name="xxxxx", temperature=0.0)`. What differences do you observe in the answers?

- **Adjust system parameters**: Experiment with different values for `chunk_size`, `chunk_overlap`, and `k` (the number of retrieved documents). Reflect on how these parameters affect the quality of the answers.

- **Adjust OpenAI parameters**: Experiment with different values between 0 and 1 for the `temperature` parameter in `ChatOpenAI(model_name="gpt-3.5-turbo", temperature=x.x)`. How does this parameter affect the answers?






## Conclusion

You've made it to the end! Throughout this notebook, we have explored essential tools and concepts to begin working with language models and build a basic semantic search engine.

I encourage you to keep learning and diving deeper into the world of NLP and artificial intelligence, where the possibilities are vast and constantly evolving.

Don't forget to check out the presentation available in [the repository](https://github.com/rorisDS/workshop_semantic_search).

Notebook created by [Víctor Manuel Alonso Rorís](https://www.linkedin.com/in/victor-roris/) on March 11, 2025.