# Information retrieval

Information Retrieval (IR) is a Natural Language Processing (NLP) task, in which the objective is to retrieve relevant information within a corpus of text, based on a query. For this purpose, text encoder models are trained to represent texts using `embeddings` (i.e., vectors of numbers), aiming to correctly represent the meaning of the text. Finally, the model can also encode the query, and hopefully, the query embedding will be closer to text embeddings that are meaningful to answer the query. 

This notebooks show how to perform Information Retrieval to retrieve relevant spans of texts within a set of documents. This will be divided into 3 different sections : 
1. **Document processing** : the 1st step is to extract spans of texts from all the desired documents.
2. **Text encoding** : the 2nd step is to encode all the spans of texts using an appropriate embedding model. For this demonstration, we will use the [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) model,
3. **Search query** : finally, the last step is to encode the query by using the same model, and retrieve the spans with the lowest distance (or higher similarity) with the query embedding. 

## 1. Document processing

The document processing step aims to extract texts from documents. The `parse_document` method accepts filenames, directories and filename format (as below), and returns a list of paragraphs. This is more convenient compared to full text extraction for futher processing, like encoding the texts ;) The method currently accepts `.txt`, `.md`, `.pdf` and `.docs` file formats, and more will be added in the future !

For this demonstration, I will use the `README` files from all my github repositories. This will also be easier to evaluate the relevance of the retrieved documents ! 

In [1]:
import pandas as pd

from utils.text import parse_document

documents = parse_document('../**/README.md')
documents = pd.DataFrame(documents)
print('# texts : {}'.format(len(documents)))
documents.head()

# texts : 323


Unnamed: 0,text,type,section,section_titles,filename,language
0,# :yum: Data processing utilities,text,1.0,"[:yum: Data processing utilities, Project stru...",../data_processing/README.md,
1,Check the CHANGELOG file to have a global over...,text,1.0,"[:yum: Data processing utilities, Project stru...",../data_processing/README.md,
2,## Project structure,text,1.1,"[:yum: Data processing utilities, Project stru...",../data_processing/README.md,
3,Check the provided notebooks to have an overvi...,text,1.1,"[:yum: Data processing utilities, Project stru...",../data_processing/README.md,
4,├── example_data : data used for the de...,code,1.1,"[:yum: Data processing utilities, Project stru...",../data_processing/README.md,bash


## 2. Text encoding

Now that we have all the texts extracted with additional information (like section title / filename), we can encode them using embeddings ! For this purpose, let's initialize a `TextEncoder` model with the [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) model. Then, we can use this model to encode the texts by using the `embed` method. 

`embed` is a batched function, meaning that you can provide the `batch_size` argument to control the number of texts to encode in parallel. An important aspect to consider is that texts are padded when passing in parallel, in order to form a rectangular matrix (i.e., the smaller texts have zero-values at the end so that all texts within a batch have the same length). The function has been optimized by sorting the texts by length, in order to minimize padding. However, it remains interesting to correctly tune the `batch_size`, as it has a large impact on performances ! My recommandation would be to use a small value, around 8. 

The model is compiled using `XLA` by default, which explains why some calls are slower than subsequent ones, due to retracing. 

At the 1st call, the official `transformers` model will be downloaded and converted to my `keras` implementation of the `XLMRoberta` architecture. For this purpose, you will need the `torch` library to be installed. Once done, the model will be saved in regular keras format under the `pretrained_models/{name}` folder for subsequent loading. 

In [1]:
from models.encoder.text_encoder import TextEncoder

model = TextEncoder(pretrained = 'BAAI/bge-m3', name = 'bge-m3')
print(model)

2025-05-25 08:57:32.681846: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-05-25 08:57:32.688544: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1748156252.696001    3787 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1748156252.698133    3787 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1748156252.703711    3787 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

Loading weights from `pretrained_models/bge-m3/saving/ckpt-0000.weights.h5`
TextEncoder `bge-m3` initialized successfully !

Model :
- Inputs 	: unknown
- Outputs 	: unknown
- Number of layers 	: 26
- Number of parameters 	: 567.756 Millions
- Model not compiled yet

Transfer-learning from : BAAI/bge-m3
Already trained on 0 epochs (0 steps)

- Embedding dim   : 1024
- Distance metric : cosine
- Language : multi
- Vocabulary (size = 250002) : ['<s>', '<pad>', '</s>', '<unk>', ',', '.', '▁', 's', '▁de', '-', '▁a', 'a', ':', 'e', 'i', '▁(', ')', '▁i', 't', 'n', '▁-', '▁la', '▁en', '▁in', '▁na', ...]



In [6]:
from tqdm import tqdm

vectors = model.predict(
    documents  = '../**/README.md',
    batch_size = 8,

    save         = False,
    chunk_size   = 256,
    group_by    = ('filename', 'section_titles'),
    primary_key = ('filename', 'text'),
    
    tqdm = tqdm,
)
print(vectors)

<VectorDatabase path=documents.db key=('filename', 'text') length=133>


## 3. Search query

The final step is to encode the query, then compute the `cosine similarity` (or any other distance/similarity metric) between the embedded query and all embedded texts, and take the top-k with the best score ! All these steps are performed internally by the `search` method of the `DenseVectors` class ;)

We can observe that the best results are correctly related to `embeddings`, and even more, the best retrieved passage correctly defines the notion of embeddings !

It is worth mentioning that the model will retrieve passages no matter if they are relevant or not, as it simply provides a score for each passage. Therefore, if the query does not have any relevant span in the provided text, it will return irrelevant spans. Nonetheless, as it can be observed in the 2nd example, scores for such irreevant passages (in the 2nd example) is lower than relevant ones (in the 1st example) ;) 

### Example 1

In [7]:
import json

from utils import to_json

query = 'What is an embedding ?'

res = model.retrieve(query, vectors, k = 3)[0]
print(json.dumps(to_json(res), indent = 4))

[
    {
        "section_titles": [
            ":yum: Encoder networks",
            "Project structure"
        ],
        "language": "bash",
        "filename": "../encoders/README.md",
        "text": "text encoder that uses pretrained embedding models\n\u251c\u2500\u2500 pretrained_models\n\u251c\u2500\u2500 unitests\n\u251c\u2500\u2500 utils\n\u251c\u2500\u2500 speaker_verification.ipynb\n\u2514\u2500\u2500 information_retrieval.ipynb Check the main project for more information about the unextended modules / structure / main classes.\n\n **Important Note** : this project is the keras 3 extension of the siamese network project. All features are not available yet. Once the convertion will be completely finished, the siamese networks project will be removed in favor of this one.",
        "score": 0.48156794905662537
    },
    {
        "type": "text",
        "section_titles": [
            ":yum: Text To Speech (TTS)",
            "Multi-speaker Text-To-Speech",
            "Aut

In [4]:
for paragraph in res:
    print('Text from file `{filename}` - section {section_titles}\nScore : {score:.3f}\n{text}\n'.format(** paragraph))


Text from file `../encoders/README.md` - section [':yum: Encoder networks', 'Project structure']
Score : 0.482
text encoder that uses pretrained embedding models
├── pretrained_models
├── unitests
├── utils
├── speaker_verification.ipynb
└── information_retrieval.ipynb Check the main project for more information about the unextended modules / structure / main classes.

 **Important Note** : this project is the keras 3 extension of the siamese network project. All features are not available yet. Once the convertion will be completely finished, the siamese networks project will be removed in favor of this one.

Text from file `../text_to_speech/README.md` - section [':yum: Text To Speech (TTS)', 'Multi-speaker Text-To-Speech', 'Automatic voice cloning with the `SV2TTS` architecture', 'The basic intuition']
Score : 0.471
2. This pre-trained `Speaker Encoder (SE)` is then used to encode the voice of the speaker to clone.
 3. The produced embedding is then concatenated with the output of th

### Example 2 : irrelevant query

In [9]:
query = 'What is the meaning of life ?'

res = model.retrieve(query, vectors, k = 3)[0]

for paragraph in res:
    print('Text from file `{filename}`\nScore : {score:.3f}\n{text}\n'.format(** paragraph))


Text from file `../detection/README.md`
Score : 0.403
1]` |
| Applications | General detection + classification | Medical image detection / object extraction |
| Model architecture | Full CNN 2D downsampling to `(grid_h, grid_w)` | Full CNN with downsampling and upsampling |
| Post processing | Decode output to get position of boxes | Thresholding pixel confidence |
| Model mechanism | Split image into grid and detect boxes in each grid cell | Downsample the image and upsample it to give probability of object for each pixel |
| Support multi-label classification | Yes, by design | Yes, but not its main application | \* This is the classical output shape of `YOLO` models. The last dimension is `[x, y, w, h, confidence, * class_score]`

 More advanced strategies also exist, differing from the standard methodologies described above. This aims to be a simple introduction to object detection and segmentation.

Text from file `../yui-mhcp/README.md`
Score : 0.400
<h2 align="center">
<p> :yum