# Information retrieval

Information Retrieval (IR) is a Natural Language Processing (NLP) task, in which the objective is to retrieve relevant information within a corpus of text, based on a query. For this purpose, text encoder models are trained to represent texts using `embeddings` (i.e., vectors of numbers), aiming to correctly represent the meaning of the text. Finally, the model can also encode the query, and hopefully, the query embedding will be closer to text embeddings that are meaningful to answer the query. 

This notebooks show how to perform Information Retrieval to retrieve relevant spans of texts within a set of documents. This will be divided into 3 different sections : 
1. **Document processing** : the 1st step is to extract spans of texts from all the desired documents.
2. **Text encoding** : the 2nd step is to encode all the spans of texts using an appropriate embedding model. For this demonstration, we will use the [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) model,
3. **Search query** : finally, the last step is to encode the query by using the same model, and retrieve the spans with the lowest distance (or higher similarity) with the query embedding. 

## 1. Document processing

The document processing step aims to extract texts from documents. The `parse_document` method accepts filenames, directories and filename format (as below), and returns a list of paragraphs. This is more convenient compared to full text extraction for futher processing, like encoding the texts ;) The method currently accepts `.txt`, `.md`, `.pdf` and `.docs` file formats, and more will be added in the future !

For this demonstration, I will use the `README` files from all my github repositories. This will also be easier to evaluate the relevance of the retrieved documents ! 

In [None]:
import pandas as pd

from utils.text import parse_document

documents = parse_document('../**/README.md')
documents = pd.DataFrame(documents)
print('# texts : {}'.format(len(documents)))
documents.head()

## 2. Text encoding

Now that we have all the texts extracted with additional information (like section title / filename), we can encode them using embeddings ! For this purpose, let's initialize a `TextEncoder` model with the [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) model. Then, we can use this model to encode the texts by using the `embed` method. 

`embed` is a batched function, meaning that you can provide the `batch_size` argument to control the number of texts to encode in parallel. An important aspect to consider is that texts are padded when passing in parallel, in order to form a rectangular matrix (i.e., the smaller texts have zero-values at the end so that all texts within a batch have the same length). The function has been optimized by sorting the texts by length, in order to minimize padding. However, it remains interesting to correctly tune the `batch_size`, as it has a large impact on performances ! My recommandation would be to use a small value, around 8. 

The model is compiled using `XLA` by default, which explains why some calls are slower than subsequent ones, due to retracing. 

At the 1st call, the official `transformers` model will be downloaded and converted to my `keras` implementation of the `XLMRoberta` architecture. For this purpose, you will need the `torch` library to be installed. Once done, the model will be saved in regular keras format under the `pretrained_models/{name}` folder for subsequent loading. 

In [None]:
from models.encoder.text_encoder import TextEncoder

model = TextEncoder(pretrained = 'BAAI/bge-m3', name = 'bge-m3')
print(model)

In [5]:
from tqdm import tqdm

vectors = model.predict(
    documents,
    batch_size = 8,

    save         = False,
    chunk_size   = 256,
    group_by    = ('filename', 'section_titles'),
    primary_key = ('filename', 'text'),
    missmatch_mode = 'ignore',
    
    tqdm = tqdm,
)
print(vectors)

100%|███████████████████████████████████████████████████████████████████████████████████| 16/16 [00:01<00:00,  9.63it/s]

- # data    : 127
- Dimension : 1024
- Columns (primary ('filename', 'text')) : ('section_titles', 'filename', 'text', 'chunks', 'type', 'section')






## 3. Search query

The final step is to encode the query, then compute the `cosine similarity` (or any other distance/similarity metric) between the embedded query and all embedded texts, and take the top-k with the best score ! All these steps are performed internally by the `search` method of the `DenseVectors` class ;)

We can observe that the best results are correctly related to `embeddings`, and even more, the best retrieved passage correctly defines the notion of embeddings !

It is worth mentioning that the model will retrieve passages no matter if they are relevant or not, as it simply provides a score for each passage. Therefore, if the query does not have any relevant span in the provided text, it will return irrelevant spans. Nonetheless, as it can be observed in the 2nd example, scores for such irreevant passages (in the 2nd example) is lower than relevant ones (in the 1st example) ;) 

### Example 1

In [7]:
query = 'What is an embedding ?'

res = vectors.search(query, k = 3)
print(res)

- # data    : 3
- Dimension : 1024
- Columns (primary ('filename', 'text')) : ('section_titles', 'filename', 'text', 'chunks', 'type', 'section', 'score')



In [8]:
for paragraph in res:
    print('Text from file `{filename}` - section {section_titles}\nScore : {score:.3f}\n{text}\n'.format(** paragraph))


Text from file `../text_to_speech/README.md` - section [':yum: Text To Speech (TTS)', 'Multi-speaker Text-To-Speech', 'Automatic voice cloning with the `SV2TTS` architecture', 'The basic intuition']
Score : 0.505
This model basically takes as input an audio sample (5-10 sec) from a speaker, and encodes it on a *d*-dimensional vector, named the `embedding`. This embedding aims to capture relevant information about the speaker's voice (e.g., `frequencies`, `rythm`, `pitch`, ...). 
2. This pre-trained `Speaker Encoder (SE)` is then used to encode the voice of the speaker to clone.
3. The produced embedding is then concatenated with the output of the `Tacotron-2` encoder part, such that the `Decoder` has access to both the encoded text and the speaker embedding.

The objective is that the `Decoder` will learn to use the `speaker embedding` to copy its prosody / intonation / ... to read the text with the voice of this speaker.



Text from file `../text_to_speech/README.md` - section [':yum

### Example 2 : irrelevant query

In [9]:
query = 'What is the meaning of life ?'

res = vectors.search(query, k = 3)
print(res)

- # data    : 3
- Dimension : 1024
- Columns (primary ('filename', 'text')) : ('section_titles', 'filename', 'text', 'chunks', 'type', 'section', 'score')



In [10]:
for paragraph in res:
    print('Text from file `{filename}` - section {section_titles}\nScore : {score:.3f}\n{text}\n'.format(** paragraph))


Text from file `../yui-mhcp/README.md` - section ['Covered topics', 'Contacts and licence', 'Terms of use']
Score : 0.423
- x] [Search text in audios / videos
- x] [Live transcription / subtitle generation : some models tend to be accurate enough for transcription, like the `Whisper` family of models !
- x] [Text-To-Speech logger : `logging`-based logger that reads your logs with `TTS` models
- x] [Optical Character Recognition (OCR) : this projects allows to detect text in an image, and performs OCR on the detected text

All topics are released in separate repositories to make it easier to learn / experiments with dedicated codes and ressources.

\* It is a demonstration code to show how to subclass `BaseModel`. I will add a dedicated repository later for general classification (text / image / ... ).

## Contacts and licence

Contacts :
- **Mail** : `yui-mhcp@tutanota.com`
- **Discord** : yui0732

### Terms of use



Text from file `../base_dl_project/README.md` - section [':yum: Base