# Task 2: Embedding for a similarity search using AI

## Explanation of Embedding and Similarity Search

### 1. **What is Embedding?**
Embedding is the process of converting textual data (words, sentences, or documents) into numerical vectors in a high-dimensional space. These vectors capture the semantic meaning of the text, allowing us to perform operations like similarity searches.

### Visual Representation:

Imagine a 3D space where each word is represented as a point. Words with similar meanings (e.g., "king" and "queen") are closer together, while unrelated words (e.g., "king" and "banana") are farther apart.

Needs to be performed in sentences to make sense.


| **Library/Model**           | **Description**                                                                 | **Documentation Link**                                      |
|------------------------------|---------------------------------------------------------------------------------|------------------------------------------------------------|
| `sentence-transformers`      | Pre-trained models for creating embeddings for semantic similarity and search.  | [sentence-transformers Documentation](https://www.sbert.net/) |
| `FAISS`                      | Library for efficient similarity search and clustering of dense vectors.        | [FAISS Documentation](https://faiss.ai/)                   |
| `gensim`                     | Topic modeling and document similarity using word embeddings.                   | [gensim Documentation](https://radimrehurek.com/gensim/)    |
| `spaCy`                      | NLP library with support for word vectors and similarity comparisons.           | [spaCy Documentation](https://spacy.io/)                   |
| `transformers` (Hugging Face)| Provides pre-trained transformer models for embeddings and NLP tasks.           | [transformers Documentation](https://huggingface.co/docs/transformers/) |
| `OpenAI API`                 | Embeddings and other NLP tasks using OpenAI's GPT models.                       | [OpenAI API Documentation](https://platform.openai.com/docs/) |
| `TensorFlow Hub`             | Pre-trained models for embeddings and other machine learning tasks.             | [TensorFlow Hub Documentation](https://www.tensorflow.org/hub) |
| `Universal Sentence Encoder` | Pre-trained model for sentence-level embeddings.                                | [USE Documentation](https://tfhub.dev/google/collections/universal-sentence-encoder/1) |


## Method 1: Sentence-transformers

In [1]:
%pip install torch torchvision
%pip install transformers sentence-transformers

Collecting torch
  Downloading torch-2.7.1-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (29 kB)
Collecting torchvision
  Downloading torchvision-0.22.1-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (6.1 kB)
Collecting filelock (from torch)
  Downloading filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB)
Collecting sympy>=1.13.3 (from torch)
  Downloading sympy-1.14.0-py3-none-any.whl.metadata (12 kB)
Collecting networkx (from torch)
  Downloading networkx-3.5-py3-none-any.whl.metadata (6.3 kB)
Collecting fsspec (from torch)
  Downloading fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.6.77 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.6.77 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.6.80 (from torch)
  Downloading n

In [2]:
from sentence_transformers import SentenceTransformer
import json

# 1. Load a pretrained Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

with open("highlighted_output.txt", "r") as file:
    txt_doc_text = file.read()
    # Convert the text document into a list of words
    words = txt_doc_text.split(" ")
    
print(words)
# 2. Calculate embeddings by calling model.encode()
embeddings = model.encode(words)
print(embeddings.shape)
# [3, 384]
# 1. Load your JSON data (assuming it's the blood test questionnaire)
with open('questionnaire.json', 'r') as json_file:
    questionnaire_data = json.load(json_file)

# 2. Extract question titles (FIXED: iterate directly over the list)
question_titles = [question["questionTitle"] for question in questionnaire_data]
print(question_titles)
keywords = model.encode(question_titles)
# 3. Calculate the embedding similarities
similarities = model.similarity(embeddings, keywords)
print(similarities)
# tensor([[1.0000, 0.6660, 0.1046],
#         [0.6660, 1.0000, 0.1411],
#         [0.1046, 0.1411, 1.0000]])
for i in range(similarities.shape[0]):
    for j in range(similarities.shape[1]):
        if similarities[i][j] > 0.5:
            print(f"Similarity between word '{words[i]}' and question '{question_titles[j]}': {similarities[i][j]:.4f}")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

['Copie', 'électronique\n ', '\nN°', 'FINESS', ':', '34', '3', '', '2\n7', 'av', 'du', '', 'de', '', '-', '', '-', '', '(:', , , '51', , '15', '7:', , , '53', '15', '44\n', '', '-', 'Biologiste(s)', 'Médical(aux)\nDocteur', ' ', ' \nCABINET', 'MEDICAL', '"LA', '"Madame', ' ', '\n250', '', 'DES', '\n', '', '', '(100)\nX', 'Demande', 'n°', '01/02/', '-LABO--TPEdité', 'le,', 'lundi', '1', 'février', '2021\nCopie', 'à', ':', 'Docteur', ' ', ' ,', 'DR\n', '', '\nCopie', 'à', ':', 'Docteur', ' ', ' ,', 'DR', '', '', '\nPatient', 'né(e)', ' ', 'le', '\nFSE', 'Tiers', 'payant', '', '-', '\nPrélèvements', 'effectués', 'par', 'le', 'laboratoire', 'le', '01/02/21', 'à', '10H27\nVos', 'résultats', 'sur', 'internet', ':', 'Accès', 'sécurisé,', 'rapide,', 'gratuit,', 'pratique,', 'écoresponsable\n1)', 'Communiquez', 'votre', 'mail', 'au', 'laboratoire\n2)', 'Recevez', 'un', 'email', 'dès', 'que', 'vos', 'résultats', 'sont', 'disponibles\n3)', 'Cliquez', 'sur', 'le', 'lien\nINFORMATION', 'COVID-19\nR

## Method 2: FAISS 

Faiss just performs the similarity search, not the actual encoding. 

In [None]:
%pip install faiss-cpu
import faiss
print(faiss.__version__)  # Should output '1.11.0'

In [4]:
import faiss
import json
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")


with open("highlighted_output.txt", "r") as file:
    # Denoise the text by removing special characters and extra spaces
    import re
sentences = re.split(r'(?<=[.!?])\s+', txt_doc_text)
    
print(sentences)
# 2. Calculate embeddings by calling model.encode()
embeddings = model.encode(words)
print(embeddings.shape)
# [3, 384]
# 1. Load your JSON data (assuming it's the blood test questionnaire)
with open('questionnaire.json', 'r') as json_file:
    questionnaire_data = json.load(json_file)

# 2. Extract question titles (FIXED: iterate directly over the list)
question_titles = [question["questionTitle"] for question in questionnaire_data]
print(question_titles)
keywords = model.encode(question_titles)

# 3. Create a FAISS index
dimension = embeddings.shape[1]  # Dimension of the embeddings
index = faiss.IndexFlatL2(dimension)  # L2 distance index
index.add(embeddings)  # Add the embeddings to the index    
# 4. Search for the nearest neighbors
k = 5  # Number of nearest neighbors to search for
D, I = index.search(keywords, k)  # D: distances, I: indices of nearest neighbors

 #5. Print the results
for i in range(len(question_titles)):
    print(f"Question: {question_titles[i]}")
    for j in range(k):
        if I[i][j] < len(words):  # Ensure index is within bounds
            print(f"  Nearest word: {words[I[i][j]]}, Distance: {D[i][j]:.4f}")
    print()  # Newline for better readability

['Copie électronique\n  \nN° FINESS : \n -  -  (:  7: \n  - Biologiste(s) Médical(aux)\nDocteur    \nCABINET MEDICAL " "Madame   \n\n   (100)\nX Demande n° 01/02/ -LABO--TPEdité le, lundi 1 février 2021\nCopie à : Docteur    , DR\n\nCopie à : Docteur    , DR \nPatient né(e)   le \nFSE Tiers payant  - \nPrélèvements effectués par le laboratoire le 01/02/21 à 10H27\nVos résultats sur internet : Accès sécurisé, rapide, gratuit, pratique, écoresponsable\n1) Communiquez votre mail au laboratoire\n2) Recevez un email dès que vos résultats sont disponibles\n3) Cliquez sur le lien\nINFORMATION COVID-19\nRendez-vous sur notre site internet dédié pour connaître notre organisation : https:// .fr/depistage-covid-19/\nHématologie\nValeurs de référence\nAntériorités\n✔ Hémogramme\n(Sang total - Variation d\'impédance, photométrie, cytométrie en flux)  - \n\n4,97\nHématies ........................................', 'Hémoglobine ....................................4,94 Téra/L\n13,6 g/dL3,80 à 5,9011,5

## Method 3: Universal Sentence Encoder
Tensorflow is giving me some issues, I will try again later on. 

>**WARNING** : This snippet won't work due to conflicting installations issues. I recommend that you create a another project folder, another virtual environment, copy paste this code into a notebook as well as [highlighted_output.txt](highlighted_output.txt) and [quesionnaire.json](questionnaire.json) once there this snippet should work without any issues. The project is no longer active so it isn't worth considering. 

Do not run these installations or this in this entire file will probably stop compiling. 

```python
%pip install tensorflow_hub
%pip install numpy
```

This is the code that you should copy and paste in another project folder.

```python
import tensorflow_hub as hub
import numpy as np
import json

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
  return model(input)
with open("highlighted_output.txt", "r") as file:
    txt_doc_text = file.read()
    # Denoise the text by removing special characters and extra spaces
    words = txt_doc_text.split(" ")
    sentences = [' '.join(words[i:i+10]) for i in range(0, len(words), 10)]
print(sentences)

embeddings = embed(sentences)

# 1. Load your JSON data (assuming it's the blood test questionnaire)
with open('questionnaire.json', 'r') as json_file:
    questionnaire_data = json.load(json_file)

# 2. Extract question titles (FIXED: iterate directly over the list)
question_titles = [question["questionTitle"] for question in questionnaire_data]
print(question_titles)
keywords = embed(question_titles)

# 3. Calculate the embedding similarities
similarities = np.inner(embeddings, keywords)
print (similarities.shape)
k = 5

# Get the top-k most similar sentences for each question
top_k_indices = np.argsort(similarities, axis=0)[-k:][::-1]  # shape: (k, num_questions)
top_k_scores = np.take_along_axis(similarities, top_k_indices, axis=0)

# Print the results
for q_idx, title in enumerate(question_titles):
    print(f"Question: {title}")
    for rank in range(k):
        sent_idx = top_k_indices[rank, q_idx]
        score = top_k_scores[rank, q_idx]
        if sent_idx < len(sentences):
            print(f"  Nearest sentence: {sentences[sent_idx]!r}, Similarity: {score:.4f}")
    print()  # Newline for better readability
    ```

## Method 4: Gensim

In [None]:
%pip install scipy
%pip install gensim

Depending on your system, you may be asked to install a compiler.


In [None]:
import gensim.downloader as api
import json
import re
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load word2vec model
wv = api.load('word2vec-google-news-300')

with open("highlighted_output.txt", "r") as file:
    txt_doc_text = file.read()
    # Remove all non-alphanumeric characters (keeping spaces)
    txt_doc_text = re.sub(r'[^a-zA-Z0-9\s]', '', txt_doc_text)
    words = [word for word in txt_doc_text.split() if word in wv]  # Only keep words that exist in the vocabulary

# Get embeddings for words in the document
doc_embeddings = np.array([wv[word] for word in words])


with open('questionnaire.json', 'r') as json_file:
    questionnaire_data = json.load(json_file)

# Extract question titles
question_titles = [question["questionTitle"] for question in questionnaire_data]

# Process each question
for question in question_titles:
    # Split question into words and get valid embeddings
    question_words = [word for word in re.sub(r'[^a-zA-Z0-9\s]', '', question).split() if word in wv]
    
    if not question_words:  # Skip if no valid words found
        print(f"Question: {question} - No valid words found in vocabulary")
        continue
        
    # Average the word vectors for the question
    question_embedding = np.mean([wv[word] for word in question_words], axis=0)
    
    # Calculate cosine similarities
    similarities = cosine_similarity([question_embedding], doc_embeddings)[0]
    
    # Get top 5 most similar words from document
    top_indices = np.argsort(similarities)[-5:][::-1]
    
    print(f"Question: {question}")
    for idx in top_indices:
        print(f"  Similar word: {words[idx]} (similarity: {similarities[idx]:.4f})")
    print()

Question: Date of Blood Test
  Similar word: Date (similarity: 0.6027)
  Similar word: test (similarity: 0.3969)
  Similar word: test (similarity: 0.3969)
  Similar word: test (similarity: 0.3969)
  Similar word: test (similarity: 0.3969)

Question: Hemoglobin (Hb) Level
  Similar word: SGOT (similarity: 0.6210)
  Similar word: SGPT (similarity: 0.5785)
  Similar word: TSH (similarity: 0.5227)
  Similar word: Thyroxine (similarity: 0.5198)
  Similar word: LDL (similarity: 0.4935)

Question: White Blood Cell (WBC) Count
  Similar word: Lymphocytes (similarity: 0.4249)
  Similar word: Monocytes (similarity: 0.3786)
  Similar word: Thyroxine (similarity: 0.2854)
  Similar word: cholesterol (similarity: 0.2827)
  Similar word: SGOT (similarity: 0.2814)

Question: Platelet Count
  Similar word: SGOT (similarity: 0.4548)
  Similar word: Lymphocytes (similarity: 0.4430)
  Similar word: Monocytes (similarity: 0.4176)
  Similar word: SGPT (similarity: 0.4090)
  Similar word: MDRD (similarity: 0

## Method 5: spaCy

In [None]:
%pip install -U pip setuptools wheel
%pip install -U spacy


>**WARNING** : remember to also install the requirements on the terminal. Here is the necessary [guide](https://spacy.io/usage)   

In [5]:
import spacy
from sklearn.metrics.pairwise import cosine_similarity


nlp = spacy.load("fr_core_news_sm")  # GloVe vectors

with open("highlighted_output.txt", "r") as file:
    txt_doc_text = file.read()
    # Remove all non-alphanumeric characters (keeping spaces)
    txt_doc_text = re.sub(r'[^a-zA-Z0-9\s]', '', txt_doc_text)
 
with open('questionnaire.json', 'r') as json_file:
    questionnaire_data = json.load(json_file)

nlp=spacy.load("fr_core_news_sm")  # Load the French model with GloVe vectors

embeddings = nlp(txt_doc_text)

for question in questionnaire_data:
    question_embedding = nlp(question["questionTitle"])

    similarities = cosine_similarity( [question_embedding.vector], [word.vector for word in embeddings])
    top_indices = similarities[0].argsort()[-5:][::-1]  # Get top 5 most similar words
    print(f"question:{question['questionTitle']}")
    for idx in top_indices:
        print(f"Similar word:  {embeddings[idx].text}  (similarity: {similarities[0][idx]:.4f})")





question:Date of Blood Test
Similar word:  Vitamine  (similarity: 0.6415)
Similar word:  Vitamine  (similarity: 0.6415)
Similar word:  Vitamine  (similarity: 0.6415)
Similar word:  Filtration  (similarity: 0.6285)
Similar word:  Page  (similarity: 0.6193)
question:Hemoglobin (Hb) Level
Similar word:    (similarity: 0.6331)
Similar word:    (similarity: 0.6321)
Similar word:  HAS  (similarity: 0.6271)
Similar word:  HAS  (similarity: 0.6128)
Similar word:  Cholestrol  (similarity: 0.6117)
question:White Blood Cell (WBC) Count
Similar word:    (similarity: 0.6166)
Similar word:    (similarity: 0.6156)
Similar word:  Calcul  (similarity: 0.6027)
Similar word:  HAS  (similarity: 0.6011)
Similar word:  HAS  (similarity: 0.5935)
question:Platelet Count
Similar word:    (similarity: 0.6867)
Similar word:    (similarity: 0.6710)
Similar word:    (similarity: 0.6701)
Similar word:    (similarity: 0.6700)
Similar word:    (similarity: 0.6693)
question:Blood Gose Level
Similar word:  Docteur  (si

## Method 6: Tensorflow_hub + Keras
Tensorflow is giving me some issues, I will try again later on. 

>**WARNING** : This snippet won't work due to conflicting installations issues. I recommend that you create a another project folder, another virtual environment, copy paste this code into a notebook as well as [highlighted_output.txt](highlighted_output.txt) and [quesionnaire.json](questionnaire.json) once there this snippet should work without any issues.



```python
%pip install tensorflow-hub
%pip install sklearn
%pip install numpy
```

This code must not be run here
```python
import tensorflow_hub as hub
import re
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load the model (try Universal Sentence Encoder for better results)
hub_url = "https://tfhub.dev/google/universal-sentence-encoder/4"  # Better for phrases
embed = hub.load(hub_url)

# Load and preprocess the document
with open("highlighted_output.txt", "r") as file:
    text = file.read()

# Better splitting: Split by double newlines or section headers
chunks = re.split(r'\n\s*\n|\b(?:Date|Patient|Result|Test):', text)  # Adjust based on your doc structure
chunks = [chunk.strip() for chunk in chunks if chunk.strip() and len(chunk.split()) > 3]  # Remove small fragments

# Generate embeddings for each chunk
chunk_embeddings = embed(chunks).numpy()

# Load questionnaire
with open('questionnaire.json', 'r') as f:
    questions = json.load(f)

# Compare each question to document chunks
for q in questions:
    question_text = q["questionTitle"]
    q_embedding = embed([question_text]).numpy()
    
    similarities = cosine_similarity(q_embedding, chunk_embeddings)
    top_idx = np.argmax(similarities[0])  # Get the single best match
    
    print(f"\nQuestion: {question_text}")
    print(f"Best Match (Similarity: {similarities[0][top_idx]:.3f}):")
    print(f"  {chunks[top_idx][:150]}...")  # Print first 150 chars of best chunk
```

***THE RESULTS WERE VERY BAD ANYWAY, NOT REALLY WORTH TESTING.***

# Paid Embedding Methods


## Method 7: Mistral-Embed 

In [None]:
%pip install mistralai

In [6]:
import json
import numpy as np
from collections import defaultdict
from mistralai import Mistral
import os

# Initialize Mistral client
api_key = os.environ.get("MISTRAL_API_KEY")
client = Mistral(api_key=api_key)
model = "mistral-embed"

# 1. Load and chunk the document text
with open('highlighted_output.txt', 'r') as file:
    text = file.read()

chunk_size = 100
words = text.split()
chunks = [" ".join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]

# 2. Get embeddings for document chunks
print("Generating document embeddings...")
word_embeddings = []
for chunk in chunks:
    response = client.embeddings.create(model=model, inputs=[chunk])
    word_embeddings.append(np.array(response.data[0].embedding))
word_embeddings = np.vstack(word_embeddings)  # Convert to numpy array

# 3. Load questions
with open('questionnaire.json', 'r') as f:
    questionnaire_data = json.load(f)
question_titles = [q["questionTitle"] for q in questionnaire_data]

# 4. Get embeddings for questions
print("Generating question embeddings...")
question_embeddings_response = client.embeddings.create(
    model=model,
    inputs=question_titles,
)
question_embeddings = np.array([d.embedding for d in question_embeddings_response.data])

# 5. Find top 5 matches for each question
print("Finding matches...")
top_matches = defaultdict(list)

for q_idx, q_emb in enumerate(question_embeddings):
    # Calculate cosine similarity
    norms = np.linalg.norm(word_embeddings, axis=1) * np.linalg.norm(q_emb)
    sims = np.dot(word_embeddings, q_emb) / (norms + 1e-8)
    
    # Get top 5 indices
    top_indices = np.argsort(sims)[-5:][::-1]
    
    # Store results
    for idx in top_indices:
        chunk_start = idx * chunk_size
        chunk_end = min((idx + 1) * chunk_size, len(words))
        matched_text = " ".join(words[chunk_start:chunk_end])
        similarity = sims[idx]
        top_matches[question_titles[q_idx]].append({
            "similarity": float(similarity),
            "text": matched_text,
            "chunk_index": idx
        })

# 6. Print results
for question, matches in top_matches.items():
    print(f"\nQuestion: {question}")
    print("Top 5 matches:")
    for i, match in enumerate(matches, 1):
        print(f"{i}. [Similarity: {match['similarity']:.3f}]")
        print(f"   Text: {match['text']}")
        print(f"   Chunk index: {match['chunk_index']}")

Generating document embeddings...
Generating question embeddings...
Finding matches...

Question: Date of Blood Test
Top 5 matches:
1. [Similarity: 0.798]
   Text: 27,4 pg 33,0 g/dL 13,734,0 à 53,041,8 76,0 à 96,084,0 24,4 à 34,027,7 31,0 à 36,032,9 10 à 1614,1 Leucocytes .......................................6,5 Giga/L3,8 à 11,07,8 Polynucléaires neutrophiles .. . Polynucléaires éosinophiles ... Polynucléaires basophiles ..... Lymphocytes ................... Monocytes ....................... 8,4 mmol/L 58,1 % 3,1 % 0,9 % 32,4 % 5,5 % ✔ Plaquettes ........................................... 3,780 G/L 0,200 G/L 0,060 G/L 2,110 G/L 0,360 G/L (Sang total - Variation d'impédance - Beckman Coulter)  -  4,240 0,200 0,00 à 0,110,060 1,00 à 4,802,870 0,15 à 1,000,420 228 Giga/L150 à 445 213 9,7 fLInf. à 11,010,2  V.P.M. .............................................. 1,40 à 7,700,02
   Chunk index: 2
2. [Similarity: 0.783]
   Text: Antériorités Hormones de la fertilité Date des dernières règle

Decent approximation in the first few matches. Far from accurate but could do the job, prices will need to be consulted. Hopefully other options can do better. 

## spaCy NER benchmark

```bash
python -m spacy download fr_dep_news_trf
```


In [None]:
%pip install -U pip setuptools wheel
%pip install spacy


Collecting wheel
  Using cached wheel-0.45.1-py3-none-any.whl.metadata (2.3 kB)
Using cached wheel-0.45.1-py3-none-any.whl (72 kB)
Installing collected packages: wheel
Successfully installed wheel-0.45.1
Note: you may need to restart the kernel to use updated packages.
Collecting spacy
  Downloading spacy-3.8.7-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.13-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.2 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.11-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.5 kB)
Collecting preshed<3

In [None]:
import spacy
nlp = spacy.load("fr_core_news_sm")
meta_OP=' '.join(meta_OP)
doc = nlp(meta_OP)
for ent in doc.ents:
    print(ent.text, ent.label_)

  ORG

N LOC
 PER
de  PER
  PER
CABINET MISC
MEDICAL ORG
 ORG
Copie LOC
Docteur   ORG
  ORG
DR  MISC
X Demande MISC
 MISC
 MISC
 MISC
FSE Tiers ORG
01/02/21 LOC
Edité MISC
Copie LOC
Docteur   ORG
  ORG
DR 

 MISC
Communiquez PER
Recevez PER
Cliquez LOC
INFORMATION MISC
Rendez LOC
Hématologie
Hémogramme PER
Sang total LOC
Antériorités
 PER
Téra LOC
Hémoglobine LOC
Hématocrite LOC
V.G.M. LOC
fL MISC
T.C.M.H. MISC
Index d'anisocytose MISC
Leucocytes ORG
Giga LOC
Polynucléaires neutrophiles ORG
G LOC
Polynucléaires éosinophiles ORG
G LOC
Polynucléaires basophiles ORG
G LOC
Lymphocytes LOC
G LOC
Monocytes LOC
G LOC
Plaquettes ............................................ 228 Giga/L 150 MISC
Sang total LOC
V.P.M. LOC
Inf LOC
Code de la Santé Publique MISC
LABORATOIRE DE BIOLOGIE MÉDICALE ORG
rue  LOC
 ORG
Laboratoires de biologie médicale ORG
www. .fr LOC
MEMBRE ORG
Demande n° 01/02/21-9- MISC
 PER
Sang total LOC
Protéines Sériques LOC
Vitamine B7 MISC
Vitamine B8 PER
Vitamine H PER
ABC ORG
F

## Meta Llama-3.3-70B-Versatile NER Benchmark

In [None]:
%pip install groq

In [None]:
import os


from groq import Groq

print("before OCR:")
with open('task2_prompts/prompting_tesseract.txt','r') as file:
    print(file.read())

client = Groq(

    # This is the default and can be omitted

    api_key=os.environ.get("GROQ_API_KEY"),

)


chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": (
                "You are a medical document processing expert performing named entity recognition. "
                "Extract and tag the following entities from the clinical text:\n"
                "1. <MEDICATION> - Drug names, supplements, vaccines\n"
                "2. <DOSAGE> - Amounts, frequencies, durations\n"
                "3. <CONDITION> - Diagnoses, symptoms, complaints\n"
                "4. <PROCEDURE> - Tests, surgeries, interventions\n"
                "5. <LAB_VALUE> - Results with units/reference ranges\n"
                "6. <BODY_PART> - Anatomical locations\n"
                "7. <BIOMARKER> - Medical indicators (cholesterol, HbA1c, blood pressure)\n"
                "8. <RISK_FACTOR> - Health risks (smoking, family history)\n\n"
                "Rules:\n"
                "- Preserve original text exactly (no corrections)\n"
                "- Tag full phrases (e.g., '<BIOMARKER>total cholesterol</BIOMARKER> <LAB_VALUE>210 mg/dL</LAB_VALUE>')\n"
                "- Use <BIOMARKER> for medical indicators that aren't conditions\n"
                "- Use <RISK_FACTOR> for behavioral/environmental factors\n"
                "- Include negations (e.g., '<CONDITION>no chest pain</CONDITION>')\n"
                "- For uncertain matches, use <POSSIBLE_...> tags\n"
                "- Return only the annotated text with no additional commentary"
            )
        },
        {
            "role": "user",
            "content": (
                "Perform NER on this OCR-extracted medical document:\n\n"
                "=== INPUT TEXT ===\n"
                f"{open('task2_prompts/prompting_tesseract.txt', encoding='utf-8').read()}\n"
                "=== END INPUT ==="
            )
        }
    ],
    model="llama-3.3-70b-versatile",
    temperature=0.2,  # Reduce randomness for factual accuracy
    max_tokens=4096   # Ensure long documents are fully processed
)



print(chat_completion.choices[0].message.content)

before OCR:
Copie électronique LABORATOIRE DE BIOLOGIE MEDICALE C LABO    N° FINESS :   -  -  @:  &:    - Biologiste(s) Médical(aux) Docteur     Madame    CABINET MEDICAL " "      (100) Copie a : Docteur    , DR  X Demande n° 01/02/ -LABO--TP Edité le, lundi 1 février 2021 Copie a : Docteur    , DR Patient né(e)   le   FSE Tiers payant  -  Prélevements effectués par le laboratoire le 01/02/21 a 10H27 Vos résultats sur internet : Accés sécurisé, rapide, gratuit, pratique, @coresponsable 1) Communiquez votre mail au laboratoire 2) Recevez un email dés que vos résultats sont disponibles 3) Cliquez sur le lien INFORMATION COVID-19 Rendez-vous sur notre site internet dédié pour connaitre notre organisation : https:// .fr/depistage-covid-19/ ia = Hematologie Valeurs de référence Antériorités v Hémogramme (Sang total - Variation d'impédance, photométrie, cytométrie en flux)  -   HEMatieS .......cccccecececeeeeeeeeeeeeeeeeeeeees 4,94 Téra/L 3,80 a 5,90 4,97 HEMOGIODING ........cceeeeeeeeeeeeee

Good, Promising output. 

## Google gemma2-9b-it NER Benchmark

In [None]:
%pip install groq

Note: you may need to restart the kernel to use updated packages.


In [12]:
import os


from groq import Groq

print("before OCR:")
with open('task2_prompts/prompting_tesseract.txt','r') as file:
    print(file.read())

client = Groq(

    # This is the default and can be omitted

    api_key=os.environ.get("GROQ_API_KEY"),

)


chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": (
                "You are a medical document processing expert performing named entity recognition. "
                "Extract and tag the following entities from the clinical text:\n"
                "1. <MEDICATION> - Drug names, supplements, vaccines\n"
                "2. <DOSAGE> - Amounts, frequencies, durations\n"
                "3. <CONDITION> - Diagnoses, symptoms, complaints\n"
                "4. <PROCEDURE> - Tests, surgeries, interventions\n"
                "5. <LAB_VALUE> - Results with units/reference ranges\n"
                "6. <BODY_PART> - Anatomical locations\n"
                "7. <BIOMARKER> - Medical indicators (cholesterol, HbA1c, blood pressure)\n"
                "8. <RISK_FACTOR> - Health risks (smoking, family history)\n\n"
                "Rules:\n"
                "- Preserve original text exactly (no corrections)\n"
                "- Tag full phrases (e.g., '<BIOMARKER>total cholesterol</BIOMARKER> <LAB_VALUE>210 mg/dL</LAB_VALUE>')\n"
                "- Use <BIOMARKER> for medical indicators that aren't conditions\n"
                "- Use <RISK_FACTOR> for behavioral/environmental factors\n"
                "- Include negations (e.g., '<CONDITION>no chest pain</CONDITION>')\n"
                "- For uncertain matches, use <POSSIBLE_...> tags\n"
                "- Return only the annotated text with no additional commentary"
            )
        },
        {
            "role": "user",
            "content": (
                "Perform NER on this OCR-extracted medical document:\n\n"
                "=== INPUT TEXT ===\n"
                f"{open('task2_prompts/prompting_tesseract.txt', encoding='utf-8').read()}\n"
                "=== END INPUT ==="
            )
        }
    ],
    model="gemma2-9b-it",
    temperature=0.2,  # Reduce randomness for factual accuracy
    max_tokens=4096   # Ensure long documents are fully processed
)



print(chat_completion.choices[0].message.content)


before OCR:
Copie électronique LABORATOIRE DE BIOLOGIE MEDICALE C LABO    N° FINESS :   -  -  @:  &:    - Biologiste(s) Médical(aux) Docteur     Madame    CABINET MEDICAL " "      (100) Copie a : Docteur    , DR  X Demande n° 01/02/ -LABO--TP Edité le, lundi 1 février 2021 Copie a : Docteur    , DR Patient né(e)   le   FSE Tiers payant  -  Prélevements effectués par le laboratoire le 01/02/21 a 10H27 Vos résultats sur internet : Accés sécurisé, rapide, gratuit, pratique, @coresponsable 1) Communiquez votre mail au laboratoire 2) Recevez un email dés que vos résultats sont disponibles 3) Cliquez sur le lien INFORMATION COVID-19 Rendez-vous sur notre site internet dédié pour connaitre notre organisation : https:// .fr/depistage-covid-19/ ia = Hematologie Valeurs de référence Antériorités v Hémogramme (Sang total - Variation d'impédance, photométrie, cytométrie en flux)  -   HEMatieS .......cccccecececeeeeeeeeeeeeeeeeeeeees 4,94 Téra/L 3,80 a 5,90 4,97 HEMOGIODING ........cceeeeeeeeeeeeee

Discrimination but good output for the amount of parameters, 9B is not much.

## deepseek-r1-distill-llama-70b NER Benchmark

In [None]:
%pip install groq

Note: you may need to restart the kernel to use updated packages.


In [9]:
import os


from groq import Groq

print("before OCR:")
with open('task2_prompts/prompting_tesseract.txt','r') as file:
    print(file.read())

client = Groq(

    # This is the default and can be omitted

    api_key=os.environ.get("GROQ_API_KEY"),

)


chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": (
                "You are a medical document processing expert performing named entity recognition. "
                "Extract and tag the following entities from the clinical text:\n"
                "1. <MEDICATION> - Drug names, supplements, vaccines\n"
                "2. <DOSAGE> - Amounts, frequencies, durations\n"
                "3. <CONDITION> - Diagnoses, symptoms, complaints\n"
                "4. <PROCEDURE> - Tests, surgeries, interventions\n"
                "5. <LAB_VALUE> - Results with units/reference ranges\n"
                "6. <BODY_PART> - Anatomical locations\n"
                "7. <BIOMARKER> - Medical indicators (cholesterol, HbA1c, blood pressure)\n"
                "8. <RISK_FACTOR> - Health risks (smoking, family history)\n\n"
                "Rules:\n"
                "- Preserve original text exactly (no corrections)\n"
                "- Tag full phrases (e.g., '<BIOMARKER>total cholesterol</BIOMARKER> <LAB_VALUE>210 mg/dL</LAB_VALUE>')\n"
                "- Use <BIOMARKER> for medical indicators that aren't conditions\n"
                "- Use <RISK_FACTOR> for behavioral/environmental factors\n"
                "- Include negations (e.g., '<CONDITION>no chest pain</CONDITION>')\n"
                "- For uncertain matches, use <POSSIBLE_...> tags\n"
                "- Return only the annotated text with no additional commentary"
            )
        },
        {
            "role": "user",
            "content": (
                "Perform NER on this OCR-extracted medical document:\n\n"
                "=== INPUT TEXT ===\n"
                f"{open('task2_prompts/prompting_tesseract.txt', encoding='utf-8').read()}\n"
                "=== END INPUT ==="
            )
        }
    ],
    model="deepseek-r1-distill-llama-70b",
    temperature=0.2,  # Reduce randomness for factual accuracy
    max_tokens=4096   # Ensure long documents are fully processed
)



print(chat_completion.choices[0].message.content)


before OCR:
Copie électronique LABORATOIRE DE BIOLOGIE MEDICALE C LABO    N° FINESS :   -  -  @:  &:    - Biologiste(s) Médical(aux) Docteur     Madame    CABINET MEDICAL " "      (100) Copie a : Docteur    , DR  X Demande n° 01/02/ -LABO--TP Edité le, lundi 1 février 2021 Copie a : Docteur    , DR Patient né(e)   le   FSE Tiers payant  -  Prélevements effectués par le laboratoire le 01/02/21 a 10H27 Vos résultats sur internet : Accés sécurisé, rapide, gratuit, pratique, @coresponsable 1) Communiquez votre mail au laboratoire 2) Recevez un email dés que vos résultats sont disponibles 3) Cliquez sur le lien INFORMATION COVID-19 Rendez-vous sur notre site internet dédié pour connaitre notre organisation : https:// .fr/depistage-covid-19/ ia = Hematologie Valeurs de référence Antériorités v Hémogramme (Sang total - Variation d'impédance, photométrie, cytométrie en flux)  -   HEMatieS .......cccccecececeeeeeeeeeeeeeeeeeeeees 4,94 Téra/L 3,80 a 5,90 4,97 HEMOGIODING ........cceeeeeeeeeeeeee

did not respect conditions, too much writing.

## OpenAI Embeddings text-embedding-3-small Benchmark

In [28]:
%pip install openai
%pip install faiss-cpu

In [39]:
import json
with open('highlighted_output.txt','r') as file: 
    text = file.read()
text=text.replace('\n',' ')
text=text.split(' ')

string_size= 100
print(len(text)//string_size)
input=[]
for i in range(len(text)//string_size):
    input.append(' '.join(text[string_size*i:string_size*(i+1)]))
input.append(' '.join(text[(i+1)*string_size:]))

with open('questionnaire.json', 'r') as json_file:
    questionnaire_data = json.load(json_file)

# Extract question titles
question_titles = [question["questionTitle"] for question in questionnaire_data]

19


In [40]:
from openai import OpenAI
client = OpenAI()

response = client.embeddings.create(
    input=input,
    model="text-embedding-3-small"
)

print(response.data[0].embedding)

questions = client.embeddings.create(
    input=question_titles,
    model="text-embedding-3-small"
)



[0.001367782591842115, 0.012216448783874512, 0.015802567824721336, -0.015369080938398838, -0.012623663991689682, 0.002957234624773264, -0.03520439192652702, 0.0036025389563292265, -0.027349082753062248, -0.0834396630525589, 0.022186648100614548, -0.07818527519702911, 0.01534280925989151, -0.019309870898723602, -0.039618074893951416, 0.080812469124794, 0.04907597228884697, -0.011067052371799946, 0.018324673175811768, 0.0015073522226884961, 0.06688833981752396, -0.057745710015296936, 0.033785708248615265, -0.0525175966322422, -0.003796294331550598, -0.05222860351204872, 0.00789471622556448, -0.0663629025220871, 0.017930595204234123, -0.00579624529927969, 0.03523066267371178, -0.025720223784446716, 0.035940006375312805, 0.02581217512488365, -0.0905856266617775, 0.03147377818822861, -0.002206842415034771, 0.015263993293046951, -0.007750220596790314, -0.0097534554079175, 0.0039079501293599606, -0.033339083194732666, -0.0060556805692613125, 0.033076364547014236, 0.009365944191813469, -0.0096

In [None]:
import numpy as np
import faiss

# 1. Prepare your embeddings as numpy arrays
# For example, for answer texts:
answer_embeddings = np.array([item.embedding for item in response.data], dtype=np.float32)
# For questions:
question_embeddings = np.array([item.embedding for item in questions.data], dtype=np.float32)

dimension = answer_embeddings.shape[1]  # Embedding dimension

# 2. (Optional) Normalize for cosine similarity
faiss.normalize_L2(answer_embeddings)
faiss.normalize_L2(question_embeddings)

# 3. Create the FAISS index for cosine similarity
index = faiss.IndexFlatIP(dimension)  # Use IndexFlatIP for cosine similarity

# 4. Add answer embeddings to the index
index.add(answer_embeddings)

# 5. Search for the nearest neighbors
k = 5  # Number of nearest neighbors
D, I = index.search(question_embeddings, k)  # D: similarity scores, I: indices

# 6. Print the results
for i in range(len(question_titles)):
    print(f"Question: {question_titles[i]}")
    for j in range(k):
        idx = I[i][j]
        if idx < len(input):  # Ensure index is within bounds
            print(f"  Nearest answer: {input[idx]}, Similarity: {D[i][j]:.4f}")


Question: Date of Blood Test
  Nearest answer: laboratoire le 01/02/21 à 10H27 Vos résultats sur internet : Accès sécurisé, rapide, gratuit, pratique, écoresponsable 1) Communiquez votre mail au laboratoire 2) Recevez un email dès que vos résultats sont disponibles 3) Cliquez sur le lien INFORMATION COVID-19 Rendez-vous sur notre site internet dédié pour connaître notre organisation : https:// .fr/depistage-covid-19/ Hématologie Valeurs de référence Antériorités ✔ Hémogramme (Sang total - Variation d'impédance, photométrie, cytométrie en flux)  -   4,97 Hématies ........................................ Hémoglobine ....................................4,94 Téra/L 13,6 g/dL3,80 à 5,9011,5 à 17,5 7,1 à 10,913,8 Hématocrite ...................................... V.G.M. ............................................. T.C.M.H. .......................................... C.C.M.H. .......................................... Index d'anisocytose ...........................41,1 % 83,1 fL, Similarity: 

## OpenAI Embeddings text-embedding-3-large Benchmark

In [None]:
%pip install openai
%pip install faiss-cpu

In [42]:
import json
with open('highlighted_output.txt','r') as file: 
    text = file.read()
text=text.replace('\n',' ')
text=text.split(' ')

string_size= 100
print(len(text)//string_size)
input=[]
for i in range(len(text)//string_size):
    input.append(' '.join(text[string_size*i:string_size*(i+1)]))
input.append(' '.join(text[(i+1)*string_size:]))

with open('questionnaire.json', 'r') as json_file:
    questionnaire_data = json.load(json_file)

# Extract question titles
question_titles = [question["questionTitle"] for question in questionnaire_data]

19


In [43]:
from openai import OpenAI
client = OpenAI()

response = client.embeddings.create(
    input=input,
    model="text-embedding-3-large"
)

print(response.data[0].embedding)

questions = client.embeddings.create(
    input=question_titles,
    model="text-embedding-3-large"
)



[-0.008174805901944637, 0.03273143991827965, 0.0005592493689619005, 0.0077519710175693035, 0.004969315603375435, -0.011597754433751106, -0.02245454117655754, 0.011573592200875282, 0.019232941791415215, 0.0602116733789444, 0.003753162221983075, -0.04339493066072464, -0.011863536201417446, -0.0003790411865338683, 0.02876887284219265, 0.004776019603013992, -0.09104236960411072, -0.015528104268014431, 0.003833702066913247, -0.02976756915450096, -0.002317537320777774, -0.020988713949918747, 0.036178551614284515, 0.0327153317630291, -0.009761443361639977, 0.018379218876361847, 0.01527843065559864, 0.022051841020584106, -0.002973938127979636, 0.010534626431763172, -0.008247291669249535, 0.004121632315218449, 0.0019178577931597829, -0.026980886235833168, 0.03685508668422699, -0.02931654453277588, -0.03511542081832886, 0.014609948731958866, -0.0007253630319610238, 0.02221292071044445, -0.008263399824500084, -0.012209857814013958, -0.018975215032696724, 0.003372610779479146, -0.01220180373638868

In [44]:
import numpy as np
import faiss

# 1. Prepare your embeddings as numpy arrays
# For example, for answer texts:
answer_embeddings = np.array([item.embedding for item in response.data], dtype=np.float32)
# For questions:
question_embeddings = np.array([item.embedding for item in questions.data], dtype=np.float32)

dimension = answer_embeddings.shape[1]  # Embedding dimension

# 2. (Optional) Normalize for cosine similarity
faiss.normalize_L2(answer_embeddings)
faiss.normalize_L2(question_embeddings)

# 3. Create the FAISS index for cosine similarity
index = faiss.IndexFlatIP(dimension)  # Use IndexFlatIP for cosine similarity

# 4. Add answer embeddings to the index
index.add(answer_embeddings)

# 5. Search for the nearest neighbors
k = 5  # Number of nearest neighbors
D, I = index.search(question_embeddings, k)  # D: similarity scores, I: indices

# 6. Print the results
for i in range(len(question_titles)):
    print(f"Question: {question_titles[i]}")
    for j in range(k):
        idx = I[i][j]
        if idx < len(input):  # Ensure index is within bounds
            print(f"  Nearest answer: {input[idx]}, Similarity: {D[i][j]:.4f}")


Question: Date of Blood Test
  Nearest answer: 27,4 pg 33,0 g/dL 13,734,0 à 53,041,8 76,0 à 96,084,0 24,4 à 34,027,7 31,0 à 36,032,9 10 à 1614,1 Leucocytes .......................................6,5 Giga/L3,8 à 11,07,8 Polynucléaires neutrophiles .. . Polynucléaires éosinophiles ... Polynucléaires basophiles ..... Lymphocytes ................... Monocytes ....................... 8,4 mmol/L 58,1 % 3,1 % 0,9 % 32,4 % 5,5 % ✔ Plaquettes ........................................... 3,780 G/L 0,200 G/L 0,060 G/L 2,110 G/L 0,360 G/L (Sang total - Variation d'impédance - Beckman Coulter)  -  4,240 0,200 0,00 à 0,110,060 1,00 à 4,802,870 0,15 à 1,000,420 228 Giga/L150 à 445 213 9,7 fLInf. à 11,010,2  V.P.M. .............................................. 1,40 à 7,700,02, Similarity: 0.3970
  Nearest answer: laboratoire le 01/02/21 à 10H27 Vos résultats sur internet : Accès sécurisé, rapide, gratuit, pratique, écoresponsable 1) Communiquez votre mail au laboratoire 2) Recevez un email dès que vos

## ChatGPT 4.1 Name entity recognition.


In [None]:
%pip install openai
%pip install python-dotenv

In [45]:
import base64
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client = OpenAI()


response = client.responses.create(
    model="gpt-4.1",
    input=[
        {
            "role": "system",
            "content": (
                "You are a medical document processing expert performing named entity recognition. "
                "Extract and tag the following entities from the clinical text:\n"
                "1. <MEDICATION> - Drug names, supplements, vaccines\n"
                "2. <DOSAGE> - Amounts, frequencies, durations\n"
                "3. <CONDITION> - Diagnoses, symptoms, complaints\n"
                "4. <PROCEDURE> - Tests, surgeries, interventions\n"
                "5. <LAB_VALUE> - Results with units/reference ranges\n"
                "6. <BODY_PART> - Anatomical locations\n"
                "7. <BIOMARKER> - Medical indicators (cholesterol, HbA1c, blood pressure)\n"
                "8. <RISK_FACTOR> - Health risks (smoking, family history)\n\n"
                "Rules:\n"
                "- Preserve original text exactly (no corrections)\n"
                "- Tag full phrases (e.g., '<BIOMARKER>total cholesterol</BIOMARKER> <LAB_VALUE>210 mg/dL</LAB_VALUE>')\n"
                "- Use <BIOMARKER> for medical indicators that aren't conditions\n"
                "- Use <RISK_FACTOR> for behavioral/environmental factors\n"
                "- Include negations (e.g., '<CONDITION>no chest pain</CONDITION>')\n"
                "- For uncertain matches, use <POSSIBLE_...> tags\n"
                "- Return only the annotated text with no additional commentary"
            )
        },
        {
            "role": "user",
            "content": (
                "Perform NER on this OCR-extracted medical document:\n\n"
                "=== INPUT TEXT ===\n"
                f"{open('task2_prompts/prompting_tesseract.txt', encoding='utf-8').read()}\n"
                "=== END INPUT ==="
            )
        }
    ],
)
print(response.output_text)

Copie électronique LABORATOIRE DE BIOLOGIE MEDICALE C LABO    N° FINESS :   -  -  @:  &:    - Biologiste(s) Médical(aux) Docteur     Madame    CABINET MEDICAL " "      (100) Copie a : Docteur    , DR  X Demande n° 01/02/ -LABO--TP Edité le, lundi 1 février 2021 Copie a : Docteur    , DR Patient né(e)   le   FSE Tiers payant  -  Prélevements effectués par le laboratoire le 01/02/21 a 10H27 Vos résultats sur internet : Accés sécurisé, rapide, gratuit, pratique, @coresponsable 1) Communiquez votre mail au laboratoire 2) Recevez un email dés que vos résultats sont disponibles 3) Cliquez sur le lien INFORMATION COVID-19 Rendez-vous sur notre site internet dédié pour connaitre notre organisation : https:// .fr/depistage-covid-19/ ia = <PROCEDURE>Hematologie</PROCEDURE> Valeurs de référence Antériorités v <PROCEDURE>Hémogramme (Sang total - Variation d'impédance, photométrie, cytométrie en flux)</PROCEDURE>  -   <BIOMARKER>HEMatieS</BIOMARKER> .......cccccecececeeeeeeeeeeeeeeeeeeeees <LAB_VAL

## Gemini word Embeddings Benchmark

In [2]:
%pip install google-genai
%pip install python-dotenv

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [3]:
from google import genai
from google.genai import types
import os
from dotenv import load_dotenv 
load_dotenv()

import json
with open('highlighted_output.txt','r') as file: 
    text = file.read()
text=text.replace('\n',' ')
text=text.split(' ')

string_size= 100
print(len(text)//string_size)
input=[]
for i in range(len(text)//string_size):
    input.append(' '.join(text[string_size*i:string_size*(i+1)]))
input.append(' '.join(text[(i+1)*string_size:]))

with open('questionnaire.json', 'r') as json_file:
    questionnaire_data = json.load(json_file)

# Extract question titles
question_titles = [question["questionTitle"] for question in questionnaire_data]



api_key=os.environ.get("GEMINI_API_KEY")
client = genai.Client(api_key=api_key)

embedded_text = client.models.embed_content(
        model="gemini-embedding-exp-03-07",
        contents=input,
        config=types.EmbedContentConfig(task_type="SEMANTIC_SIMILARITY")
)
embedded_questions = client.models.embed_content(
        model="gemini-embedding-exp-03-07",
        contents=question_titles,
        config=types.EmbedContentConfig(task_type="SEMANTIC_SIMILARITY")
)

print(embedded_text.embeddings)


19
[ContentEmbedding(values=[0.0026502106, 0.008502779, 0.009475291, -0.053564295, -0.015400698, -0.013331938, 0.006530855, 0.018381117, 0.0066022095, -0.00070538645, -0.0033250195, -0.013561771, -0.0065685147, 0.026363129, 0.1394144, -0.0004907195, -0.023095483, 0.031286646, 0.020168921, -0.024655141, 0.0006349347, 0.01217627, -0.013523396, -0.018797139, 0.01112137, -0.0049443827, 0.0028578339, 0.02265747, 0.046683036, 0.004209134, 0.009926889, 0.01506248, -0.019852577, 0.0043136417, 0.0248335, 0.015933774, 0.018594943, 0.018945266, -0.0025141907, 0.04095917, -0.008698146, 0.013745759, -0.0050982293, -0.008466959, 0.029022219, 0.011110399, 0.0042902124, -0.021294344, -0.0021094964, 0.05215766, 1.130807e-05, -0.008344076, -0.009024458, -0.17364371, 0.01052119, -0.011629976, -0.0054811854, 0.01541639, 0.016484214, 0.0012304478, -0.011329807, 0.034219477, -0.01675533, -0.022414355, -0.011387596, -0.015967937, 0.009515224, -0.009694126, -0.013955953, -0.020946965, -0.013111636, 0.00654786

In [None]:
import numpy as np
import faiss

# Convert Gemini embeddings to numpy arrays
answer_embeddings = np.array([e.values for e in embedded_text.embeddings], dtype=np.float32)
question_embeddings = np.array([e.values for e in embedded_questions.embeddings], dtype=np.float32)

dimension = answer_embeddings.shape[1]

# Normalize for cosine similarity
faiss.normalize_L2(answer_embeddings)
faiss.normalize_L2(question_embeddings)

# Create the FAISS index for cosine similarity
index = faiss.IndexFlatIP(dimension)
index.add(answer_embeddings)

# Search for the nearest neighbors
k = 5
D, I = index.search(question_embeddings, k)

# Print the results
for i in range(len(question_titles)):
    print(f"Question: {question_titles[i]}")
    for j in range(k):
        idx = I[i][j]
        if idx < len(input):
            print(f"  Nearest answer: {input[idx]}, Similarity: {D[i][j]:.4f}")


Question: Date of Blood Test
  Nearest answer: Valeurs de référence Antériorités Hormones de la fertilité Date des dernières règles ...............30/01/21 ✔ F.S.H. ................................................ 8,3 UI/L (Sang - Electrochimiluminescence - Roche)  -  Cf tableau Valeurs de référence FEMME DE PLUS DE 17 ans Phase folliculaire Pic ovulatoire Phase lutéale Postménopause : 3.5 à 12.5 : 4.7 à 21.5 : 1.7 à 7.7 : 25.8 à 134.8  Valeurs de référence HOMME DE PLUS DE 17 ans Hommes : 1.5 à 12.4   Biologiste Page 3/5   Demande n° 01/02/ ✔ L.H. .................................................... (Sang - Electrochimiluminescence - Roche)  -  ME    6,6, Similarity: 0.8170
  Nearest answer: Copie électronique    N° FINESS :   -  -  (:  7:    - Biologiste(s) Médical(aux) Docteur     CABINET MEDICAL " "Madame        (100) X Demande n° 01/02/ -LABO--TPEdité le, lundi 1 février 2021 Copie à : Docteur    , DR  Copie à : Docteur    , DR  Patient né(e)   le  FSE Tiers payant  -  Prélèvement