## Moteur de recherche sur les données médicales 

Dans ce projet, nous allons construire un moteur de recherche appliqué aux données médicales issues du challenge TREC-COVID. Afin de respecter la logique pédagogique du cours de Recherche d'Information, nous suivrons une approche progressive :

Du modèle le plus simple vers les modèles avancés, en analysant à chaque étape les performances obtenues sur nos données réelles.

### Notre objectif 
**Mettre en œuvre, comparer et analyser différents modèles de recherche d'information, en appliquant l'état de l'art directement sur le corpus TREC-COVID Round 1, un ensemble de documents scientifiques portant sur le COVID-19.**


Nous allons commencer par charger les données (ici les documents sur le COVID), puis les requêtes et les qrels pour ces requêtes 

Nous utilisons la bibliothèque ir_datasets, qui fournit un accès structuré aux documents, aux requêtes (queries) et aux jugements de pertinence (qrels). Cela nous permet de :

- Charger directement le Round 1 du corpus, sans passer par un téléchargement manuel.

- Éviter le scraping des données sur les liens comme :

     - https://ir.nist.gov/covidSubmit/data/topics-rnd1.xml (pour les requêtes)

     - https://ir.nist.gov/covidSubmit/data/qrels-rnd1.txt (pour les qrels)



#### La structure finale de nos données 
- Documents : Articles scientifiques (titre, résumé, etc.)

- Requêtes : Questions médicales posées dans le cadre du challenge

- Qrels : Fichier de pertinence qui indique quels documents sont jugés pertinents pour chaque requête


### Chargement des données grâce à la librairie ir_dataset

In [1]:
import pandas as pd
import ir_datasets
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from tqdm import tqdm

# Charger uniquement round1
dataset = ir_datasets.load("cord19/trec-covid/round1")

# Ici nous avons juste un objet fourni par la librairie ir_datasets, 
# qui propose une interface pour accéder aux documents, requêtes, jugements de pertinence, etc.

Nous allons donc maintenant récupérer les documents de ce dataset que nous venons de charger afin de pouvoir les avoir sous forme de dataFrame ainsi on pourra voir le shape, le header...

In [2]:
# Préparation
docs = []
for doc in tqdm(dataset.docs_iter(), desc="Chargement des documents"):
    docs.append({
        "doc_id": doc.doc_id,
        "title": doc.title,
        "abstract": doc.abstract
    })

# Conversion en DataFrame
df_docs = pd.DataFrame(docs)

# Affichage
print(df_docs.shape)
print(df_docs.columns)
df_docs.head()


Chargement des documents: 51078it [00:00, 175144.85it/s]

(51078, 3)
Index(['doc_id', 'title', 'abstract'], dtype='object')





Unnamed: 0,doc_id,title,abstract
0,xqhn0vbp,Airborne rhinovirus detection and effect of ul...,"BACKGROUND: Rhinovirus, the most common cause ..."
1,gi6uaa83,Discovering human history from stomach bacteria,Recent analyses of human pathogens have reveal...
2,le0ogx1s,A new recruit for the army of the men of death,"The army of the men of death, in John Bunyan's..."
3,fy4w7xz8,Association of HLA class I with severe acute r...,BACKGROUND: The human leukocyte antigen (HLA) ...
4,0qaoam29,A double epidemic model for the SARS propagation,BACKGROUND: An epidemic of a Severe Acute Resp...


### Chargement des queries et des qrels 

Nous allons nous servir de l'objet **dataset** chargé pour obtenir les queries et les qrels 

In [3]:
## Chargement des queries 

queries_iter = dataset.queries_iter()  # ou dataset.queries


Nous pouvons maintenant transformer en dataFrame comme avec les documents précédement 

In [4]:
first_query = next(dataset.queries_iter())
print(first_query)


TrecQuery(query_id='1', title='coronavirus origin', description='what is the origin of COVID-19', narrative="seeking range of information about the SARS-CoV-2 virus's origin, including its evolution, animal source, and first transmission into humans")


In [5]:
queries = []
for q in dataset.queries_iter():
    queries.append({
        "query_id": q.query_id,
        "title": q.title,
        "description": q.description,
        "narrative": q.narrative
    })

queries_df = pd.DataFrame(queries)
queries_df.head()


Unnamed: 0,query_id,title,description,narrative
0,1,coronavirus origin,what is the origin of COVID-19,seeking range of information about the SARS-Co...
1,2,coronavirus response to weather changes,how does the coronavirus respond to changes in...,seeking range of information about the SARS-Co...
2,3,coronavirus immunity,will SARS-CoV2 infected people develop immunit...,seeking studies of immunity developed due to i...
3,4,how do people die from the coronavirus,what causes death from Covid-19?,Studies looking at mechanisms of death from Co...
4,5,animal models of COVID-19,what drugs have been active against SARS-CoV o...,Papers that describe the results of testing d...


In [6]:
## Chargement des qurels 

queries_iter = dataset.qrels_iter()

On procède comme précédement avec les queries 

In [7]:
qrels = []
for qrel in dataset.qrels_iter():
    qrels.append({
        "query_id": qrel.query_id,
        "doc_id": qrel.doc_id,
        "relevance": qrel.relevance
    })

qrels_df = pd.DataFrame(qrels)
qrels_df.head()


Unnamed: 0,query_id,doc_id,relevance
0,1,010vptx3,2
1,1,02f0opkr,1
2,1,04ftw7k9,0
3,1,05qglt1f,0
4,1,0604jed8,0


### Nettoyage simple

Nous allons commencer par effectuer un nettoyage simple pour le moment 

- Suppression des NAN
- Concatenation **title + abstract**
- Tokenisation simple 

### Préparation des outils 

In [8]:
import pandas as pd
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

tokenizer = RegexpTokenizer(r'\w+')  # Garde uniquement les mots alphanumériques
stop_words = ENGLISH_STOP_WORDS


### Fonction de nettoyage simple 

In [9]:
def clean_text(text):
    tokens = tokenizer.tokenize(text.lower())  # minuscules + tokenisation
    return [token for token in tokens if token not in stop_words] # On garde ceux qui ne sont pas dans la liste des stopwords


### Appliquons le nettoyage sur les documents (corpus)

In [None]:

## On combine le titre et l'abstract 
documents_df = pd.DataFrame([
    {
        "doc_id": doc.doc_id,
        "text": (doc.title or "") + " " + (doc.abstract or "")
    }
    for doc in dataset.docs_iter()
])

# Appliquer le nettoyage
documents_df["tokens"] = documents_df["text"].fillna("").apply(clean_text)



In [14]:
documents_df.head(3)

Unnamed: 0,doc_id,text,tokens
0,xqhn0vbp,Airborne rhinovirus detection and effect of ul...,"[airborne, rhinovirus, detection, effect, ultr..."
1,gi6uaa83,Discovering human history from stomach bacteri...,"[discovering, human, history, stomach, bacteri..."
2,le0ogx1s,A new recruit for the army of the men of death...,"[new, recruit, army, men, death, army, men, de..."


#### Nettoyage des requêtes

In [13]:
queries_df["tokens"] = queries_df["title"].fillna("").apply(clean_text)


On nettoie seulement les titres pour les requêtes car c'est la reformulation courte de la requête et c'est ce que l'utilisateur taperait naturellement dans un moteur de recherche.

*description* et *narrative* décrivent l'intention complète du sujet, mais ils ne sont pas directement utilisés dans les évaluations (TREC utilise *title* par défaut)

In [16]:
queries_df.head(3)

Unnamed: 0,query_id,title,description,narrative,tokens
0,1,coronavirus origin,what is the origin of COVID-19,seeking range of information about the SARS-Co...,"[coronavirus, origin]"
1,2,coronavirus response to weather changes,how does the coronavirus respond to changes in...,seeking range of information about the SARS-Co...,"[coronavirus, response, weather, changes]"
2,3,coronavirus immunity,will SARS-CoV2 infected people develop immunit...,seeking studies of immunity developed due to i...,"[coronavirus, immunity]"


### Modèle binaire 

Nous allons implémenter un modèle binaire ici et voir ce qui en est des performances, nous terminerons par des commentaire à propos de ce modèle 

- Index inversé (mot et liste des documents où il apparait)
- Requête en mode booléen 
- Recherche des documents concernant certains termes 

Nous ferons une évaluation simple (couverture de la requête)

### Création d'un index inversé sur les documents dont nous venons de faire le nettoyage simple

In [23]:
from collections import defaultdict

def create_inverted_index(docs_tokens):
    """
    docs_tokens : DataFrame avec colonnes ['doc_id', 'tokens']
    Retourne un index inversé : mot → set de doc_ids
    """
    index = defaultdict(set)
    for _, row in docs_tokens.iterrows():
        doc_id = row["doc_id"]
        for token in set(row["tokens"]):  # set pour éviter les doublons
            index[token].add(doc_id) # Constituer l'index inversé
    return dict(index)

# Création de l'index
inverted_index = create_inverted_index(documents_df[["doc_id", "tokens"]])


In [18]:
docs_token = documents_df[['doc_id', 'tokens']]

In [122]:
documents_df['tokens'].head(3)

0    [airborne, rhinovirus, detection, effect, ultr...
1    [discovering, human, history, stomach, bacteri...
2    [new, recruit, army, men, death, army, men, de...
Name: tokens, dtype: object

In [22]:
for _,row in docs_token.iterrows():
    print(row['doc_id'])

xqhn0vbp
gi6uaa83
le0ogx1s
fy4w7xz8
0qaoam29
qj4dh6rg
1wswi7us
yy96yeu9
5o38ihe0
1ul8owic
5s6acr7m
tvxpckxo
ri5v6u4x
kuybfc1y
ng4rrdte
1769ovyk
tixxm78q
pwsvhitd
wt8zfqk0
5gsbtfag
fpj5urao
wutnzzhg
lwla5ugt
8zwsi4nk
t20z4mtt
0gmtnkbh
rrhh2alf
sgmk96vr
1ke7i2wr
v95fzp8n
52vixim5
47ema2dq
jzj8q25c
qg0fsliy
p7um7o87
xvi5miqw
1i36lsj2
4k8f7ou1
mtmgur1u
gdsfkw1b
0s6ort9f
zl5lgcog
2su7oqbz
yba7mdtb
ic4d9dhk
jh9e85c0
7vi6skvh
efrv5nvf
xtg0e142
bbvxu8op
e6e5nvn9
lgcmamfb
4u2re1cu
e62cfqt7
04cuk2cn
mvxz7lx7
m9rg6d3w
89xnnvuv
zowp10ts
i4pmux28
jw1lxwyd
xiv9vxdp
mcfmxqp2
snqdma0s
402ls2aq
4mnaicki
ztkjm79p
0fitbwuv
oad4l4fa
6vcts4w3
a22s8xyz
azkamnpa
r8j6lhoc
0svscbpu
0brhn8oc
tv1fx7sy
ckby80vf
jz61fxpn
ln8ddyuj
y0wf456f
x8zdlml2
e1iaiwc6
fooovon8
pyam9yn3
zc491h8v
ff7dg890
7ots8npg
whoydkd5
covwyddp
3ylcev0n
eqiokrub
9k55f9o6
az4a70b0
2vlvz5o9
jlhykfbf
cl9gpt9w
ndnze5o4
176djnf5
ltp7iv1z
qzm9wgde
qm5a5c4b
dgnddq80
ofx0hvvs
kqithgfo
5dx3lscp
pqzwk6rj
ej795nks
xv0esvos
9mzs5dl4
mxyxwkhx
k4en9ksd
o

### Recherche binaire avec cet index que nous venons de créer 

Ici nous faisons la recherche AND 

In [47]:
def boolean_search(query_tokens, index):
    """
    Effectue une recherche binaire avec opérateur AND.
    query_tokens : liste de tokens nettoyés
    index : dictionnaire mot ==> set(doc_ids)
    Retourne l'ensemble des doc_ids correspondant
    """
    if not query_tokens:
        return set()

    result = index.get(query_tokens[0], set())
    for token in query_tokens[1:]:
        result = result & index.get(token, set())  # Intersection
        if not result:
            break  # optimisation : stop si plus aucun document trouvé
    return result



#### Fonction qui se sert de la recherche pour retourner les documents avec les titres et les abstracts 

In [52]:
def boolean_search_documents(query_text, index, documents_df):
    """
    - query_text : texte brut de la requête (non tokenisé)
    - index : index inversé (mot → doc_ids)
    - documents_df : DataFrame avec ['doc_id', 'text', 'tokens']

    Retourne une liste de dictionnaires avec 'doc_id', 'title', 'abstract'
    """
    query_tokens = clean_text(query_text)
    matched_ids = boolean_search(query_tokens, index)

    # Récupération des documents correspondant aux doc_ids trouvés
    results = documents_df[documents_df['doc_id'].isin(matched_ids)].copy()

    # Extraction séparée du titre et abstract
    results['title'] = results['text'].apply(lambda t: t.split(". ")[0] if ". " in t else t[:100])
    results['abstract'] = results['text'].apply(lambda t: t if ". " not in t else ". ".join(t.split(". ")[1:]))

    return results[['doc_id', 'title', 'abstract']].to_dict(orient='records')



### Exemple d'usage de ce moteur de recherche booléen 

In [57]:
query = "origin of coronavirus"
results_df = boolean_search_documents(query, inverted_index, documents_df)

for row in results_df:
    print(f"\n {row['doc_id']}")
    print(f" Title: {row['title']}")
    print(f" Abstract: {row['abstract'][:300]}...")



 1wswi7us
 Title: Relationship of SARS-CoV to other pathogenic RNA viruses explored by tetranucleotide usage profiling BACKGROUND: The exact origin of the cause of the Severe Acute Respiratory Syndrome (SARS) is still an open question
 Abstract: The genomic sequence relationship of SARS-CoV with 30 different single-stranded RNA (ssRNA) viruses of various families was studied using two non-standard approaches. Both approaches began with the vectorial profiling of the tetra-nucleotide usage pattern V for each virus. In approach one, a distanc...

 8zwsi4nk
 Title: Date of origin of the SARS coronavirus strains BACKGROUND: A new respiratory infectious epidemic, severe acute respiratory syndrome (SARS), broke out and spread throughout the world
 Abstract: By now the putative pathogen of SARS has been identified as a new coronavirus, a single positive-strand RNA virus. RNA viruses commonly have a high rate of genetic mutation. It is therefore important to know the mutation rate of the SARS

#### Pour un affichage plus joli

In [37]:
from IPython.display import display, HTML
import pandas as pd
import re


In [96]:
import pandas as pd
import re
from IPython.display import display, HTML

def highlight_keywords(text, keywords):
    """
    Surligne les mots-clés présents dans un texte.
    """
    if not text:
        return ""
    for kw in keywords:
        pattern = re.compile(rf'\b({re.escape(kw)})\b', re.IGNORECASE)
        text = pattern.sub(r'<mark style="background-color: #ffd54f;">\1</mark>', text)
    return text

def display_search_results(results_list, query, max_rows=10, show_scores=True):
    if not results_list:
        display(HTML("<p style='color:red;'>Aucun résultat trouvé.</p>"))
        return

    df = pd.DataFrame(results_list[:max_rows])
    query_tokens = [t.lower() for t in re.findall(r'\w+', query)]

    df["title"] = df["title"].apply(lambda x: highlight_keywords(x, query_tokens))
    df["abstract"] = df["abstract"].apply(lambda x: highlight_keywords(x, query_tokens))

    columns = ["doc_id", "title", "abstract"]
    if show_scores and "score" in df.columns:
        columns.insert(1, "score")

    styled_table = df[columns].style.set_table_attributes("style='width:100%; border-collapse:collapse'")
    styled_table = styled_table.set_properties(**{
        'border': '1px solid #ccc',
        'padding': '8px',
        'text-align': 'left'
    })
    styled_table = styled_table.set_table_styles([
        {'selector': 'th', 'props': [('background-color', '#208A9D'), ('color', 'white'), ('font-weight', 'bold')]}
    ])

    display(styled_table)


In [97]:
query = "coronavirus origin"
results_df = boolean_search_documents(query, inverted_index, documents_df)
display_search_results(results_df, query)


Unnamed: 0,doc_id,title,abstract
0,1wswi7us,Relationship of SARS-CoV to other pathogenic RNA viruses explored by tetranucleotide usage profiling BACKGROUND: The exact origin of the cause of the Severe Acute Respiratory Syndrome (SARS) is still an open question,"The genomic sequence relationship of SARS-CoV with 30 different single-stranded RNA (ssRNA) viruses of various families was studied using two non-standard approaches. Both approaches began with the vectorial profiling of the tetra-nucleotide usage pattern V for each virus. In approach one, a distance measure of a vector V, based on correlation coefficient was devised to construct a relationship tree by the neighbor-joining algorithm. In approach two, a multivariate factor analysis was performed to derive the embedded tetra-nucleotide usage patterns. These patterns were subsequently used to classify the selected viruses. RESULTS: Both approaches yielded relationship outcomes that are consistent with the known virus classification. They also indicated that the genome of RNA viruses from the same family conform to a specific pattern of word usage. Based on the correlation of the overall tetra-nucleotide usage patterns, the Transmissible Gastroenteritis Virus (TGV) and the Feline CoronaVirus (FCoV) are closest to SARS-CoV. Surprisingly also, the RNA viruses that do not go through a DNA stage displayed a remarkable discrimination against the CpG and UpA di-nucleotide (z = -77.31, -52.48 respectively) and selection for UpG and CpA (z = 65.79,49.99 respectively). Potential factors influencing these biases are discussed. CONCLUSION: The study of genomic word usage is a powerful method to classify RNA viruses. The congruence of the relationship outcomes with the known classification indicates that there exist phylogenetic signals in the tetra-nucleotide usage patterns, that is most prominent in the replicase open reading frames."
1,8zwsi4nk,"Date of origin of the SARS coronavirus strains BACKGROUND: A new respiratory infectious epidemic, severe acute respiratory syndrome (SARS), broke out and spread throughout the world","By now the putative pathogen of SARS has been identified as a new coronavirus, a single positive-strand RNA virus. RNA viruses commonly have a high rate of genetic mutation. It is therefore important to know the mutation rate of the SARS coronavirus as it spreads through the population. Moreover, finding a date for the last common ancestor of SARS coronavirus strains would be useful for understanding the circumstances surrounding the emergence of the SARS pandemic and the rate at which SARS coronavirus diverge. METHODS: We propose a mathematical model to estimate the evolution rate of the SARS coronavirus genome and the time of the last common ancestor of the sequenced SARS strains. Under some common assumptions and justifiable simplifications, a few simple equations incorporating the evolution rate (K) and time of the last common ancestor of the strains (T(0)) can be deduced. We then implemented the least square method to estimate K and T(0 )from the dataset of sequences and corresponding times. Monte Carlo stimulation was employed to discuss the results. RESULTS: Based on 6 strains with accurate dates of host death, we estimated the time of the last common ancestor to be about August or September 2002, and the evolution rate to be about 0.16 base/day, that is, the SARS coronavirus would on average change a base every seven days. We validated our method by dividing the strains into two groups, which coincided with the results from comparative genomics. CONCLUSION: The applied method is simple to implement and avoid the difficulty and subjectivity of choosing the root of phylogenetic tree. Based on 6 strains with accurate date of host death, we estimated a time of the last common ancestor, which is coincident with epidemic investigations, and an evolution rate in the same range as that reported for the HIV-1 virus."
2,0gmtnkbh,Mutational dynamics of the SARS coronavirus in cell culture and human populations isolated in 2003 BACKGROUND: The SARS coronavirus is the etiologic agent for the epidemic of the Severe Acute Respiratory Syndrome,"The recent emergence of this new pathogen, the careful tracing of its transmission patterns, and the ability to propagate in culture allows the exploration of the mutational dynamics of the SARS-CoV in human populations. METHODS: We sequenced complete SARS-CoV genomes taken from primary human tissues (SIN3408, SIN3725V, SIN3765V), cultured isolates (SIN848, SIN846, SIN842, SIN845, SIN847, SIN849, SIN850, SIN852, SIN3408L), and five consecutive Vero cell passages (SIN2774_P1, SIN2774_P2, SIN2774_P3, SIN2774_P4, SIN2774_P5) arising from SIN2774 isolate. These represented individual patient samples, serial in vitro passages in cell culture, and paired human and cell culture isolates. Employing a refined mutation filtering scheme and constant mutation rate model, the mutation rates were estimated and the possible date of emergence was calculated. Phylogenetic analysis was used to uncover molecular relationships between the isolates. RESULTS: Close examination of whole genome sequence of 54 SARS-CoV isolates identified before 14(th )October 2003, including 22 from patients in Singapore, revealed the mutations engendered during human-to-Vero and Vero-to-human transmission as well as in multiple Vero cell passages in order to refine our analysis of human-to-human transmission. Though co-infection by different quasipecies in individual tissue samples is observed, the in vitro mutation rate of the SARS-CoV in Vero cell passage is negligible. The in vivo mutation rate, however, is consistent with estimates of other RNA viruses at approximately 5.7 × 10(-6 )nucleotide substitutions per site per day (0.17 mutations per genome per day), or two mutations per human passage (adjusted R-square = 0.4014). Using the immediate Hotel M contact isolates as roots, we observed that the SARS epidemic has generated four major genetic groups that are geographically associated: two Singapore isolates, one Taiwan isolate, and one North China isolate which appears most closely related to the putative SARS-CoV isolated from a palm civet. Non-synonymous mutations are centered in non-essential ORFs especially in structural and antigenic genes such as the S and M proteins, but these mutations did not distinguish the geographical groupings. However, no non-synonymous mutations were found in the 3CLpro and the polymerase genes. CONCLUSIONS: Our results show that the SARS-CoV is well adapted to growth in culture and did not appear to undergo specific selection in human populations. We further assessed that the putative origin of the SARS epidemic was in late October 2002 which is consistent with a recent estimate using cases from China. The greater sequence divergence in the structural and antigenic proteins and consistent deletions in the 3' – most portion of the viral genome suggest that certain selection pressures are interacting with the functional nature of these validated and putative ORFs."
3,7ecliy9q,Inhibition of cytokine gene expression and induction of chemokine genes in non-lymphatic cells infected with SARS coronavirus BACKGROUND: SARS coronavirus (SARS-CoV) is the etiologic agent of the severe acute respiratory syndrome,"SARS-CoV mainly infects tissues of non-lymphatic origin, and the cytokine profile of those cells can determine the course of disease. Here, we investigated the cytokine response of two human non-lymphatic cell lines, Caco-2 and HEK 293, which are fully permissive for SARS-CoV. RESULTS: A comparison with established cytokine-inducing viruses revealed that SARS-CoV only weakly triggered a cytokine response. In particular, SARS-CoV did not activate significant transcription of the interferons IFN-α, IFN-β, IFN-λ1, IFN-λ2/3, as well as of the interferon-induced antiviral genes ISG56 and MxA, the chemokine RANTES and the interleukine IL-6. Interestingly, however, SARS-CoV strongly induced the chemokines IP-10 and IL-8 in the colon carcinoma cell line Caco-2, but not in the embryonic kidney cell line 293. CONCLUSION: Our data indicate that SARS-CoV suppresses the antiviral cytokine system of non-immune cells to a large extent, thus buying time for dissemination in the host. However, synthesis of IP-10 and IL-8, which are established markers for acute-stage SARS, escapes the virus-induced silencing at least in some cell types. Therefore, the progressive infiltration of immune cells into the infected lungs observed in SARS patients could be due to the production of these chemokines by the infected tissue cells."
4,xad3k5aw,"Recombinant Canine Coronaviruses in Dogs, Europe Coronaviruses of potential recombinant origin with porcine transmissible gastroenteritis virus (TGEV), referred to as a new subtype (IIb) of canine coronavirus (CCoV), were recently identified in dogs in Europe","To assess the distribution of the TGEV-like CCoV subtype, during 2001–2008 we tested fecal samples from dogs with gastroenteritis. Of 1,172 samples, 493 (42.06%) were positive for CCoV. CCoV-II was found in 218 samples, and CCoV-I and CCoV-II genotypes were found in 182. Approximately 20% of the samples with CCoV-II had the TGEV-like subtype; detection rates varied according to geographic origin. The highest and lowest rates of prevalence for CCoV-II infection were found in samples from Hungary and Greece (96.87% and 3.45%, respectively). Sequence and phylogenetic analyses showed that the CCoV-IIb strains were related to prototype TGEV-like strains in the 5′ and the 3′ ends of the spike protein gene."
5,jcu3pasy,Recombination in Avian Gamma-Coronavirus Infectious Bronchitis Virus Recombination in the family Coronaviridae has been well documented and is thought to be a contributing factor in the emergence and evolution of different coronaviral genotypes as well as different species of coronavirus,"However, there are limited data available on the frequency and extent of recombination in coronaviruses in nature and particularly for the avian gamma-coronaviruses where only recently the emergence of a turkey coronavirus has been attributed solely to recombination. In this study, the full-length genomes of eight avian gamma-coronavirus infectious bronchitis virus (IBV) isolates were sequenced and along with other full-length IBV genomes available from GenBank were analyzed for recombination. Evidence of recombination was found in every sequence analyzed and was distributed throughout the entire genome. Areas that have the highest occurrence of recombination are located in regions of the genome that code for nonstructural proteins 2, 3 and 16, and the structural spike glycoprotein. The extent of the recombination observed, suggests that this may be one of the principal mechanisms for generating genetic and antigenic diversity within IBV. These data indicate that reticulate evolutionary change due to recombination in IBV, likely plays a major role in the origin and adaptation of the virus leading to new genetic types and strains of the virus."
6,v08cs51n,"Review of Bats and SARS Bats have been identified as a natural reservoir for an increasing number of emerging zoonotic viruses, including henipaviruses and variants of rabies viruses","Recently, we and another group independently identified several horseshoe bat species (genus Rhinolophus) as the reservoir host for a large number of viruses that have a close genetic relationship with the coronavirus associated with severe acute respiratory syndrome (SARS). Our current research focused on the identification of the reservoir species for the progenitor virus of the SARS coronaviruses responsible for outbreaks during 2002–2003 and 2003–2004. In addition to SARS-like coronaviruses, many other novel bat coronaviruses, which belong to groups 1 and 2 of the 3 existing coronavirus groups, have been detected by PCR. The discovery of bat SARS-like coronaviruses and the great genetic diversity of coronaviruses in bats have shed new light on the origin and transmission of SARS coronaviruses."
7,ezi2mret,SARS-associated Coronavirus Transmitted from Human to Pig SEVERE ACUTE RESPIRATORY SYNDROME–ASSOCIATED: coronavirus (SARS-CoV) was isolated from a pig during a survey for possible routes of viral transmission after a SARS epidemic,Sequence and epidemiology analyses suggested that the pig was infected by a SARS-CoV of human origin.
8,buzyani0,"Origin, diversity, and maturation of human antiviral antibodies analyzed by high-throughput sequencing Our understanding of how antibodies are generated and function could help develop effective vaccines and antibody-based therapeutics against viruses such as HIV-1, SARS coronavirus (SARS CoV), and Hendra and Nipah viruses (henipaviruses)","Although broadly neutralizing antibodies (bnAbs) against the HIV-1 were observed in patients, elicitation of such bnAbs remains a major challenge when compared to other viral targets. We previously hypothesized that HIV-1 could have evolved a strategy to evade the immune system due to absent or very weak binding of germline antibodies to the conserved epitopes that may not be sufficient to initiate and/or maintain an effective immune response. To further explore our hypothesis, we used the 454 sequence analysis of a large naïve library of human IgM antibodies which had been used for selecting antibodies against SARS CoV receptor-binding domain (RBD), and soluble G proteins (sG) of henipaviruses. We found that the human IgM repertoires from the 454 sequencing have diverse germline usages, recombination patterns, junction diversity, and a lower extent of somatic mutation. In this study, we identified antibody maturation intermediates that are related to bnAbs against the HIV-1 and other viruses as observed in normal individuals, and compared their genetic diversity and somatic mutation level along with available structural and functional data. Further computational analysis will provide framework for understanding the underlying genetic and molecular determinants related to maturation pathways of antiviral bnAbs that could be useful for applying novel approaches to the design of effective vaccine immunogens and antibody-based therapeutics."
9,vexo81k5,"Identification and Characterization of a Novel Alpaca Respiratory Coronavirus Most Closely Related to the Human Coronavirus 229E In 2007, a novel coronavirus associated with an acute respiratory disease in alpacas (Alpaca Coronavirus, ACoV) was isolated","Full-length genomic sequencing of the ACoV demonstrated the genome to be consistent with other Alphacoronaviruses. A putative additional open-reading frame was identified between the nucleocapsid gene and 3'UTR. The ACoV was genetically most similar to the common human coronavirus (HCoV) 229E with 92.2% nucleotide identity over the entire genome. A comparison of spike gene sequences from ACoV and from HCoV-229E isolates recovered over a span of five decades showed the ACoV to be most similar to viruses isolated in the 1960’s to early 1980’s. The true origin of the ACoV is unknown, however a common ancestor between the ACoV and HCoV-229E appears to have existed prior to the 1960’s, suggesting virus transmission, either as a zoonosis or anthroponosis, has occurred between alpacas and humans."


### Evaluation de ce moteur de recherche booléen sur les requêtes et les qrels compatibles aux documents qui nous ont servit à construire l'index 

### Une fonction d'évaluation globale que nous allons utiliser dans ce projet 

In [77]:
import numpy as np

def precision_at_k(predicted, relevant, k):
    if not predicted or not relevant:
        return 0.0
    predicted_k = predicted[:k]
    hits = len(set(predicted_k) & set(relevant))
    return hits / k

def recall_at_k(predicted, relevant, k):
    if not predicted or not relevant:
        return 0.0
    predicted_k = predicted[:k]
    return len(set(predicted_k) & set(relevant)) / len(relevant)

def reciprocal_rank(predicted, relevant):
    for idx, doc_id in enumerate(predicted):
        if doc_id in relevant:
            return 1 / (idx + 1)
    return 0.0

def dcg_at_k(relevances, k):
    relevances = np.array(relevances)[:k]
    return np.sum(relevances / np.log2(np.arange(2, relevances.size + 2)))

def ndcg_at_k(predicted, relevant, k):
    relevances = [1 if doc in relevant else 0 for doc in predicted[:k]]
    dcg = dcg_at_k(relevances, k)
    ideal_relevances = sorted(relevances, reverse=True)
    idcg = dcg_at_k(ideal_relevances, k)
    return dcg / idcg if idcg > 0 else 0.0

def evaluate_model(search_function, queries_df, qrels_df, k=10):
    precision_scores = []
    recall_scores = []
    rr_scores = []
    ndcg_scores = []

    for _, row in queries_df.iterrows():
        qid = str(row["query_id"])
        query_text = row["title"]

        relevant_docs = qrels_df[qrels_df["query_id"] == qid]["doc_id"].tolist()
        if not relevant_docs:
            continue

        results = search_function(query_text, top_k=k)
        predicted_docs = [r["doc_id"] for r in results]
        if not predicted_docs:
            continue

        precision_scores.append(precision_at_k(predicted_docs, relevant_docs, k))
        recall_scores.append(recall_at_k(predicted_docs, relevant_docs, k))
        rr_scores.append(reciprocal_rank(predicted_docs, relevant_docs))
        ndcg_scores.append(ndcg_at_k(predicted_docs, relevant_docs, k))

    return {
        "precision@k": np.mean(precision_scores),
        "recall@k": np.mean(recall_scores),
        "MRR": np.mean(rr_scores),
        "nDCG@k": np.mean(ndcg_scores)
    }


In [78]:
# Wrapper pour inclure les paramètres supplémentaires
def boolean_search_eval_wrapper(query_text, top_k=10):
    results = boolean_search_documents(query_text, inverted_index, documents_df)
    return results[:top_k]

# Évaluation
results_boolean = evaluate_model(boolean_search_eval_wrapper, queries_df, qrels_df, k=10)

# Affichage des résultats
print("Résultats du moteur booléen :")
for metric, score in results_boolean.items():
    print(f"- {metric}: {score:.4f}")


Résultats du moteur booléen :
- precision@k: 0.3185
- recall@k: 0.0133
- MRR: 0.5346
- nDCG@k: 0.5915


### Commentaire 

Nous avons les premières évaluations pour ce premier moteur de recherche booléen

Nous constatons que les scores ne sont pas très convaiquants à ce niveau, et ceci est dû aux différentes faiblesses de ce type de moteur de recherche, l'autre des mots n'est pas pris en compte, la recherche est trop stricte, elle est moins flexible


Les résultats obtenus avec ce moteur de recherche booléen montrent des performances modestes


Ces scores traduisent plusieurs limitations structurelles du modèle booléen :

- Pas de notion de pertinence graduelle :

    - Le moteur booléen retourne une liste de documents qui correspondent exactement à la requête, sans tenir compte du niveau de pertinence ou de similarité sémantique.

    - Il n’y a aucun score de classement : tous les documents retournés sont considérés égaux.

- Ordre des mots non pris en compte :

    - Le moteur ignore la syntaxe, la sémantique, et l’ordre des mots, ce qui rend les résultats souvent approximatifs ou incomplets.

- Manque de flexibilité :

    - Une requête contenant plusieurs mots-clés doit correspondre exactement à ces termes, même si d'autres documents très pertinents ne contiennent que certains de ces mots ou utilisent des synonymes.

- Faible rappel :

    - Le rappel très faible (~1.3%) indique que de nombreux documents pertinents ne sont pas retournés.

    - Cela reflète le caractère trop strict des requêtes booléennes, qui négligent la richesse du langage naturel.

### Modèle sac de mots (représentation binaire)

- Matrice document-term avec présence (0/1): chaque ligne est un document, chaque mot est une colonne et chaque cellule vaut (0/1) en fonction de si le mot y est ou pas 
- Similarité avec la requête via le nombre de mots communs 

Nous ferons une comparaison naive (cosine binaire)


### Création de la matrice binaire document-term 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_binary = CountVectorizer(binary=True)  # On veut 0 ou 1, pas les fréquences
X_binary = vectorizer_binary.fit_transform(documents_df["tokens"].apply(lambda tokens: " ".join(tokens)))

# apply(lambda tokens: " ".join(tokens)): On joint les tokens pour obtenir une phrase exploitable par CountVectorizer 


Il s'agit d'une matrice sparse dans laquelle seule les cellules avec des 1 sont retenus afin d'éviter d'avoir une matrice trop creuse 

In [None]:
X_binary # est une matrice n_documents x n_mots 

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 3824759 stored elements and shape (51078, 115898)>

In [None]:
vectorizer_binary # contient le mapping mot -> indice de la colonne 

### Transformer une requête 

In [None]:
def vectorize_query_binary(query_text):
    query_tokens = clean_text(query_text) 
    query_str = " ".join(query_tokens)  # Afin de pouvoir appliquer la transformation, on forme une phrase avec les tokens 
    return vectorizer_binary.transform([query_str]) # On vectorise avec le même vectorizer que pour les documents donc le même vocabulaire


### Calcul de la similarité cosinus 

On calcule la similarité entre la requête et chaque document

In [98]:
from sklearn.metrics.pairwise import cosine_similarity

def search_binary(query_text, top_k=10):
    query_vec = vectorize_query_binary(query_text)
    scores = cosine_similarity(query_vec, X_binary).flatten()
    top_indices = scores.argsort()[::-1][:top_k]
    return documents_df.iloc[top_indices][['doc_id', 'text']].assign(score=scores[top_indices]).to_dict(orient='records')


def search_binary_document(query_text, documents_df, top_k=10):
    """
    Recherche binaire avec affichage enrichi trié par score.
    """
    raw_results = search_binary(query_text, top_k=top_k)

    if not raw_results:
        return []

    # Extraire les doc_ids
    matched_ids = [r['doc_id'] for r in raw_results]
    score_dict = {r['doc_id']: r['score'] for r in raw_results}

    # Récupérer les documents correspondants
    results = documents_df[documents_df["doc_id"].isin(matched_ids)].copy()

    # Ajouter titre / abstract
    results["title"] = results["text"].apply(lambda t: t.split(". ")[0] if ". " in t else t[:100])
    results["abstract"] = results["text"].apply(lambda t: t if ". " not in t else ". ".join(t.split(". ")[1:]))

    # Ajouter les scores puis trier
    results["score"] = results["doc_id"].map(score_dict)
    results = results.sort_values(by="score", ascending=False)

    return results[["doc_id", "title", "abstract", "score"]].to_dict(orient="records")



In [99]:
### Exemple d'application sur une requête: 

query = "origin of COVID-19"

search_binary_document(query_text=query, documents_df=documents_df, top_k=3)

[{'doc_id': 'gv1k7u7j',
  'title': 'Strategies to trace back the origin of COVID-19 ',
  'abstract': 'Strategies to trace back the origin of COVID-19 ',
  'score': 0.7745966692414834},
 {'doc_id': '450z0tv1',
  'title': 'The Anesthesiologist and COVID-19 ',
  'abstract': 'The Anesthesiologist and COVID-19 ',
  'score': 0.6666666666666669},
 {'doc_id': 's7idehep',
  'title': 'Medicine: before COVID-19, and after ',
  'abstract': 'Medicine: before COVID-19, and after ',
  'score': 0.6666666666666669}]

In [100]:
query = "origin of COVID-19"
results = search_binary_document(query, documents_df, top_k=3)
display_search_results(results, query)


Unnamed: 0,doc_id,score,title,abstract
0,gv1k7u7j,0.774597,Strategies to trace back the origin of COVID-19,Strategies to trace back the origin of COVID-19
1,450z0tv1,0.666667,The Anesthesiologist and COVID-19,The Anesthesiologist and COVID-19
2,s7idehep,0.666667,"Medicine: before COVID-19, and after","Medicine: before COVID-19, and after"


In [101]:
results_binary = evaluate_model(search_binary, queries_df, qrels_df, k=10)

print("Résultats du modèle sac de mots (binaire) :")
for metric, score in results_binary.items():
    print(f"- {metric}: {score:.4f}")


Résultats du modèle sac de mots (binaire) :
- precision@k: 0.0633
- recall@k: 0.0024
- MRR: 0.3228
- nDCG@k: 0.3796


### Commentaire

 Le modèle sac de mots binaire montre une précision très faible (6%) et un rappel quasi nul (0,2%), ce qui signifie qu'il retrouve peu de documents pertinents dans les premiers résultats. Cela peut s’expliquer par le fait que cette représentation (0/1) ne prend ni en compte la fréquence des mots ni leur importance contextuelle, rendant les résultats peu discriminants.

### Fréquence des mots (Term Frequency): Une variante du modèle sac de mots

- Poids selon le nombre d’occurrences dans le document.

- Matrice pondérée.

- Cosine similarity avec une requête vectorisée également.

#### Création de la matrice term-document basé sur les fréquences des mots dans les documents 

In [102]:
from sklearn.feature_extraction.text import CountVectorizer

# On réutilise les documents nettoyés précédemment
tf_vectorizer = CountVectorizer(tokenizer=lambda x: x, lowercase=False)  # car nos tokens sont déjà nettoyés
X_tf = tf_vectorizer.fit_transform(documents_df["tokens"])




#### Fonction pour vectoriser la requête avec le même vocabulaire 

In [103]:
def vectorize_query_tf(query_text):
    tokens = clean_text(query_text)
    return tf_vectorizer.transform([tokens])


#### Fonction de recherche avec similarité cosinus 

In [104]:
from sklearn.metrics.pairwise import cosine_similarity

def search_tf(query_text, top_k=10):
    query_vec = vectorize_query_tf(query_text)
    scores = cosine_similarity(query_vec, X_tf).flatten()
    top_indices = scores.argsort()[::-1][:top_k]
    
    results = documents_df.iloc[top_indices].copy()
    results["score"] = scores[top_indices]
    
    # Séparer titre / abstract
    results["title"] = results["text"].apply(lambda t: t.split(". ")[0] if ". " in t else t[:100])
    results["abstract"] = results["text"].apply(lambda t: t if ". " not in t else ". ".join(t.split(". ")[1:]))
    
    return results[["doc_id", "title", "abstract", "score"]].to_dict(orient="records")


#### Exemple de recherche 

In [108]:
query = "origin of COVID-19"
results = search_tf(query, top_k=10)
display_search_results(results, query)


Unnamed: 0,doc_id,score,title,abstract
0,gv1k7u7j,0.774597,Strategies to trace back the origin of COVID-19,Strategies to trace back the origin of COVID-19
1,3gbxbg5f,0.666667,COVID-19 and Interconnectedness,COVID-19 and Interconnectedness
2,pttcysvc,0.666667,"COVID-19, a pandemic or not?","COVID-19, a pandemic or not?"
3,450z0tv1,0.666667,The Anesthesiologist and COVID-19,The Anesthesiologist and COVID-19
4,s7idehep,0.666667,"Medicine: before COVID-19, and after","Medicine: before COVID-19, and after"
5,s4dqx9en,0.6506,Plasma Metabolomic and Lipidomic Alterations Associated with COVID-19 The pandemic of the coronavirus disease 2019 (COVID-19) has become a global public health crisis,"COVID-19 is marked by its rapid progression from mild to severe conditions, particularly in the absence of adequate medical care. However, the physiological changes associated with COVID-19 are barely understood. In this study, we performed untargeted metabolomic and lipidomic analyses of plasma from a cohort of COVID-19 patients who had experienced different symptoms. We found the metabolite and lipid alterations exhibit apparent correlation with the course of disease in these COVID-19 patients, indicating that the development of COVID-19 affected patient metabolism. Moreover, many of the metabolite and lipid alterations, particularly ones associated with hepatic functions, have been found to align with the progress and severity of COVID-19. This work provides valuable knowledge about blood biomarkers associated with COVID-19 and potential therapeutic targets, and presents important resource for further studies of COVID-19 pathogenesis."
6,awitk3se,0.620337,"COVID-19 (Novel Coronavirus 2019) - recent trends The World Health Organization (WHO) has issued a warning that, although the 2019 novel coronavirus (COVID-19) from Wuhan City (China), is not pandemic, it should be contained to prevent the global spread","The COVID-19 virus was known earlier as 2019-nCoV. As of 12 February 2020, WHO reported 45,171 cases and 1115 deaths related to COVID-19. COVID-19 is similar to Severe Acute Respiratory Syndrome coronavirus (SARS-CoV) virus in its pathogenicity, clinical spectrum, and epidemiology. Comparison of the genome sequences of COVID-19, SARS-CoV, and Middle East Respiratory Syndrome coronavirus (MERS-CoV) showed that COVID-19 has a better sequence identity with SARS-CoV compared to MERS CoV. However, the amino acid sequence of COVID-19 differs from other coronaviruses specifically in the regions of 1ab polyprotein and surface glycoprotein or S-protein. Although several animals have been speculated to be a reservoir for COVID-19, no animal reservoir has been already confirmed. COVID-19 causes COVID-19 disease that has similar symptoms as SARS-CoV. Studies suggest that the human receptor for COVID-19 may be angiotensin-converting enzyme 2 (ACE2) receptor similar to that of SARS-CoV. The nucleocapsid (N) protein of COVID-19 has nearly 90% amino acid sequence identity with SARS-CoV. The N protein antibodies of SARS-CoV may cross react with COVID-19 but may not provide cross-immunity. In a similar fashion to SARS-CoV, the N protein of COVID-19 may play an important role in suppressing the RNA interference (RNAi) to overcome the host defense. This mini-review aims at investigating the most recent trend of COVID-19."
7,sjyrr2bn,0.604398,"COVID-19: A promising cure for the global panic Abstract The novel Coronavirus disease 2019 (COVID-19) is caused by SARS-CoV-2, which is the causative agent of a potentially fatal disease that is of great global public health concern","The outbreak of COVID-19 is wreaking havoc worldwide due to inadequate risk assessment regarding the urgency of the situation. The COVID-19 pandemic has entered a dangerous new phase. When compared with SARS and MERS, COVID-19 has spread more rapidly, due to increased globalization and adaptation of the virus in every environment. Slowing the spread of the COVID-19 cases will significantly reduce the strain on the healthcare system of the country by limiting the number of people who are severely sick by COVID-19 and need hospital care. Hence, the recent outburst of COVID-19 highlights an urgent need for therapeutics targeting SARS-CoV-2. Here, we have discussed the structure of virus; varying symptoms among COVID-19, SARS, MERS and common flu; the probable mechanism behind the infection and its immune response. Further, the current treatment options, drugs available, ongoing trials and recent diagnostics for COVID-19 have been discussed. We suggest traditional Indian medicinal plants as possible novel therapeutic approaches, exclusively targeting SARS-CoV-2 and its pathways."
8,84ib5ol5,0.598785,"Experience and suggestion of medical practices for burns during the outbreak of COVID-19 Abstract COVID-19 is spreading almost all over the world at present, which is infected by 2019 novel coronavirus (2019-nCoV)","It was epidemic firstly in Hubei province of China. The Chinese government has formally set COVID-19 in the statutory notification and control system for infectious diseases according to the Law of the People’s Republic of China on the Prevention and Treatment of Infectious Diseases. China currently is still struggling to defend COVID-19 though great achievements and progresses are being got. Burn Department is one of sections in clinics with the highest infectious risk of COVID-19. Based on our own experiences and the guidelines on the diagnosis and treatment of COVID-19 (7th Version) with other regulations and literatures, we put forward this experience and suggestion of medical practices for burns during the outbreak of COVID-19. We hope these experiences and suggestions could benefit for our international colleagues during the pandemic of the COVID-19."
9,3aozpdr3,0.596285,"CT Imaging and Differential Diagnosis of COVID-19 Since the beginning of 2020, coronavirus disease 2019 (COVID-19) has spread throughout China",This study explains the findings from lung computed tomography images of some patients with COVID-19 treated in this medical institution and discusses the difference between COVID-19 and other lung diseases.


#### Evaluation du moteur de recherche tf

In [107]:
results_tf = evaluate_model(search_tf, queries_df, qrels_df, k=10)

print("Résultats du modèle sac de mots (tf) :")
for metric, score in results_tf.items():
    print(f"- {metric}: {score:.4f}")


Résultats du modèle sac de mots (tf) :
- precision@k: 0.1533
- recall@k: 0.0060
- MRR: 0.3457
- nDCG@k: 0.4516


### Commentaire 
Ici on constate une légère amélioration des scores 

Comme nous l'avions dit précédemment, le moteur binaire ne prenait pas en compte la fréquence des mots. Ce nouveau moteur ne prend certe pas en compte le contexte (semantique), l'ordre des mots, mais il est mieux que le modèle précédent et mieux que le modèle booléen que nous avons effectué plus haut 

### TF-IDF (Term Frequency-Inverse Document Frequency)

- Ici nous ajoutons l'aspect de rareté au moteur précédent (TF)
-  Nous allons faire une évaluation (Précision, rappel, nDCG, MRR (découvert lors des recherches, qui permet de savoir si le premier document est toujours celui qui est pertinent))

**MRR** : Mean Reciprocal Rank, si le document pertinent est à la position 1 alors le score est de 1, si c'est en deuxième, le score est de 1/2 ainsi de suite. Puis on fait la moyenne de tous ces scores ce qui donne le MRR global, si cela vaut 1 alors le moteur de recherche met toujours le document pertinent comme premier document 

### Construction de la matrice tf-idf 

Ici, à la place des fréquences, on utilise les poids tfidf = tf x idf (tf = nombre d'occurence du mot divisé par la taille du vocabulaire, idf = logarithme du nombre total de documents, divisé par le nombre de documents dans lequels le mots apparait)

In [109]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer_tfidf = TfidfVectorizer()
X_tfidf = vectorizer_tfidf.fit_transform(documents_df["tokens"].apply(lambda tokens: " ".join(tokens)))

In [110]:
documents_df.shape

(51078, 3)

#### Vectorisation de la requête avec le même vectorizer 

In [111]:
def vectorize_query_tfidf(query_text):
    tokens = clean_text(query_text)
    query_string = " ".join(tokens)
    return vectorizer_tfidf.transform([query_string])


#### Recherche par similarité cosinus 

In [112]:
from sklearn.metrics.pairwise import cosine_similarity

def search_tfidf(query_text, top_k=10):
    query_vec = vectorize_query_tfidf(query_text)
    scores = cosine_similarity(query_vec, X_tfidf).flatten()
    top_indices = scores.argsort()[::-1][:top_k]
    
    results = documents_df.iloc[top_indices].copy()
    results["score"] = scores[top_indices]
    results["title"] = results["text"].apply(lambda t: t.split(". ")[0] if ". " in t else t[:100])
    results["abstract"] = results["text"].apply(lambda t: t if ". " not in t else ". ".join(t.split(". ")[1:]))
    
    return results[["doc_id", "title", "abstract", "score"]].to_dict(orient="records")


#### Test sur la requête 

In [114]:
query = "origin of COVID-19"
display_search_results(search_tfidf(query, top_k=3), query)


Unnamed: 0,doc_id,score,title,abstract
0,gv1k7u7j,0.660618,Strategies to trace back the origin of COVID-19,Strategies to trace back the origin of COVID-19
1,pttcysvc,0.584592,"COVID-19, a pandemic or not?","COVID-19, a pandemic or not?"
2,13veedct,0.554319,COVID-19 infection in children,COVID-19 infection in children


#### Evaluation du moteur de recherche 

In [115]:
results_tfidf = evaluate_model(search_tfidf, queries_df, qrels_df, k=10)

print("Résultats du modèle TF-IDF :")
for metric, score in results_tfidf.items():
    print(f"- {metric}: {score:.4f}")


Résultats du modèle TF-IDF :
- precision@k: 0.5167
- recall@k: 0.0203
- MRR: 0.6581
- nDCG@k: 0.7274


### Commentaire

L'ajout de la notion de rareté des termes à travers le TF-IDF marque un tournant significatif dans la qualité du moteur de recherche. On observe une amélioration notable de toutes les métriques, notamment une nDCG@k de 72%, indiquant que les documents jugés pertinents sont désormais mieux classés dans les résultats.

Ce modèle commence à véritablement capturer l'intention des requêtes. Pour aller plus loin, nous allons explorer des techniques de prétraitement plus poussées afin d’affiner encore davantage la pertinence des documents retournés.

### Prétraitement avancé et application avec le moteur TF-IDF 

Nous allons maintenant :

1. Appliquer la lemmatisation (réduction des mots à leur forme canonique) avec spaCy pour rendre les textes plus homogènes.

2. Utiliser un dictionnaire de synonymes (par exemple UMLS ou un thesaurus médical simple) afin de détecter les variantes terminologiques fréquentes dans les documents médicaux. Cela permettra :

    - D’élargir la couverture des requêtes (récupérer plus de documents pertinents).

    - D’améliorer le rappel sans trop sacrifier la précision.

### Lemmatisation sur les documents (nous avons déjà combiné le titre et l'abstract) 

In [118]:
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ------ --------------------------------- 2.1/12.8 MB 10.7 MB/s eta 0:00:02
     ------------- -------------------------- 4.5/12.8 MB 11.2 MB/s eta 0:00:01
     -------------------- ------------------- 6.6/12.8 MB 10.6 MB/s eta 0:00:01
     ---------------------------- ----------- 9.2/12.8 MB 11.0 MB/s eta 0:00:01
     --------------------------------- ----- 11.0/12.8 MB 10.7 MB/s eta 0:00:01
     --------------------------------------- 12.8/12.8 MB 10.4 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [119]:
import spacy
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

# Initialisation
nlp = spacy.load("en_core_web_sm")
tokenizer = RegexpTokenizer(r'\w+')
stop_words = ENGLISH_STOP_WORDS

def preprocess_advanced(text):
    """
    Nettoyage + lemmatisation du texte.
    """
    doc = nlp(text.lower())
    tokens = [token.lemma_ for token in doc if token.is_alpha and token.text not in stop_words and not token.is_stop]
    return " ".join(tokens)


### Application sur les documents 

In [None]:
from tqdm import tqdm

tqdm.pandas()

# Ajoute une nouvelle colonne "processed_text"
documents_df["processed_text"] = documents_df["text"].progress_apply(preprocess_advanced)


  2%|▏         | 1022/51078 [00:31<26:10, 31.88it/s]