# Université Paul Sabatier

EMIND1G1 - Fondements de la recherche d'information

**TP 3**

Enseignant : José G. Moreno

2023

## TP 3. Évaluation d’un système de recherche d'information

L'évaluation est une étape complexe dans la recherche d'information. Une des conférences qui a  largement aidé à l'avancement en cette matière est la conférence TREC (http://trec.nist.gov/). Dans ce TP nous nous intéressons à l'utilisation d'une des ces outils pour l'évaluation des moteurs des recherche. 

Pour l'évaluation nous avons besoin d'un fichier qui contient la « vérité de terrain » ou « gold standard » qui est normalement appelé qrel. Ce fichier contient pour chaque requête les identifiants des documents pertinents et non-pertinents. Également, il est nécessaire d'avoir des fichiers des résultats du moteur de recherche à évaluer.

Dans ce TP, nous allons utiliser un seul fichier qrel et plusieurs fichiers des résultats (chaque fichier des résultats sera évalué).

En continuation avec le TP2, considérez la phrase « Thomas and Mario are strikers playing in Munich ». Elle sera transforme en 3 requêtes  « Thomas », « Mario » et « Munich ». Chaque requête aura de documents considérés comme correctes (relevants) et incorrectes (no relevants).  La recherche de documents sera faite par votre système de recherche d’information. Cependant, la fait de dire qu’un document est relevant est une étape manuelle. Nous allons considérer les documents suivants comme relevants pour chaque requête :

> **Thomas** and **Mario** are strikers playing in **Munich**
>
>Thomas <br>
>* http://simple.wikipedia.org/wiki/Thomas_Müller
>
>
>Mario <br>
>* http://simple.wikipedia.org/wiki/Mario_Gómez <br>
>* http://simple.wikipedia.org/wiki/Mario_Götze
>
>
>Munich <br>
>* http://simple.wikipedia.org/wiki/FC_Bayern_Munich

Maintenant, il suffit d’utiliser vos résultats de chaque requête dans le format TREC pour les évaluer. Pour simplicité, nous allons utiliser la librarie [pytrec_eval](https://github.com/cvangysel/pytrec_eval) qui est un wrapper du logiciel [trec_eval](http://trec.nist.gov/trec_eval/trec_eval.8.1.tar.gz)

Pour information, voici le fichier qrel pour les 3 requêtes précédentes :

```
101 0 Thomas_Müller 1
101 0 Thomas_Edison 0
101 0 Thomas_the_Apostle 0
102 0 Mario_Gómez 1
102 0 Mario_Götze 1
103 0 FC_Bayern_Munich 1
```

Notez que nous allons utiliser pytrec_eval, qui utilise un dictionaire pour le qrel au lieu d'un fichier.

Notez que la première colonne est l’identifiant de la requête (nous avons trois valeurs différents, une pour chaque requête), suivi de zéro (0), suivi de l’identifiant du document annoté (le titre de la page Wikipédia) et une valeur pour dire si le document est relevant (1) ou non (0). Notez aussi que les qrels contient des documents pertinents et des documents non-pertinents.

Puis il faut créer le fichier des résultats avec la sortie de votre programme fait pendant les Tps précédents. Pour information, voici un fichier résultat d’un système :

```
101	Q0	Thomas_Edison	1	  5.5	STANDARD
101	Q0	Thomas_Müller	2	  4.4	STANDARD
101	Q0	Thomas_the_Apostle	3	  3.3	STANDARD
101	Q0	Isiah_Thomas	4	  2.2	STANDARD
101	Q0	Thomas_Aquinas	5	  1.1	STANDARD
102	Q0	Mario	1	  5.5	STANDARD
102	Q0	Super_Mario	2	  4.4	STANDARD
102	Q0	Super_Mario_Bros.	3	  3.3	STANDARD
102	Q0	Super_Mario_Bros._2	4	  2.2	STANDARD
102	Q0	Mario_(series)	5	  1.1	STANDARD
102	Q0	Super_Mario_World	6	  1.0	STANDARD
102	Q0	Super_Mario_Bros._3	7	  0.9	STANDARD
102	Q0	New_Super_Mario_Bros.	8	  0.8	STANDARD
102	Q0	Mario_Gómez	9	  0.7	STANDARD
102	Q0	Mario_Party_4	10	  0.6	STANDARD
103	Q0	Munich	1	  5.5	STANDARD
103	Q0	FC_Bayern_Munich	2	  4.4	STANDARD
103	Q0	Munich_Airport	3	  3.3	STANDARD
103	Q0	Munich_Agreement	4	  2.2	STANDARD
103	Q0	Munich_Rural_District	5	  1.1	STANDARD
```

Notez, que comme pour les qrels, pytrec_eval utilise un dictionaire pour le résultat d’un système au lieu d'un fichier.

La première colonne est l’identifiant de la requête (la même que pour le qrel), suivi de zéro (Q0), suivi de l’identifiant du document retrouvé par votre système (le titre de la page Wikipédia), suivi de la position du documents dans les résultats, suivi de la valeur de similarité donné par le modèle de poids choisi et de l’identifiant du système (votre nom par exemple).

Une fois construit les fichiers qrels et résultats, nous pouvons utiliser le logiciel d'évaluation trec_eval pour obtenir les résultats de l'évaluation. Cependant, pour simplicité nous allons utiliser pytrec_eval. Donc, pour pytrec_eval, il suffit de déclarer les deux dictionaires (qrel et run) et en suite appeler la méthode ```relevanceEvaluator``` comme indiqué dans l'exemple ci-dessous. 

In [1]:
!pip install pytrec_eval

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytrec_eval
  Downloading pytrec_eval-0.5.tar.gz (15 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pytrec_eval
  Building wheel for pytrec_eval (setup.py) ... [?25l[?25hdone
  Created wheel for pytrec_eval: filename=pytrec_eval-0.5-cp39-cp39-linux_x86_64.whl size=293188 sha256=063f2ae6da9ed6bdb1e192942e1bf59742dd3585bc4c53c137f71e37830f2650
  Stored in directory: /root/.cache/pip/wheels/e9/91/35/6059501bca98e27e0b4f91ecaaff86c95ca7f4919ff22f0d54
Successfully built pytrec_eval
Installing collected packages: pytrec_eval
Successfully installed pytrec_eval-0.5


Création des fichier examples

In [2]:
import pytrec_eval
import json
import pandas as pd

In [3]:
qrel = {
    '101': {
        'Thomas_Müller': 1,
        'Thomas_Edison': 0,
        'Thomas_the_Apostle': 0,
    },
    '102': {
        'Mario_Gómez': 1,
        'Mario_Götze': 1,
    },
    '103': {
        'FC_Bayern_Munich': 1,
    },
}


In [4]:
run = {
    '101': {
        'Thomas_Edison': 5.5,
        'Thomas_Müller': 4.4,
        'Thomas_the_Apostle': 3.3,
        'Isiah_Thomas': 2.2,
        'Thomas_Aquinas': 1.1,
    },
    '102': {
        'Mario': 10.10,
        'Super_Mario': 9.9,
        'Super_Mario_Bros.': 8.8,
        'Super_Mario_Bros._2': 7.7,
        'Mario_(series)': 6.6,
        'Super_Mario_World': 5.5,
        'Super_Mario_Bros._3': 4.4,
        'New_Super_Mario_Bros.': 3.3,
        'Mario_Gómez': 2.2,
        'Mario_Party_4': 1.1,
    },
    '103': {
        'Munich': 5.5,
        'FC_Bayern_Munich': 4.4,
        'Munich_Airport': 3.3,
        'Munich_Agreement': 2.2,
        'Munich_Rural_District': 1.1,
    },
}


Évaluation de l'exemple

In [5]:
evaluator = pytrec_eval.RelevanceEvaluator(
    qrel, {'map', 'ndcg'})

pd.DataFrame(evaluator.evaluate(run)).T

Unnamed: 0,map,ndcg
101,0.5,0.63093
102,0.055556,0.184576
103,0.5,0.63093


Chaque clé corresponde au résultat d’une métrique d’évaluation pour les trois requêtes. 

###1. Requêtes

Utilisez les suivants requêtes dans votre système et générez les résultats dans le format décrit précédemment (variable ```run```)  :

```
ID:100+i
Thomas and Mario are strikers playing in Munich

ID:200+i
Leo scored two goals and assisted Puyol to ensure a 4–0 quarter-final victory over Bayern

ID:300+i
Skype software for Mac

ID:400+i
Cowboys fans petition Obama to oust Jones

ID:500+i
Kate and Henry are known for being devoted to the Anglican church
```



2. Qrels

Utilisez le qrel ***qreltp*** déclaré ci-dessous

In [6]:
qreltp = {
    '101': {
        'Thomas_Müller': 1,
        'Thomas_Edison': 0,
        'Thomas_the_Apostle': 0,
    },
    '102': {
        'Mario_Gómez': 1,
        'Mario_Götze': 1,
    },
    '103': {
        'FC_Bayern_Munich': 1,
    },
    '201': {
        'Lionel_Messi': 1,
    },
    '202': {
        'Carles_Puyol': 1,
    },
    '203': {
        'FC_Bayern_Munich': 1,
    },
    '301': {
        'Skype': 1,
    },
    '302': {
        'Mac_OS': 1,
    },
    '401': {
        'Dallas_Cowboys': 1,
    },
    '402': {
        'Barack_Obama': 1,
    },
    '403': {
        'Jerry_Jones': 1,
    },
    '501': {
        'Catherine_Duchess_of_Cambridge': 1,
    },
    '502': {
        'Prince_Harry': 1,
    },
    '503': {
        'Anglicanism': 1,
    },
}

### 3. Configurations
Générez au moins 5 configurations différents de votre système avec 100 résultats et évaluez-les. Comment expliquez-vous vos résultats ?

### 4. Résultats
Avec les mêmes 5 configurations, générez 1000 résultats et évaluez-les. Il y a-t-il des différences dans certains métriques ? Pourquoi ?

### 5. Analyses
Faites une comparaison entre les résultats des différents configurations. Quelles métriques ont changés ?

In [7]:
!wget "https://drive.google.com/uc?id=16rd8rFNR5qtjaX_vtZ__nqopq7Q6pSFu&confirm=t&uuid=0cc61c9f-387b-4218-b1e1-6421fb83d11e&at=ALgDtsx9J1qAF9DesqbqTEOfuLsR:1675797919874" -O pd_index.zip
!unzip pd_index.zip

--2023-03-17 20:57:38--  https://drive.google.com/uc?id=16rd8rFNR5qtjaX_vtZ__nqopq7Q6pSFu&confirm=t&uuid=0cc61c9f-387b-4218-b1e1-6421fb83d11e&at=ALgDtsx9J1qAF9DesqbqTEOfuLsR:1675797919874
Resolving drive.google.com (drive.google.com)... 173.194.216.100, 173.194.216.138, 173.194.216.101, ...
Connecting to drive.google.com (drive.google.com)|173.194.216.100|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-14-9s-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/l0gvbe8dcdtea55ppqkmo0h6hjtdkc2j/1679086650000/12754483704616509995/*/16rd8rFNR5qtjaX_vtZ__nqopq7Q6pSFu?uuid=0cc61c9f-387b-4218-b1e1-6421fb83d11e [following]
--2023-03-17 20:57:38--  https://doc-14-9s-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/l0gvbe8dcdtea55ppqkmo0h6hjtdkc2j/1679086650000/12754483704616509995/*/16rd8rFNR5qtjaX_vtZ__nqopq7Q6pSFu?uuid=0cc61c9f-387b-4218-b1e1-6421fb83d11e
Resolving doc-14-9s-docs.googleusercontent.com

In [8]:
#installation de pyterrier avec pip
!pip install --upgrade git+https://github.com/terrier-org/pyterrier.git#egg=python-terrier
#Initialization de JVM
import pyterrier as pt
if not pt.started():
  pt.init()

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting python-terrier
  Cloning https://github.com/terrier-org/pyterrier.git to /tmp/pip-install-kyfu05vc/python-terrier_de7a5854497047afa61a4f25ab449e5f
  Running command git clone --filter=blob:none --quiet https://github.com/terrier-org/pyterrier.git /tmp/pip-install-kyfu05vc/python-terrier_de7a5854497047afa61a4f25ab449e5f
  Resolved https://github.com/terrier-org/pyterrier.git to commit dc7997ed4bb4bbaf78f639a511bfe92fcd290c02
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyjnius>=1.4.2
  Downloading pyjnius-1.4.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting matchpy
  Downloading matchpy-0.5.5-py3-none-any.

terrier-assemblies 5.7 jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.7 jar not found, downloading to /root/.pyterrier...
Done


PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7



In [9]:
# déclaration de la variable JAVA_HOME
import os
os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-11-openjdk-amd64'
!export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

In [10]:
indexref = pt.autoclass("org.terrier.querying.IndexRef").of(os.path.join("/content/pd_index", "data.properties"))

In [11]:
!unzip pd_index2.zip

unzip:  cannot find or open pd_index2.zip, pd_index2.zip.zip or pd_index2.zip.ZIP.


In [12]:
indexref2 = pt.autoclass("org.terrier.querying.IndexRef").of(os.path.join("/content/pd_index2", "data.properties"))

In [13]:
def systeme(paragraph, model,index) :
  res = []
  mots = paragraph.split()
  mots_majuscules = [mot for mot in mots if mot[0].isupper()]
  
  for mot in mots_majuscules:
    res.append(pt.BatchRetrieve(index, wmodel=model, metadata=["docno","title","url"]).search(mot)[['title','score']].head(100))
    
  return res

In [14]:
text1 = "Thomas and Mario are strikers playing in Munich"
text2 = "Leo scored two goals and assisted Puyol to ensure a 4–0 quarter-final victory over Bayern"
text3 = "Skype software for Mac"
text4 = "Cowboys fans petition Obama to oust Jones"
text5 = "Kate and Henry are known for being devoted to the Anglican church"

In [15]:
def creatdict(requetes,model,index):
  size = 100
  cpt = 101
  dics = {}
  for r in requetes :
    docs = systeme(r, model, index)
    for d in range(len(docs)) :
      dic = {}
      num = cpt + (d)
      for t in range(len(docs[d]['title'])):
        dic[docs[d]['title'][t].replace(" ","_")] = docs[d]['score'][t]
      dics[str(num)] = dic

    cpt = 100 + cpt
  return dics

In [16]:
requetes = [text1,text2,text3,text4,text5]

In [17]:
# 3- Configurations
# configuration 1
run1 = creatdict(requetes,"BM25",indexref)
evaluator = pytrec_eval.RelevanceEvaluator(qreltp, {'map', 'ndcg','P_1000'})
pd.DataFrame(evaluator.evaluate(run1)).T

Unnamed: 0,map,P_1000,ndcg
101,0.0,0.0,0.0
102,0.0,0.0,0.0
103,0.0,0.0,0.0
201,0.0,0.0,0.0
202,0.5,0.001,0.63093
203,0.029412,0.001,0.194959
301,1.0,0.001,1.0
302,1.0,0.001,1.0
401,0.018182,0.001,0.172195
402,0.111111,0.001,0.30103


In [18]:
# configuration 2
run2 = creatdict(requetes,"LGD",indexref)
evaluator = pytrec_eval.RelevanceEvaluator(qreltp, {'map', 'ndcg','P_1000'})
pd.DataFrame(evaluator.evaluate(run2)).T

Unnamed: 0,map,P_1000,ndcg
101,0.0,0.0,0.0
102,0.0,0.0,0.0
103,0.0,0.0,0.0
201,0.0,0.0,0.0
202,0.5,0.001,0.63093
203,0.03125,0.001,0.19824
301,1.0,0.001,1.0
302,1.0,0.001,1.0
401,0.018182,0.001,0.172195
402,0.2,0.001,0.386853


In [19]:
# configuration 3
run3 = creatdict(requetes,"IFB2",indexref)
evaluator = pytrec_eval.RelevanceEvaluator(qreltp, {'map', 'ndcg','P_1000'})
pd.DataFrame(evaluator.evaluate(run3)).T

Unnamed: 0,map,P_1000,ndcg
101,0.0,0.0,0.0
102,0.0,0.0,0.0
103,0.0,0.0,0.0
201,0.0,0.0,0.0
202,0.5,0.001,0.63093
203,0.03125,0.001,0.19824
301,1.0,0.001,1.0
302,1.0,0.001,1.0
401,0.018182,0.001,0.172195
402,0.2,0.001,0.386853


In [20]:
# configuration 4
run4 = creatdict(requetes,"DFIC",indexref)
evaluator = pytrec_eval.RelevanceEvaluator(qreltp, {'map', 'ndcg','P_1000'})
pd.DataFrame(evaluator.evaluate(run4)).T

Unnamed: 0,map,P_1000,ndcg
101,0.0,0.0,0.0
102,0.0,0.0,0.0
103,0.012346,0.001,0.157293
201,0.0,0.0,0.0
202,0.5,0.001,0.63093
203,0.111111,0.001,0.30103
301,1.0,0.001,1.0
302,0.5,0.001,0.63093
401,0.021739,0.001,0.180031
402,1.0,0.001,1.0


In [21]:
# configuration 5
run5 = creatdict(requetes,"DLH",indexref)
evaluator = pytrec_eval.RelevanceEvaluator(qreltp, {'map', 'ndcg','P_1000'})
pd.DataFrame(evaluator.evaluate(run5)).T

Unnamed: 0,map,P_1000,ndcg
101,0.0,0.0,0.0
102,0.0,0.0,0.0
103,0.0,0.0,0.0
201,0.0,0.0,0.0
202,0.333333,0.001,0.5
203,0.043478,0.001,0.218104
301,1.0,0.001,1.0
302,0.5,0.001,0.63093
401,0.020408,0.001,0.177184
402,0.090909,0.001,0.278943


In [22]:
# 4- Resultats
def creatdict2(requetes,model,index):
  size = 1000
  cpt = 101
  dics = {}
  for r in requetes :
    docs = systeme(r, model, index)
    for d in range(len(docs)) :
      dic = {}
      num = cpt + (d)
      for t in range(len(docs[d]['title'])):
        dic[docs[d]['title'][t].replace(" ","_")] = docs[d]['score'][t]
      dics[str(num)] = dic

    cpt = 100 + cpt
  return dics

In [23]:
# configuration 1
run1 = creatdict2(requetes,"BM25",indexref)
evaluator = pytrec_eval.RelevanceEvaluator(qreltp, {'map', 'ndcg','P_1000'})
pd.DataFrame(evaluator.evaluate(run1)).T

Unnamed: 0,map,P_1000,ndcg
101,0.0,0.0,0.0
102,0.0,0.0,0.0
103,0.0,0.0,0.0
201,0.0,0.0,0.0
202,0.5,0.001,0.63093
203,0.029412,0.001,0.194959
301,1.0,0.001,1.0
302,1.0,0.001,1.0
401,0.018182,0.001,0.172195
402,0.111111,0.001,0.30103


In [24]:
# configuration 2
run2 = creatdict2(requetes,"LGD",indexref)
evaluator = pytrec_eval.RelevanceEvaluator(qreltp, {'map', 'ndcg','P_1000'})
pd.DataFrame(evaluator.evaluate(run2)).T

Unnamed: 0,map,P_1000,ndcg
101,0.0,0.0,0.0
102,0.0,0.0,0.0
103,0.0,0.0,0.0
201,0.0,0.0,0.0
202,0.5,0.001,0.63093
203,0.03125,0.001,0.19824
301,1.0,0.001,1.0
302,1.0,0.001,1.0
401,0.018182,0.001,0.172195
402,0.2,0.001,0.386853


In [25]:
# configuration 3
run3 = creatdict2(requetes,"IFB2",indexref)
evaluator = pytrec_eval.RelevanceEvaluator(qreltp, {'map', 'ndcg','P_1000'})
pd.DataFrame(evaluator.evaluate(run3)).T

Unnamed: 0,map,P_1000,ndcg
101,0.0,0.0,0.0
102,0.0,0.0,0.0
103,0.0,0.0,0.0
201,0.0,0.0,0.0
202,0.5,0.001,0.63093
203,0.03125,0.001,0.19824
301,1.0,0.001,1.0
302,1.0,0.001,1.0
401,0.018182,0.001,0.172195
402,0.2,0.001,0.386853


In [26]:
# configuration 4
run4 = creatdict2(requetes,"DFIC",indexref)
evaluator = pytrec_eval.RelevanceEvaluator(qreltp, {'map', 'ndcg','P_1000'})
pd.DataFrame(evaluator.evaluate(run4)).T

Unnamed: 0,map,P_1000,ndcg
101,0.0,0.0,0.0
102,0.0,0.0,0.0
103,0.012346,0.001,0.157293
201,0.0,0.0,0.0
202,0.5,0.001,0.63093
203,0.111111,0.001,0.30103
301,1.0,0.001,1.0
302,0.5,0.001,0.63093
401,0.021739,0.001,0.180031
402,1.0,0.001,1.0


In [27]:
# configuration 5
run5 = creatdict2(requetes,"DLH",indexref)
evaluator = pytrec_eval.RelevanceEvaluator(qreltp, {'map', 'ndcg','P_1000'})
pd.DataFrame(evaluator.evaluate(run5)).T

Unnamed: 0,map,P_1000,ndcg
101,0.0,0.0,0.0
102,0.0,0.0,0.0
103,0.0,0.0,0.0
201,0.0,0.0,0.0
202,0.333333,0.001,0.5
203,0.043478,0.001,0.218104
301,1.0,0.001,1.0
302,0.5,0.001,0.63093
401,0.020408,0.001,0.177184
402,0.090909,0.001,0.278943


5- Analyse et conclusion :

Les tableaux d'évaluation montrent que le modèle qui offre les meilleures performances est le "DFIC". Il n'y a pas de différence significative entre les configurations avec size = 100 et size = 1000. De plus, l'ajout de la métrique "P_1000" n'apporte pas plus d'informations sur la pertinence.Enfin, on peut conclure qu'il existe une dépendance entre les requêtes créées ainsi qu'un degrès de certitude entre les métriques utilisées et les résultats obtenues.