# SDM With PL2

### Step 1: Import everything and load variables

In [1]:
import pyterrier as pt
import pandas as pd
from tira.third_party_integrations import ensure_pyterrier_is_loaded, get_input_directory_and_output_directory, persist_and_normalize_run
import json
from tqdm import tqdm

ensure_pyterrier_is_loaded()
input_directory, output_directory = get_input_directory_and_output_directory('./iranthology-dataset-tira')


Start PyTerrier with version=5.7, helper_version=0.0.7, no_download=True


PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


I will use a small hardcoded example located in ./iranthology-dataset-tira.
The output directory is /tmp/


### Step 2: Load the Data

In [9]:
queries = pt.io.read_topics(input_directory + '/milestone-1-topics.xml', format='trecxml')

#qrels = open(input_directory + '/qrels.txt', "r")

documents = [json.loads(i) for i in open(input_directory + '/documents.jsonl', 'r')]
documents = [{'docno': i['docno'], 'text': i['text'], 'title': i['original_document']['title'], 'abstract': i['original_document']['abstract']} for i in documents]

print('We look at the first document:\n')
print(documents[0])

Step 2: Load the data.
We look at the first document:

{'docno': '2019.sigirconf_workshop-2019birndl.0', 'text': 'Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2019) co-located with the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019), Paris, France, July 25, 2019 ', 'title': 'Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2019) co-located with the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019), Paris, France, July 25, 2019', 'abstract': ''}


### Step 3: Create the Index

In [4]:
print('Step 3: Create the Index.')

!rm -Rf ./index
iter_indexer = pt.IterDictIndexer("./index", meta={'docno' : 100, 'title': 10240, 'abstract': 10240, 'text': 10240}, blocks=True)
index_ref = iter_indexer.index(tqdm(documents))

Step 3: Create the Index.


 31%|█████████████▌                              | 16558/53673 [00:35<00:18, 1984.37it/s]



100%|█████████████████████████████████████████████| 53673/53673 [01:06<00:00, 811.33it/s]


11:54:57.297 [ForkJoinPool-1-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 3 empty documents


### PL2

Pl2 is a probabilistic scoring metric. It ranks the documents based the term occurence and distribution on how likely they're related to a document.

In [48]:
pl2 = pt.BatchRetrieve(index_ref, wmodel="PL2")

### Step 4: Create Retrieval Pipeline

Originally we wanted to try neural scoring systems like Bert out, which would have a lot more context-based approach to ranking. The notebook + docker kinda restricted us, which is the reason why we decided against it. SDM (Sequential Dependence Model) expands the query by connecting words that appear together / in the same context, which is the reason why we thought it could be more accurate. We try SDM query expansion together with different scoring systems.

In [49]:
sdm = pt.rewrite.SequentialDependence()

The query is first expandend with sdm, then ranked with the pl2 model

In [None]:
retrieval_pipeline = sdm >> pl2

### Step 5: Create the run

In [51]:
print('Step 5: Create Run.')

run = retrieval_pipeline(queries)

Step 5: Create Run.


In [52]:
print('We look at the first 10 results of the run (query has ben expanded):\n')
run.head(10)

We look at the first 10 results of the run (query has ben expanded):



Unnamed: 0,qid,docid,docno,rank,score,query,query_0
0,1,14568,2019.wsdm_conference-2019.40,0,15.375928,relevant documents show bigram fake news synon...,relevant documents show the bigram of fake ne...
1,1,33491,2018.wwwconf_conference-2018c.140,1,14.010381,relevant documents show bigram fake news synon...,relevant documents show the bigram of fake ne...
2,1,21754,2020.cikm_conference-2020.118,2,13.814559,relevant documents show bigram fake news synon...,relevant documents show the bigram of fake ne...
3,1,14643,2019.wsdm_conference-2019.115,3,13.800234,relevant documents show bigram fake news synon...,relevant documents show the bigram of fake ne...
4,1,7566,2008.sigirconf_conference-2008.116,4,13.045042,relevant documents show bigram fake news synon...,relevant documents show the bigram of fake ne...
5,1,21379,2009.cikm_conference-2009.191,5,12.202793,relevant documents show bigram fake news synon...,relevant documents show the bigram of fake ne...
6,1,30099,2019.wwwconf_conference-2019c.232,6,12.099965,relevant documents show bigram fake news synon...,relevant documents show the bigram of fake ne...
7,1,17492,2021.ecir_conference-20212.9,7,11.895309,relevant documents show bigram fake news synon...,relevant documents show the bigram of fake ne...
8,1,13000,2020.clef_conference-2020w.52,8,11.771914,relevant documents show bigram fake news synon...,relevant documents show the bigram of fake ne...
9,1,14162,2020.fire_conference-2020w.66,9,11.751446,relevant documents show bigram fake news synon...,relevant documents show the bigram of fake ne...


### Step 6: Persist Run

In [135]:
print('Step 6: Persist Run.')

persist_and_normalize_run(run, output_file=output_directory, system_name='multi-field', depth=100)

Step 6: Persist Run.


### Self-Reflection

#### Iris

A lot of time of the Milestone 3 was I used for trying to configure docker and the things that I needed to be able to work with pyterrier. In the end everything worked out, but I used an awful lot of time, which left less time for the actual pyterrier. It was difficult to figure out ways to rank the documents, especially because there wasn't really much information except the pyterrier documentation. I still found it a lot of trying out different approaches and learning about them.

### Justin
The tasks for milestone 3 were a bit more fun for me because we could try different approaches to find a better retrieval pipeline. Unfortunality we had problems with understanding the tutorial and which dataset we should use and whether we have to register it in TIRA. Then we thought we can evaluate our approaches directly in TIRA but we wondered why we get no results instantly. As we know that the evaluating made manually we could concentrate on find approaches. But we didn't find out how we can evaluate our approaches correctly.