# IR Lab Tutorial: Data Access from Java (or any other language)

This tutorial demonstrates how to access [TIRA](https://www.tira.io)/[TIREx](https://www.tira.io) components by loading their outputs.

### Basics: Access to Documents and Queries

In [4]:
!tira-cli download --dataset longeval-tiny-train-20240315-training

INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.
INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.
/root/.tira/extracted_datasets/None/longeval-tiny-train-20240315-training/input-data


In [5]:
!ls /root/.tira/extracted_datasets/None/longeval-tiny-train-20240315-training/input-data

documents.jsonl.gz  metadata.json  queries.jsonl  queries.xml


In [6]:
!zcat /root/.tira/extracted_datasets/None/longeval-tiny-train-20240315-training/input-data/documents.jsonl.gz|head -2 

{"docno": "doc062200109610", "text": "\n\nEDF\n-\nGDF School-Valentine (25480)\n- Opening of electricity and gas meter Opening of your electricity or gas meter at \u00c9cole-Valentin on the Enedis/ErDF or GrDF network with papernest Free and non-binding service Announcement\n- papernest is not a partner of EDF.\nThank you.\nYour request has been taken into account A counsellor will call you back to the I understood\nIt seems that there is an error with our service Try again Opening your electricity or gas meter at \u00c9cole-Valentin on the Enedis/ErDF or GrDF network\nwith agence-france-electricite.fr Call the Me to call back Simple and quick: 5 minutes is enough No commitment or cancellation fee On 13 users Announcement\n- agency-france-electricite.fr\nis not a partner of Edf Contacts and rates of Engie gas offers to \u00c9cole-Valentin\nEngie\n, formerly SFM\nSuez, is one of the main suppliers of energy in Franche-Comt\u00e9 and throughout France.\nThe company emerged from the merge

### Advanced: Query Performance Prediction

Paper: [An Enhanced Evaluation Framework for
Query Performance Prediction](https://iiia.dei.unipd.it/research/papers/2021/ECIR2021-FZCFS.pdf)

In [10]:
!tira-cli download --dataset longeval-train-20230513-training --approach ir-benchmarks/qpptk/all-predictors

INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.
INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.
Download from the Incubator: https://files.webis.de/data-in-production/data-research/tira-zenodo-dump-preparation/query-processors-in-progress/qpptk-all-predictors-clef-labs.zip
	This is only used for last spot checks before archival to Zenodo.
Download: 100%|█████████████████████████████| 969k/969k [00:00<00:00, 3.82MiB/s]
Download finished. Extract...
Extraction finished:  /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/qpptk
/root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/qpptk/2024-02-27-21-19-19/output


In [13]:
!head -2 /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/qpptk/2024-02-27-21-19-19/output/queries.jsonl

{"qid":"q06223196","max-idf":4.0958698531,"avg-idf":3.1749690458,"scq":80.9333491421,"max-scq":47.6725992603,"avg-scq":40.4666745711,"var":3.9709926948,"max-var":2.0144139575,"avg-var":1.9854963474,"wig+5":6.182578941,"nqc+5":0.0088258559,"smv+5":0.0066671207,"clarity+5+100":4.5015133219,"wig+10":5.9957913108,"nqc+10":0.0187822357,"smv+10":0.0156152256,"clarity+10+100":4.4665091851,"wig+20":5.4968548214,"nqc+20":0.0439798072,"smv+20":0.0408074922,"clarity+20+100":4.4414408844,"wig+50":5.0234189643,"nqc+50":0.0423631289,"smv+50":0.0329943025,"clarity+50+100":4.2113685018,"wig+100":4.7201972152,"nqc+100":0.0396683849,"smv+100":0.0257729994,"clarity+100+100":4.0947997525,"wig+1000":3.4686269255,"nqc+1000":0.0437217743,"smv+1000":0.0317818778,"clarity+1000+100":3.7476998003}
{"qid":"q062228","max-idf":3.4870977938,"avg-idf":3.4870977938,"scq":44.6419616273,"max-scq":44.6419616273,"avg-scq":44.6419616273,"var":2.5749288345,"max-var":2.5749288345,"avg-var":2.5749288345,"wig+5":5.992612212,"n

### Advanced: Query Segmentation

Paper: [Query Segmentation Revisited](https://webis.de/publications.html?q=segmentation#hagen_2011a)

In [39]:
!tira-cli download --dataset longeval-train-20230513-training --approach ir-benchmarks/ows/query-segmentation-hyb-a

INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.
INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.
Download from the Incubator: https://files.webis.de/data-in-production/data-research/tira-zenodo-dump-preparation/query-processors-in-progress/ows-query-segmentation-hyb-a-clef-labs.zip
	This is only used for last spot checks before archival to Zenodo.
Download: 100%|█████████████████████████████| 275k/275k [00:00<00:00, 2.49MiB/s]
Download finished. Extract...
Extraction finished:  /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/ows
/root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/ows/2024-02-25-08-12-47/output


In [42]:
!head -5 /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/ows/2024-02-25-08-12-47/output/queries.jsonl

{"qid":"q062214880","originalQuery":"papillomavirus","segmentationApproach":"hyb-a","segmentation":["papillomavirus"]}
{"qid":"q06225490","originalQuery":"weight of a car","segmentationApproach":"hyb-a","segmentation":["weight of","a car"]}
{"qid":"q06225371","originalQuery":"solar panel self-consumption","segmentationApproach":"hyb-a","segmentation":["solar panel","self-consumption"]}
{"qid":"q062213796","originalQuery":"Potato patty","segmentationApproach":"hyb-a","segmentation":["potato","patty"]}
{"qid":"q062214645","originalQuery":"my job centre","segmentationApproach":"hyb-a","segmentation":["my","job centre"]}


### Advanced: Query Intent

Paper: [ORCAS-I: Queries Annotated with Intent using Weak Supervision](https://dl.acm.org/doi/abs/10.1145/3477495.3531737)

In [43]:
!tira-cli download --dataset longeval-train-20230513-training --approach ir-benchmarks/dossier/pre-retrieval-query-intent

INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.
INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.
Download from the Incubator: https://files.webis.de/data-in-production/data-research/tira-zenodo-dump-preparation/query-processors-in-progress/dossier-pre-retrieval-query-intent-clef-labs.zip
	This is only used for last spot checks before archival to Zenodo.
Download: 100%|█████████████████████████████| 272k/272k [00:00<00:00, 2.59MiB/s]
Download finished. Extract...
Extraction finished:  /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/dossier
/root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/dossier/2024-02-26-19-27-33/output


In [49]:
!head -5 /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/dossier/2024-02-26-19-27-33/output/queries.jsonl

{"qid":"q06223196","intent_prediction":"Abstain"}
{"qid":"q062228","intent_prediction":"Abstain"}
{"qid":"q062287","intent_prediction":"Abstain"}
{"qid":"q06223261","intent_prediction":"Transactional"}
{"qid":"q062291","intent_prediction":"Abstain"}


### Advanced: Corpus Graph

Paper: [Adaptive Re-Ranking with a Corpus Graph](https://arxiv.org/pdf/2208.08942.pdf)

In [14]:
!tira-cli download --dataset longeval-train-20230513-training --approach ir-benchmarks/seanmacavaney/corpus-graph

INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.
INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.
/root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/seanmacavaney/2024-03-21-12-46-50/output


In [17]:
!zcat /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/seanmacavaney/2024-03-21-12-46-50/output/documents.jsonl.gz|head -2

{"doc_id": "doc062209106001", "neighbors": ["doc062210211350", "doc062210406947", "doc062210607383", "doc062210507204", "doc062210607043", "doc062210612025", "doc062210300409", "doc062210413453", "doc062210414356", "doc062210402464", "doc062210407620", "doc062210503771", "doc062210405214", "doc062210700385", "doc062210204782"]}
{"doc_id": "doc062209106002", "neighbors": ["doc062208706086", "doc062206408053", "doc062208906995", "doc062209009751", "doc062208807503", "doc062208805517", "doc062208704530", "doc062208906844", "doc062209008454", "doc062206304488", "doc062206511806", "doc062208900657", "doc062206305100", "doc062209201327", "doc062208908335"]}

gzip: stdout: Broken pipe


### Advanced: Query Expansion with LLMs

Approaches:

- ir-benchmarks/tu-dresden-03/qe-gpt3.5-cot
- ir-benchmarks/tu-dresden-03/qe-gpt3.5-sq-zs
- ir-benchmarks/tu-dresden-03/qe-gpt3.5-sq-fs
- ir-benchmarks/tu-dresden-03/qe-llama-cot
- ir-benchmarks/tu-dresden-03/qe-llama-sq-zs
- ir-benchmarks/tu-dresden-03/qe-llama-sq-fs
- ir-benchmarks/tu-dresden-03/qe-flan-ul2-cot
- ir-benchmarks/tu-dresden-03/qe-flan-ul2-sq-zs
- ir-benchmarks/tu-dresden-03/qe-flan-ul2-sq-fs

In [36]:
!tira-cli download --dataset longeval-train-20230513-training --approach ir-benchmarks/tu-dresden-03/qe-gpt3.5-cot

INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.
INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.
Download from the Incubator: https://files.webis.de/data-in-production/data-research/tira-zenodo-dump-preparation/query-processors-in-progress/tu-dresden-03-qe-gpt3.5-cot-clef-labs.zip
	This is only used for last spot checks before archival to Zenodo.
Download: 100%|█████████████████████████████| 620k/620k [00:00<00:00, 4.01MiB/s]
Download finished. Extract...
Extraction finished:  /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/tu-dresden-03
/root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/tu-dresden-03/2024-03-10-19-13-34/output


In [37]:
!zcat /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/tu-dresden-03/2024-03-10-19-13-34/output/queries.jsonl.gz|head -2

{"qid":"q06223196","query":"A car shelter is a structure designed to provide protection and coverage for vehicles, such as cars, trucks, and motorcycles. It is typically made of materials like metal, wood, or fabric and can come in various forms such as carports, garages, or portable shelters.\n\nThe rationale for using a car shelter is to protect vehicles from various elements and environmental factors that can cause damage. This includes protection from sunlight, rain, snow, hail, and wind, which can lead to fading of paint, rusting, corrosion, and other forms of damage. Additionally, a car shelter can also provide security by keeping the vehicle out of sight and reducing the risk of theft or vandalism.\n\nOverall, a car shelter helps to extend the lifespan of vehicles, maintain their appearance, and provide a safe and secure storage space."}
{"qid":"q062228","query":"The term \"airport\" refers to a facility where aircraft can take off and land, as well as receive services such as f

### Advanced: DocT5Query

Paper: [From doc2query to docTTTTTquery](https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery-v2.pdf)

In [18]:
!tira-cli download --dataset longeval-train-20230513-training --approach ir-benchmarks/seanmacavaney/DocT5Query

INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.
INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.
Download from the Incubator: https://files.webis.de/data-in-production/data-research/tira-zenodo-dump-preparation/doc-t5-query/2024-03-19-19-46-01.zip
	This is only used for last spot checks before archival to Zenodo.
Download: 100%|███████████████████████████| 60.8M/60.8M [00:03<00:00, 16.6MiB/s]
Download finished. Extract...
Extraction finished:  /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/seanmacavaney
/root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/seanmacavaney/2024-03-19-19-46-01/output


In [19]:
!zcat /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/seanmacavaney/2024-03-19-19-46-01/output/documents.jsonl.gz|head -2

{"doc_id": "doc062211608898", "querygen": "when is fiesta in dominicana?\nwhen is fiesta dominicana\nwhen is fiesta dinner"}
{"doc_id": "doc062214401851", "querygen": "when is spectacle coming\nwhen does spectacle show start\nwhat is the name of the three little pigs circus"}

gzip: stdout: Broken pipe


## Advanced: Query Entity Linking 

Paper: [ Query Interpretations from Entity-Linked Segmentations](https://webis.de/publications.html?q=Query#kasturia_2022)

In [26]:
!tira-cli download --dataset longeval-train-20230513-training --approach ir-benchmarks/marcel-gohsen/entity-linking

INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.
INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.
Download from the Incubator: https://files.webis.de/data-in-production/data-research/tira-zenodo-dump-preparation/query-processors-in-progress/marcel-gohsen-entity-linking-clef-labs.zip
	This is only used for last spot checks before archival to Zenodo.
Download: 100%|███████████████████████████| 1.82M/1.82M [00:00<00:00, 7.86MiB/s]
Download finished. Extract...
Extraction finished:  /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/marcel-gohsen
/root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/marcel-gohsen/2024-02-22-05-05-35/output


In [28]:
!cat /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/marcel-gohsen/2024-02-22-05-05-35/output/queries.jsonl|head -4

{"qid":"q06223196","query":"Car shelter","entities":[{"begin":4,"end":11,"mention":"shelter","url":"https://en.wikipedia.org/wiki/Shelter_(charity)","score":0.19759679572763686},{"begin":4,"end":11,"mention":"shelter","url":"https://en.wikipedia.org/wiki/Shelter_(band)","score":0.13618157543391188},{"begin":4,"end":11,"mention":"shelter","url":"https://en.wikipedia.org/wiki/Shelter_(building)","score":0.11615487316421896},{"begin":0,"end":3,"mention":"car","url":"https://en.wikipedia.org/wiki/car","score":0.09153180278509597},{"begin":4,"end":11,"mention":"shelter","url":"https://en.wikipedia.org/wiki/Shelter_Records","score":0.06675567423230974},{"begin":4,"end":11,"mention":"shelter","url":"https://en.wikipedia.org/wiki/Shelter_(2007_film)","score":0.04672897196261682},{"begin":4,"end":11,"mention":"shelter","url":"https://en.wikipedia.org/wiki/Shelter","score":0.044058744993324434},{"begin":4,"end":11,"mention":"shelter","url":"https://en.wikipedia.org/wiki/Shelter_(Porter_Robinson_

### Advanced: Genre Classification

Paper: [Web Genre Analysis: Use Cases, Retrieval Models, and Implementation Issues](https://webis.de/publications.html?q=genre#stein_2010b)

In [30]:
!tira-cli download --dataset longeval-train-20230513-training --approach ir-benchmarks/tu-dresden-01/genre-mlp

INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.
INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.
Download: 2.51MiB [00:00, 5.04MiB/s]
Download finished. Extract...
Extraction finished:  /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/tu-dresden-01
/root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/tu-dresden-01/2024-03-18-18-34-17/output


In [32]:
!zcat /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/tu-dresden-01/2024-03-18-18-34-17/output/documents.jsonl.gz|head -4

{"docno":"doc062200602177","predicted_label":"Shop","probability_Discussion":0.0344156198,"probability_Shop":0.4675025358,"probability_Download":0.0412818462,"probability_Articles":0.037098362,"probability_Help":0.0942853201,"probability_Linklists":0.0314797933,"probability_Porttrait private":0.0184404896,"probability_Protrait non private":0.2754960331}
{"docno":"doc062200206592","predicted_label":"Help","probability_Discussion":0.0556134054,"probability_Shop":0.0598814937,"probability_Download":0.0131974471,"probability_Articles":0.0504586852,"probability_Help":0.5047823773,"probability_Linklists":0.0353957292,"probability_Porttrait private":0.0223725693,"probability_Protrait non private":0.2582982928}
{"docno":"doc062210912628","predicted_label":"Help","probability_Discussion":0.0549635234,"probability_Shop":0.0319441333,"probability_Download":0.0316361181,"probability_Articles":0.1685541632,"probability_Help":0.3903540373,"probability_Linklists":0.0427212461,"probability_Porttrait p

### Advanced: Text Features, e.g., readability, coherence, etc.

Spacy

In [33]:
!tira-cli download --dataset longeval-train-20230513-training --approach ir-benchmarks/tu-dresden-04/spacy-document-features

INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.
INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.
Download: 16.2MiB [00:01, 8.54MiB/s]
Download finished. Extract...
Extraction finished:  /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/tu-dresden-04
/root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/tu-dresden-04/2024-03-18-18-16-47/output


In [34]:
!zcat /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/tu-dresden-04/2024-03-18-18-16-47/output/documents.jsonl.gz|head -4

{"docno":"doc062200602177","entropy":19.989747459,"perplexity":480216431.4089455605,"per_word_perplexity":589222.6152257001,"first_order_coherence":0.4858336708,"second_order_coherence":0.4434301508,"flesch_reading_ease":50.059375,"flesch_kincaid_grade":13.5746187943,"smog":14.2530751775,"gunning_fog":16.9131205674,"automated_readability_index":15.6784361702,"coleman_liau_index":11.1828085106,"lix":54.9069148936,"rix":7.5,"pos_prop_ADJ":0.0429447853,"pos_prop_ADP":0.082208589,"pos_prop_ADV":0.0122699387,"pos_prop_AUX":0.0196319018,"pos_prop_CCONJ":0.0159509202,"pos_prop_DET":0.0588957055,"pos_prop_INTJ":0.0,"pos_prop_NOUN":0.226993865,"pos_prop_NUM":0.0355828221,"pos_prop_PART":0.0147239264,"pos_prop_PRON":0.0429447853,"pos_prop_PROPN":0.1521472393,"pos_prop_PUNCT":0.1263803681,"pos_prop_SCONJ":0.0061349693,"pos_prop_SYM":0.0036809816,"pos_prop_VERB":0.0625766871,"pos_prop_X":0.0049079755,"token_length_mean":4.7602836879,"token_length_median":4.0,"token_length_std":3.3800599985,"senten

## Advanced: Query Interpretation 

Paper: [ Query Interpretations from Entity-Linked Segmentations](https://webis.de/publications.html?q=Query#kasturia_2022)

In [20]:
!tira-cli download --dataset longeval-train-20230513-training --approach ir-benchmarks/marcel-gohsen/query-interpretation

INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.
INFO:root:No settings given in /root/.tira/.tira-settings.json. I will use defaults.
Download from the Incubator: https://files.webis.de/data-in-production/data-research/tira-zenodo-dump-preparation/query-processors-in-progress/marcel-gohsen-query-interpretation-clef-labs.zip
	This is only used for last spot checks before archival to Zenodo.
Download: 100%|█████████████████████████████| 191k/191k [00:00<00:00, 2.40MiB/s]
Download finished. Extract...
Extraction finished:  /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/marcel-gohsen
/root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/marcel-gohsen/2024-02-23-07-19-23/output


In [24]:
cat /root/.tira/extracted_runs/ir-benchmarks/longeval-train-20230513-training/marcel-gohsen/2024-02-23-07-19-23/output/queries.jsonl|head -4

{"qid":"q06223196","query":"Car shelter","interpretations":[{"id":0,"interpretation":["car shelter"],"relevance":0.0,"containedEntities":[],"contextWords":["shelter","car"],"score":0.0}]}
{"qid":"q062228","query":"airport","interpretations":[{"id":0,"interpretation":["https://en.wikipedia.org/wiki/airport"],"relevance":0.7456366828462253,"containedEntities":["https://en.wikipedia.org/wiki/airport"],"contextWords":[],"score":0.7456366828462253}]}
{"qid":"q062287","query":"antivirus comparison","interpretations":[{"id":0,"interpretation":["antivirus comparison"],"relevance":0.0,"containedEntities":[],"contextWords":["comparison","antivirus"],"score":0.0}]}
{"qid":"q06223261","query":"free antivirus","interpretations":[{"id":0,"interpretation":["free antivirus"],"relevance":0.0,"containedEntities":[],"contextWords":["antivirus","free"],"score":0.0}]}
cat: write error: Broken pipe
