# IR Lab Tutorial (Research Oriented): Genre Classification

This tutorial shows how to re-use genre classification approaches on all datasets available in [TIREx](https://www.tira.io/tirex).
Genre classification aims at predicting the goal of a web page, e.g., does the web page intend to inform a user, help a user, to sell something to a user, etc.?

Genre classification might be helpful for slicing/dicing an index, or as features in learning to rank pipelines.
Please have a look at the [corresponding paper](https://webis.de/publications.html?q=web+genre#stein_2010b).

In [3]:
from tira.third_party_integrations import ensure_pyterrier_is_loaded
from tira.rest_api_client import Client
ensure_pyterrier_is_loaded()
import pandas as pd
import pyterrier as pt
from tqdm import tqdm

tira = Client()

In [5]:
pt_dataset = pt.get_dataset('irds:clueweb09/en/trec-web-2009')
topics = pt_dataset.get_topics('query')

bm25 = tira.pt.from_submission('ir-benchmarks/tira-ir-starter/BM25 Re-Rank (tira-ir-starter-pyterrier)', pt_dataset)


In [6]:
genre_mlp_classifier = tira.pt.transform_documents('ir-benchmarks/tu-dresden-01/genre-mlp', pt_dataset)

Download: 2.66MiB [00:00, 19.7MiB/s]


Download finished. Extract...
Extraction finished:  /root/.tira/extracted_runs/ir-benchmarks/clueweb09-en-trec-web-2009-20230107-training/tu-dresden-01


In [36]:
(bm25 >> genre_mlp_classifier)(topics[topics['qid'] == '21']).head(5)

Unnamed: 0,qid,query,rank,docno,predicted_label,probability_Shop,probability_Linklists,probability_Protrait non private
0,21,volvo,1,clueweb09-zh0015-47-11207,Shop,0.430557,0.050686,0.303563
1,21,volvo,2,clueweb09-en0035-03-39670,Shop,0.566059,0.035091,0.255652
2,21,volvo,3,clueweb09-zh0033-92-44184,Protrait non private,0.225574,0.050969,0.553595
3,21,volvo,4,clueweb09-ja0009-84-31373,Shop,0.52504,0.071335,0.173629
4,21,volvo,5,clueweb09-en0028-06-13844,Shop,0.429051,0.043311,0.389856


In [18]:
(bm25 >> genre_mlp_classifier)(topics[topics['qid'] == '20']).head(10)[['qid', 'query', 'rank', 'docno', 'predicted_label', 'probability_Download', 'probability_Linklists', 'probability_Porttrait private', ]]

Unnamed: 0,qid,query,rank,docno,predicted_label,probability_Download,probability_Linklists,probability_Porttrait private
0,20,defender,1,clueweb09-en0039-89-31095,Download,0.874826,0.028537,0.006056
1,20,defender,2,clueweb09-en0075-44-23924,Download,0.882751,0.027567,0.005506
2,20,defender,3,clueweb09-enwp01-41-00429,Porttrait private,0.007222,0.201657,0.383109
3,20,defender,4,clueweb09-enwp02-01-20435,Linklists,0.062379,0.356414,0.073092
4,20,defender,5,clueweb09-enwp00-54-21495,Linklists,0.063029,0.371884,0.072177
5,20,defender,6,clueweb09-enwp01-93-20555,Linklists,0.058611,0.346424,0.082329
6,20,defender,7,clueweb09-enwp00-55-22209,Linklists,0.059132,0.351548,0.082686
7,20,defender,8,clueweb09-enwp00-54-21491,Linklists,0.059481,0.344593,0.08333
8,20,defender,9,clueweb09-enwp00-50-22413,Linklists,0.059481,0.344593,0.08333
9,20,defender,10,clueweb09-enwp00-50-22412,Linklists,0.059481,0.344593,0.08333
