A Large-Scale Dataset for Known-Item Question Performance Prediction

The paper associated to this repository and dataset is currently under review. This repository contains all the code to re-run and reproduce the crawling and experiments.

You can find the ready-to-use dataset at https://files.webis.de/corpora/corpora-webis/known-item-question-performance-prediction/.

Structure of this repository

Analyse TOMT subreddit

The code related to analyses is contained in tomt-parsing.

Analyse categories

We analyse the distribution of each category that belongs to at least one TOMT question. Because redundancies occurred due to different spellings, we consider the category names case-insensitive. These are the top-7 categories that we extracted in tomt-parsing:

song, 161385 questions
movie, 158543 questions
video, 67642 questions
music, 47591 questions
book, 47578 questions
2000s, 38200 questions
game, 32961 questions

Analyse solved questions

analyse-tomt-solved.ipynb contains the code to analyse the top categories of the solved questions and the waiting time.

Extract Gold Answers

extract-solved-comment-test.ipynb contains our code for extracting Gold Answers from the TOMT subreddit dataset. It loads in the TOMT subreddit dataset and extends it with the solved_utc, chosen_answer and links_on_answer_path attributes. In addition, we rename the Reddit attributes selftext to content and titleto subject, as it is common in other Q&A datasets, such as Yahoo!-Answers. We iterate through all questions and call the find_gold_answer(qa) method which traverses the comment tree of each question. It expects a Pandas dataframe row with all the required attributes (i.e. author, created_utc, link_flair_text, num_comments, comments). This will extract all the links on the comments path to the Gold Answer are extracted and determine the solved_dates that correspond to the created_utc of the Gold Answer. There is a possibility to customize the Gold Answer heuristic, which is based on the presence of the keywords "yes", "thank", "solved", "amazing" as this choice seemed to work adequately in pilot experiments and precision and recall experiments (see below). Another possible adaptation is to also consider questions without a value of link_flair_text, i.e. questions that could be solved but are not officially marked as Solved.

Precision and Recall experiments

We measure the precision and recall of our approach by annotating 50 random questions from the corpus and 50 questions for which our heuristic extracts a solved answer (those questions were not used to develop the rules), finding that our approach has a precision of 92 % and a recall of 78 %. qpp-experiments/extract-precision-recall-samples.ipynb contains the code that extracts these random questions. These samples can also be found in sample-data as csv files.

Sample data

We provide sample data extracted in analyse-tomt-solved.ipynb. The sample-data folder contains sample questions from the TOMT subreddit that were solved within a day, a week, a month, a year or later. There are also two random datasets for recall and precision experiments.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
qpp-experiments		qpp-experiments
sample-data		sample-data
tomt-crawling		tomt-crawling
tomt-parsing		tomt-parsing
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Large-Scale Dataset for Known-Item Question Performance Prediction

Structure of this repository

Analyse TOMT subreddit

Analyse categories

Analyse solved questions

Extract Gold Answers

Precision and Recall experiments

Sample data

About

Releases

Packages

Contributors 2

Languages

webis-de/QPP-23

Folders and files

Latest commit

History

Repository files navigation

A Large-Scale Dataset for Known-Item Question Performance Prediction

Structure of this repository

Analyse TOMT subreddit

Analyse categories

Analyse solved questions

Extract Gold Answers

Precision and Recall experiments

Sample data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages