Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pt.Experiment consistency in missing qids in topics/qrels/runs #228

Merged
merged 3 commits into from
Sep 17, 2021

Conversation

seanmacavaney
Copy link
Collaborator

fixes #226

s_metric = rev_mapping.get(metric, str(metric))
aggregators[s_metric].add(evalMeasuresDict[q][s_metric])
evalMeasuresDict = {m: agg.result() for m, agg in aggregators.items()}
evalMeasuresDict = _ir_measures_to_dict(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change may be controversial. When batch_size is set, it no longer evaluates after each batch, but rather keeps all the results and evaluates everything at the end.

Alternatively, I could update so it filters the qrels each time.

@@ -95,7 +97,17 @@ def _ir_measures_to_dict(
metric = m.measure
metric = rev_mapping.get(metric, str(metric))
rtr[m.query_id][metric] = m.value
if backfill_qids is not None:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was necessary to move the backfilling stuff down to this level where rev_mapping is available. It was actually buggy before: If a family name was given and perquery was set, the family name would appear with a 0 in the summary, rather than the member names. Tests catch this now.

@@ -56,15 +57,15 @@ def _color_cols(data, col_type,
}

def _convert_measures(metrics : MEASURES_TYPE) -> Tuple[Sequence[BaseMeasure], Dict[BaseMeasure,str]]:
from ir_measures import convert_trec_name
from ir_measures import parse_trec_measure
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same function, old name was deprecated

@@ -346,6 +354,8 @@ def _apply_round(measure, value):
evalDict[name] = evalMeasuresDict
else:
import builtins
if mrt_needed:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mrt only applis when perquery is false

@cmacdonald
Copy link
Contributor

sorry, evaluation after each batch_size is definitely preferred - if you have lots of queries, this avoids keeping ALL of them in memory. Esp. a big challenge for MSMARCO dev/eval sets.

@seanmacavaney
Copy link
Collaborator Author

yep, makes sense, I'll update.

@@ -289,6 +299,14 @@ def _apply_round(measure, value):
# the commented variant would drop queries not having any RELEVANT labels
# topics = topics.merge(qrels[qrels["label"] > 0][["qid"]].drop_duplicates())
topics = topics.merge(qrels[["qid"]].drop_duplicates())
if len(topics) == 0:
raise ValueError('There is no overlap between the query IDs found in the topics and qrels. If this is intentional, set filter_qrels=False and drop_unused=False.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qids

@@ -214,6 +223,7 @@ def Experiment(
Applying a batch_size is useful if you have large numbers of topics, and/or if your pipeline requires large amounts of temporary memory
during a run.
drop_unused(bool): If True, will drop topics from the topics dataframe that have qids not appearing in the qrels dataframe.
filter_qrels(bool): If True, will drop topics from the qrels dataframe that have qids not appearing in the topics dataframe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can filter_qrels and drop_unused be both set to True? Surely not.
One wonders if we should make these have similar named:
filter_by_topics and filter_by_qrels?
or filter_by='topics'

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you can have both true. It's essentially left outer join, right outer join, and inner join.

If qrels include A, B and topics include B, C:

  • drop_unused=True filter_qrels=True gives B
  • drop_unused=True filter_qrels=False gives A, B
  • drop_unused=False filter_qrels=True gives B, C
  • drop_unused=False filter_qrels=False gives A, B, C

I'm in favour of giving them more similar names. filter_by_topics (formerly filter_qrels) and filter_by_qrels (formerly drop_unused) sound reasonable to me. Should probably pull from kwargs and give a warning if drop_unused is used for backward compatibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. We may need something in the documentation about this filtering.

Are we sure

  • filter_by='topics'
  • filter_by='qrels'
  • filter_by='both'
  • filter_by=None

shouldnt be the correct nomenclature?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about it in terms of separate decisions the user could make, and as a result, a single option that does both could be confusing.

  1. Do you want skip topics that do not appear in the qrels? If so, set filter_by_qrels=True.
  2. Do you want to evaluate across all topics from the qrels, even if they do not appear in the requested topics? If so, set filter_by_topics=False. I'd expect this to be particularly uncommon and be limited to situations like the one brought up in pt.Experiment consistency in missing qids in topics/qrels/runs #226.

I could probably be convinced otherwise, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, you convinced me. Carry on.

if filter_qrels:
qrels = qrels.merge(topics[["qid"]].drop_duplicates())
if len(qrels) == 0:
raise ValueError('There is no overlap between the query IDs found in the topics and qrels. If this is intentional, set filter_qrels=False and drop_unused=False.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. qids please rather than query IDs.

@@ -1,6 +1,7 @@
from collections import defaultdict
from warnings import warn
import os
import itertools
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is itertools used?

@@ -95,7 +97,17 @@ def _ir_measures_to_dict(
metric = m.measure
metric = rev_mapping.get(metric, str(metric))
rtr[m.query_id][metric] = m.value
if backfill_qids is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a comment explaining "what backfill is/means"

docs/experiments.rst Outdated Show resolved Hide resolved
@cmacdonald cmacdonald merged commit c6d7c97 into master Sep 17, 2021
@cmacdonald cmacdonald added this to the 0.7 milestone Sep 17, 2021
@cmacdonald cmacdonald deleted the irm branch December 20, 2021 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pt.Experiment consistency in missing qids in topics/qrels/runs
2 participants