Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Function Score Query #2396

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

alex-au-922
Copy link

@alex-au-922 alex-au-922 commented May 12, 2024

The implementation of FunctionScoreQuery which accepts an existing query and a closure for applying new scoring. You may consider this query as a .map() function which does not filter the documents out, but rather change their score.

Closes #2395

@adamreichold
Copy link
Contributor

How you would say this relates to the existing tweak_score? Would that also suffice to handle your use case?

@adamreichold
Copy link
Contributor

Ah sorry, you wrote that in the linked issue...

@adamreichold adamreichold changed the title Issue 2395 Add Function Score Query May 12, 2024
@alex-au-922
Copy link
Author

Ah sorry, you wrote that in the linked issue...

Sorry that I'm not quite used to contribute to large OSS projects so I'm not sure the exact practices on linking Issues and PR. Will add the issue link in the PR next time

@adamreichold
Copy link
Contributor

Regarding the formatting errors, note that you'll need to use a nightly toolchain to apply the correct formatting configuration.

}
}

impl DocSet for FunctionScorer {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you implement the other methods too? seek for instance, has a default impemnetation but it will be catastrophically slow.

use crate::query::{EnableScoring, Explanation, Query, Scorer, Weight};
use crate::{DocId, DocSet, Score, SegmentReader, TantivyError};

type Function = Arc<dyn Fn(&SegmentReader, Score, DocId) -> Score + Send + Sync + 'static>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If your fucntion needs to access a column from a fast field, you don't want to access it at each call.

The function should be a factory that takes a SegmentReader and returns a scoring function

@alex-au-922
Copy link
Author

I have removed the use of SegmentReader in the current version of FunctionScoreQuery as the query could be too versatile that users cannot optimize their query performance with the wrong use of SegmentReader.

The use of SegmentReader should be thoroughly discussed. We can even just limit them with the use of fastfield only.

The current signature should only be a simple Fn(Score) -> Score which can simplify complicated nested boosting cases and some impossible maths like reciprocal, logarithm etc..

@adamreichold
Copy link
Contributor

The use of SegmentReader should be thoroughly discussed. We can even just limit them with the use of fastfield only.

I think @fulmicoton's proposed approach is very right on target and has precedence in other Tantivy API, i.e. make the user pass a fallible factory function which given a segment reader will produce an infallible per-document scoring function.

@alex-au-922
Copy link
Author

I think @fulmicoton's proposed approach is very right on target and has precedence in other Tantivy API, i.e. make the user pass a fallible factory function which given a segment reader will produce an infallible per-document scoring function.

What comes up in my mind is a new enum like

pub enum ScoreFunction {
    ScoreOnly(Arc<dyn Fn(Score) -> Score + Send + Sync + 'static>),
    ScoreWithDetails(Arc<dyn Fn(&SegmentReader, Score, DocId) -> Score + Send + Sync + 'static>),
}

pub struct FunctionScoreQuery {
    query: Box<dyn Query>,
    score_function: ScoreFunction,
}

so that we can match different types of ScoreFunction. However, this cannot prevent users from using the ScoreWithDetails when they don't need the SegmentReader at all.

If we accept users to dynamically use the segment reader based on different conditions, we have to include it in the closure input parameter and in the end each call will still calling the segment reader. Or do you have any suggestions?

@adamreichold
Copy link
Contributor

The idea is not allow the scoring function access to the segment reader but only to the relevant column itself, because after that column has been loaded, access to the values becomes essentially infallible.

type ScoreFunction = Arc<dyn Fn(DocId, Score) -> Score + Send + Sync + 'static>;

type ScoreFunctionFactory = Arc<dyn Fn(&SegmentReader) -> crate::Result<ScoreFunction> + Send + Sync + 'static>;

(The score function will access the column retrieved from the segment reader in the factory by capturing it.)

@alex-au-922
Copy link
Author

Yes, that makes sense by utilizing the type checker system. However, I'm still thinking how can we be benefited with not accessing the segment reader by every call, as @fulmicoton mentioned:

If your fucntion needs to access a column from a fast field, you don't want to access it at each call.

I think it's better to separate two structs as FunctionScoreQuery (only dealing with the score) and a FunctionSegmentQuery (accessing the segment reader)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding Function Score Query
3 participants