Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to methods of Index #15

Merged
merged 13 commits into from
Aug 19, 2022
Merged

Conversation

tmpfs
Copy link
Collaborator

@tmpfs tmpfs commented Aug 19, 2022

Closes #13.

@tmpfs
Copy link
Collaborator Author

tmpfs commented Aug 19, 2022

@marcus-pousette, I think this is more or less there. Later I think it would be nice to spend a bit of time improving the API documentation but for the moment I think this is good enough.

One thing that bothers me a bit is that query() takes a mutable reference to the Index and that seems counter-intuitive to what a query function should do. I wonder if there is any way to remove the mutability or otherwise to use interior mutability so we could use &self - maybe better in a separate issue/PR.

Now that query() is a method of Index it makes more sense to have the
query helper functions in the index module.

Keep the query module just for tests.

Some improvements to the handling of test utility functions and flatten
the score module hierarchy for better documentation output.
@marcus-pousette
Copy link
Collaborator

marcus-pousette commented Aug 19, 2022

Thanks for the PR! @tmpfs

The only thing I am considering is that, if it is necessary to have "everything" is in the index file (query just contain tests with this PR). I see why you made this changes (because the query fn is now a member of Index). I wonder if it would be possible to have separation of this, so I as a reader can more quickly get an overview/insight into different parts of this library without having to read through a long index file. I am also taking height for a future where the query capabilities are much more feature rich. Which would make the index file even larger if the same approach is used.

For example you could perhaps make

use crate::query::query;

impl Index {
    fn query(....) {
        query(...)
    }
}

instead of

impl Index {
    fn query(....) {
       " ----- 100 lines of code ----- "
    }
}

fn query_helper_method_1 ( ... ) {}

fn query_helper_method_2 ( ... ) {}

What do you think?

I fully agree that passing a mutable index to the query is not right / looks weird. The reason for this is that is that I am vacuuming removed documents during query time since you might be iterating over them. Not sure if this feature is worth the price of having to pass a mutable reference. Maybe it would be better to return pointers to nodes that you later can remove instead, or something.

@tmpfs
Copy link
Collaborator Author

tmpfs commented Aug 19, 2022

Sure @marcus-pousette that sounds reasonable to separate out the modules. I will update this PR with those changes soonish.

I thought that automatic vacuuming was the reason for the mutable reference and thanks for clarifying that. I would argue that it is not worthwhile to automatically vacuum. When i was integrating with my project i had a nicely encapsulated index but had to expose a mutable reference in order to query and i would like to remove that as the owner of the index should only be able to mutate it.

For consumers of the library that need vacuuming then having the query ignore the removed documents is enough i think. Eventually the index will be purged when vacuum is called explicitly.

@tmpfs
Copy link
Collaborator Author

tmpfs commented Aug 19, 2022

@marcus-pousette, please take another look, I think it works better now; in particular the trait bounds for T are clearer and more consistent we only define them twice (in the index and query modules) for each impl Index block 👍

@marcus-pousette
Copy link
Collaborator

Yes, looks great now!

Now. query module is completely independent of index (i.e. there can be an index without any query capabilities potentially).

@tmpfs
Copy link
Collaborator Author

tmpfs commented Aug 19, 2022

Yes, we could put query behind a feature flag if necessary. I would imagine we would certainly want to do that for the fuzzy matching feature but not sure whether it makes sense for query - is the index useful without query?

@marcus-pousette
Copy link
Collaborator

I thought that automatic vacuuming was the reason for the mutable reference and thanks for clarifying that. I would argue that it is not worthwhile to automatically vacuum. When i was integrating with my project i had a nicely encapsulated index but had to expose a mutable reference in order to query and i would like to remove that as the owner of the index should only be able to mutate it.

For consumers of the library that need vacuuming then having the query ignore the removed documents is enough i think. Eventually the index will be purged when vacuum is called explicitly.

I agree with you. I realize now that there might even be immediate performance benefits to do this your way since we are to remove some checks. I created an issue for this: #16

@marcus-pousette
Copy link
Collaborator

marcus-pousette commented Aug 19, 2022

Yes, we could put query behind a feature flag if necessary. I would imagine we would certainly want to do that for the fuzzy matching feature but not sure whether it makes sense for query - is the index useful without query?

Query capabilities will always be necessary in someway but now at least you can implement your own query module and not include this query module when depending on this library (if there is a feature flag).

I am not sure how far back one need to go to implement the fuzzy matching. But I am quite sure you do not need to reinvent everything in the query module, there must be some smart abstraction to make that allows existing functionality coexist nicely with fuzzy matching features. We could take this in the #12 issue, but maybe also think about what fuzzy matching could represent in a bigger picture, for example, many "fuzzy" matching systems today are using synonym/word2vec modules to measure distances from what you searching for and what exist in the index. Fuzzy matching is in some sense a property where we set the allowed "error distance" where distance could represent Levenshtein distance, or whatever distance type that is feasible for your need. Continuing on this, perhaps it would be interesting to pass an "error" function, that allows you during query to set the precision of your query. This error function could itself be a function of some distance measurement algorithm, like Levenshtein or word2vec word distance norm

@marcus-pousette marcus-pousette merged commit a2cef1d into quantleaf:master Aug 19, 2022
@tmpfs
Copy link
Collaborator Author

tmpfs commented Aug 20, 2022

Thanks for landing this @marcus-pousette, I agree that we shouldn't need to make many changes to support fuzzy matching. Allowing complete control of query through a feature flag is possible but I think a better approach is to allow complete control of the query behavior (scoring, matching, filtering, sorting) via the arguments to query.

First I will deal with #16 and I want to spend a bit more time on the documentation to help my understanding of the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use methods rather than functions
2 participants