Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FM index cursor enable node access #2061

Closed
tloka opened this issue Aug 19, 2020 · 14 comments · Fixed by #2076
Closed

FM index cursor enable node access #2061

tloka opened this issue Aug 19, 2020 · 14 comments · Fixed by #2076
Labels
feature/proposal a new feature or an idea of

Comments

@tloka
Copy link
Contributor

tloka commented Aug 19, 2020

Hey,

I have one more request concerning the FM index cursor. While with #2048 you provide cerealisation for FM index cursors (which is great and clearly resolves #2044 ), this unfortunately does not resolve a different issue I have with the current implementation of the cursor.

Need

For my FM index algorithm, I need to compare the locations of FM index cursors. I use this for sorting, removing duplicates, detecting nodes that are "contained" in other nodes, etc.. Therefore, I need access to the cursor members to compare cursors with each other, either manually or by enabling the implementation of operators such as < to use stl functions.

Solution

To be able to do this with SeqAn3, I would need to access at least the node of FM index cursors which is currently not possible (private member). Would it be possible to provide access to the node of FM index cursors? As these are not too big, I guess a getter returning a copy would in principle be fine in case you want to ensure that corruption of a cursor is not possible, though returning a reference or a getter to access each node member individually would probably be more efficient.

Notes

I am aware that == and != operators for FM index cursors are defined. Still, this is not sufficient for my purpose as I need very special types of comparisons which would probably not be appropriate for general purpose (e.g., in some of my comparisons the depth should not be included in the comparison, which it currently is for the == operator which generally also makes sense). This is why I would like to have access to the members and define my own comparisons instead of asking for operators directly.

@tloka tloka added the feature/proposal a new feature or an idea of label Aug 19, 2020
@marehr
Copy link
Member

marehr commented Aug 24, 2020

Core Meeting 2020-08-24:

Enrico wanted to talk with you on Wednesday, to understand your use case. The initial thought was to make (lower|upper)_bound public available (low-level) functions, but we have to see if that fits your use case.

@eseiler
Copy link
Member

eseiler commented Aug 26, 2020

We can implement getter functions for what you need.

Do you use the fm_index_cursor with parent_lb, parent_rb and node?
Or the bi_fm_index_cursor with fwd_lb, fwd_rb, rev_lb, rev_rb, parent_lb, parent_rb and depth?

What do you use and what kind of access (do you want to modify them) would you need on which members?
Implementation should be straight forward.

@tloka
Copy link
Contributor Author

tloka commented Aug 27, 2020

Thanks for your responses and the feedback.

I am currently using the fm_index_cursor, though it is possible that I switch to the bidirectional index at some point.
As far as I understand, the actual position in the index is fully described in the node!? Thus, currently accessing the node would probably be sufficient for now. The counterpart in the bidirectional index would be fw and rev boundaries and the depth.

I don't need to modify the values, and I think it generally makes sense that the members can't be modified from outside to prevent corrupt state of a cursor. Therefore I would propose to return by value or const ref.

@eseiler
Copy link
Member

eseiler commented Aug 27, 2020

I added the node access to fm_index_cursors in #2076
We can add it for bi_fm_index_cursors later

@marehr
Copy link
Member

marehr commented Aug 27, 2020

As far as I understand, the actual position in the index is fully described in the node!?

I'm sorry, but I still don't understand what you need. It seems that you don't need the node itself, but some value that can be computed / retrieved from the node. Can you try out @eseiler Pr and write us exactly what members of that node you need?

As we talked about this in the core meeting it seemed that the (suffix?) interval, i.e. lower- and upper-bound, would suffice for your use case.

If we can, I would prefer to keep the node an internal implementation detail and give a stronger interface for values that are needed.

@eseiler
Copy link
Member

eseiler commented Aug 27, 2020

If you just want the position, you only need the node that encapsulates all the information.

The fm_index_cursor also stores parent_lb and parent_rb but that is only needed when cycling back, i.e. going up in the tree - in this case you need to know where you were before.

@tloka
Copy link
Contributor Author

tloka commented Aug 27, 2020

As far as I understand, the actual position in the index is fully described in the node!?

I'm sorry, but I still don't understand what you need. It seems that you don't need the node itself, but some value that can be computed / retrieved from the node. Can you try out @eseiler Pr and write us exactly what members of that node you need?

As we talked about this in the core meeting it seemed that the (suffix?) interval, i.e. lower- and upper-bound, would suffice for your use case.

If we can, I would prefer to keep the node an internal implementation detail and give a stronger interface for values that are needed.

I can actually tell you without checking out the PR:
In the end, I just need the bounds (and maybe the depth), which is nearly the full node (except for the last character).
Individual getters to the bounds and the depth would therefore also be totally fine for me.
@eseiler I don't need the parents.

@eseiler
Copy link
Member

eseiler commented Aug 27, 2020

We can also add 3 getters get the members of the node.
But then we return 3 integers that together have the semantics of a node in a tree instead of just returning what it is - a node in the tree.

@marehr
Copy link
Member

marehr commented Aug 31, 2020

2020-08-31 Resolution:

We discussed what a cursor and a node of a cursor would be, and we came to the conclusion that a cursor is already the "user" abstraction for a node and that we should add the bounds and "depth" to the fm-cursor.

We should find a different name for depth because it is tied to a tree.

@tloka
Copy link
Contributor Author

tloka commented Aug 31, 2020

Could be something like suffix_length / suffix_size!?

@eseiler
Copy link
Member

eseiler commented Aug 31, 2020

The depth is already implemented as query_length

@eseiler
Copy link
Member

eseiler commented Aug 31, 2020

I updated my PR #2076

@eseiler
Copy link
Member

eseiler commented Sep 1, 2020

@tloka
Would you prefer two member function left_bound() and right_bound() or one member function that returns a pair suffix_array_bounds().

I think, the usual use case would be to look at both bounds anyways, so returning a pair may be a better interface.

@tloka
Copy link
Contributor Author

tloka commented Sep 2, 2020

This is a pure design decision. I am personally not the biggest fan of using pairs in general as it often is not directly obvious from the code what the meaning of first and second is. Though, in this case it would be pretty intuitive that first is left and second is right.

However, design decisions should imho be made by the core team, so just go for whatever is the best / most consistent from your perspective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/proposal a new feature or an idea of
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants