-
Notifications
You must be signed in to change notification settings - Fork 355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use cursorMark for fetching IDs in sitemap generator. #1253
Conversation
@EreMaijala, I've merged your verbose mode into master with some minor adjustments, and I've added a timer so we can use verbosity for benchmarking. I generated sitemaps against an index of about 1.3 million records and got these results:
I think this suggests that we should NOT remove the terms component support, as it is clearly still the most performant option when viable. However, I would certainly favor replacing the old search mode with the cursor-based search mode. Let me know if you'd like me to take care of reintroducing the terms mode, or if you'd like me to do it. In any case, thanks for working on this -- it's definitely a major improvement for cases where terms will not work... but it seems like it may also still be worth investigating a way to make terms work in a distributed environment. |
Oh, the terms component is pretty impressive, it seems. |
@EreMaijala, you are correct about hidden filters, etc. -- that was the original use case that led to the creation of a search-based alternative to use of the terms component. (It's actually mentioned in the comments of the .ini file... though maybe we need to do a better job of drawing people's attention to this detail). |
I saw the comment but as far as I can see the search-based alternative didn't use any filters, so even though the comment is there, something more is needed to actually use the filters. Unless I'm missing something, obviously.. Anyway, I suppose the question is if it still makes sense to keep the terms version. If it does, I've no objection to that. |
@EreMaijala, when the code was originally written, it was all going through classes that applied hidden filters, etc., automatically, and it "just worked." It's possible that things have been refactored since that time and it's not doing what it used to -- I haven't re-tested in ages. Regarding the terms version, I think it's worth keeping since hidden filters are not used by everyone, and the performance gains are tremendous. Maybe it's worth considering changing the default, though, and noting that you can switch to terms if you don't need certain functionality. It's probably better to have the most functionally complete rather than the most performant option as the default. |
@demiankatz Did you verify that the terms method actually works? It looks to me like there's a bug in VuFindSearch\Backend\Solr\Response\Json\NamedList::toArray method. It didn't work unless I change the foreach to access $this->list, and then the terms method is way slower than cursorMark for me. |
@EreMaijala, that's odd -- during my tests I confirmed that I got identical output from the terms method and from your revised code... and we currently use the terms method in production. Is it possible that the response format is different on your end because of your cloud configuration? If you spin up the generic test instance do you see the same problem? |
@EreMaijala, alternatively, could you tell me what PHP version you're using? Is it possible there's some incompatibility around the Iterator interface? Calling foreach on |
@demiankatz I just verified that in the master branch I get the contents in the NamedList, but the toArray method returns an empty list. |
@EreMaijala, how are you testing this? I can try to reproduce the situation on this end... but I find this very puzzling. Are you seeing the problem when you run the test that I cited? |
@demiankatz No problem with the tests (I assume |
So the problem was that I was accidentally testing with a sharded index that for some reason returns different results. |
It seems that I actually finished adding back the terms method but failed to push it. Anyway, should be done now, but please double-check. |
We tested it on our side and it works fine. Here are some numbers for our index (30 million documents) :
As we don't want to specify the shards names for SOLR Cloud, on our side we will use the cursor-based search mode. Thanks for this ! |
config/vufind/sitemap.ini
Outdated
; 'cursorMark' (the default, most compatible but slower method), or 'terms' (the | ||
; faster option). Note that 'terms' method does not support hidden filters or | ||
; other limiting options and requires that the index has terms enabled. | ||
retrievalMode = cursorMark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a strong reason to replace the 'search' mode with 'cursorMark'? Even though you are using a cursor mark, it is still a search-based retrieval method, so I don't think renaming it (and thus potentially breaking existing search-based configurations) is necessary. Let me know if you feel otherwise!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not really. It just started as another method, but I've changed it back.
@EreMaijala, I had a brainstorm about how to simplify the code a little bit -- better encapsulating the cursorMark- and terms-specific logic within the appropriate functions instead of having them bleed into the top-level generateForBackend method. Now the "last term in set" and "cursor mark" are treated generically in a uniform way. Please take a look at commit 2a7fbe7 and let me know what you think! If you're happy with that, I think this can be merged. |
@demiankatz Looks good to me, and cleaner! |
No description provided.