Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cursorMark for fetching IDs in sitemap generator. #1253

Merged
merged 6 commits into from
Nov 8, 2018

Conversation

EreMaijala
Copy link
Contributor

No description provided.

@demiankatz
Copy link
Member

demiankatz commented Oct 30, 2018

@EreMaijala, I've merged your verbose mode into master with some minor adjustments, and I've added a timer so we can use verbosity for benchmarking. I generated sitemaps against an index of about 1.3 million records and got these results:

  • existing search mode: 414 seconds
  • your cursor-based search mode: 193 seconds
  • terms component: 6 seconds

I think this suggests that we should NOT remove the terms component support, as it is clearly still the most performant option when viable. However, I would certainly favor replacing the old search mode with the cursor-based search mode.

Let me know if you'd like me to take care of reintroducing the terms mode, or if you'd like me to do it.

In any case, thanks for working on this -- it's definitely a major improvement for cases where terms will not work... but it seems like it may also still be worth investigating a way to make terms work in a distributed environment.

@EreMaijala
Copy link
Contributor Author

Oh, the terms component is pretty impressive, it seems.
One thing I started thinking, though, and this wouldn't be supported by the terms component I guess: shouldn't the sitemap generation honour any hidden filters or sources setting in searches.ini? Since the sitemap consumer wouldn't be able to access them anyway, they shouldn't be included, no?

@demiankatz
Copy link
Member

@EreMaijala, you are correct about hidden filters, etc. -- that was the original use case that led to the creation of a search-based alternative to use of the terms component. (It's actually mentioned in the comments of the .ini file... though maybe we need to do a better job of drawing people's attention to this detail).

@EreMaijala
Copy link
Contributor Author

I saw the comment but as far as I can see the search-based alternative didn't use any filters, so even though the comment is there, something more is needed to actually use the filters. Unless I'm missing something, obviously..

Anyway, I suppose the question is if it still makes sense to keep the terms version. If it does, I've no objection to that.

@demiankatz
Copy link
Member

@EreMaijala, when the code was originally written, it was all going through classes that applied hidden filters, etc., automatically, and it "just worked." It's possible that things have been refactored since that time and it's not doing what it used to -- I haven't re-tested in ages.

Regarding the terms version, I think it's worth keeping since hidden filters are not used by everyone, and the performance gains are tremendous. Maybe it's worth considering changing the default, though, and noting that you can switch to terms if you don't need certain functionality. It's probably better to have the most functionally complete rather than the most performant option as the default.

@EreMaijala
Copy link
Contributor Author

@demiankatz Did you verify that the terms method actually works? It looks to me like there's a bug in VuFindSearch\Backend\Solr\Response\Json\NamedList::toArray method. It didn't work unless I change the foreach to access $this->list, and then the terms method is way slower than cursorMark for me.

@demiankatz
Copy link
Member

@EreMaijala, that's odd -- during my tests I confirmed that I got identical output from the terms method and from your revised code... and we currently use the terms method in production. Is it possible that the response format is different on your end because of your cloud configuration? If you spin up the generic test instance do you see the same problem?

@demiankatz
Copy link
Member

@EreMaijala, alternatively, could you tell me what PHP version you're using? Is it possible there's some incompatibility around the Iterator interface? Calling foreach on $this is a bit weird, but it ought to work. There is test coverage for it at https://github.com/vufind-org/vufind/blob/master/module/VuFindSearch/tests/unit-tests/src/VuFindTest/Backend/Solr/Response/Json/NamedListTest.php#L76 -- does this test pass on your system?

@EreMaijala
Copy link
Contributor Author

@demiankatz I just verified that in the master branch I get the contents in the NamedList, but the toArray method returns an empty list.

@demiankatz
Copy link
Member

@EreMaijala, how are you testing this? I can try to reproduce the situation on this end... but I find this very puzzling. Are you seeing the problem when you run the test that I cited?

@EreMaijala
Copy link
Contributor Author

@demiankatz No problem with the tests (I assume phing phpunit runs this one too). PHP 7.2.10. The index I'm testing with has the same terms component setup in solrconfig.xml as in VuFind's own.

@EreMaijala
Copy link
Contributor Author

So the problem was that I was accidentally testing with a sharded index that for some reason returns different results.

@EreMaijala
Copy link
Contributor Author

It seems that I actually finished adding back the terms method but failed to push it. Anyway, should be done now, but please double-check.

@liowalter
Copy link
Contributor

We tested it on our side and it works fine. Here are some numbers for our index (30 million documents) :

  • on SOLR4 (no solrcloud) with the terms component : ~5 minutes
  • on SOLR7, distributed with 4 shards, with cursor-based search mode : ~1 hour

As we don't want to specify the shards names for SOLR Cloud, on our side we will use the cursor-based search mode. Thanks for this !

liowalter pushed a commit to swissbib/vufind that referenced this pull request Nov 6, 2018
liowalter pushed a commit to swissbib/vufind that referenced this pull request Nov 6, 2018
; 'cursorMark' (the default, most compatible but slower method), or 'terms' (the
; faster option). Note that 'terms' method does not support hidden filters or
; other limiting options and requires that the index has terms enabled.
retrievalMode = cursorMark
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a strong reason to replace the 'search' mode with 'cursorMark'? Even though you are using a cursor mark, it is still a search-based retrieval method, so I don't think renaming it (and thus potentially breaking existing search-based configurations) is necessary. Let me know if you feel otherwise!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not really. It just started as another method, but I've changed it back.

@demiankatz
Copy link
Member

@EreMaijala, I had a brainstorm about how to simplify the code a little bit -- better encapsulating the cursorMark- and terms-specific logic within the appropriate functions instead of having them bleed into the top-level generateForBackend method. Now the "last term in set" and "cursor mark" are treated generically in a uniform way. Please take a look at commit 2a7fbe7 and let me know what you think! If you're happy with that, I think this can be merged.

@EreMaijala
Copy link
Contributor Author

@demiankatz Looks good to me, and cleaner!

@demiankatz demiankatz merged commit 5c5df5b into vufind-org:master Nov 8, 2018
@EreMaijala EreMaijala deleted the sitemap-cursormark branch September 22, 2020 07:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants