Use cursorMark for fetching IDs in sitemap generator. #1253

EreMaijala · 2018-10-23T14:48:20Z

No description provided.

demiankatz · 2018-10-30T13:34:39Z

@EreMaijala, I've merged your verbose mode into master with some minor adjustments, and I've added a timer so we can use verbosity for benchmarking. I generated sitemaps against an index of about 1.3 million records and got these results:

existing search mode: 414 seconds
your cursor-based search mode: 193 seconds
terms component: 6 seconds

I think this suggests that we should NOT remove the terms component support, as it is clearly still the most performant option when viable. However, I would certainly favor replacing the old search mode with the cursor-based search mode.

Let me know if you'd like me to take care of reintroducing the terms mode, or if you'd like me to do it.

In any case, thanks for working on this -- it's definitely a major improvement for cases where terms will not work... but it seems like it may also still be worth investigating a way to make terms work in a distributed environment.

EreMaijala · 2018-10-30T13:41:47Z

Oh, the terms component is pretty impressive, it seems.
One thing I started thinking, though, and this wouldn't be supported by the terms component I guess: shouldn't the sitemap generation honour any hidden filters or sources setting in searches.ini? Since the sitemap consumer wouldn't be able to access them anyway, they shouldn't be included, no?

demiankatz · 2018-10-30T13:44:54Z

@EreMaijala, you are correct about hidden filters, etc. -- that was the original use case that led to the creation of a search-based alternative to use of the terms component. (It's actually mentioned in the comments of the .ini file... though maybe we need to do a better job of drawing people's attention to this detail).

EreMaijala · 2018-10-30T13:47:15Z

I saw the comment but as far as I can see the search-based alternative didn't use any filters, so even though the comment is there, something more is needed to actually use the filters. Unless I'm missing something, obviously..

Anyway, I suppose the question is if it still makes sense to keep the terms version. If it does, I've no objection to that.

demiankatz · 2018-10-30T13:51:02Z

@EreMaijala, when the code was originally written, it was all going through classes that applied hidden filters, etc., automatically, and it "just worked." It's possible that things have been refactored since that time and it's not doing what it used to -- I haven't re-tested in ages.

Regarding the terms version, I think it's worth keeping since hidden filters are not used by everyone, and the performance gains are tremendous. Maybe it's worth considering changing the default, though, and noting that you can switch to terms if you don't need certain functionality. It's probably better to have the most functionally complete rather than the most performant option as the default.

EreMaijala · 2018-10-31T11:20:13Z

@demiankatz Did you verify that the terms method actually works? It looks to me like there's a bug in VuFindSearch\Backend\Solr\Response\Json\NamedList::toArray method. It didn't work unless I change the foreach to access $this->list, and then the terms method is way slower than cursorMark for me.

demiankatz · 2018-10-31T12:08:00Z

@EreMaijala, that's odd -- during my tests I confirmed that I got identical output from the terms method and from your revised code... and we currently use the terms method in production. Is it possible that the response format is different on your end because of your cloud configuration? If you spin up the generic test instance do you see the same problem?

demiankatz · 2018-10-31T12:10:59Z

@EreMaijala, alternatively, could you tell me what PHP version you're using? Is it possible there's some incompatibility around the Iterator interface? Calling foreach on $this is a bit weird, but it ought to work. There is test coverage for it at https://github.com/vufind-org/vufind/blob/master/module/VuFindSearch/tests/unit-tests/src/VuFindTest/Backend/Solr/Response/Json/NamedListTest.php#L76 -- does this test pass on your system?

EreMaijala · 2018-10-31T12:13:40Z

@demiankatz I just verified that in the master branch I get the contents in the NamedList, but the toArray method returns an empty list.

demiankatz · 2018-10-31T12:18:10Z

@EreMaijala, how are you testing this? I can try to reproduce the situation on this end... but I find this very puzzling. Are you seeing the problem when you run the test that I cited?

EreMaijala · 2018-10-31T12:26:47Z

@demiankatz No problem with the tests (I assume phing phpunit runs this one too). PHP 7.2.10. The index I'm testing with has the same terms component setup in solrconfig.xml as in VuFind's own.

EreMaijala · 2018-10-31T12:44:43Z

So the problem was that I was accidentally testing with a sharded index that for some reason returns different results.

EreMaijala · 2018-11-06T14:56:56Z

It seems that I actually finished adding back the terms method but failed to push it. Anyway, should be done now, but please double-check.

liowalter · 2018-11-06T15:02:17Z

We tested it on our side and it works fine. Here are some numbers for our index (30 million documents) :

on SOLR4 (no solrcloud) with the terms component : ~5 minutes
on SOLR7, distributed with 4 shards, with cursor-based search mode : ~1 hour

As we don't want to specify the shards names for SOLR Cloud, on our side we will use the cursor-based search mode. Thanks for this !

demiankatz · 2018-11-06T19:00:19Z

config/vufind/sitemap.ini

+;  'cursorMark' (the default, most compatible but slower method), or 'terms' (the
+;    faster option). Note that 'terms' method does not support hidden filters or
+;    other limiting options and requires that the index has terms enabled.
+retrievalMode = cursorMark


Is there a strong reason to replace the 'search' mode with 'cursorMark'? Even though you are using a cursor mark, it is still a search-based retrieval method, so I don't think renaming it (and thus potentially breaking existing search-based configurations) is necessary. Let me know if you feel otherwise!

No, not really. It just started as another method, but I've changed it back.

demiankatz · 2018-11-07T17:44:27Z

@EreMaijala, I had a brainstorm about how to simplify the code a little bit -- better encapsulating the cursorMark- and terms-specific logic within the appropriate functions instead of having them bleed into the top-level generateForBackend method. Now the "last term in set" and "cursor mark" are treated generically in a uniform way. Please take a look at commit 2a7fbe7 and let me know what you think! If you're happy with that, I think this can be merged.

EreMaijala · 2018-11-08T07:15:43Z

@demiankatz Looks good to me, and cleaner!

EreMaijala and others added 2 commits October 23, 2018 17:46

Use cursorMark for fetching IDs in sitemap generator.

fc96095

Merge branch 'master' into sitemap-cursormark

020589c

Resurrect terms method.

d62af91

liowalter pushed a commit to swissbib/vufind that referenced this pull request Nov 6, 2018

test vufind pull request vufind-org#1253

8af3c27

liowalter pushed a commit to swissbib/vufind that referenced this pull request Nov 6, 2018

resurect terms method (copied from pull request vufind-org#1253)

d08bde2

demiankatz reviewed Nov 6, 2018

View reviewed changes

Rename mode back to 'search' and revive introduction of $retrievalMode.

014b7a2

liowalter mentioned this pull request Nov 7, 2018

Test Google Indexing gegen Solr7 cluster swissbib/vufind#657

Closed

demiankatz added 2 commits November 7, 2018 11:54

Merge branch 'master' into sitemap-cursormark

bbfcaf5

Code simplification.

2a7fbe7

demiankatz merged commit 5c5df5b into vufind-org:master Nov 8, 2018

EreMaijala deleted the sitemap-cursormark branch September 22, 2020 07:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use cursorMark for fetching IDs in sitemap generator. #1253

Use cursorMark for fetching IDs in sitemap generator. #1253

EreMaijala commented Oct 23, 2018

demiankatz commented Oct 30, 2018 •

edited

Loading

EreMaijala commented Oct 30, 2018

demiankatz commented Oct 30, 2018

EreMaijala commented Oct 30, 2018

demiankatz commented Oct 30, 2018

EreMaijala commented Oct 31, 2018

demiankatz commented Oct 31, 2018

demiankatz commented Oct 31, 2018

EreMaijala commented Oct 31, 2018

demiankatz commented Oct 31, 2018

EreMaijala commented Oct 31, 2018

EreMaijala commented Oct 31, 2018

EreMaijala commented Nov 6, 2018

liowalter commented Nov 6, 2018

demiankatz Nov 6, 2018

EreMaijala Nov 7, 2018

demiankatz commented Nov 7, 2018

EreMaijala commented Nov 8, 2018

Use cursorMark for fetching IDs in sitemap generator. #1253

Use cursorMark for fetching IDs in sitemap generator. #1253

Conversation

EreMaijala commented Oct 23, 2018

demiankatz commented Oct 30, 2018 • edited Loading

EreMaijala commented Oct 30, 2018

demiankatz commented Oct 30, 2018

EreMaijala commented Oct 30, 2018

demiankatz commented Oct 30, 2018

EreMaijala commented Oct 31, 2018

demiankatz commented Oct 31, 2018

demiankatz commented Oct 31, 2018

EreMaijala commented Oct 31, 2018

demiankatz commented Oct 31, 2018

EreMaijala commented Oct 31, 2018

EreMaijala commented Oct 31, 2018

EreMaijala commented Nov 6, 2018

liowalter commented Nov 6, 2018

demiankatz Nov 6, 2018

Choose a reason for hiding this comment

EreMaijala Nov 7, 2018

Choose a reason for hiding this comment

demiankatz commented Nov 7, 2018

EreMaijala commented Nov 8, 2018

demiankatz commented Oct 30, 2018 •

edited

Loading