-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Change Tracker information to Sitemap #3641
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @mtrojan-ub, this looks like a great start! I've left some suggestions below for possible simplifications/reorganizations. I'll be interested to hear what @EreMaijala thinks when he has a chance, too.
module/VuFind/src/VuFind/Sitemap/Plugin/Index/CursorMarkIdFetcher.php
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the progress on this, it's looking very good to me. I see some tests are still failing, so please let me know if you need any help sorting that out. Also, one question below...
@@ -212,7 +212,7 @@ public function getIds( | |||
|
|||
$params->set('rows', $limit); | |||
$params->set('start', $offset); | |||
$params->set('fl', $this->getConnector()->getUniqueKey()); | |||
$params->add('fl', $this->getConnector()->getUniqueKey()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to work fine, but I can't find documentation indicating that repeating the fl parameter is expected behavior. I originally had something like this in mind:
$fl = implode(',', $params->get('fl'));
$flParts = explode(',', $fl);
$flParts[] = $this->getConnector()->getUniqueKey();
$params->set('fl', implode(',', array_unique($flParts)));
Maybe my solution is overcomplicated and unnecessary, but it feels like it might stick a little closer to standard Solr behavior. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember this PR? #3368
Deduplication is now handled by the ParamBag class and is active by default so we should be able to use it here for free without writing additional code. Or do you think there will be side effects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mtrojan-ub, my concern is that this method creates a Solr URL containing fl=last_indexed&fl=id
. I would expect fl=last_indexed,id
instead based on what the documentation says. As I noted, repeating the fl
parameter seems to work correctly, but since I can't find any documentation that it is supposed to work, I'm concerned that it might be a fragile solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right! We should avoid generating these kind of URLs.
Just one more question: Is this a more generic problem? There seems to be no standard for passing HTTP array parameter across all kinds of systems (=Search Backends):
https://stackoverflow.com/questions/6243051/how-to-pass-an-array-within-a-query-string
At the moment, there is an implementation in ParamBag::request()
which takes a paramBag and converts the params into a form which can later be used for a HTTP request. But if there is no general solution for this problem, should the logic then instead be moved from the ParamBag into e.g. the QueryBuilder implementation of each system?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a tricky problem with no standardized answer, since URL processing is very system-specific. There are even cases where different parameters need different handling -- e.g. in this situation, we would prefer to concatenate fl
values with a separator, but other Solr parameters like fq
are definitely intended to be repeated (and that's the case this code was originally written to support).
One solution might be to come up with a variety of value combination algorithms, and let the request()
method take a parameter which maps field names to algorithms, so it can process things appropriately. Or it could even have a signature like string $defaultCombineAlgorithm, array $combineAlgorithmByField
. Some algorithms might need to be passed in as arrays so we can give them arguments -- e.g. if we wanted a "combine with delimiter" algorithm that could use different delimiters and/or escaping strategies.
I'm not sure whether or not it's worth developing this type of solution -- I think, at least for the purposes of this PR, my proposed code should do the job -- but I'm certainly open to discussing further and starting work in a separate PR if there's enough benefit to justify the effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, that might generate lots of effort and might only be worth while if we run into the same problem on multiple occasions, so for now i just added your proposed code snippet as a workaround.
Thanks!
Thanks for the review, @demiankatz! I already fixed part of the tests myself, but failed to fix the other ones yet so if you have some time left to help here i'd gladly appreciate it! (But no need to rush) |
@mtrojan-ub, I'll wait and see how you decide to handle the |
@mtrojan-ub, I fixed broken tests, and while I was working on them, I also modernized the test code and added coverage for the functionality added here. Running tests also revealed some problems with my proposed fl merging solution, so I improved that code as well. Please take another look at my recent changes and let me know if you have any concerns. If you do not, I think this is ready to merge, but I'd like to wait to hear from @EreMaijala first just in case he sees anything that we have missed. |
Thanks! Looks good to me now! |
One more note on the whole topic: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @mtrojan-ub! I'm approving this but will wait to hear from @EreMaijala before merging it.
Regarding the search engine APIs, that sounds like a good candidate for some kind of console utility (which would probably need a database table or other persistence mechanism to keep track of what has been sent out and what hasn't). Maybe it's worth opening a JIRA ticket to keep track of the idea. Would you like to do that, or should I?
@demiankatz: Good idea! I created a JIRA ticket for this one. |
Thanks, @mtrojan-ub! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
For whatever it's worth I don't really like the idea of sending updates to search engines via APIs. While it probably makes sense in some cases, it necessarily puts some search engines in advantage of others, and I'd prefer at least some resemblance of neutrality. :)
This also feels like an area where APIs might change or get deprecated with some frequency, leading to ongoing maintenance work. So I'm not incredibly enthusiastic about it either, but if people want it and somebody is willing to develop it, I certainly wouldn't turn code away... it's just not a high personal priority for me. :-) |
... as mentioned in today's community call, here's a first attempt to add the lastmod field.
Basically works, but i'd really appreciate feedback before working on the details! (e.g. changing test cases and so on)
TODO