Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Change Tracker information to Sitemap #3641

Merged
merged 14 commits into from
May 16, 2024

Conversation

mtrojan-ub
Copy link
Contributor

@mtrojan-ub mtrojan-ub commented May 7, 2024

... as mentioned in today's community call, here's a first attempt to add the lastmod field.
Basically works, but i'd really appreciate feedback before working on the details! (e.g. changing test cases and so on)

TODO

Copy link
Member

@demiankatz demiankatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @mtrojan-ub, this looks like a great start! I've left some suggestions below for possible simplifications/reorganizations. I'll be interested to hear what @EreMaijala thinks when he has a chance, too.

config/vufind/sitemap.ini Outdated Show resolved Hide resolved
config/vufind/sitemap.ini Outdated Show resolved Hide resolved
config/vufind/sitemap.ini Outdated Show resolved Hide resolved
Copy link
Member

@demiankatz demiankatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the progress on this, it's looking very good to me. I see some tests are still failing, so please let me know if you need any help sorting that out. Also, one question below...

@@ -212,7 +212,7 @@ public function getIds(

$params->set('rows', $limit);
$params->set('start', $offset);
$params->set('fl', $this->getConnector()->getUniqueKey());
$params->add('fl', $this->getConnector()->getUniqueKey());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to work fine, but I can't find documentation indicating that repeating the fl parameter is expected behavior. I originally had something like this in mind:

        $fl = implode(',', $params->get('fl'));
        $flParts = explode(',', $fl);
        $flParts[] = $this->getConnector()->getUniqueKey();
        $params->set('fl', implode(',', array_unique($flParts)));

Maybe my solution is overcomplicated and unnecessary, but it feels like it might stick a little closer to standard Solr behavior. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember this PR? #3368

Deduplication is now handled by the ParamBag class and is active by default so we should be able to use it here for free without writing additional code. Or do you think there will be side effects?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtrojan-ub, my concern is that this method creates a Solr URL containing fl=last_indexed&fl=id. I would expect fl=last_indexed,id instead based on what the documentation says. As I noted, repeating the fl parameter seems to work correctly, but since I can't find any documentation that it is supposed to work, I'm concerned that it might be a fragile solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right! We should avoid generating these kind of URLs.

Just one more question: Is this a more generic problem? There seems to be no standard for passing HTTP array parameter across all kinds of systems (=Search Backends):
https://stackoverflow.com/questions/6243051/how-to-pass-an-array-within-a-query-string

At the moment, there is an implementation in ParamBag::request() which takes a paramBag and converts the params into a form which can later be used for a HTTP request. But if there is no general solution for this problem, should the logic then instead be moved from the ParamBag into e.g. the QueryBuilder implementation of each system?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a tricky problem with no standardized answer, since URL processing is very system-specific. There are even cases where different parameters need different handling -- e.g. in this situation, we would prefer to concatenate fl values with a separator, but other Solr parameters like fq are definitely intended to be repeated (and that's the case this code was originally written to support).

One solution might be to come up with a variety of value combination algorithms, and let the request() method take a parameter which maps field names to algorithms, so it can process things appropriately. Or it could even have a signature like string $defaultCombineAlgorithm, array $combineAlgorithmByField. Some algorithms might need to be passed in as arrays so we can give them arguments -- e.g. if we wanted a "combine with delimiter" algorithm that could use different delimiters and/or escaping strategies.

I'm not sure whether or not it's worth developing this type of solution -- I think, at least for the purposes of this PR, my proposed code should do the job -- but I'm certainly open to discussing further and starting work in a separate PR if there's enough benefit to justify the effort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, that might generate lots of effort and might only be worth while if we run into the same problem on multiple occasions, so for now i just added your proposed code snippet as a workaround.

Thanks!

@mtrojan-ub
Copy link
Contributor Author

Thanks for the review, @demiankatz! I already fixed part of the tests myself, but failed to fix the other ones yet so if you have some time left to help here i'd gladly appreciate it! (But no need to rush)

@demiankatz
Copy link
Member

@mtrojan-ub, I'll wait and see how you decide to handle the fl situation before I do any work on tests here... but once that piece is settled, I'll be happy to help with tests if time permits.

@demiankatz demiankatz added this to the 10.0 milestone May 8, 2024
@demiankatz demiankatz requested a review from EreMaijala May 8, 2024 16:00
@demiankatz
Copy link
Member

@mtrojan-ub, I fixed broken tests, and while I was working on them, I also modernized the test code and added coverage for the functionality added here.

Running tests also revealed some problems with my proposed fl merging solution, so I improved that code as well.

Please take another look at my recent changes and let me know if you have any concerns. If you do not, I think this is ready to merge, but I'd like to wait to hear from @EreMaijala first just in case he sees anything that we have missed.

@mtrojan-ub
Copy link
Contributor Author

Thanks! Looks good to me now!

@mtrojan-ub
Copy link
Contributor Author

One more note on the whole topic:
There is a certain trend right now to actively publish your new URLs via an API to the search provider, instead of waiting until the crawler detects your new URLs at some point. Here are some examples for Google Indexing API as well as Bing IndexNow. The number of URLs you can register per day is usually limited, so it might be useful to keep the current sitemap approach in addition, but we should keep an eye on this topic and see how it develops over the next couple of years. Maybe it will be useful to provide some support for these kind of APIs in the future.

Copy link
Member

@demiankatz demiankatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @mtrojan-ub! I'm approving this but will wait to hear from @EreMaijala before merging it.

Regarding the search engine APIs, that sounds like a good candidate for some kind of console utility (which would probably need a database table or other persistence mechanism to keep track of what has been sent out and what hasn't). Maybe it's worth opening a JIRA ticket to keep track of the idea. Would you like to do that, or should I?

@mtrojan-ub
Copy link
Contributor Author

@demiankatz: Good idea! I created a JIRA ticket for this one.

@demiankatz
Copy link
Member

Thanks, @mtrojan-ub!

Copy link
Contributor

@EreMaijala EreMaijala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

For whatever it's worth I don't really like the idea of sending updates to search engines via APIs. While it probably makes sense in some cases, it necessarily puts some search engines in advantage of others, and I'd prefer at least some resemblance of neutrality. :)

@demiankatz demiankatz merged commit c012952 into vufind-org:dev May 16, 2024
7 checks passed
@demiankatz demiankatz deleted the upstream_sitemap branch May 16, 2024 17:46
@demiankatz
Copy link
Member

For whatever it's worth I don't really like the idea of sending updates to search engines via APIs. While it probably makes sense in some cases, it necessarily puts some search engines in advantage of others, and I'd prefer at least some resemblance of neutrality. :)

This also feels like an area where APIs might change or get deprecated with some frequency, leading to ongoing maintenance work. So I'm not incredibly enthusiastic about it either, but if people want it and somebody is willing to develop it, I certainly wouldn't turn code away... it's just not a high personal priority for me. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants