-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve MARC DOI extraction. #2473
Conversation
@mtrojan-ub / @EreMaijala, it was brought to my attention that VuFind's default MARC DOI extraction left quite a bit to be desired - in addition to failing to account for variations of the DOI base URLs, it also did not properly URL-decode DOIs containing special characters. I believe that this PR should fix both of those issues. What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, but I'll defer final judgement to @mtrojan-ub.
*/ | ||
public Set<String> getDoisFromUrlWithRegEx(final Record record, String fieldSpec, String regEx, String groupIndex) { | ||
// Build the regular expression: | ||
Pattern pattern = Pattern.compile(regEx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method might be called millions of times, and it will always compile the same pattern over and over again. Can we use some kind of cache for compiled patterns?
Or maybe we could use something like the CachedPattern class in this article:
https://stackoverflow.com/questions/13420321/does-pattern-compile-cache
(We should also keep in mind that the cache should be thread-safe)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @mtrojan-ub -- what do you think of e24ee32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's definitely better! However, for an optimal Java-like solution you might wanna use computeIfAbsent() instead of a hardcoded if ... get ... else ... put block, because it's the most elementary thread-safe operation. This is already used in several other parts of the code, here's an example:
? null : this.transliterators.computeIfAbsent(transliterationRules, rules -> |
return sanitizedConfigCache.computeIfAbsent(sanitizedCacheKey, retVal -> { |
So it should basically be one line of code:
Pattern pattern = patternCache.computeIfAbsent(compiledPattern -> Pattern.compile(regEx));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, @mtrojan-ub, that's a lot nicer -- see 1d867dc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also accidentally introduced unwanted whitespace into the comment in that commit -- fixed in a subsequent commit; sorry about that!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine now, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for the very helpful input, @mtrojan-ub!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment related to CachedPattern
This PR offers more sophisticated, regex-driven DOI extraction -- this method makes it possible to extract DOIs from a wider variety of URLs without having to add many lines of prefix-based checking. It also works better when URLs have proxy prefixes included.
TODO