Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve MARC DOI extraction. #2473

Merged
merged 7 commits into from
Jun 23, 2022

Conversation

demiankatz
Copy link
Member

@demiankatz demiankatz commented Jun 21, 2022

This PR offers more sophisticated, regex-driven DOI extraction -- this method makes it possible to extract DOIs from a wider variety of URLs without having to add many lines of prefix-based checking. It also works better when URLs have proxy prefixes included.

TODO

  • Add URL decoding capabilities

@demiankatz demiankatz changed the title Improve DOI extraction. Improve MARC DOI extraction. Jun 21, 2022
@demiankatz
Copy link
Member Author

@mtrojan-ub / @EreMaijala, it was brought to my attention that VuFind's default MARC DOI extraction left quite a bit to be desired - in addition to failing to account for variations of the DOI base URLs, it also did not properly URL-decode DOIs containing special characters. I believe that this PR should fix both of those issues. What do you think?

Copy link
Contributor

@EreMaijala EreMaijala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but I'll defer final judgement to @mtrojan-ub.

*/
public Set<String> getDoisFromUrlWithRegEx(final Record record, String fieldSpec, String regEx, String groupIndex) {
// Build the regular expression:
Pattern pattern = Pattern.compile(regEx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method might be called millions of times, and it will always compile the same pattern over and over again. Can we use some kind of cache for compiled patterns?

Or maybe we could use something like the CachedPattern class in this article:
https://stackoverflow.com/questions/13420321/does-pattern-compile-cache

(We should also keep in mind that the cache should be thread-safe)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @mtrojan-ub -- what do you think of e24ee32?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's definitely better! However, for an optimal Java-like solution you might wanna use computeIfAbsent() instead of a hardcoded if ... get ... else ... put block, because it's the most elementary thread-safe operation. This is already used in several other parts of the code, here's an example:

? null : this.transliterators.computeIfAbsent(transliterationRules, rules ->

return sanitizedConfigCache.computeIfAbsent(sanitizedCacheKey, retVal -> {

So it should basically be one line of code:
Pattern pattern = patternCache.computeIfAbsent(compiledPattern -> Pattern.compile(regEx));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, @mtrojan-ub, that's a lot nicer -- see 1d867dc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also accidentally introduced unwanted whitespace into the comment in that commit -- fixed in a subsequent commit; sorry about that!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine now, thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the very helpful input, @mtrojan-ub!

Copy link
Contributor

@mtrojan-ub mtrojan-ub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment related to CachedPattern

@demiankatz demiankatz merged commit 82ceab7 into vufind-org:dev Jun 23, 2022
EreMaijala pushed a commit to EreMaijala/vufind that referenced this pull request Jan 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants