Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VUFIND-1630] Alphabrowse: new normalizer for titles based on SolrMarc titleSortLower #3024

Merged

Conversation

damien-git
Copy link
Contributor

@damien-git damien-git commented Aug 4, 2023

This makes sure the same normalizer is used for indexing and browsing. It uses a new normalizer based on SolrMarc titleSortLower. The same field as before is used for indexing (title_sort).

TODO

  • Merge related PR in vufind-browse-handler first.
  • Implement XML equivalent title_sort behavior
  • Add equivalent to SolrMarc cleanData in XML formatting logic
  • Make sure Windows .bat file works the same as Linux .sh file.
  • Add Mink test (if time permits)
  • Add changelog note (reindex needed for XML records)

Copy link
Member

@demiankatz demiankatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this started! In addition to the issue highlighted below, two other questions/comments:

1.) Why is solrmarc_core.jar included in this PR? Did you mean to replace the browse handler jars instead?

2.) We'll need to implement equivalent XML processing to avoid breaking browse headings related to XML input. Right now, all of our XSLTs have lines like this:

https://github.com/vufind-org/vufind/blob/dev/import/xsl/dspace.xsl#L140

I think we need to add a PHP version of titleSortLower to \VuFind\XSLT\Import\VuFind and modify the XSLT examples to wrap this around the existing stripArticles call.

If you don't want to spend time on this because it's not relevant to your use cases, we can add a TODO checkbox and I'll try to find time to help with it when I'm back from vacation!

index-alphabetic-browse.sh Outdated Show resolved Hide resolved
index-alphabetic-browse.sh Outdated Show resolved Hide resolved
Copy link

@todolson todolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not yet able to test the main issue of the new normalizer, but see in-file comments about BITNAMI_SOLR_HOME.

@damien-git
Copy link
Contributor Author

It's my turn to be on vacation, and I won't be able to address all the questions for a few days. I am fine with naming this variable differently, SOLR_JAR_PATH sounds good.
The last entry for the CLASSPATH should probably be ${BITNAMI_SOLR_HOME}/server/lib/ext/* instead (because log4j*.jar might not work).
The change to use a new env variable was intentional, because we use Solr separately from Vufind (in a different container) and we currently have to recreate a special environment with a vendor link to make things work without changing the scripts. It would be nice to be able to use the indexing script without the Vufind structure dependency.

@demiankatz
Copy link
Member

@damien-git, would it make sense to set up SOLR_JAR_PATH by first checking for the existence of $VUFIND_HOME/solr/vendor, and then failing over to $BITNAMI_SOLR_HOME as a possible next resort?

I don't think using a more generic search path is a huge problem, though if I'm remembering correctly, we may have used the log4j*.jar pattern because using the entire directory was slowing things down by searching too many unneeded jars.

In any case, I hope you have a great vacation, and we can resume this conversation when we're both back in the office!

@damien-git
Copy link
Contributor Author

damien-git commented Aug 10, 2023

would it make sense to set up SOLR_JAR_PATH by first checking for the existence of $VUFIND_HOME/solr/vendor, and then failing over to $BITNAMI_SOLR_HOME as a possible next resort?

I think it makes sense to use SOLR_JAR_PATH if it is defined, and use $VUFIND_HOME/solr/vendor otherwise, which is what the script does now. Because SOLR_JAR_PATH is not used for anything else, so if it is defined we can assume users want it to be used for this script.

I don't think using a more generic search path is a huge problem, though if I'm remembering correctly, we may have used the log4j*.jar pattern because using the entire directory was slowing things down by searching too many unneeded jars.

I think log4j*.jar would not work, as Java only understands * for names in the classpath. It might slow things down a bit, yes. I guess we could improve the CLASSPATH creation by looking for log4j*.jar in bash and assemble the related CLASSPATH part. Edit: I implemented that.

Copy link
Member

@demiankatz demiankatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @damien-git, this is looking good to me.

I've just merged vufind-org/vufind-browse-handler#43, so this PR can be the place where we pull the updated browse handler code into the dev branch. I notice that you've only updated the browse handler jar here, though; should the browse indexer also be updated?

Also, one other question below...

index-alphabetic-browse.sh Outdated Show resolved Hide resolved
@demiankatz demiankatz added this to the 9.1 milestone Aug 31, 2023
Copy link
Member

@demiankatz demiankatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long delay getting back to this -- I'm still catching up on work that piled up during WOLFcon! It looks like most of the issues under discussion have been resolved, but I had one more question about the existing code when reviewing the state of things. Once we settle this conversation, I'll work on updating the Windows .bat file to match. In the meantime, if time permits, I'll get started on updating the XML indexing logic to match.

index-alphabetic-browse.sh Outdated Show resolved Hide resolved
@demiankatz
Copy link
Member

@damien-git, I've just implemented XML title_sort normalization. While I obviously wouldn't guarantee that this is 100% equivalent to what SolrMarc is doing, it's at least a great deal closer and much less likely to result in browse handler inconsistencies. Feedback on my implementation is welcome!

@damien-git
Copy link
Contributor Author

Great, we had put that into a future sprint, but I will look at your implementation instead.

@damien-git
Copy link
Contributor Author

@demiankatz OK, I took a look. I am not sure how close you are trying to be with the SolrMarc implementation, but I think we could improve 2 things:

@demiankatz
Copy link
Member

@damien-git, thanks for taking another look at this!

A few comments:

1.) I failed to find the cleanData method when I was trying to make sense of the SolrMarc code. I couldn't figure out what eCleanVal was doing, and it seems you've found the link that I missed. I'll close this gap in the PHP version as soon as time permits!

2.) The stripAccents function is not turning a string entirely into ASCII -- the Latin-ASCII transformer only selects Latin accented characters and converts those to their ASCII equivalents while leaving everything else alone. One of my UNIT tests applies the transform to Japanese text to confirm that it is unaffected. There are probably still ways we could do better, but I think for common Latin accent scenarios, the transformation should be equivalent.

3.) I'm still a bit confused about the Log4j situation. In recent VuFind releases, SolrMarc's Log4j has been replaced with Reload4j (which maintains the 1.x API but backports security fixes). This is a temporary measure until SolrMarc upgrades to Log4j 2.x. However, all the necessary .jar files for this are found in import/lib -- so I'm not sure why extra work should be needed for Log4j jars if the import/lib folder is on the classpath. I'm quite short on time this week, but if you need me to do any experimentation around this, let me know and I will try as soon as time permits.

@damien-git
Copy link
Contributor Author

OK. Regarding 3) I was not aware that reload4j was implementing the Log4j API. We run our code on a container separate from VuFind, so we copy only the jars we need into a temporary "import/lib" structure. Currently we are only copying marc4j from there. It is entirely possible that it will work with reload4j, and that we can simplify the classpath and remove Log4j. I will try that as soon as I can.

@demiankatz
Copy link
Member

Thanks, @damien-git -- fingers crossed that that helps. :-)

@damien-git
Copy link
Contributor Author

Yes, it worked. I will update the PR accordingly.

…ll implement its API and it is already on the classpath
@demiankatz
Copy link
Member

Thanks for the updates, @damien-git -- I should be able to work on the Windows port of the import script tomorrow.

In the meantime, I've added a port of the cleanData logic in commit beff9a5. I did a fairly literal port of the SolrMarc logic, even though in the context of titleSortLower, many of the subtleties get wiped out anyway. I haven't taken the time to carefully test the individual support methods, but I think the overall effect should be reasonable in most cases. I certainly welcome the addition of further test cases if anyone has time and thinks it's worthwhile.

@damien-git
Copy link
Contributor Author

Great, thanks !

@demiankatz
Copy link
Member

@damien-git, I have finished porting the index-alphabetic-browse logic to the Windows .bat file, so I think this is now functionally complete. I might try to add another Mink test case if time permits, if there's something in the sample data we can use to reproduce the problem (and thus prevent future regressions). I've asked @EreMaijala for a review as well, in case he has any concerns about how this might interact with his RecordManager ecosystem.

Copy link
Contributor

@EreMaijala EreMaijala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Solr jar finding bit would warrant a separate commit, but other than that this looks ok to me. Just a small comment about a comment. :)

index-alphabetic-browse.bat Outdated Show resolved Hide resolved
index-alphabetic-browse.sh Outdated Show resolved Hide resolved
@demiankatz
Copy link
Member

Thanks, @EreMaijala! I've opened #3088 to extract the SOLR_JAR_PATH logic into a separate commit and revise the comment to be more generic as you suggested. Once that's merged, I'll resolve the conflicts here.

Copy link
Contributor

@EreMaijala EreMaijala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Copy link
Member

@demiankatz demiankatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ready to go now, though I'll wait for @EreMaijala's double-check before merging.

As promised, I added a few more tests to the Mink suite for AlphaBrowse. We could certainly still do a more thorough job of testing, especially if we added some test records to specifically highlight specific behaviors. Most of the tests I added pass both here and on the dev branch, so I'm not necessarily proving a whole lot. However, the "accent stripping" case works here and fails on dev, so we do have at least some confirmation that this version improves on the dev version.

@demiankatz demiankatz merged commit 342a245 into vufind-org:dev Sep 14, 2023
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants