Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sentence-case #26

Merged
merged 17 commits into from
Aug 9, 2023
Merged

sentence-case #26

merged 17 commits into from
Aug 9, 2023

Conversation

retorquere
Copy link
Contributor

Simpler sentence-caser. I've tried my best to make sure it is in line with the coding style for Zotero, but utilities does not have an eslint config.

@dstillman
Copy link
Member

Can we add a test to https://github.com/zotero/utilities/blob/master/test/tests/utilitiesTest.js? Assuming you have the strings separately, we can just put a JSON file with the pairs in https://github.com/zotero/utilities/tree/master/test/data and load with loadSampleData().

Once you've run npm i once, you can run tests in this repo with npm test.

@retorquere
Copy link
Contributor Author

According to the README, npm i && npm test should run the tests, but I get

no such file or directory, open 'test/../resource/schema/global/schema.json'

@retorquere
Copy link
Contributor Author

I've dumped my strings at https://gist.github.com/retorquere/8fb5a14a0b0f0a60db3df5313a258d5c . I haven't inspected everyone I must admit, this is accrued testing over the years based on user reports. I'd be happy to add tests.

@dstillman
Copy link
Member

You probably didn't do a recursive clone. (Almost all Zotero repos require recursive clones.)

You can use git submodule update --init --recursive to initialize the submodules.

@retorquere
Copy link
Contributor Author

retorquere commented Aug 5, 2023

great, thanks. But when we're talking about loadSampleData, shouldn't I be adding them to test/tests/utilities_itemTest.js? never mind, I see the pattern.

@retorquere
Copy link
Contributor Author

tests have been added.

let masked = text.replace(/<[^>]+>/g, (match, i) => {
preserve.push({ start: i, end: i + match.length, description: 'markup' });
return '\uFFFD'.repeat(match.length);
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above two sections don't seem to be doing anything, at least with the sample input. Do we need them? If so, we should add some test data for them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The (sub-)sentence-start ones? One of them did fail a test when I removed them, but I've added two new samples. But in the mail I received you seemed to be pointing towards the (sub-)sentence start handlers, here on the site it looks like you're referring to the markup handler. I'll look into the markup handler tonight -- certainly that should hit on something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two blocks above my comment here — protect nocase and mask html tags with characters…. Nothing failed when I removed those (but still setting masked properly).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well that's just weird. I'll look into it tonight - I have a 6 hour drive in front of me right now.

Copy link
Contributor Author

@retorquere retorquere Aug 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It captures in-word markup. I will grant that I do not have a non-synthetic case at hand, but I've added a synthetic case.

@dstillman
Copy link
Member

What about the nocase?

@retorquere
Copy link
Contributor Author

retorquere commented Aug 7, 2023

I missed that testcase, added it now.

@dstillman
Copy link
Member

dstillman commented Aug 7, 2023

"How to Get an <span class=\"nocase\">A</span> Without Trying": "How to get an <span class=\"nocase\">A</span> without trying",

"Effects of Open- and Closed-System Temperature Changes on Blood <span class=\"nocase\">O</span>₂-Binding Characteristics of <span class=\"nocase\">Atlantic</span> Bluefin Tuna (<span class=\"nocase\"><i>Thunnus thynnus</i></span>)": "Effects of open- and closed-system temperature changes on blood <span class=\"nocase\">O</span>₂-binding characteristics of <span class=\"nocase\">Atlantic</span> bluefin tuna (<span class=\"nocase\"><i>Thunnus thynnus</i></span>)",

I don't think either of these is really appropriate as an example. CSL styles don't sentence-case, so despite the name, the sole point of nocase is to avoid automatic title-casing of foreign words, scientific terms, and mixed-case words like iPhone (though citeproc-js should arguably just avoid capitalizing the last one). So other than Thunnus thynnus, which maybe would be protected before sentence-casing, none of these other strings would/should be. I know we're only testing the sentence-caser here, but I don't think we want to imply anywhere (even to developers looking at the code) that someone should be putting nocase around A, Atlantic, or O. Short of perhaps AI, there's no way to avoid manually recapitalizing proper nouns as a human being after sentence-casing a string, and that's not going to change with this better sentence-caser.

@retorquere
Copy link
Contributor Author

I'd have no issue with Zotero making that call. For BBT it's a live case, but for Zotero it doesn't need to be.

@dstillman
Copy link
Member

dstillman commented Aug 7, 2023

I'm just suggesting we go with something like this:

"Migration and the Origins of <i><span class=\"nocase\">Homo sapiens</span></i>": "Migration and the origins of <i><span class=\"nocase\">Homo sapiens</span></i>",

Which better reflects how nocase is meant to be used.

@dstillman
Copy link
Member

(Which actually makes me wonder if citeproc-js actually looks for <span class="nocase"> or just class="nocase". I.e., does <i class="nocase"> work?)

@retorquere
Copy link
Contributor Author

retorquere commented Aug 7, 2023

I'm just suggesting we go with something like this:

"Migration and the Origins of <i><span class=\"nocase\">Homo sapiens</span></i>": "Migration and the origins of <i><span class=\"nocase\">Homo sapiens</span></i>",

Which better reflects how nocase is meant to be used.

Ah I see, I hadn't scrutinized the sample, this is how I got it from a user. I can update this tonight.

@retorquere
Copy link
Contributor Author

(Which actually makes me wonder if citeproc-js actually looks for <span class="nocase"> or just class="nocase". I.e., does <i class="nocase"> work?)

It wouldn't be hard to add that; the BBT html-to-latex supports it but I haven't documented that. Do you know what markup citeproc supports beyond b/i/sup/sub BTW?

@retorquere
Copy link
Contributor Author

I've replaced the nocase tests with the proposed sample

@northword
Copy link

Maybe some of these idea can carry over? But I'd suggest e.g. testing for \p{Lu} rather than A-Z, and I think string.charAt would have problems with multibyte characters. I've opened a new PR at #26.

It looks great!

I've tried adding chem elements detection to the sentencecaser, but it flags "No" as an element instead of as the adverb/adjective/noun. edit: also: "Be", "Au" (French), "Na" (Portuguese).

oh! Sorry, I hadn't thought about this at all. Perhaps we can apply chemical elements only when the title is in English? Or unmatch chemical elements that are function words in other languages.
This seems to be a bit of a hassle, maybe it can be done with a plugin patch as well.

@retorquere
Copy link
Contributor Author

quote-protection has been removed.

@dstillman dstillman merged commit 27c9818 into zotero:master Aug 9, 2023
1 check passed
@dstillman
Copy link
Member

Great! Thank you!

dstillman added a commit to zotero/zotero that referenced this pull request Aug 9, 2023
@retorquere
Copy link
Contributor Author

retorquere commented Aug 9, 2023

Super. This will show up in the next beta I take it? If so, when could we expect the next beta?

@dstillman
Copy link
Member

Beta 32, out now

northword added a commit to northword/zotero-format-metadata that referenced this pull request Aug 9, 2023
@northword
Copy link

northword commented Aug 9, 2023

The problem is that when the text is in all caps, the result is still in all caps, e.g. "NITROUS-OXIDE EMISSIONS FROM VEHICLES", which obviously doesn't have a specific word that needs to be capitalized, so I think that we can convert every word to lowercase when uppercase == uppercase with lowercase in all text.

@retorquere
Copy link
Contributor Author

That wouldn't be hard to add.

@dstillman
Copy link
Member

dstillman commented Aug 9, 2023

Oh, yes, that's a fairly significant regression.

@retorquere
Copy link
Contributor Author

#27

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants