New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capitalization: Capitalize all title-fields for language "en" #383

Closed
retorquere opened this Issue Oct 17, 2015 · 127 comments

Comments

Projects
None yet
4 participants
@retorquere
Copy link
Owner

retorquere commented Oct 17, 2015

@nickbart1980 says:

BBT should convert all titles to title-case if the ‘Language’ field is empty or starts with ‘en’, excluding, however, skip words, and strings enclosed in <span class="nocase">…<span>.
‘All titles’ means title, volume-title, container-title, collection-title, including their ‘short’ forms.
Titles in entries with a non-empty ‘Language’ field that does not start with ‘en’ should be left alone (see the notes on \MakeSentenceCase, biblatex manual 4.6.4, and compare the man page of pandoc-citeproc, which has to do the inverse conversion when using a biblatex database – as would, BTW, any import of bib(la)tex into Zotero).
For bibtex, which does not have a langid field and thus cannot distinguish languages, I would guess that the complete title fields of non-English titles should be wrapped in braces to prevent bibtex from messing with capitalisation.

@retorquere

This comment has been minimized.

Copy link
Owner

retorquere commented Oct 17, 2015

Do you mean for CSL JSON or for BBT? I'm not entirely certain about capitalizing user input; my idea is that BBT discloses user intent as best as possible given the impedance mismatch between the formats. User intent for capitalization is, I think, best expressed by the user capitalizing titles as desired.

@njbart

This comment has been minimized.

Copy link

njbart commented Oct 17, 2015

If users enter On the prosodies of the Greek and Latin languages in Zotero, this is rendered as “On the Prosodies of the Greek and Latin Languages” in a title-case style, e.g., Chicago, and as “On the prosodies of the Greek and Latin languages” in a sentence-case style, e.g., APA.

To get the same in bibtex and biblatex, there is no other option than to convert the title to On the Prosodies of the {Greek} and {Latin} Languages; this is the only way to have it rendered as “On the Prosodies of the Greek and Latin Languages” in Chicago, and as “On the prosodies of the Greek and Latin languages” in APA.

This, I would argue, respects user intent as best as possible.

@retorquere

This comment has been minimized.

Copy link
Owner

retorquere commented Oct 17, 2015

Interesting. Which processor renders it that way? Not BibTex then.

I'm still not entirely convinced. Adding braces around {Greek} is the only way to disclose to LaTeX you want to keep capitalisation. The easiest way to disclose that you want to have a certain capitalisation, but only if the style demands it, is to Capitalise the Source Sentence.

If you input On the Prosodies of the Greek and Latin Languages, does the processor you have in mind do the right thing when a style that does not do title-casing?

@retorquere

This comment has been minimized.

Copy link
Owner

retorquere commented Oct 17, 2015

BTW, how should this interact with caps preservation? Surely you wouldn't want On the prosodies of the Greek and Latin languages to be translated to On the {Prosodies} of the {Greek} and {Latin} {Languages}?

@retorquere

This comment has been minimized.

Copy link
Owner

retorquere commented Oct 17, 2015

Or would you want non-capitalised non-filler words to be capitalised, and capitalised non-filler words to be braced? How about something like iPod? This would be capitalised in this scheme. I'm not too keen on the <span class="nocase">…<span> workaround. It seems easier to provide a "capitalise this title" function to Zotero to just fix the input (assuming such can easily be done).

@retorquere

This comment has been minimized.

Copy link
Owner

retorquere commented Oct 17, 2015

Ugh, I can't add things to the reference edit pane without some crazy shady monkey patching. That is going to be too brittle. On the whole reference is not a problem though.

@retorquere

This comment has been minimized.

Copy link
Owner

retorquere commented Oct 17, 2015

Why specifically for english though? Doesn't this apply to other languages equally?

retorquere added a commit that referenced this issue Oct 17, 2015

retorquere added a commit that referenced this issue Oct 17, 2015

retorquere added a commit that referenced this issue Oct 17, 2015

@retorquere

This comment has been minimized.

Copy link
Owner

retorquere commented Oct 17, 2015

I have some ideas on how to get this to work, but I'll probably put it behind a preference

retorquere added a commit that referenced this issue Oct 17, 2015

@retorquere

This comment has been minimized.

Copy link
Owner

retorquere commented Oct 17, 2015

Is the list of fields that should be capitalised the same as the list that should get preserve caps?

@adunning

This comment has been minimized.

Copy link

adunning commented Oct 17, 2015

To clarify your earlier question, this doesn't need to be applied to CSL JSON, since citeproc handles the capitalization already; Zotero recommends that all titles be stored sentence-case.

@retorquere

This comment has been minimized.

Copy link
Owner

retorquere commented Oct 17, 2015

That was my earlier point actually. Why not have the user store the titles sentence-cased in the first place?

@adunning

This comment has been minimized.

Copy link

adunning commented Oct 17, 2015

retorquere added a commit that referenced this issue Oct 17, 2015

@retorquere retorquere added question and removed enhancement labels Oct 17, 2015

@retorquere

This comment has been minimized.

Copy link
Owner

retorquere commented Oct 17, 2015

I've taken a few stabs at it but it gets increasingly messy and fragile. I'm sorry, but I'm not going to honor this one.

@retorquere retorquere closed this Oct 17, 2015

@njbart

This comment has been minimized.

Copy link

njbart commented Oct 17, 2015

Why specifically for english though? Doesn't this apply to other languages equally?

No, only English has both title-case and sentence-case styles.

@njbart

This comment has been minimized.

Copy link

njbart commented Oct 17, 2015

Or would you want non-capitalised non-filler words to be capitalised, and capitalised non-filler words to be braced?

Exactly.

How about something like iPod? This would be capitalised in this scheme.

EDIT: “iPod” shouldn’t be capitalised by BBT, and it should be protected.

@njbart

This comment has been minimized.

Copy link

njbart commented Oct 17, 2015

Is the list of fields that should be capitalised the same as the list that should get preserve caps?

Yes. bib(la)tex needs titles in title case, and those words that must not be lowercased again by sentence-case styles such as biblatex-apa need protection.

@njbart

This comment has been minimized.

Copy link

njbart commented Oct 17, 2015

I'm sorry, but I'm not going to honor this one.

That’d be a pity. It’s necessary since the conventions of bib(la)tex and CSL are incompatible: bib(la)tex expects titles in title-case, and words that must not be lowercased must be protected, but CSL expects titles in sentence-case, and words that must not be uppercased must be protected. (The latter doesn’t happen so very often, but without protection CSL title-case styles would turn, e.g., “nm” (nanometer) into “Nm” (Newtonmeter), something that should really be avoided.)

<span class="nocase">…<span>, BTW, is officially supported by citeproc-js and pandoc-citeproc.

Over at pandoc, we’ve been through this whole exercise when writing pandoc-citeproc’s biblatex -> CSL converter (the inverse of what I’d like BBT to do), but it’s not that complicated after all, and seems to work great.

@retorquere

This comment has been minimized.

Copy link
Owner

retorquere commented Oct 17, 2015

But how would I know that iPod should be excluded from capitalization? And why would it not be better to assume title-case and convert title-case to sentence-case for CSL? That seems to be a lot simpler to me.

@retorquere retorquere reopened this Oct 17, 2015

@retorquere

This comment has been minimized.

Copy link
Owner

retorquere commented Oct 17, 2015

This one is not going to be easy. It will require rethinking of the way I convert the HTML-ish input to LaTeX.

@njbart

This comment has been minimized.

Copy link

njbart commented Nov 16, 2015

When it has caps, I would say.

@retorquere

This comment has been minimized.

Copy link
Owner

retorquere commented Nov 16, 2015

Sweet, that's simple the current behavior

@retorquere

This comment has been minimized.

Copy link
Owner

retorquere commented Nov 16, 2015

There's still a fair number of cases where I think the title caser doesn't do the right thing: https://bitbucket.org/fbennett/citeproc-js/issues/191/a-is-uppercased-in-the-title-caser

@retorquere

This comment has been minimized.

Copy link
Owner

retorquere commented Nov 17, 2015

I've worked around most of those by feeding the titlecaser just plain text. So we're getting close on this one.

What should be done with The City of To-morrow? The CSL title caser wants to make it The City of to-Morrow. I can enter it in my bibliography as The City of <span class="nocase">To-morrow</span> but that will always prevent downcasing by the bibliopgraphy processor, even where the style demands sentence case. Same goes for Organising/Disorganising the Breakthrough Motif; title caser makes it Organising/disorganising the Breakthrough Motif:, but protecting it with nocase is too strong. Ideas?

@retorquere

This comment has been minimized.

Copy link
Owner

retorquere commented Nov 17, 2015

Is there a list of words that biblatex expects to be lowercase in titlecase? I know "and" and "or" are supposed to keep downcased, but what about words like "after"?

@njbart

This comment has been minimized.

Copy link

njbart commented Nov 17, 2015

… feeding the titlecaser just plain text.

I’m still puzzled why you seem to be having such difficulties with <span class="nocase"> and the citeproc-js titlecaser. I’m using <span class="nocase"> a lot in Zotero, and never encountered anything unexpected.

Still, <span class="nocase"> should work in all circumstances, and if it doesn’t, I would report it as a citeproc-js bug.

The City of To-morrow

Hmm, in Zotero, from a title [The City of To-morrow], and using “Create bibliography from item” with Chicago-author-date, I get “The City of To-Morrow” (which seems correct for a title-case style).

Organising/disorganising

Again, I think that’s a citeproc-js bug.

… list of words that biblatex expects to be lowercase in titlecase

There’s no official bib(la)tex list; bib(la)tex expects the user to enter titles in correct title case (which some styles then convert to sentence case; never the other way around).

Style manuals differ a little here, but the citeproc-js list of small words is a good approximation.

@njbart

This comment has been minimized.

Copy link

njbart commented Nov 17, 2015

BTW, citeproc-js is currently changing some of the titlecaser’s details, and from what it looks like neither quotes nor parentheses, nor HTML-like markup will protect against case conversion from now on. See
http://sourceforge.net/p/xbiblio/mailman/xbiblio-devel/thread/CAJgpGgAGORo22rX8wRoV2Gd1fmX3iuzbXzwm829sRQ-9i%3DMcmg%40mail.gmail.com/#msg34605413

@retorquere

This comment has been minimized.

Copy link
Owner

retorquere commented Nov 17, 2015

I’m still puzzled why you seem to be having such difficulties with <span class="nocase"> and the citeproc-js titlecaser. I’m using <span class="nocase"> a lot in Zotero, and never encountered anything unexpected.

Still, <span class="nocase"> should work in all circumstances, and if it doesn’t, I would report it as a citeproc-js bug.

It does, but sometimes it just dies when I feed it valid input; other times, it just doesn't title-case right; it seems the nocase is handled properly, but it sometimes appears to throw off its idea on whether it is mid-sentence, end-sentence or start-of-sentence.

Hmm, in Zotero, from a title [The City of To-morrow], and using “Create bibliography from item” with Chicago-author-date, I get “The City of To-Morrow” (which seems correct for a title-case style).

The official recommendation is however to enter titles in sentence-case, right? So that would have to be The city of to-morrow. If I enter The City of To-morrow, caps preservation will kick in to make that The {{City}} of {{To-morrow}}.

Organising/disorganising
Again, I think that’s a citeproc-js bug.

OK, so I could just wait this one out.

There’s no official bib(la)tex list; bib(la)tex expects the user to enter titles in correct title case (which some styles then convert to sentence case; never the other way around).

but then what is "correct title case"? I'm going with the smallwords from the CSL titlecaser, in any case, ....

wow, that thread is active! The progression there seems promising, so I'll just wait for the results of that, but there's another reason I may want to feed only plaintext to the title caser; BBT supports <pre>...</pre> (or <script>) for raw LaTeX, and I don't want the titlecaser to make any changes in there, but I also don't want to wrap <pre> in nocase, since that would then invoke caps preservation for each pre section. What I did before is remove the pre sections and replace them with markers (\x02...\x03) so the title caser wouldn't see them, and put them back when the title caser is done; I figured if I need to work around that anyhow I might as well just not feed the titlecaser any markup. Easiest for BBT would be if citeproc-js supported such use of pre/script, but I think it's wholly specific to BBT.

@retorquere

This comment has been minimized.

Copy link
Owner

retorquere commented Nov 17, 2015

BTW the title caser doesn't deal with words in quotes consistently;

'Of' uppercased, 'meaning' not:
input:  The meaning of 'meaning'
output: The Meaning Of 'meaning'

'Example' uppercased even though it is in quotes.
input:  Test of special chars "this for example" and the end
output: Test of Special Chars "this for Example" and the End

I've added both cases to the citeproc-js issue tracker, but it looks like I can't post to the xbiblio thread you linked to.

@njbart

This comment has been minimized.

Copy link

njbart commented Nov 17, 2015

@retorquere

This comment has been minimized.

Copy link
Owner

retorquere commented Nov 17, 2015

Ah, mailing list, not forum. Looks like a lot of these issue were in fact already handled, I've pulled in the latest citeproc and things look near perfect. Tests running again.

@retorquere

This comment has been minimized.

Copy link
Owner

retorquere commented Nov 17, 2015

OK, so just 6 or so more title caser problems and this feature should be finished.

@retorquere

This comment has been minimized.

Copy link
Owner

retorquere commented Nov 19, 2015

I've released the other recent changes we concocted as part of 1.6.6; I'll release this one when the tests go green, pending changes in the citeproc titlecaser. You seem to be in the loop on this -- can you alert me when you think something has changed? I'm also watching the citeproc-js issues list.

@retorquere

This comment has been minimized.

Copy link
Owner

retorquere commented Nov 30, 2015

Activity on the citeproc title caser has been a little low lately, so I've given another one a shot; only these cases do not pass, and if I remove "that" from the shortwords list (the CSL title caser does have it, but it seems to be smart about "that is") I get this. Neither is perfect, but the first seems preferable over the existing title caser.

What do you think?

@retorquere

This comment has been minimized.

Copy link
Owner

retorquere commented Nov 30, 2015

Sorry, that should have been this for the version that doesn't have "that" in the smallWords list.

@retorquere

This comment has been minimized.

Copy link
Owner

retorquere commented Nov 30, 2015

Adding "their" to the smallwords list leaves a single failing case but one that also fails in the same way with the CSL title caser.

@retorquere

This comment has been minimized.

Copy link
Owner

retorquere commented Dec 2, 2015

I see no activity on citeproc-js currently, and the alternative titlecaser passes all my tests, so I've merged to master. Next release will have the feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment