Truecase not working #265

waltsatan · 2016-09-15T01:42:32Z

I'm working on a cultural history project with hundreds of hours of audio. When the transcripts were created many years ago, they were done in ALL CAPS. CoreNLP is working great identifying our names so we can link to their profile pages, but the true case annotator doesn't seem to be doing anything. I have the stanford-corenlp-3.6.0-models-english.jar in my path and running both the server and command-line versions load the annotator and output as excepted:

/127.0.0.1:53700] API call w/annotators tokenize,ssplit,pos,lemma,new,truecase
NORVELL BROWN WAS LEAD MAN AT THE DUKE
21:34:53.950 [pool-1-thread-1] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
21:34:53.963 [pool-1-thread-1] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator split
21:34:53.966 [pool-1-thread-1] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [3.2 sec].
21:34:57.167 [pool-1-thread-1] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
21:34:57.168 [pool-1-thread-1] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator new
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [8.5 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [3.7 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [4.7 sec].
21:35:14.142 [pool-1-thread-1] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator true case
loadClassifier=edu/stanford/nlp/models/truecase/truecasing.fast.qn.ser.gz
mixedCaseMapFile=edu/stanford/nlp/models/truecase/MixDisambiguation.list
classBias=INIT_UPPER:-0.7,UPPER:-0.7,O:0
Loading classifier from edu/stanford/nlp/models/truecase/truecasing.fast.qn.ser.gz ... done [4.7 sec].

and a snippet of the output:

{"index":5,"word":"MAN","originalText":"MAN","lemma":"MAN","characterOffsetBegin":23,"characterOffsetEnd":26,"pos":"NNP","ner":"O","truecase":"O","truecaseText":"MAN"}

As you can see, the truecaseText property of MAN is MAN.

I've also tried setting truecase.model to edu/stanford/nlp/models/truecase/truecasing.fast.qn.ser.gz, but get the same results.

Anyone encountered this?

The text was updated successfully, but these errors were encountered:

manning · 2016-10-09T19:57:54Z

Sorry, this should work, but I think there are definitely issues with the truecaser released with corenlp v3.6.0.... We'll see if we can fix this in version 3.7, but until then, I think the way to get decent output is to first lowercase the input (such as with the Unix command tr '[:upper:]' '[:lower:]' < input.txt > output.txt and then to run CoreNLP on the output to truecase it. Trying to truecase uppercase text is failing....

waltsatan · 2016-10-10T03:40:51Z

A friend who had experience with CoreNLP suggested this fix and I was able to get it working, however, there were still fragments of sentences that would get entirely uppercased. Tweaking the bias down to a certain threshold would keep the text lowercase even though there was some capitalizations that should have been identified, so something is definitely amiss in the truecase module. Let me know if you'd like some samples if the bug's source isn't already known and identified. Thanks!

manning · 2016-12-04T19:14:34Z

There were at least 2 issues with the v.3.6 truecaser. One was that the model didn't work well at akk – the version 3.5 model was much better. The other was bugs in the annotator so that the annotator didn't work on uppercase text, only lowercase text. Both of these things have been fixed for 3.7.0.

The output for

NORVELL BROWN WAS LEAD MAN AT THE DUKE

is now:

NORVELL Brown was lead man at the Duke

So, I'm going to close this for now. However, I accept that beyond these bugs, the model could still be better (we'd like to get "Norvell"). But that part is a research question of improving the model, and it is never likely to be perfect.... If there are particular things that it always gets wrong, and you want to send some text with the correct answers, at some point we could include it into the training text which may well help with performance on your and similar applications.

waltsatan · 2017-01-03T03:06:20Z

I've been testing the new version and there are definitely improvements in the truecaser. I found a bug that seems to be causing lots of the trouble in my usage. Our transcripts were done in all caps and in shorthand, so lots of "and"s are "&"s. The appearance of this character (and other punctuation) seems to throw the truecaser wildly off. Here's an example: (run with default bias)

Input: WE TOOK OUR SHOES OFF & WE SAT BY THE FIRE ALL WINTER (no period)

Output: All words remain uppercased.

If I change & to 'and', it works fine. (We took our shoes off and we sat by the fire all winter)

If I add a period to the end with the ampersand, I get:

WE TOOK OUR SHOES OFF & we sat by the fire all winter.

Changing the & to 'and' also then makes the sentence output properly.

Hope this little tidbit provides some insight for truecase improvements.

Alan

manning added the bug label Oct 9, 2016

manning added this to the v.3.7 milestone Oct 9, 2016

manning closed this as completed Dec 4, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Truecase not working #265

Truecase not working #265

waltsatan commented Sep 15, 2016

manning commented Oct 9, 2016

waltsatan commented Oct 10, 2016

manning commented Dec 4, 2016

waltsatan commented Jan 3, 2017

Truecase not working #265

Truecase not working #265

Comments

waltsatan commented Sep 15, 2016

manning commented Oct 9, 2016

waltsatan commented Oct 10, 2016

manning commented Dec 4, 2016

waltsatan commented Jan 3, 2017