-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Truecase not working #265
Comments
Sorry, this should work, but I think there are definitely issues with the truecaser released with corenlp v3.6.0.... We'll see if we can fix this in version 3.7, but until then, I think the way to get decent output is to first lowercase the input (such as with the Unix command |
A friend who had experience with CoreNLP suggested this fix and I was able to get it working, however, there were still fragments of sentences that would get entirely uppercased. Tweaking the bias down to a certain threshold would keep the text lowercase even though there was some capitalizations that should have been identified, so something is definitely amiss in the truecase module. Let me know if you'd like some samples if the bug's source isn't already known and identified. Thanks! |
There were at least 2 issues with the v.3.6 truecaser. One was that the model didn't work well at akk – the version 3.5 model was much better. The other was bugs in the annotator so that the annotator didn't work on uppercase text, only lowercase text. Both of these things have been fixed for 3.7.0. The output for NORVELL BROWN WAS LEAD MAN AT THE DUKE is now: NORVELL Brown was lead man at the Duke So, I'm going to close this for now. However, I accept that beyond these bugs, the model could still be better (we'd like to get "Norvell"). But that part is a research question of improving the model, and it is never likely to be perfect.... If there are particular things that it always gets wrong, and you want to send some text with the correct answers, at some point we could include it into the training text which may well help with performance on your and similar applications. |
I've been testing the new version and there are definitely improvements in the truecaser. I found a bug that seems to be causing lots of the trouble in my usage. Our transcripts were done in all caps and in shorthand, so lots of "and"s are "&"s. The appearance of this character (and other punctuation) seems to throw the truecaser wildly off. Here's an example: (run with default bias) Input: WE TOOK OUR SHOES OFF & WE SAT BY THE FIRE ALL WINTER (no period) Output: All words remain uppercased. If I change & to 'and', it works fine. (We took our shoes off and we sat by the fire all winter) If I add a period to the end with the ampersand, I get: WE TOOK OUR SHOES OFF & we sat by the fire all winter. Changing the & to 'and' also then makes the sentence output properly. Hope this little tidbit provides some insight for truecase improvements. Alan |
I'm working on a cultural history project with hundreds of hours of audio. When the transcripts were created many years ago, they were done in ALL CAPS. CoreNLP is working great identifying our names so we can link to their profile pages, but the true case annotator doesn't seem to be doing anything. I have the stanford-corenlp-3.6.0-models-english.jar in my path and running both the server and command-line versions load the annotator and output as excepted:
and a snippet of the output:
As you can see, the truecaseText property of MAN is MAN.
I've also tried setting
truecase.model
toedu/stanford/nlp/models/truecase/truecasing.fast.qn.ser.gz
, but get the same results.Anyone encountered this?
The text was updated successfully, but these errors were encountered: