You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I run train_ner with BrownClusters feature enabled I get following output:
Loading train data: done, 8158 sentences
Loading heldout data: done, 899 sentences
Parsing feature templates: Form '0000000' is present twice in Brown cluster file 'clusters/cs_brown_1000'!
Cannot initialize feature template sentence processor 'BrownClusters' from line 'BrownClusters/2 clusters/cs_brown_1000' of feature templates file!
Why exactly can't be Form '0000000' present twice?
It seems like a unnecessary limitation. As far as I know all words with the same prefix belong into one cluster. Therefore any additional bits after chosen prefix are irrelevant.
(Eg. with prefix of length 20 any bits after 20th bit are irrelevant.)
Am I missing something?
Best regards.
Simon Let
The text was updated successfully, but these errors were encountered:
The error means that NameTag thinks word 0000000 is in multiple clusters (and fails because it does not know which cluster to use).
The input file for the BrownCluster feature should contain lines with cluster<tab>lemma -- don't you have it reversed (i.e., lemma<tab>cluster)? The "form 0000000" looks more like a cluster.
If you really have one form present multiple times in the file, you have to decide which one to use yourself.
When I run train_ner with
BrownClusters
feature enabled I get following output:Why exactly can't be
Form
'0000000' present twice?It seems like a unnecessary limitation. As far as I know all words with the same prefix belong into one cluster. Therefore any additional bits after chosen prefix are irrelevant.
(Eg. with prefix of length 20 any bits after 20th bit are irrelevant.)
Am I missing something?
Best regards.
Simon Let
The text was updated successfully, but these errors were encountered: