Why can't two words have same brown cluster representation? #5

curusarn · 2017-07-26T12:57:09Z

When I run train_ner with BrownClusters feature enabled I get following output:

Loading train data: done, 8158 sentences
Loading heldout data: done, 899 sentences
Parsing feature templates: Form '0000000' is present twice in Brown cluster file 'clusters/cs_brown_1000'!
Cannot initialize feature template sentence processor 'BrownClusters' from line 'BrownClusters/2 clusters/cs_brown_1000' of feature templates file!

Why exactly can't be Form '0000000' present twice?
It seems like a unnecessary limitation. As far as I know all words with the same prefix belong into one cluster. Therefore any additional bits after chosen prefix are irrelevant.
(Eg. with prefix of length 20 any bits after 20th bit are irrelevant.)
Am I missing something?

Best regards.
Simon Let

The text was updated successfully, but these errors were encountered:

foxik · 2017-07-26T13:13:32Z

The error means that NameTag thinks word 0000000 is in multiple clusters (and fails because it does not know which cluster to use).

The input file for the BrownCluster feature should contain lines with cluster<tab>lemma -- don't you have it reversed (i.e., lemma<tab>cluster)? The "form 0000000" looks more like a cluster.

If you really have one form present multiple times in the file, you have to decide which one to use yourself.

curusarn · 2017-07-27T12:34:48Z

The BrownCluster feature file was reversed (lemmacluster). Works as expected.

Thanks for your time.

curusarn closed this as completed Jul 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why can't two words have same brown cluster representation? #5

Why can't two words have same brown cluster representation? #5

curusarn commented Jul 26, 2017

foxik commented Jul 26, 2017

curusarn commented Jul 27, 2017

Why can't two words have same brown cluster representation? #5

Why can't two words have same brown cluster representation? #5

Comments

curusarn commented Jul 26, 2017

foxik commented Jul 26, 2017

curusarn commented Jul 27, 2017