Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why can't two words have same brown cluster representation? #5

Closed
curusarn opened this issue Jul 26, 2017 · 2 comments
Closed

Why can't two words have same brown cluster representation? #5

curusarn opened this issue Jul 26, 2017 · 2 comments

Comments

@curusarn
Copy link
Contributor

When I run train_ner with BrownClusters feature enabled I get following output:

Loading train data: done, 8158 sentences
Loading heldout data: done, 899 sentences
Parsing feature templates: Form '0000000' is present twice in Brown cluster file 'clusters/cs_brown_1000'!
Cannot initialize feature template sentence processor 'BrownClusters' from line 'BrownClusters/2 clusters/cs_brown_1000' of feature templates file!

Why exactly can't be Form '0000000' present twice?
It seems like a unnecessary limitation. As far as I know all words with the same prefix belong into one cluster. Therefore any additional bits after chosen prefix are irrelevant.
(Eg. with prefix of length 20 any bits after 20th bit are irrelevant.)
Am I missing something?

Best regards.
Simon Let

@foxik
Copy link
Member

foxik commented Jul 26, 2017

The error means that NameTag thinks word 0000000 is in multiple clusters (and fails because it does not know which cluster to use).

The input file for the BrownCluster feature should contain lines with cluster<tab>lemma -- don't you have it reversed (i.e., lemma<tab>cluster)? The "form 0000000" looks more like a cluster.

If you really have one form present multiple times in the file, you have to decide which one to use yourself.

@curusarn
Copy link
Contributor Author

The BrownCluster feature file was reversed (lemmacluster). Works as expected.

Thanks for your time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants