Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error using POS tagger #17

Closed
rlvoyer opened this issue Dec 10, 2014 · 6 comments
Closed

Error using POS tagger #17

rlvoyer opened this issue Dec 10, 2014 · 6 comments

Comments

@rlvoyer
Copy link

rlvoyer commented Dec 10, 2014

Hi there. I'm trying to use your POS tagger and I'm getting the following error when I attempt to train on a very small sample (10 sentences) from the Penn Treebank WSJ dataset. Any thoughts as to what I'm doing wrong?

In [2]: from redshift.tagger import train

In [3]: train(open('wsj.10.txt', 'r').read(), 'redshift_model')
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-4-16d6fd520844> in <module>()
----> 1 train(open('wsj.10.txt', 'r').read(), 'redshift_model')

/Library/Python/2.7/site-packages/redshift/tagger.so in redshift.tagger.train (redshift/tagger.cpp:2391)()

/Library/Python/2.7/site-packages/redshift/tagger.so in redshift.tagger.Tagger.train_sent (redshift/tagger.cpp:4013)()

/Library/Python/2.7/site-packages/thinc/learner.so in thinc.learner.LinearModel.update (thinc/learner.cpp:2395)()

AssertionError: 
@rlvoyer
Copy link
Author

rlvoyer commented Dec 10, 2014

I think I've tracked that assertion down to here:

https://github.com/honnibal/thinc/blob/master/thinc/learner.pyx#L99

But I'm unclear as to why my class label is negative.

@syllog1sm
Copy link
Owner

Hi,
Thanks for your patience and persistence! Sorry I haven't had much time to help yet.

How is the data in wsj.10.txt formatted? Are the tests passing for you?

This test shows passing a single training example to the train function: https://github.com/syllog1sm/redshift/blob/develop/tests/test_tagger.py

@rlvoyer
Copy link
Author

rlvoyer commented Dec 11, 2014

wsj.10.txt is PTB-formatted:

Why/WRB is/VBZ the/DT stock/NN market/NN suddenly/RB so/RB volatile/JJ ?/. 

This seems to be the expected format for the Input.from_pos constructor.

I tried running the tests and two of them fail. As you can see from the snippet below, these failures are resulting from the same AssertionError that I mentioned above:

➜  redshift git:(develop) ✗ py.test
========================================================= test session starts ==========================================================
platform darwin -- Python 2.7.6 -- py-1.4.26 -- pytest-2.6.4
collected 24 items 

tests/test_ae.py .............
tests/test_edit_ae.py .....
tests/test_lexicon.py .
tests/test_parser.py E
tests/test_tagger.py ...E

================================================================ ERRORS ================================================================
_____________________________________________________ ERROR at setup of test_parse _____________________________________________________

    @pytest.fixture
    def train_dir():
        import redshift.parser
>       redshift.parser.train(train_str, model_dir)

tests/test_parser.py:20: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
redshift/parser.pyx:111: in redshift.parser.train (redshift/parser.cpp:3039)
    parser.tagger.train_sent(py_sent)
redshift/tagger.pyx:122: in redshift.tagger.Tagger.train_sent (redshift/tagger.cpp:4013)
    self.guide.update(counts)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   AssertionError

thinc/learner.pyx:81: AssertionError
______________________________________________________ ERROR at setup of test_tag ______________________________________________________

    @pytest.fixture
    def train_dir():
        import redshift.tagger
        sent_strs = []
        for sent_str in train_str.strip().split('\n\n'):
            sent = []
            for tok_str in sent_str.strip().split('\n'):
                fields = tok_str.split()
                sent.append('%s/%s' % (fields[1], fields[3]))
            sent_strs.append(' '.join(sent))
        train_pos = '\n'.join(sent_strs)
>       redshift.tagger.train(train_pos, model_dir)

tests/test_tagger.py:27: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
redshift/tagger.pyx:43: in redshift.tagger.train (redshift/tagger.cpp:2391)
    tagger.train_sent(sent)
redshift/tagger.pyx:122: in redshift.tagger.Tagger.train_sent (redshift/tagger.cpp:4013)
    self.guide.update(counts)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   AssertionError

thinc/learner.pyx:81: AssertionError
================================================== 22 passed, 2 error in 1.71 seconds ==================================================

Are you able to reproduce this? I'm running on OS X 10.10, using Python 2.7.6.

@syllog1sm
Copy link
Owner

Okay, I think I've fixed this.

The underlying problem is that I've broken the perceptron code out into its own module, thinc, and I'd been redshift against my local version of that library instead of the one on pip.

Try pulling the new version, and running "pip install -r requirements.txt", to get thinc1.50. Then run "fab clean make test".

@rlvoyer
Copy link
Author

rlvoyer commented Dec 14, 2014

Yay. Tests pass and I've trained a tagger. Thanks!

@syllog1sm
Copy link
Owner

Great! Thanks for the bug reports. Let me know if you have any other problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants