Access to tagset in ShiftReduceParser #41

reckart · 2014-12-03T23:03:05Z

it would be nice if the ShiftReduceParser exposed a tagSet() method which would basically do a

return model.knownStates

Currently, I need to use reflection to access ShiftReduceParser.model and BaseModel.knownStates to extract the tag set.

The text was updated successfully, but these errors were encountered:

whitten · 2014-12-04T15:58:36Z

Is exposing the tagset() method the most useful one to expose? What kind of information do you need to properly understand/interpret the information in the tagset, and is it available (already exposed) ? Will exposing the tagset require exposing other information simply to make it useful?

reckart · 2014-12-04T16:16:53Z

I just need access to the raw tags in the model. I already know how to use the LanguagePack to convert them into basic categories, that I have to ignore the "@" tags because these are part of the internal binarized trees, and that I may have to strip the grammatical function from the tag (also using the LanguagePack).

Knowing the tagset gives us a hint (not certainty!) on whether a model is semantically compatible with another model.

In DKPro Core, we try to extract tagset information from all models. Cf. the DKPro Core UIMA wrapper code for the Stanford parser [1]. We use tagset information extracted to:

determine what tagset is (probably) being used by a model - some models don't come with decent documentation
verify that tagsets don't change between revisions of a model
if they change, update our code or our mappings of tags to coarse-grained tags and other metadata

[1] https://code.google.com/p/dkpro-core-gpl/source/browse/de.tudarmstadt.ukp.dkpro.core-gpl/trunk/de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl/src/main/java/de/tudarmstadt/ukp/dkpro/core/stanfordnlp/StanfordParser.java?r=606#556

AngledLuffa · 2014-12-08T19:54:38Z

Done

On Thu, Dec 4, 2014 at 8:16 AM, Richard Eckart de Castilho <
notifications@github.com> wrote:

I just need access to the raw tags in the model. I already know how to use
the LanguagePack to convert them into basic categories, that I have to
ignore the "@" tags because these are part of the internal binarized trees,
and that I may have to strip the grammatical function from the tag (also
using the LanguagePack).

Knowing the tagset gives us a hint (not certainty!) on whether a model is
semantically compatible with another model.

In DKPro Core, we try to extract tagset information from all models. Cf.
the DKPro Core UIMA wrapper code for the Stanford parser [1]. We use tagset
information extracted to:

determine what tagset is (probably) being used by a model - some
models don't come with decent documentation

verify that tagsets don't change between revisions of a model

if they change, update our code or our mappings of tags to
coarse-grained tags and other metadata

[1]
https://code.google.com/p/dkpro-core-gpl/source/browse/de.tudarmstadt.ukp.dkpro.core-gpl/trunk/de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl/src/main/java/de/tudarmstadt/ukp/dkpro/core/stanfordnlp/StanfordParser.java?r=606#556

—
Reply to this email directly or view it on GitHub
#41 (comment).

reckart · 2014-12-09T10:59:00Z

Thanks! :) (I guess the commit comes later).

AngledLuffa · 2014-12-09T17:15:32Z

I recall something about the pushing to github no longer working. I don't
know anything more than that, as I am no longer directly connected with the
group.
On Dec 9, 2014 2:59 AM, "Richard Eckart de Castilho" <
notifications@github.com> wrote:

Thanks! :) (I guess the commit comes later).

—
Reply to this email directly or view it on GitHub
#41 (comment).

gangeli · 2014-12-14T05:50:04Z

The script pushing to Github should be working again -- is the commit still not showing up?

reckart · 2014-12-14T09:51:32Z

It is possible to tie commits to issues by including the issue number in the commit message (cf. link below). Doing so causes commits to show up in an issue. I assumed you do that, so I didn't even check the actual commits list to search for the a related commit.

https://guides.github.com/features/issues/
used e.g. here: nlplab/brat#1084

manning · 2014-12-19T04:29:52Z

I see you added a knownStates() method, John, but that doesn't actually give Richard what he wants, since those states are the phrasal category set not the tag set. Actually, the current sr parser doesn't store the category set anywhere. However, I'm writing a method which in a rather brittle way extracts the tag set from the features. It seems like it will work. But it might be worth it in the future just to add the tag set to the srparser models. It wouldn't take much extra space in what are already huge models, and I agree with Richard that it is useful information to have to check model compatibility. Indeed, we have an integration test that does that now even!

AngledLuffa · 2014-12-19T04:38:10Z

Perhaps mistakenly I assumed Richard wanted phrasal categories. POS tags
must be extracted from the tagger which is used instead of the parser.

The srparser could theoretically add the list of expected tags at training
time by looking at the tagger. We could even go back and add that to the
existing models if we want. I think that extracting it from the features
is not a great way of doing it,

John
On Dec 18, 2014 8:29 PM, "Christopher Manning" notifications@github.com
wrote:

I see you added a knownStates() method, John, but that doesn't actually
give Richard what he wants, since those states are the phrasal category set
not the tag set. Actually, the current sr parser doesn't store the
category set anywhere. However, I'm writing a method which in a rather
brittle way extracts the tag set from the features. It seems like it will
work. But it might be worth it in the future just to add the tag set to the
srparser models. It wouldn't take much extra space in what are already huge
models, and I agree with Richard that it is useful information to have to
check model compatibility. Indeed, we have an integration test that does
that now even!

—
Reply to this email directly or view it on GitHub
#41 (comment).

manning · 2014-12-19T05:07:15Z

I admit the way I have done it isn't great, but an srparser model does have an implicit tag set, reflecting the set of tags it was trained on. And it has proven to be a great data integrity/compatibility check to have this available. For instance, I now know that the spanish SR parser models have a tag set incompatibility problem versus the PCFG and tagger models (perhaps because older?). They're missing the tags: de0000, faa, fia, pe000000, vaic000, vsic000, which are present in the latter two....

reckart · 2014-12-19T07:44:53Z

I'm fine with the states (knownStates). Through earlier conversations with you, I (think I) know pretty well how to derive the actual tagset from those. At least I get consistent tagsets extrated across all the different parsers using different APIs (shift-reduce, pcfg, rnn, etc.).

manning closed this as completed in e60dba5 Dec 19, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Access to tagset in ShiftReduceParser #41

Access to tagset in ShiftReduceParser #41

reckart commented Dec 3, 2014

whitten commented Dec 4, 2014

reckart commented Dec 4, 2014

AngledLuffa commented Dec 8, 2014

reckart commented Dec 9, 2014

AngledLuffa commented Dec 9, 2014

gangeli commented Dec 14, 2014

reckart commented Dec 14, 2014

manning commented Dec 19, 2014

AngledLuffa commented Dec 19, 2014

manning commented Dec 19, 2014

reckart commented Dec 19, 2014

Access to tagset in ShiftReduceParser #41

Access to tagset in ShiftReduceParser #41

Comments

reckart commented Dec 3, 2014

whitten commented Dec 4, 2014

reckart commented Dec 4, 2014

AngledLuffa commented Dec 8, 2014

reckart commented Dec 9, 2014

AngledLuffa commented Dec 9, 2014

gangeli commented Dec 14, 2014

reckart commented Dec 14, 2014

manning commented Dec 19, 2014

AngledLuffa commented Dec 19, 2014

manning commented Dec 19, 2014

reckart commented Dec 19, 2014