Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scan omitted Grammar tagging in many instances #60

Closed
destatez opened this issue Nov 24, 2016 · 14 comments
Closed

Scan omitted Grammar tagging in many instances #60

destatez opened this issue Nov 24, 2016 · 14 comments

Comments

@destatez
Copy link
Contributor

destatez commented Nov 24, 2016

We should identify below, all of the grammar abbreviations that occur which should have the grammar tagging around them. e.g. adv., for an adverb. A script should be able to be developed which can do a global replace (inclusion of the tagging) for each instance that is not already tagged. The list of these can be extracted from section "I. GENERAL." at the beginning of the XML file.

Most of the current instances of tagging occur after the <form...> tag-pair and the <etym...> tag-pair and before the first <sense...> tag-pair, but there are also current instances that a a part of the contents of a <sense...> tag-pair. A decision will need to made when developing and running this script, whether the "replacements" should only before the <sense...> tag-pair or whether they should be "replaced" wherever they occur.

@cbearden
Copy link
Member

cbearden commented Nov 25, 2016 via email

@destatez
Copy link
Contributor Author

destatez commented Nov 25, 2016 via email

@cbearden
Copy link
Member

cbearden commented Nov 26, 2016 via email

@destatez
Copy link
Contributor Author

destatez commented Nov 26, 2016 via email

@destatez
Copy link
Contributor Author

destatez commented Nov 27, 2016 via email

@dowens76
Copy link
Member

Dave, you've already been approved for the group using your Gmail address. I approved you almost immediately. Try sending an email to
text-abbott-smith-project@googlegroups.com.

@cbearden
Copy link
Member

cbearden commented Nov 27, 2016 via email

@toddlprice
Copy link

Re: par. 2 of the 1st post: Yes, I do think that the grammar abbreviations even in the Sense sections should be tagged. This might be a bit beyond the original scope of making a digital representation of A-S, so perhaps this should wait until Stage 2 and be considered part of the UGL. What I mean is that I see use for it where the grammar tags in UGL can be linked to UGG so that these grammatical concepts are explained in our Grammar. That is beyond the Stage 1 goal.

@toddlprice
Copy link

Just to clarify, as part of digitizing A-S, we do want the grammar abbreviations to have tagging around them. This is valid and needed for stage 1. But linking those tags to UGG needs to wait until stage 2.

@destatez
Copy link
Contributor Author

I have run across an issue on this topic. I have done searches of the XML looking for the POS "keywords" and have found instances of these that are a part of a description, as well as what I would call viable instances. I have attached some examples of the search output and need a little clarification on what should be and what shouldn't be tagged. The keywords that I used were as follows. The search would find any word that started with the keyword. That was why I had to qualify some to preclude others from appearing in the search.
adj, adv, article, conj, interj, num, part, prep, pron, subst, art. (and NOT article), super (and NOT superscript), noun (and NOT pron), verb (and NOT adv)

Non-tagged-POS.txt

@toddlprice
Copy link

I think the examples in your txt file (verb, part and art) should not be tagged. It looks like ptcp. should be tagged since it is used in lexical entries rather than in 'running text'.

@destatez
Copy link
Contributor Author

I am concerned about the current state of the pos tags in A-S. There are currently 53 different ”values” that are tagged in the XML (see A_S_XML_pos_instance_text.txt). {I combined instances that were abbreviations or variations of abbreviations for those listed} There are total of 357 instances where these are tagged, with 29 of these being within the sense data (see A_S_pos_sense_Instances.txt). The remainder are within the orth data or etym data, which is where I would have expected them. My questions, as relates to automating the tagging of the XML file are:

  1. Should I tag only instances that are in within the orth or etym data, or should I also include the instances in the sense data?
  2. What text should I search for to do this tagging? {I put the list from the Issue in file: Possible_pos_values.txt, where I moved article, part, and verb to the DO NOT include list.} Could you review this list and move any other values to the DO NOT include list that you believe I should “ignore” for this tagging.

A_S_XML_pos_instance_text.txt

A_S_pos_sense_Instances.txt

Possible_pos_values.txt

@toddlprice
Copy link

Only tag what is in orth and etmy data.

@destatez
Copy link
Contributor Author

Updated XML with only13 changes needed, when scope was reduced to orth & etym

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants