-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scan omitted Grammar tagging in many instances #60
Comments
Hi David,
This sounds good. I have a pair of scripts ('scripts/find_foreign.py' and
'scripts/fix_foreign.py') that I used to find segments of Greek & Hebrew
text that weren't enclosed in <foreign> and to add the <foreign> tags.
Possibly they could be adapted to this purpose as well. I may not get to
that immediately, so others may beat me to the punch with a different
approach.
I think there are a number of ways we could use scripts or XQuery to make
the analysis and fixing of the markup faster. My immediate focus is on
making the document valid TEI/OSIS again.
All the best,
Chuck
…On Thu, Nov 24, 2016 at 2:18 PM, David Statezni ***@***.***> wrote:
We should identify below, all of the grammar abbreviations that occur
which should have the grammar tagging around them. e.g. adv., for an
adverb. A script should be able to be developed which can do a global
replace (inclusion of the tagging) for each instance that is not already
tagged. The list of these can be extracted from the frontal material.
Most of the current instances of tagging occur after the <form...>
tag-pair and the <etym...> tag-pair and before the first <sense...>
tag-pair, but there are also current instances that a a part of the
contents of a <sense...> tag-pair. A decision will need to made when
developing and running this script, whether the "replacements" should only
before the <sense...> tag-pair or whether they should be "replaced"
wherever they occur.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#60>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAaEFpYzXKuN2WnlTwc30m5BGEpyVZOXks5rBfEpgaJpZM4K78St>
.
|
Charles
That sounds like a plan. I have been using perl to do all sorts of global
replacements for the ULB, UDB, Notes, tW, etc. Either tool can do the job.
My thoughts on this particular topic and Issue 59, were to wait until all
manual editing is complete and use the scripts to "catch" any that were
missed by the editors.
Dave
On Thu, Nov 24, 2016 at 5:27 PM, Charles Bearden <notifications@github.com>
wrote:
… Hi David,
This sounds good. I have a pair of scripts ('scripts/find_foreign.py' and
'scripts/fix_foreign.py') that I used to find segments of Greek & Hebrew
text that weren't enclosed in <foreign> and to add the <foreign> tags.
Possibly they could be adapted to this purpose as well. I may not get to
that immediately, so others may beat me to the punch with a different
approach.
I think there are a number of ways we could use scripts or XQuery to make
the analysis and fixing of the markup faster. My immediate focus is on
making the document valid TEI/OSIS again.
All the best,
Chuck
On Thu, Nov 24, 2016 at 2:18 PM, David Statezni ***@***.***>
wrote:
> We should identify below, all of the grammar abbreviations that occur
> which should have the grammar tagging around them. e.g. adv., for an
> adverb. A script should be able to be developed which can do a global
> replace (inclusion of the tagging) for each instance that is not already
> tagged. The list of these can be extracted from the frontal material.
>
> Most of the current instances of tagging occur after the <form...>
> tag-pair and the <etym...> tag-pair and before the first <sense...>
> tag-pair, but there are also current instances that a a part of the
> contents of a <sense...> tag-pair. A decision will need to made when
> developing and running this script, whether the "replacements" should
only
> before the <sense...> tag-pair or whether they should be "replaced"
> wherever they occur.
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#60
>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/
AAaEFpYzXKuN2WnlTwc30m5BGEpyVZOXks5rBfEpgaJpZM4K78St>
> .
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#60 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AQAi7-cDL0zaXuULRaQDfVHM-R-z22K-ks5rBitYgaJpZM4K78St>
.
|
Hi Dave,
Would it be good to have a channel for general communications about the
project, so as not to overload the Github 'issues' feature with more
general topics? I don't know any way to contact you other than responding
to this issue.
There is a Google Group ("TExT: Abbott-Smith Project"), but the last posts
in it were from me, about my efforts to tag Greek & Hebrew with <foreign>,
about a year ago. For instance, I don't know anything about the work of
manual review that is evidently going on (which is great news!).
I'd like to get the XML file into valid shape, but I don't want to make
life harder for those trying to merge my work with the results of their
manual review. Also, I think we'll need to discuss some markup choices.
Would it make sense to use the Google Group for general coordination and
discussion, or is there another, better channel?
All the best,
Chuck
On Thu, Nov 24, 2016 at 7:02 PM, David Statezni <notifications@github.com>
wrote:
… Charles
That sounds like a plan. I have been using perl to do all sorts of global
replacements for the ULB, UDB, Notes, tW, etc. Either tool can do the job.
My thoughts on this particular topic and Issue 59, were to wait until all
manual editing is complete and use the scripts to "catch" any that were
missed by the editors.
Dave
On Thu, Nov 24, 2016 at 5:27 PM, Charles Bearden ***@***.***
>
wrote:
> Hi David,
>
> This sounds good. I have a pair of scripts ('scripts/find_foreign.py' and
> 'scripts/fix_foreign.py') that I used to find segments of Greek & Hebrew
> text that weren't enclosed in <foreign> and to add the <foreign> tags.
> Possibly they could be adapted to this purpose as well. I may not get to
> that immediately, so others may beat me to the punch with a different
> approach.
>
> I think there are a number of ways we could use scripts or XQuery to make
> the analysis and fixing of the markup faster. My immediate focus is on
> making the document valid TEI/OSIS again.
>
> All the best,
> Chuck
>
> On Thu, Nov 24, 2016 at 2:18 PM, David Statezni <
***@***.***>
> wrote:
>
> > We should identify below, all of the grammar abbreviations that occur
> > which should have the grammar tagging around them. e.g. adv., for an
> > adverb. A script should be able to be developed which can do a global
> > replace (inclusion of the tagging) for each instance that is not
already
> > tagged. The list of these can be extracted from the frontal material.
> >
> > Most of the current instances of tagging occur after the <form...>
> > tag-pair and the <etym...> tag-pair and before the first <sense...>
> > tag-pair, but there are also current instances that a a part of the
> > contents of a <sense...> tag-pair. A decision will need to made when
> > developing and running this script, whether the "replacements" should
> only
> > before the <sense...> tag-pair or whether they should be "replaced"
> > wherever they occur.
> >
> > —
> > You are receiving this because you are subscribed to this thread.
> > Reply to this email directly, view it on GitHub
> > <https://github.com/translatable-exegetical-tools/
Abbott-Smith/issues/60
> >,
> > or mute the thread
> > <https://github.com/notifications/unsubscribe-auth/
> AAaEFpYzXKuN2WnlTwc30m5BGEpyVZOXks5rBfEpgaJpZM4K78St>
> > .
> >
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <https://github.com/translatable-exegetical-tools/
Abbott-Smith/issues/60#issuecomment-262859288>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-
auth/AQAi7-cDL0zaXuULRaQDfVHM-R-z22K-ks5rBitYgaJpZM4K78St>
> .
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#60 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAaEFrHkjuVkkc_JnjnERffTnINe-zn7ks5rBjOigaJpZM4K78St>
.
|
Charles
I just got connected to that Google Group. That sounds like a good means
of communications. We really need to get Chapel and possibly Todd connected
to it, since they are the leads. I cc'd then on this reply. I am just an
editor and tool-guy.
Dave
On Fri, Nov 25, 2016 at 5:54 PM, Charles Bearden <notifications@github.com>
wrote:
… Hi Dave,
Would it be good to have a channel for general communications about the
project, so as not to overload the Github 'issues' feature with more
general topics? I don't know any way to contact you other than responding
to this issue.
There is a Google Group ("TExT: Abbott-Smith Project"), but the last posts
in it were from me, about my efforts to tag Greek & Hebrew with <foreign>,
about a year ago. For instance, I don't know anything about the work of
manual review that is evidently going on (which is great news!).
I'd like to get the XML file into valid shape, but I don't want to make
life harder for those trying to merge my work with the results of their
manual review. Also, I think we'll need to discuss some markup choices.
Would it make sense to use the Google Group for general coordination and
discussion, or is there another, better channel?
All the best,
Chuck
On Thu, Nov 24, 2016 at 7:02 PM, David Statezni ***@***.***>
wrote:
> Charles
>
> That sounds like a plan. I have been using perl to do all sorts of global
> replacements for the ULB, UDB, Notes, tW, etc. Either tool can do the
job.
> My thoughts on this particular topic and Issue 59, were to wait until all
> manual editing is complete and use the scripts to "catch" any that were
> missed by the editors.
>
> Dave
>
> On Thu, Nov 24, 2016 at 5:27 PM, Charles Bearden <
***@***.***
> >
> wrote:
>
> > Hi David,
> >
> > This sounds good. I have a pair of scripts ('scripts/find_foreign.py'
and
> > 'scripts/fix_foreign.py') that I used to find segments of Greek &
Hebrew
> > text that weren't enclosed in <foreign> and to add the <foreign> tags.
> > Possibly they could be adapted to this purpose as well. I may not get
to
> > that immediately, so others may beat me to the punch with a different
> > approach.
> >
> > I think there are a number of ways we could use scripts or XQuery to
make
> > the analysis and fixing of the markup faster. My immediate focus is on
> > making the document valid TEI/OSIS again.
> >
> > All the best,
> > Chuck
> >
> > On Thu, Nov 24, 2016 at 2:18 PM, David Statezni <
> ***@***.***>
> > wrote:
> >
> > > We should identify below, all of the grammar abbreviations that occur
> > > which should have the grammar tagging around them. e.g. adv., for an
> > > adverb. A script should be able to be developed which can do a global
> > > replace (inclusion of the tagging) for each instance that is not
> already
> > > tagged. The list of these can be extracted from the frontal material.
> > >
> > > Most of the current instances of tagging occur after the <form...>
> > > tag-pair and the <etym...> tag-pair and before the first <sense...>
> > > tag-pair, but there are also current instances that a a part of the
> > > contents of a <sense...> tag-pair. A decision will need to made when
> > > developing and running this script, whether the "replacements" should
> > only
> > > before the <sense...> tag-pair or whether they should be "replaced"
> > > wherever they occur.
> > >
> > > —
> > > You are receiving this because you are subscribed to this thread.
> > > Reply to this email directly, view it on GitHub
> > > <https://github.com/translatable-exegetical-tools/
> Abbott-Smith/issues/60
> > >,
> > > or mute the thread
> > > <https://github.com/notifications/unsubscribe-auth/
> > AAaEFpYzXKuN2WnlTwc30m5BGEpyVZOXks5rBfEpgaJpZM4K78St>
> > > .
> > >
> >
> > —
> > You are receiving this because you authored the thread.
> > Reply to this email directly, view it on GitHub
> > <https://github.com/translatable-exegetical-tools/
> Abbott-Smith/issues/60#issuecomment-262859288>,
> > or mute the thread
> > <https://github.com/notifications/unsubscribe-
> auth/AQAi7-cDL0zaXuULRaQDfVHM-R-z22K-ks5rBitYgaJpZM4K78St>
> > .
> >
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <https://github.com/translatable-exegetical-tools/
Abbott-Smith/issues/60#issuecomment-262861621>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAaEFrHkjuVkkc_
JnjnERffTnINe-zn7ks5rBjOigaJpZM4K78St>
> .
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#60 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AQAi7wqOE0uxH7DT2JjL3ANVH0PASNStks5rB4M4gaJpZM4K78St>
.
|
Charles
It's taking some time to get approved for that Google Group, though I
thought that I had received a message that I was. So, I can't answer you
via a post against your latest topic. You can either wait until I get
approved, or you can pass me your email address and I can send to a message
about what the editors are doing. Your pick
Dave
…On Thu, Nov 24, 2016 at 6:02 PM, David Statezni ***@***.***> wrote:
Charles
That sounds like a plan. I have been using perl to do all sorts of
global replacements for the ULB, UDB, Notes, tW, etc. Either tool can do
the job. My thoughts on this particular topic and Issue 59, were to wait
until all manual editing is complete and use the scripts to "catch" any
that were missed by the editors.
Dave
On Thu, Nov 24, 2016 at 5:27 PM, Charles Bearden ***@***.***
> wrote:
> Hi David,
>
> This sounds good. I have a pair of scripts ('scripts/find_foreign.py' and
> 'scripts/fix_foreign.py') that I used to find segments of Greek & Hebrew
> text that weren't enclosed in <foreign> and to add the <foreign> tags.
> Possibly they could be adapted to this purpose as well. I may not get to
> that immediately, so others may beat me to the punch with a different
> approach.
>
> I think there are a number of ways we could use scripts or XQuery to make
> the analysis and fixing of the markup faster. My immediate focus is on
> making the document valid TEI/OSIS again.
>
> All the best,
> Chuck
>
> On Thu, Nov 24, 2016 at 2:18 PM, David Statezni ***@***.***
> >
> wrote:
>
> > We should identify below, all of the grammar abbreviations that occur
> > which should have the grammar tagging around them. e.g. adv., for an
> > adverb. A script should be able to be developed which can do a global
> > replace (inclusion of the tagging) for each instance that is not already
> > tagged. The list of these can be extracted from the frontal material.
> >
> > Most of the current instances of tagging occur after the <form...>
> > tag-pair and the <etym...> tag-pair and before the first <sense...>
> > tag-pair, but there are also current instances that a a part of the
> > contents of a <sense...> tag-pair. A decision will need to made when
> > developing and running this script, whether the "replacements" should
> only
> > before the <sense...> tag-pair or whether they should be "replaced"
> > wherever they occur.
> >
> > —
> > You are receiving this because you are subscribed to this thread.
> > Reply to this email directly, view it on GitHub
> > <https://github.com/translatable-exegetical-tools/Abbott-
> Smith/issues/60>,
> > or mute the thread
> > <https://github.com/notifications/unsubscribe-auth/AAaEFpYzX
> KuN2WnlTwc30m5BGEpyVZOXks5rBfEpgaJpZM4K78St>
> > .
> >
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#60 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AQAi7-cDL0zaXuULRaQDfVHM-R-z22K-ks5rBitYgaJpZM4K78St>
> .
>
|
Dave, you've already been approved for the group using your Gmail address. I approved you almost immediately. Try sending an email to |
Hi Dave,
I was able to see your post to the group with the subject "Group
Acceptance". Looks like you are able to post now. If you didn't get a copy
of the reply in your email inbox, perhaps you just need to edit your email
preference settings for the group.
I'm looking forward to hearing about what's going on with the dictionary. I
see you're with Wycliffe, which is very cool.
All the best,
Chuck
On Sat, Nov 26, 2016 at 6:46 PM, David Statezni <notifications@github.com>
wrote:
… Charles
It's taking some time to get approved for that Google Group, though I
thought that I had received a message that I was. So, I can't answer you
via a post against your latest topic. You can either wait until I get
approved, or you can pass me your email address and I can send to a message
about what the editors are doing. Your pick
Dave
On Thu, Nov 24, 2016 at 6:02 PM, David Statezni ***@***.***> wrote:
> Charles
>
> That sounds like a plan. I have been using perl to do all sorts of
> global replacements for the ULB, UDB, Notes, tW, etc. Either tool can do
> the job. My thoughts on this particular topic and Issue 59, were to wait
> until all manual editing is complete and use the scripts to "catch" any
> that were missed by the editors.
>
> Dave
>
> On Thu, Nov 24, 2016 at 5:27 PM, Charles Bearden <
***@***.***
> > wrote:
>
>> Hi David,
>>
>> This sounds good. I have a pair of scripts ('scripts/find_foreign.py'
and
>> 'scripts/fix_foreign.py') that I used to find segments of Greek & Hebrew
>> text that weren't enclosed in <foreign> and to add the <foreign> tags.
>> Possibly they could be adapted to this purpose as well. I may not get to
>> that immediately, so others may beat me to the punch with a different
>> approach.
>>
>> I think there are a number of ways we could use scripts or XQuery to
make
>> the analysis and fixing of the markup faster. My immediate focus is on
>> making the document valid TEI/OSIS again.
>>
>> All the best,
>> Chuck
>>
>> On Thu, Nov 24, 2016 at 2:18 PM, David Statezni <
***@***.***
>> >
>> wrote:
>>
>> > We should identify below, all of the grammar abbreviations that occur
>> > which should have the grammar tagging around them. e.g. adv., for an
>> > adverb. A script should be able to be developed which can do a global
>> > replace (inclusion of the tagging) for each instance that is not
already
>> > tagged. The list of these can be extracted from the frontal material.
>> >
>> > Most of the current instances of tagging occur after the <form...>
>> > tag-pair and the <etym...> tag-pair and before the first <sense...>
>> > tag-pair, but there are also current instances that a a part of the
>> > contents of a <sense...> tag-pair. A decision will need to made when
>> > developing and running this script, whether the "replacements" should
>> only
>> > before the <sense...> tag-pair or whether they should be "replaced"
>> > wherever they occur.
>> >
>> > —
>> > You are receiving this because you are subscribed to this thread.
>> > Reply to this email directly, view it on GitHub
>> > <https://github.com/translatable-exegetical-tools/Abbott-
>> Smith/issues/60>,
>> > or mute the thread
>> > <https://github.com/notifications/unsubscribe-auth/AAaEFpYzX
>> KuN2WnlTwc30m5BGEpyVZOXks5rBfEpgaJpZM4K78St>
>> > .
>> >
>>
>> —
>> You are receiving this because you authored the thread.
>> Reply to this email directly, view it on GitHub
>> <https://github.com/translatable-exegetical-tools/
Abbott-Smith/issues/60#issuecomment-262859288>,
>> or mute the thread
>> <https://github.com/notifications/unsubscribe-
auth/AQAi7-cDL0zaXuULRaQDfVHM-R-z22K-ks5rBitYgaJpZM4K78St>
>> .
>>
>
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#60 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAaEFhnJTNgXWJ0cb_lVkZYUeFMg4S0Aks5rCNLQgaJpZM4K78St>
.
|
Re: par. 2 of the 1st post: Yes, I do think that the grammar abbreviations even in the Sense sections should be tagged. This might be a bit beyond the original scope of making a digital representation of A-S, so perhaps this should wait until Stage 2 and be considered part of the UGL. What I mean is that I see use for it where the grammar tags in UGL can be linked to UGG so that these grammatical concepts are explained in our Grammar. That is beyond the Stage 1 goal. |
Just to clarify, as part of digitizing A-S, we do want the grammar abbreviations to have tagging around them. This is valid and needed for stage 1. But linking those tags to UGG needs to wait until stage 2. |
I have run across an issue on this topic. I have done searches of the XML looking for the POS "keywords" and have found instances of these that are a part of a description, as well as what I would call viable instances. I have attached some examples of the search output and need a little clarification on what should be and what shouldn't be tagged. The keywords that I used were as follows. The search would find any word that started with the keyword. That was why I had to qualify some to preclude others from appearing in the search. |
I think the examples in your txt file (verb, part and art) should not be tagged. It looks like ptcp. should be tagged since it is used in lexical entries rather than in 'running text'. |
I am concerned about the current state of the pos tags in A-S. There are currently 53 different ”values” that are tagged in the XML (see A_S_XML_pos_instance_text.txt). {I combined instances that were abbreviations or variations of abbreviations for those listed} There are total of 357 instances where these are tagged, with 29 of these being within the sense data (see A_S_pos_sense_Instances.txt). The remainder are within the orth data or etym data, which is where I would have expected them. My questions, as relates to automating the tagging of the XML file are:
|
Only tag what is in orth and etmy data. |
Updated XML with only13 changes needed, when scope was reduced to orth & etym |
We should identify below, all of the grammar abbreviations that occur which should have the grammar tagging around them. e.g. adv., for an adverb. A script should be able to be developed which can do a global replace (inclusion of the tagging) for each instance that is not already tagged. The list of these can be extracted from section "I. GENERAL." at the beginning of the XML file.
Most of the current instances of tagging occur after the <form...> tag-pair and the <etym...> tag-pair and before the first <sense...> tag-pair, but there are also current instances that a a part of the contents of a <sense...> tag-pair. A decision will need to made when developing and running this script, whether the "replacements" should only before the <sense...> tag-pair or whether they should be "replaced" wherever they occur.
The text was updated successfully, but these errors were encountered: