Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further <ls> cleanup #136

Closed
funderburkjim opened this issue Jul 8, 2022 · 19 comments
Closed

Further <ls> cleanup #136

funderburkjim opened this issue Jul 8, 2022 · 19 comments

Comments

@funderburkjim
Copy link
Contributor

This issue devoted to continuation (from #134) of cleanup of ls markup in mw.txt.
Two fertile areas:

876 matches in 874 lines for "<ls[^<]* and"
Example: 
OLD
<ls>Mn. ix, 49 and 51.</ls>  
NEW
<ls>Mn. ix, 49</ls> and <ls n="Mn. ix,">51.</ls>

1098 matches in 1085 lines for "<ls[^<]*;"
Example:
OLD
<ls>Mn. iii, 257; v, 73</ls>
NEW
<ls>Mn. iii, 257</ls>; <ls n="Mn.">v, 73</ls>
@gasyoun
Copy link
Member

gasyoun commented Jul 9, 2022

fertile areas

💯

funderburkjim added a commit that referenced this issue Jul 15, 2022
funderburkjim added a commit to sanskrit-lexicon/csl-orig that referenced this issue Jul 15, 2022
@funderburkjim
Copy link
Contributor Author

This week's changes to mw.txt, primarily to ls markup, are now completed.
The work is done in issue136 directory.

The sequence of changes are in files change_1.txt through change_4.txt, with corresponding notes in the readme.txt file of issue136 directory.

The change_all.txt file shows list of all 4221 lines changed in mw.txt.

The ls markup in mw.txt has received considerable attention in the last few weeks.
Thanks to @Andhrabharati and @gasyoun for pointing out areas that needed attention.

At the moment, I don't have in mind further lines of improvement to ls markup in mw.txt, and will likely return to ls markup improvements in pw and pwg.

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Jul 15, 2022

Probably @funderburkjim may consider removing the 3 <slp1> tags in the <ls> strings--

change <ls>Kielhorn., <s1 slp1="mahABAzya">Mahābhāṣya</s1>, vol. i, preface, p.9 f.</ls>
as <ls>Kielhorn., Mahābhāṣya, vol. i, preface, p.9 f.</ls>

change <ls>VS. (<s1 slp1="kARva">Kāṇva</s1>) ii, 24</ls>
as <ls>VS. (Kāṇva.) ii, 24</ls>

change <ls>YajurV. <s1 slp1="parIS">Parīś</s1>. xv</ls>
as <ls>YajurV. Parīś. xv</ls>

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Jul 15, 2022

And also correct the space-between-digits [0-9] [0-9] errors inside the <ls> (45 occurrences, which link to a wrong place) and <pc> (4 occurrences, which lead to a wrong page) blocks.

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Jul 15, 2022

Some more minor corrections, on a quick look--

  1. change <ls>R , i, 27, 7</ls>
    as <ls>R. i, 27, 7</ls>

  2. change <ls>Daś. -</ls>
    as <ls>Daś.</ls>

  3. change <ls>L. -</ls>
    as <ls>L.</ls>

  4. change <ls>Rājat. -</ls>
    as <ls>Rājat.</ls>

  5. change <ls>R. -</ls>
    as <ls>R.</ls>

  6. change <ls>L., also</ls>
    as <ls>L.</ls>, also

  7. change <ls>ŚBr., as</ls>
    as <ls>ŚBr.</ls>, as

  8. change <ls>Beta bengalensis</ls>
    as <bot>Beta bengalensis</bot>

  9. change <ls>T., but according to</ls>; <ls>Uṇ. i, 67</ls>
    as <ls>T.</ls>, but according to <ls>Uṇ. i, 67</ls>

  10. change <ls>MBh. etc</ls>,
    as <ls>MBh.</ls> etc.

  11. change <ls>Kāv. etc.</ls>
    as <ls>Kāv.</ls> etc.

And it is not out of place to mention that many punctuation errors, as at cases 9 & 10 above, are seen throughout the text; and a handful cases of hyphen marks at wrong places which indicate the <hom> numbers of the next entry word are present.

@funderburkjim
Copy link
Contributor Author

@Andhrabharati Will take a look at these flaws; is there a systematic way to find other instances like 9, 10 ?

@funderburkjim funderburkjim reopened this Jul 15, 2022
@Andhrabharati
Copy link
Contributor

Andhrabharati commented Jul 16, 2022

a handful cases of hyphen marks at wrong places which indicate the <hom> numbers of the next entry word are present.

Though these are not related to <ls> items, they are more important as concerning the HWs (and metalines) themselves; some (6) are to be found by the regex -[0-9].<info .

And, there is a <hom> related issue #131, that @funderburkjim yet needs to put his eye on!

funderburkjim added a commit to sanskrit-lexicon/csl-pywork that referenced this issue Jul 21, 2022
funderburkjim added a commit that referenced this issue Jul 21, 2022
funderburkjim added a commit to sanskrit-lexicon/csl-orig that referenced this issue Jul 21, 2022
@funderburkjim
Copy link
Contributor Author

further cleanup.

These take into account suggestions since previous commit.
The change transactions are change_5.txt.
All in all, about 800 lines were changed.

A large number of 'new' ls abbreviations were added to mwath as 'Unknown'.
(Also, the Maṇḍ. abbreviation was given tooltip -- see the 'pywork' commit above.)

@Andhrabharati could help by providing tooltips for these Unknown cases.
ls_abbrev_instances_unknown.txt file has instances of most of these Unknown cases.

An attempt was made to define programmatically a 'normal' ls instance. Using this rule,
there remain about 40 'abnormal' instances identified in file lsabnormal_5.txt.

If there are no more correction suggestions for 'ls' in mw, this issue can be closed and I'll take
a look at the hom issue mentioned above.

@Andhrabharati
Copy link
Contributor

Seen that the space between digits and other corrections related to <ls> items are all considered now.

The remaining point in this issue is #136 (comment), which could be done before going to #131, or to be kept in mind while doing the <hom> corrections.

I would prefer them being corrected here itself, as these are not marked <hom> explicitly.

So Jim can decide the action accordingly to close this issue.

@Andhrabharati
Copy link
Contributor

A large number of 'new' ls abbreviations were added to mwath as 'Unknown'. (Also, the Maṇḍ. abbreviation was given tooltip -- see the 'pywork' commit above.)

@Andhrabharati could help by providing tooltips for these Unknown cases. ls_abbrev_instances_unknown.txt file has instances of most of these Unknown cases.

An attempt was made to define programmatically a 'normal' ls instance. Using this rule, there remain about 40 'abnormal' instances identified in file lsabnormal_5.txt.

If @funderburkjim likes to do it here itself, I can surely help resolving these, but I would suggest doing this piece of work while some action is taken on the issue #135 (which is related to the same and also has some more relevant points).

Hope to listen back Jim's opinion.

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Jul 21, 2022

Some small extra corrections related to spaces:

  1. There are 112 double space instances in the mw.txt, that are to be made single spaces.
  2. 10 cases of ,, 9 cases of ; and 10 cases of ) to have the preceding space deleted.
  3. 3 dangling > to be deleted.

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Jul 21, 2022

In the tooltip.txt,

99.98 ib. int the same place [Cologne Addition] Title

to be corrected as

99.98 ib. in the same place [Cologne Addition] Title

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Jul 21, 2022

There are two instances of ** under <L>229710 and <L>237718 in the mw.txt, that may be deleted.

Incidentally, the <ls>ĀpastPray.</ls>, which is at <L>237718, is without a tooltip, but is not listed in either abbrevlist_unknown.txt or in ls_abbrev_instances_unknown.txt, though present in both mwauth.txt and tooltip.txt.

When looked for the equivalence among these 4 files, noticed that both mwauth.txt & tooltip.txt have 168 no.s of to be expanded "Unknown reference" entries, whereas both abbrevlist_unknown.txt & ls_abbrev_instances_unknown.txt listed just 147 no.s.

What is the reason for the difference of 21 between the two sets of files?

The 21 additional entries in tooltip.txt are--

ĀpGṛh.
ĀpastPray.
Śak. (Chézy)
Śak. (Pi.)
AV. Paipp.
AV., SBE.
Kaegi, Der Ṛgveda
Ludwig, RV.
Muir's Sanskrit Texts
Muir, S. T.
Pañc. B.
Pat. (K.)
R. (B)
R. (B.)
R. G.
R. [B.]
RV. AnuvAnukr.
SV.Anukr.
Uttamac.²
YajurV. Parīś.
Zachariae, Beiträge

In these, Śak. (Pi.) occurs 18 times!!

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Jul 21, 2022

Resolving 9 out of 10 instances of <ls n="Unknown">:

  1. under <L>67611, <ls n="Unknown">lii, 19</ls> to be made as <ls n="AV.Pariś.">lii, 19</ls>, taking the prev. ls item (AV.Pariś.) as the ref.
    [cf. PWG entry of govITI.]

  2. under <L>71651, <ls n="Unknown">xxx.</ls> to be made as <ls n="Vīrac.">xxx.</ls>, taking the prev. ls item (Vīrac.) as the ref.
    [cf. pwk entry candraketu and the Ind. St. 14 thereupon (p. 159).]

  3. under <L>71652, <ls n="Unknown">xxx.</ls> to be made as <ls n="Vīrac.">xxx.</ls>, taking the prev. ls item (Vīrac.) as the ref.
    [cf. pwk entry candrakeSa and the Ind. St. 14 thereupon (p. 159).]

  4. under <L>71680, <ls n="Unknown">xv</ls> to be made as <ls n="Vīrac.">xv.</ls>, taking the prev. ls item (Vīrac.) as the ref.
    [cf. pwk entry candracUqa and the Ind. St. 14 thereupon (p. 159).]

  5. under <L>71827, <ls n="Unknown">xxx.</ls> to be made as <ls n="Vīrac.">xxx.</ls>, taking the prev. ls item (Vīrac.) as the ref.
    [cf. pwk entry candravikrama and the Ind. St. 14 thereupon (p. 159).]

  6. under <L>71874, <ls n="Unknown">xxx.</ls> to be made as <ls n="Vīrac.">xxx.</ls>, taking the prev. ls item (Vīrac.) as the ref.
    [cf. pwk entry candrasena and the Ind. St. 14 thereupon (p. 159).]

  7. under <L>84603, <s1 slp1="saMgIta-darpaRa">Saṃgīta-darpaṇa</s1>, <ls n="Unknown">vi</ls> to be made as <ls>Saṃgīta-darpaṇa, vi</ls>

  8. under <L>95073.91, <ls n="Unknown">52, 5</ls> to be made as <ls n="R.">2, 52, 5</ls>; this is a print correction, and has the prev. ls item (MBh. &c.) as the ref.
    [cf. pwk entry darh having "— 5) दृढ꣫ , दृळ्ह꣫" and the PWG entry darh having "°स्थूण R. 2, 105, 16. नौ 2, 52, 5."]

  9. under <L>95074.05, <ls n="Unknown">52, 5</ls> to be made as <ls n="R.">2, 52, 5</ls>; this is a print correction, and has the prev. ls item (MBh. &c.) as the ref.
    [cf. pwk entry darh having "— 5) दृढ꣫ , दृळ्ह꣫" and the PWG entry darh having "°स्थूण R. 2, 105, 16. नौ 2, 52, 5."]

Shouldn't the last two "दृढ (or दृळ्ह॑)" be marked as or-group candidates?

There are plenty more of such "unmarked groups", separated out as diff. HWs in the whole data of mw.txt.
#132 (comment)

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Jul 21, 2022

@funderburkjim

while you're on this MW work, would you mind generating the IAST version of mw.txt again [so that I can do a better (rather, faster) work using it]?
[I am having the version which is more than one year old (Apr 2021); lot many updates have taken place on the text during this period.]

funderburkjim added a commit to sanskrit-lexicon/csl-pywork that referenced this issue Jul 21, 2022
funderburkjim added a commit to sanskrit-lexicon/csl-orig that referenced this issue Jul 21, 2022
funderburkjim added a commit that referenced this issue Jul 22, 2022
@funderburkjim
Copy link
Contributor Author

funderburkjim commented Jul 22, 2022

2nd batch of corrections.

These take into account the preceding comments by @Andhrabharati.
About 170 lines changes in mw.txt.
Change transaction details are in change_6.txt.

Note 1: The one remaining n="Unknown" was solved:

 <L>81877<pc>433,1<k1>tattvaboDa
   knowledge or understanding of truth, <ls n="Sarvad.">xii, 46</ls>
   [cf. PWG, and MW tattvaprakASa]

Note 2: Shouldn't the last two "दृढ (or दृळ्ह॑)" be marked as or-group candidates?

  They already are so marked  in
   L>95073.9<pc>490,2<k1>df|a and <L>95074<pc>490,2<k1>dfQa
which have the 'or' markup: <info or="95074,dfQa;95073.9,df|a"/>
 The `or` markup is not repeated for the '2a' subsidiary entries.

The iast version of revised mw.txt is temp_mw_6_iast.zip.

The unknown ls abbreviations file is revised and contains 170 items: abbrevlist_unknown.txt

Instances of the abbreviations with unknown tooltips are in
ls_abbrev_instances_unknown1.txt)
and ls_abbrev_instances_unknown1_iast.txt) based on temp_mw_6_iast.txt.

@funderburkjim
Copy link
Contributor Author

We can discuss tooltips for the unknown literary source abbreviations under #135.
The best format for me would be via an edit of abbrevlist_unknown.txt, where
each Unknown reference text is replaced by the appropriate tooltip text for the abbreviation.
The ls_abbrev_instances_unknown1_iast.txt file might be helpful in examining the cases.

Perhaps now we can consider this #136 closeable?

@Andhrabharati
Copy link
Contributor

@funderburkjim

Wonderful updates!
And, thanks for the IAST file.

About to finish resolving the unknown reference entities (just another 15 remaining).
Will post the results in #135.

@Andhrabharati
Copy link
Contributor

And you can close this issue now.

funderburkjim added a commit to sanskrit-lexicon/csl-orig that referenced this issue Jul 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants