Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request of SLP1 text of all dictionaries #7

Closed
drdhaval2785 opened this issue Oct 8, 2014 · 36 comments
Closed

Request of SLP1 text of all dictionaries #7

drdhaval2785 opened this issue Oct 8, 2014 · 36 comments

Comments

@drdhaval2785
Copy link
Contributor

Jim,
If we can have the SLP1 text of all dictionaries on their respective repositories, we would be in better position to play around with them and pick out the errors.
e.g. I could access MW and PWG in SLP1.
This was we could point 91 possible errors in issue #2 .
If similarly, list of other dicts are also provided, we may get still more errors by comparing patterns.

@gasyoun
Copy link
Member

gasyoun commented Oct 8, 2014

If you mean a headword only list, I can make any, see https://github.com/gasyoun/SanskritLexicography/tree/master/HeadwordLists samples.

@drdhaval2785
Copy link
Contributor Author

Headword only list should suffice for cleaning up headwords first. The step to cleaning the entries is a bit more tedious.
Right now - yes - only Headword lists of all dictionaries from Cologne site.

@funderburkjim
Copy link
Contributor

An slp1 form of the headwords (for all dictionaries EXCEPT MW) is in file Xhw2.txt, as part of the Xxml download for the dictionary. For instance, for PW, go to download page for PW, and download pwxml.zip. In that download are several files, including:

pwhw2.txt the headwords in slp1 (a colon-separated file - headword in middle, as I recall)

pw.xml This has the Sanskrit Devanagari words in <s>X</s> elements, with X being in slp1.

Same for other dictionaries.

Is this enough to work from?

@drdhaval2785
Copy link
Contributor Author

@funderburkjim - I would not waste time in doing what @gasyoun is good at.
@gasyoun - Please provide the list of slp1 of all possible dictionaries at https://github.com/gasyoun/SanskritLexicography/tree/master/HeadwordLists

@gasyoun
Copy link
Member

gasyoun commented Oct 18, 2014

@drdhaval2785 "list of slp1" - sure, let me master the batch part with my VBEE scripts, just in two weeks I'll be there. Doing manually one by one is no fun part.

@drdhaval2785
Copy link
Contributor Author

@gasyoun
Now your PhD is over - I guess these two weeks time is over.

@drdhaval2785
Copy link
Contributor Author

@gasyoun
Is this on todo list ?
Otherwise we close the issue

@gasyoun
Copy link
Member

gasyoun commented Mar 21, 2015

@drdhaval2785 reread all the comments and still can't get - you want full text of dictionaries in SLP1 instead of just headwords? My converter's can do harm - because the tags are different and could get lost.

@funderburkjim
Copy link
Contributor

Ever since this issue was raised in Oct 2014, I have made it an objective to convert the base form of each digitization
from HK to SLP1. By the base form of a digitization for dictionary X I mean X.txt. (e.g., pwg.txt, md.txt, etc.)
This task has been done for 21 of 36 of the dictionaries (see below)

Let me explaiin a little more, using PWG as an example.

  • The original digitization of PWG exists as a text file named pwg_orig.txt. This is the file as obtained from
    Thomas. It has at least two features which make it hard to work with:
    • It uses an old encoding for 8-bit ascii characters, call called CP1252 (code-page 1252).
    • Devanagari is coded using the Harvard-Kyoto transliteration
  • pwg_orig_utf8.txt converts the cp1252 encoding of extended ascii to the current standard utf8 encoding.
    This is a comparitively straightforward conversion, and there is an inverse conversion in order to validate
    that no information loss occurs.
  • pwg_orig_utf8_slp1.txt Here the coding of Devanagari is converted from HK to SLP1. This is actually
    what I call the 'base form' of the digitization. All the corrections we make are 'installed' starting with
    this version,
  • pwg.txt is the current corrected form of the dictionary. It is also in the utf8 encoding, and Devanagari
    is coded as SLP1.

The construction of the slp1 base version (pwg_orig_utf8_slp1.txt) is surprisingly tricky. The reason is
that there are various minor oddities in the HK coding. One especially tricky part is the use of the period
punctuation mark in coding of text which is Devanagari. The period in 'standard' HK and SLP1 is used to
represent the daRqa. However, this period also commonly occurs in the bilingual dictionaries as English
(or German, etc.) punctuation. In some digitizations, Thomas has used a vertical bar in Devanagari to
represent the daRqa, and a period to represent non-Sanskrit punctuation. But usually there are
inconsistencies in the use of the period in text marked as Devanagari, and this question has to be
addressed. It is a challenge and makes the construction of the slp1 form non-trivial, tedious and non-enjoyable. That's my excuse for why some
of the dictionaries with HK-coded Devanagari have not been converted to SLP1 yet.

Here's a list where the Devanagari IS converted to SLP1 in the base form and there is an
x_orig_utf8_slp1.txt form of the dictionary:

ACC,AP90,AP,BEN,BOR,BUR,CAE,CCS,GST,MCI,MD,MW72,PWG,PW,SCH,SHS,SKD,WIL,YAT

Here's the list where there is no x_orig_utf8_slp1.txt form:

MW,,VCP  have slp1 coding. See below
BHS,GRA,SNP,STC,VEI  have no Devanagari. All Sanskrit is in AS (Anglicized Sanskrit)

These 4 dictionaries have a small amount of HK coded Devanagari, and are a secondary TODO list.
IEG,INM Devanagari only in preface, PE (26 instances) , PUI (3 instances)

These dictionaries have substantial HK coded Devanagari.  They form the main TODO list.
AE,BOP,KRM,MWE,PD,PGN
  • MW Devanagari is already in SLP1 form. There are files mw_orig.txt and mw_orig_utf8.txt.
    mw_orig.txt is the form of MW1899 that Thomas provided way back in 2006 when Peter and I
    first became involved in the Cologne Sanskrit-Lexicon project that Thomas began in the 1990s.
    The current reference form for MW is mw.xml.
  • VCP The base form has a different file name: vcp0.txt, which is utf8 and has text in SLP1.

Incidentally, the x.xml files for all these dictionaries have Devanagari coded as SLP1.

@gasyoun
Copy link
Member

gasyoun commented Mar 24, 2015

Let me repeat. I'm afraid to ask questions to Jim. Because when he starts to answer it's a new entry in a to-be published Encyclopædia. If not chapter. I hope you understand now Dhaval why I can't do all the tricks and could only add more mess. Jim is scientific from alpha to omega.

  1. 21 of 36 since Oct 2014 means we might hope for full SLP1isation by early 2015. Half done in six months, as a background task. This is crucial at least in the part that is connected with the headwords, although none of the deeper issues comes out at this level.
  2. call CP1252 -> called CP1252
    danqa -> danda
  3. inverse conversion in order to validate that no information loss occurs - what script does it?
    Did not get pwg.txt how it's different from pwg_orig_utf8_slp1.txt. I was thinking it is based on the .xml file.
  4. Thomas has used a vertical bar in Devanagari to represent the danqa, and a period to represent non-Sanskrit punctuation should we keep this practice in the future? What would lessen or pain?
  5. Hope that the list of where there is no x_orig_utf8_slp1.txt will pass in the order described, so get MW, VCP, AE and PD in a year or so.
  6. base form has a different file name: vcp0.txt should we unify, before it's too late?

@funderburkjim
Copy link
Contributor

re: what script does inverse conversion? The script cp1252_to_utf8.py converts from cp1252 to utf8.
The script utf8_to_cp1252.py does the inverse conversion, from utf8 to cp1252.

These scripts are part of the xml downloads for each dictionary.

@funderburkjim
Copy link
Contributor

re 'Did not get pwg.txt how it's different from pwg_orig_utf8_slp1.txt.'

pwg.txt contains corrections; when you or anyone submits a correction for pwg, this correction gets
installed into pwg.txt; however, pwg_orig_utf8_slp1.txt does not get these corrections.

You could think of pwg.txt as the 'latest version' of pwg_orig_utf8_slp1.txt.

pwg.xml is created from pwg.txt (by script make_xml.py).

@funderburkjim
Copy link
Contributor

Re Thomas use (within coding of Devanagari) of vertical bar for danda, and period for English punctuation.

This convention is true in many dictionaries, but not all, as I recall.

We shouldn't keep this in coding of Devanagari. Since we have decided to use SLP1 as the coding system
for Devanagari, we should follow the SLP1 conventions. In SLP1, the period represents danda.

But this then leave open the question of how to represent, in SLP1, a 'true' period? The answer I've used
is to take the true periods out of SLP1 - the true period is not Devanagari, so should not be included as
part of a section of text identified as coding Devanagari.

For instance, suppose we see in a dictionary the English sentence: The word for dog is श्वन्.
Thomas would typically code this as The word for dog is {#zvan.#} (note period inside {##}).
If, in conversion to SLP1 this was coded as The word for dog is {#Svan.#} and if this were then
converted back to Devanagari, we would see The word for dog is श्वन्।, which disagrees with the
original sentence because the period of the original sentence has been treated as a danda.
The solution is to have the SLP1 conversion of the sentence to be The word for dog is {#Svan#}. (i.e., to move the period outside of the scope of the {##} Devanagari delimiters.

This is the approach taken in converting from Thomas' HK coding (e.g. pwg_orig_utf8.txt) to an SLP1
coding (pwg_orig_utf8_slp1.txt). This task is accomplished by a script called 'transcode.py' which is
in the convertwork directory of the xml download for pwg.

@funderburkjim
Copy link
Contributor

Re: vcp0 - Yes, I probably should change this file name for the sake of uniformity.

Re MW: MW(1899) is the odd man out. The base form is mw.xml. There is not likely to be a
mw_orig_utf8_slp1.txt. Devanagari is coded as SLP1 in mw.xml.

@gasyoun
Copy link
Member

gasyoun commented Mar 24, 2015

true periods out of SLP1 - the true period is not Devanagari, so should not be included as part of a section of text identified as coding Devanagari. oh so it's where the fun starts. But I understand the concerns and agree. Some RegEx magic in your python scripts will bring Thomas idea to a standard that will be usable in both directions.

@funderburkjim
Copy link
Contributor

  1. Changed name of vcp0.txt to vcp_orig_utf8_slp1.txt, so the name of this base form is consistent with others.
  2. Constructed SLP1 base form for mwe (mwe_orig_utf8_slp1.txt).

@funderburkjim
Copy link
Contributor

Constructed SLP1 base for for 'ae'

@gasyoun
Copy link
Member

gasyoun commented Mar 27, 2015

AE different because of the non-pratipadika forms or what?

@funderburkjim
Copy link
Contributor

@gasyoun The conversion of Devanagari coding in AE from HK to SLP1 only pertains to entries, since the
headwords are English. For all of the dictionaries, the conversion to SLP1 applies not just to the headwords,
but to all of the Devanagari coded originally as HK. So the non-pratipadika forms (as in AP) are not an issue.
The reason it is complicated usually has to do with rather 'trivial' issues, like use of non-standard HK (such as
n~ instead of the usual HK J for palatal nasal), and the much trickier issue of 'English' periods in Devanagari.

@funderburkjim
Copy link
Contributor

The Devanagari in the base form of PD has now been converted to SLP1. Only three more dictionaries have significant conversions to SLP1 : BOP,KRM,PGN. I'll aim to do those soon.

@gasyoun
Copy link
Member

gasyoun commented Mar 29, 2015

n~ instead of the usual HK J and 'English' periods in Devanagari are not trivial at all. Because you never know ahead what's before you. So actually what you do is nut just conversion, it's cleanup and better markup.

@funderburkjim
Copy link
Contributor

BOP now converted to SLP1. Similar issues with n~ and danda/period resolved.

@gasyoun
Copy link
Member

gasyoun commented Mar 31, 2015

KRM, PGN left, hurray!

@funderburkjim
Copy link
Contributor

KRM now converted to SLP1. Similar issues with n~ and danda/period resolved.

Considerable work would be required to improve the markup of KRM, so that its displays may more closely
correspond to the printed page. Here are some issues. (The headwords are roots in DAtupAWa form,
so for instance 'gamx'):

  • In the scan, the footnotes are mentioned as a superscript in the body of an entry and the
    text of the footnotes appear at the bottom of a page. In the digitization, the footnote text appears
    within the body of an entry at its place of mention. This is one factor that obscures a comparison
    between the display and the scans.
  • In the scans, the body of the entry often has a tabular form. But the current markup does not
    permit a reconstruction of this tabular form in a display.

Such a task requires input of a Sanskrit Scholar, who understands the nature of the information in
this text.

@gasyoun
Copy link
Member

gasyoun commented Apr 1, 2015

Not a Sanskrit scholar, but someone who understands layout coding. It'll have to be delayed for better
times, which will take years to reach us, I guess. KRM markup is of 25th priority, I would propose.

@funderburkjim
Copy link
Contributor

PGN now converted to SLP1. Similar issues with n~ and danda/period resolved.

In PGN, Devanagari text only occurs in material that is not, currently, part of the pgn.xml (and thus not part
of the displays). This material is (probably) present in the Chapter Footnotes of PGN.

It is something of a 'force' to represent the digitization of PGN as a dictionary like the 'real' dictionaries
MW, PWG, etc. This observation likely applies to several of the other so-called 'specialized' dictionaries
of the Cologne Sanskrit-Lexicon.

This complete the primary SLP1-ization of the dictionaries (the primary TODO list mentioned in the comment of March 21.

There only remains the secondary 'TODO' list in this SLP1-fest.

@gasyoun
Copy link
Member

gasyoun commented Apr 2, 2015

There is a Chapter Footnotes file of PGN or it's the non-OCRed part? So IEG,INM,PE, PUI left. Is there something I can help with?

@drdhaval2785
Copy link
Contributor Author

drdhaval2785 commented Apr 2, 2015 via email

@funderburkjim
Copy link
Contributor

Conversion to SLP1 completed for the secondary TODO list: IEG, INM, PE, PUI.
This completes the conversion to SLP1 for all 36 dictionaries.
To recap, dictionaries STC,GRA,SNO,BHS,VEI have no 'x_orig_utf8_slp1.txt' form since they have no
Devanagari. There is also no mw_orig_utff_slp1.txt, since the base form for MW (1899) is mw.xml.
For each of the other 30 dictionaries, there is an x_orig_utf8_slp1.txt base form.

@funderburkjim
Copy link
Contributor

regarding Once SLP1 for all dicts are available we would have more candidates for comparision of faultfinder:

Actually, the headwords for all the dictionaries have ALWAYS been in SLP1 form (except for the three
English-Sanskrit dictionaries, of course). Recall that, if X is one of these dictionaries, then Xhw2.txt
consists of the headwords in SLP1. This was true even before this conversion work. The conversion
work dealt with the Devanagari text in X.txt, as Devanagari in X.txt was, before SLP1 conversion, still represented in the HK form that Thomas' original digitizations provided.

Admittedly this was confusing. At least this one confusion is now removed in the digitizations.

At any rate, I definitely agree with the sentiment that we should finish the headword checking process via
faultfinder
for those dictionaries whose headword-differences generated by faultfinder have not yet been
examined. These dictionaries are listed in the 'faultfinder TODO(1)' section of issue 90. The dictionary in this list with the largest set of faultfinder candidates is PD. Finishing this task that Dhaval began will be an important milestone in our correction process.

@drdhaval2785
Copy link
Contributor Author

drdhaval2785 commented Apr 3, 2015 via email

@gasyoun
Copy link
Member

gasyoun commented Apr 3, 2015

You'll have a chance to get back, Dhaval. There are still some tiny issues left.

@drdhaval2785
Copy link
Contributor Author

@funderburkjim
It seems that the sanhw1.txt file has not been updated in last three months.
I see a lot of changes pouring in and changes installed.
Time to give a new sanhw1.txt file to the world.

@funderburkjim
Copy link
Contributor

sanhw1.txt revised, as mentioned elsewhere. Currently, it is awkward to revise sanhw1.txt on Github (run a script at Cologne, download to local Github CORRECTIONS repository, sync to Github.)

That's my excuse for irregular revisions.

@gasyoun
Copy link
Member

gasyoun commented Nov 5, 2015

@drdhaval2785 there was an update a week ago. Not sure what you meant.

@drdhaval2785
Copy link
Contributor Author

Great. Now we have sanhw1.txt and sanhw2.txt mostly updated.
Let's close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants