Request of SLP1 text of all dictionaries #7

drdhaval2785 · 2014-10-08T09:21:02Z

Jim,
If we can have the SLP1 text of all dictionaries on their respective repositories, we would be in better position to play around with them and pick out the errors.
e.g. I could access MW and PWG in SLP1.
This was we could point 91 possible errors in issue #2 .
If similarly, list of other dicts are also provided, we may get still more errors by comparing patterns.

gasyoun · 2014-10-08T09:30:43Z

If you mean a headword only list, I can make any, see https://github.com/gasyoun/SanskritLexicography/tree/master/HeadwordLists samples.

drdhaval2785 · 2014-10-08T09:41:49Z

Headword only list should suffice for cleaning up headwords first. The step to cleaning the entries is a bit more tedious.
Right now - yes - only Headword lists of all dictionaries from Cologne site.

funderburkjim · 2014-10-09T21:28:26Z

An slp1 form of the headwords (for all dictionaries EXCEPT MW) is in file Xhw2.txt, as part of the Xxml download for the dictionary. For instance, for PW, go to download page for PW, and download pwxml.zip. In that download are several files, including:

pwhw2.txt the headwords in slp1 (a colon-separated file - headword in middle, as I recall)

pw.xml This has the Sanskrit Devanagari words in <s>X</s> elements, with X being in slp1.

Same for other dictionaries.

Is this enough to work from?

drdhaval2785 · 2014-10-18T16:16:50Z

@funderburkjim - I would not waste time in doing what @gasyoun is good at.
@gasyoun - Please provide the list of slp1 of all possible dictionaries at https://github.com/gasyoun/SanskritLexicography/tree/master/HeadwordLists

gasyoun · 2014-10-18T16:26:59Z

@drdhaval2785 "list of slp1" - sure, let me master the batch part with my VBEE scripts, just in two weeks I'll be there. Doing manually one by one is no fun part.

drdhaval2785 · 2014-11-17T04:04:37Z

@gasyoun
Now your PhD is over - I guess these two weeks time is over.

drdhaval2785 · 2015-03-21T11:35:25Z

@gasyoun
Is this on todo list ?
Otherwise we close the issue

gasyoun · 2015-03-21T18:47:58Z

@drdhaval2785 reread all the comments and still can't get - you want full text of dictionaries in SLP1 instead of just headwords? My converter's can do harm - because the tags are different and could get lost.

funderburkjim · 2015-03-21T23:24:28Z

Ever since this issue was raised in Oct 2014, I have made it an objective to convert the base form of each digitization
from HK to SLP1. By the base form of a digitization for dictionary X I mean X.txt. (e.g., pwg.txt, md.txt, etc.)
This task has been done for 21 of 36 of the dictionaries (see below)

Let me explaiin a little more, using PWG as an example.

The original digitization of PWG exists as a text file named pwg_orig.txt. This is the file as obtained from
Thomas. It has at least two features which make it hard to work with:
- It uses an old encoding for 8-bit ascii characters, ~~call~~ called CP1252 (code-page 1252).
- Devanagari is coded using the Harvard-Kyoto transliteration
pwg_orig_utf8.txt converts the cp1252 encoding of extended ascii to the current standard utf8 encoding.
This is a comparitively straightforward conversion, and there is an inverse conversion in order to validate
that no information loss occurs.
pwg_orig_utf8_slp1.txt Here the coding of Devanagari is converted from HK to SLP1. This is actually
what I call the 'base form' of the digitization. All the corrections we make are 'installed' starting with
this version,
pwg.txt is the current corrected form of the dictionary. It is also in the utf8 encoding, and Devanagari
is coded as SLP1.

The construction of the slp1 base version (pwg_orig_utf8_slp1.txt) is surprisingly tricky. The reason is
that there are various minor oddities in the HK coding. One especially tricky part is the use of the period
punctuation mark in coding of text which is Devanagari. The period in 'standard' HK and SLP1 is used to
represent the daRqa. However, this period also commonly occurs in the bilingual dictionaries as English
(or German, etc.) punctuation. In some digitizations, Thomas has used a vertical bar in Devanagari to
represent the daRqa, and a period to represent non-Sanskrit punctuation. But usually there are
inconsistencies in the use of the period in text marked as Devanagari, and this question has to be
addressed. It is a challenge and makes the construction of the slp1 form non-trivial, tedious and non-enjoyable. That's my excuse for why some
of the dictionaries with HK-coded Devanagari have not been converted to SLP1 yet.

Here's a list where the Devanagari IS converted to SLP1 in the base form and there is an
x_orig_utf8_slp1.txt form of the dictionary:

ACC,AP90,AP,BEN,BOR,BUR,CAE,CCS,GST,MCI,MD,MW72,PWG,PW,SCH,SHS,SKD,WIL,YAT

Here's the list where there is no x_orig_utf8_slp1.txt form:

MW,,VCP  have slp1 coding. See below
BHS,GRA,SNP,STC,VEI  have no Devanagari. All Sanskrit is in AS (Anglicized Sanskrit)

These 4 dictionaries have a small amount of HK coded Devanagari, and are a secondary TODO list.
IEG,INM Devanagari only in preface, PE (26 instances) , PUI (3 instances)

These dictionaries have substantial HK coded Devanagari.  They form the main TODO list.
AE,BOP,KRM,MWE,PD,PGN

MW Devanagari is already in SLP1 form. There are files mw_orig.txt and mw_orig_utf8.txt.
mw_orig.txt is the form of MW1899 that Thomas provided way back in 2006 when Peter and I
first became involved in the Cologne Sanskrit-Lexicon project that Thomas began in the 1990s.
The current reference form for MW is mw.xml.
VCP The base form has a different file name: vcp0.txt, which is utf8 and has text in SLP1.

Incidentally, the x.xml files for all these dictionaries have Devanagari coded as SLP1.

gasyoun · 2015-03-24T05:07:21Z

Let me repeat. I'm afraid to ask questions to Jim. Because when he starts to answer it's a new entry in a to-be published Encyclopædia. If not chapter. I hope you understand now Dhaval why I can't do all the tricks and could only add more mess. Jim is scientific from alpha to omega.

21 of 36 since Oct 2014 means we might hope for full SLP1isation by early 2015. Half done in six months, as a background task. This is crucial at least in the part that is connected with the headwords, although none of the deeper issues comes out at this level.
call CP1252 -> called CP1252
danqa -> danda
inverse conversion in order to validate that no information loss occurs - what script does it?
Did not get pwg.txt how it's different from pwg_orig_utf8_slp1.txt. I was thinking it is based on the .xml file.
Thomas has used a vertical bar in Devanagari to represent the danqa, and a period to represent non-Sanskrit punctuation should we keep this practice in the future? What would lessen or pain?
Hope that the list of where there is no x_orig_utf8_slp1.txt will pass in the order described, so get MW, VCP, AE and PD in a year or so.
base form has a different file name: vcp0.txt should we unify, before it's too late?

funderburkjim · 2015-03-24T19:37:19Z

re: what script does inverse conversion? The script cp1252_to_utf8.py converts from cp1252 to utf8.
The script utf8_to_cp1252.py does the inverse conversion, from utf8 to cp1252.

These scripts are part of the xml downloads for each dictionary.

funderburkjim · 2015-03-24T19:41:18Z

re 'Did not get pwg.txt how it's different from pwg_orig_utf8_slp1.txt.'

pwg.txt contains corrections; when you or anyone submits a correction for pwg, this correction gets
installed into pwg.txt; however, pwg_orig_utf8_slp1.txt does not get these corrections.

You could think of pwg.txt as the 'latest version' of pwg_orig_utf8_slp1.txt.

pwg.xml is created from pwg.txt (by script make_xml.py).

funderburkjim · 2015-03-24T20:08:43Z

Re Thomas use (within coding of Devanagari) of vertical bar for danda, and period for English punctuation.

This convention is true in many dictionaries, but not all, as I recall.

We shouldn't keep this in coding of Devanagari. Since we have decided to use SLP1 as the coding system
for Devanagari, we should follow the SLP1 conventions. In SLP1, the period represents danda.

But this then leave open the question of how to represent, in SLP1, a 'true' period? The answer I've used
is to take the true periods out of SLP1 - the true period is not Devanagari, so should not be included as
part of a section of text identified as coding Devanagari.

For instance, suppose we see in a dictionary the English sentence: The word for dog is श्वन्.
Thomas would typically code this as The word for dog is {#zvan.#} (note period inside {##}).
If, in conversion to SLP1 this was coded as The word for dog is {#Svan.#} and if this were then
converted back to Devanagari, we would see The word for dog is श्वन्।, which disagrees with the
original sentence because the period of the original sentence has been treated as a danda.
The solution is to have the SLP1 conversion of the sentence to be The word for dog is {#Svan#}. (i.e., to move the period outside of the scope of the {##} Devanagari delimiters.

This is the approach taken in converting from Thomas' HK coding (e.g. pwg_orig_utf8.txt) to an SLP1
coding (pwg_orig_utf8_slp1.txt). This task is accomplished by a script called 'transcode.py' which is
in the convertwork directory of the xml download for pwg.

funderburkjim · 2015-03-24T20:13:31Z

Re: vcp0 - Yes, I probably should change this file name for the sake of uniformity.

Re MW: MW(1899) is the odd man out. The base form is mw.xml. There is not likely to be a
mw_orig_utf8_slp1.txt. Devanagari is coded as SLP1 in mw.xml.

gasyoun · 2015-03-24T20:13:44Z

true periods out of SLP1 - the true period is not Devanagari, so should not be included as part of a section of text identified as coding Devanagari. oh so it's where the fun starts. But I understand the concerns and agree. Some RegEx magic in your python scripts will bring Thomas idea to a standard that will be usable in both directions.

funderburkjim · 2015-03-26T21:39:59Z

Changed name of vcp0.txt to vcp_orig_utf8_slp1.txt, so the name of this base form is consistent with others.
Constructed SLP1 base form for mwe (mwe_orig_utf8_slp1.txt).

funderburkjim · 2015-03-27T18:56:39Z

Constructed SLP1 base for for 'ae'

gasyoun · 2015-03-27T21:14:23Z

AE different because of the non-pratipadika forms or what?

funderburkjim · 2015-03-29T00:34:27Z

@gasyoun The conversion of Devanagari coding in AE from HK to SLP1 only pertains to entries, since the
headwords are English. For all of the dictionaries, the conversion to SLP1 applies not just to the headwords,
but to all of the Devanagari coded originally as HK. So the non-pratipadika forms (as in AP) are not an issue.
The reason it is complicated usually has to do with rather 'trivial' issues, like use of non-standard HK (such as
n~ instead of the usual HK J for palatal nasal), and the much trickier issue of 'English' periods in Devanagari.

funderburkjim · 2015-03-29T00:37:02Z

The Devanagari in the base form of PD has now been converted to SLP1. Only three more dictionaries have significant conversions to SLP1 : BOP,KRM,PGN. I'll aim to do those soon.

gasyoun · 2015-03-29T03:50:42Z

n~ instead of the usual HK J and 'English' periods in Devanagari are not trivial at all. Because you never know ahead what's before you. So actually what you do is nut just conversion, it's cleanup and better markup.

funderburkjim · 2015-03-31T01:20:29Z

BOP now converted to SLP1. Similar issues with n~ and danda/period resolved.

gasyoun · 2015-03-31T05:07:58Z

KRM, PGN left, hurray!

funderburkjim · 2015-04-01T20:52:11Z

KRM now converted to SLP1. Similar issues with n~ and danda/period resolved.

Considerable work would be required to improve the markup of KRM, so that its displays may more closely
correspond to the printed page. Here are some issues. (The headwords are roots in DAtupAWa form,
so for instance 'gamx'):

In the scan, the footnotes are mentioned as a superscript in the body of an entry and the
text of the footnotes appear at the bottom of a page. In the digitization, the footnote text appears
within the body of an entry at its place of mention. This is one factor that obscures a comparison
between the display and the scans.
In the scans, the body of the entry often has a tabular form. But the current markup does not
permit a reconstruction of this tabular form in a display.

Such a task requires input of a Sanskrit Scholar, who understands the nature of the information in
this text.

gasyoun · 2015-04-01T21:37:31Z

Not a Sanskrit scholar, but someone who understands layout coding. It'll have to be delayed for better
times, which will take years to reach us, I guess. KRM markup is of 25th priority, I would propose.

funderburkjim · 2015-04-01T23:51:43Z

PGN now converted to SLP1. Similar issues with n~ and danda/period resolved.

In PGN, Devanagari text only occurs in material that is not, currently, part of the pgn.xml (and thus not part
of the displays). This material is (probably) present in the Chapter Footnotes of PGN.

It is something of a 'force' to represent the digitization of PGN as a dictionary like the 'real' dictionaries
MW, PWG, etc. This observation likely applies to several of the other so-called 'specialized' dictionaries
of the Cologne Sanskrit-Lexicon.

This complete the primary SLP1-ization of the dictionaries (the primary TODO list mentioned in the comment of March 21.

There only remains the secondary 'TODO' list in this SLP1-fest.

gasyoun · 2015-04-02T05:09:26Z

There is a Chapter Footnotes file of PGN or it's the non-OCRed part? So IEG,INM,PE, PUI left. Is there something I can help with?

drdhaval2785 · 2015-04-02T05:16:09Z

I would term this job super quick Jim. Pity that i couldnt be actively involved. Actually i put the house on fire and then ran away. Satisfying indeed the way Jim responds. Once SLP1 for all dicts are available we would have more candidates for comparision of faultfinder.

funderburkjim · 2015-04-02T21:52:57Z

Conversion to SLP1 completed for the secondary TODO list: IEG, INM, PE, PUI.
This completes the conversion to SLP1 for all 36 dictionaries.
To recap, dictionaries STC,GRA,SNO,BHS,VEI have no 'x_orig_utf8_slp1.txt' form since they have no
Devanagari. There is also no mw_orig_utff_slp1.txt, since the base form for MW (1899) is mw.xml.
For each of the other 30 dictionaries, there is an x_orig_utf8_slp1.txt base form.

funderburkjim · 2015-04-02T22:05:58Z

regarding Once SLP1 for all dicts are available we would have more candidates for comparision of faultfinder:

Actually, the headwords for all the dictionaries have ALWAYS been in SLP1 form (except for the three
English-Sanskrit dictionaries, of course). Recall that, if X is one of these dictionaries, then Xhw2.txt
consists of the headwords in SLP1. This was true even before this conversion work. The conversion
work dealt with the Devanagari text in X.txt, as Devanagari in X.txt was, before SLP1 conversion, still represented in the HK form that Thomas' original digitizations provided.

Admittedly this was confusing. At least this one confusion is now removed in the digitizations.

At any rate, I definitely agree with the sentiment that we should finish the headword checking process via
faultfinder for those dictionaries whose headword-differences generated by faultfinder have not yet been
examined. These dictionaries are listed in the 'faultfinder TODO(1)' section of issue 90. The dictionary in this list with the largest set of faultfinder candidates is PD. Finishing this task that Dhaval began will be an important milestone in our correction process.

drdhaval2785 · 2015-04-03T03:46:18Z

Thanks for clarifying the matter. I was under the wrong impression.

gasyoun · 2015-04-03T15:20:07Z

You'll have a chance to get back, Dhaval. There are still some tiny issues left.

drdhaval2785 · 2015-10-31T04:06:35Z

@funderburkjim
It seems that the sanhw1.txt file has not been updated in last three months.
I see a lot of changes pouring in and changes installed.
Time to give a new sanhw1.txt file to the world.

funderburkjim · 2015-11-04T22:58:38Z

sanhw1.txt revised, as mentioned elsewhere. Currently, it is awkward to revise sanhw1.txt on Github (run a script at Cologne, download to local Github CORRECTIONS repository, sync to Github.)

That's my excuse for irregular revisions.

gasyoun · 2015-11-05T19:51:07Z

@drdhaval2785 there was an update a week ago. Not sure what you meant.

drdhaval2785 · 2015-11-16T08:46:45Z

Great. Now we have sanhw1.txt and sanhw2.txt mostly updated.
Let's close the issue.

This was referenced Apr 10, 2015

Conversion of PWG to SLP1 sanskrit-lexicon/PWG#11

Closed

Adding of Russian Etymologies Started sanskrit-lexicon/PWG#6

Closed

drdhaval2785 closed this as completed Nov 16, 2015

Request of SLP1 text of all dictionaries #7

Request of SLP1 text of all dictionaries #7

Comments

drdhaval2785 commented Oct 8, 2014

gasyoun commented Oct 8, 2014

drdhaval2785 commented Oct 8, 2014

funderburkjim commented Oct 9, 2014

drdhaval2785 commented Oct 18, 2014

gasyoun commented Oct 18, 2014

drdhaval2785 commented Nov 17, 2014

drdhaval2785 commented Mar 21, 2015

gasyoun commented Mar 21, 2015

funderburkjim commented Mar 21, 2015

gasyoun commented Mar 24, 2015

funderburkjim commented Mar 24, 2015

funderburkjim commented Mar 24, 2015

funderburkjim commented Mar 24, 2015

funderburkjim commented Mar 24, 2015

gasyoun commented Mar 24, 2015

funderburkjim commented Mar 26, 2015

funderburkjim commented Mar 27, 2015

gasyoun commented Mar 27, 2015

funderburkjim commented Mar 29, 2015

funderburkjim commented Mar 29, 2015

gasyoun commented Mar 29, 2015

funderburkjim commented Mar 31, 2015

gasyoun commented Mar 31, 2015

funderburkjim commented Apr 1, 2015

gasyoun commented Apr 1, 2015

funderburkjim commented Apr 1, 2015

gasyoun commented Apr 2, 2015

drdhaval2785 commented Apr 2, 2015 via email

funderburkjim commented Apr 2, 2015

funderburkjim commented Apr 2, 2015

drdhaval2785 commented Apr 3, 2015 via email

gasyoun commented Apr 3, 2015

drdhaval2785 commented Oct 31, 2015

funderburkjim commented Nov 4, 2015

gasyoun commented Nov 5, 2015

drdhaval2785 commented Nov 16, 2015