-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request of SLP1 text of all dictionaries #7
Comments
If you mean a headword only list, I can make any, see https://github.com/gasyoun/SanskritLexicography/tree/master/HeadwordLists samples. |
Headword only list should suffice for cleaning up headwords first. The step to cleaning the entries is a bit more tedious. |
An slp1 form of the headwords (for all dictionaries EXCEPT MW) is in file Xhw2.txt, as part of the Xxml download for the dictionary. For instance, for PW, go to download page for PW, and download pwxml.zip. In that download are several files, including: pwhw2.txt the headwords in slp1 (a colon-separated file - headword in middle, as I recall) pw.xml This has the Sanskrit Devanagari words in <s>X</s> elements, with X being in slp1. Same for other dictionaries. Is this enough to work from? |
@funderburkjim - I would not waste time in doing what @gasyoun is good at. |
@drdhaval2785 "list of slp1" - sure, let me master the batch part with my VBEE scripts, just in two weeks I'll be there. Doing manually one by one is no fun part. |
@gasyoun |
@gasyoun |
@drdhaval2785 reread all the comments and still can't get - you want full text of dictionaries in SLP1 instead of just headwords? My converter's can do harm - because the tags are different and could get lost. |
Ever since this issue was raised in Oct 2014, I have made it an objective to convert the base form of each digitization Let me explaiin a little more, using PWG as an example.
The construction of the slp1 base version (pwg_orig_utf8_slp1.txt) is surprisingly tricky. The reason is Here's a list where the Devanagari IS converted to SLP1 in the base form and there is an
Here's the list where there is no x_orig_utf8_slp1.txt form:
Incidentally, the x.xml files for all these dictionaries have Devanagari coded as SLP1. |
Let me repeat. I'm afraid to ask questions to Jim. Because when he starts to answer it's a new entry in a to-be published Encyclopædia. If not chapter. I hope you understand now Dhaval why I can't do all the tricks and could only add more mess. Jim is scientific from alpha to omega.
|
re: what script does inverse conversion? The script cp1252_to_utf8.py converts from cp1252 to utf8. These scripts are part of the xml downloads for each dictionary. |
re 'Did not get pwg.txt how it's different from pwg_orig_utf8_slp1.txt.' pwg.txt contains corrections; when you or anyone submits a correction for pwg, this correction gets You could think of pwg.txt as the 'latest version' of pwg_orig_utf8_slp1.txt. pwg.xml is created from pwg.txt (by script make_xml.py). |
Re Thomas use (within coding of Devanagari) of vertical bar for danda, and period for English punctuation. This convention is true in many dictionaries, but not all, as I recall. We shouldn't keep this in coding of Devanagari. Since we have decided to use SLP1 as the coding system But this then leave open the question of how to represent, in SLP1, a 'true' period? The answer I've used For instance, suppose we see in a dictionary the English sentence: This is the approach taken in converting from Thomas' HK coding (e.g. pwg_orig_utf8.txt) to an SLP1 |
Re: vcp0 - Yes, I probably should change this file name for the sake of uniformity. Re MW: MW(1899) is the odd man out. The base form is mw.xml. There is not likely to be a |
|
|
Constructed SLP1 base for for 'ae' |
AE different because of the non-pratipadika forms or what? |
@gasyoun The conversion of Devanagari coding in AE from HK to SLP1 only pertains to entries, since the |
The Devanagari in the base form of PD has now been converted to SLP1. Only three more dictionaries have significant conversions to SLP1 : BOP,KRM,PGN. I'll aim to do those soon. |
|
BOP now converted to SLP1. Similar issues with n~ and danda/period resolved. |
KRM, PGN left, hurray! |
KRM now converted to SLP1. Similar issues with n~ and danda/period resolved. Considerable work would be required to improve the markup of KRM, so that its displays may more closely
Such a task requires input of a Sanskrit Scholar, who understands the nature of the information in |
Not a Sanskrit scholar, but someone who understands layout coding. It'll have to be delayed for better |
PGN now converted to SLP1. Similar issues with n~ and danda/period resolved. In PGN, Devanagari text only occurs in material that is not, currently, part of the pgn.xml (and thus not part It is something of a 'force' to represent the digitization of PGN as a dictionary like the 'real' dictionaries This complete the primary SLP1-ization of the dictionaries (the primary TODO list mentioned in the comment of March 21. There only remains the secondary 'TODO' list in this SLP1-fest. |
There is a Chapter Footnotes file of PGN or it's the non-OCRed part? So IEG,INM,PE, PUI left. Is there something I can help with? |
I would term this job super quick Jim. Pity that i couldnt be actively
involved. Actually i put the house on fire and then ran away. Satisfying
indeed the way Jim responds.
Once SLP1 for all dicts are available we would have more candidates for
comparision of faultfinder.
|
Conversion to SLP1 completed for the secondary TODO list: IEG, INM, PE, PUI. |
regarding Actually, the headwords for all the dictionaries have ALWAYS been in SLP1 form (except for the three Admittedly this was confusing. At least this one confusion is now removed in the digitizations. At any rate, I definitely agree with the sentiment that we should finish the headword checking process via |
Thanks for clarifying the matter. I was under the wrong impression.
|
You'll have a chance to get back, Dhaval. There are still some tiny issues left. |
@funderburkjim |
sanhw1.txt revised, as mentioned elsewhere. Currently, it is awkward to revise sanhw1.txt on Github (run a script at Cologne, download to local Github CORRECTIONS repository, sync to Github.) That's my excuse for irregular revisions. |
@drdhaval2785 there was an update a week ago. Not sure what you meant. |
Great. Now we have sanhw1.txt and sanhw2.txt mostly updated. |
Jim,
If we can have the SLP1 text of all dictionaries on their respective repositories, we would be in better position to play around with them and pick out the errors.
e.g. I could access MW and PWG in SLP1.
This was we could point 91 possible errors in issue #2 .
If similarly, list of other dicts are also provided, we may get still more errors by comparing patterns.
The text was updated successfully, but these errors were encountered: