Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pw revisions based on AB version(s), continued #102

Closed
funderburkjim opened this issue Nov 14, 2023 · 26 comments
Closed

pw revisions based on AB version(s), continued #102

funderburkjim opened this issue Nov 14, 2023 · 26 comments

Comments

@funderburkjim
Copy link
Contributor

This issue continues the revisions of PW digitization at #88, based upon work done by @Andhrabharati.
We start with AB's temp_pw_ab_17.zip

funderburkjim added a commit that referenced this issue Nov 14, 2023
@funderburkjim
Copy link
Contributor Author

Working directory for this issue: pwkissues/issue102.

temp_pw_17a.txt -- 4 changes from temp_pw_ab_17.txt above. see changes_17a.txt.

@Andhrabharati Andhrabharati changed the title unmarked abbreviations, continued pw revisions based on AB version(s), continued Nov 14, 2023
@funderburkjim
Copy link
Contributor Author

@Andhrabharati
Do you have a revision for pwkvn that I should apply to the cdsl version? I should incorporate your pwkvn changes before attempting to merge pwkvn into pwk.

@Andhrabharati
Copy link

Do you have a revision for pwkvn that I should apply to the cdsl version?

Yes I do, @funderburkjim !
I did some work, esp. to bring the pwkvn to the same format as the main pw.txt (apart from many other points).

BTW, I see that the transcoder file is giving some errors now on the revised file (and outputs a file just upto the first metaline, but not inclusive, only!!), which I wanted to use for "proofing" the pwkvn file once--

C:\pw-transcode> python pw_transcode.py slp1 deva .\pwkvn.txt .\pwkvn_deva.txt
Traceback (most recent call last):
  File "C:\pw-transcode\pw_transcode.py", line 149, in <module>
    lineout = convert_metaline(line,tranin,tranout)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\pw-transcode\pw_transcode.py", line 71, in convert_metaline
    k1a = transcode(k1,tranin,tranout)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\pw-transcode\pw_transcode.py", line 56, in transcode
    y = transcoder.transcoder_processString(x,tranin,tranout)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\pw-transcode\transcoder.py", line 257, in transcoder_processString
    transcoder_fsm(from1,to)
  File "C:\pw-transcode\transcoder.py", line 74, in transcoder_fsm
    tree = ET.parse(filein)
           ^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\xml\etree\ElementTree.py", line 1203, in parse
    tree.parse(source, parser)
  File "C:\Program Files\Python312\Lib\xml\etree\ElementTree.py", line 568, in parse
    self._root = parser._parse_whole(source)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0
PS C:\pw-transcode>

Could you pl. tell why the problem is occurring?
Same problem occurs with the pw_AB file as well.

@Andhrabharati
Copy link

Anyway, here is the pwkvn file to "adopt" for cdsl usage--
pwkvn_AB v.1.zip

@Andhrabharati
Copy link

Andhrabharati commented Nov 15, 2023

Now, coming to my pw_AB v.2 file, I had already mentioned about it earlier.

I see that the original typed text also has the volume-page notation, as seen at the text given by Thomas recently, while commenting about the das. abbreviation.
image

The dot between volume & page got missed in the present pw.txt, and probably Jim might not mind bringing it back.

Otherwise, I shall revert the correction in my v.2 file, for giving it out (to start the next phase of corrections in cdsl pw.text)

@funderburkjim
Copy link
Contributor Author

the transcoder file is giving some errors now on the revised file

I do not find this problem when I run locally for pwkvn.

This may be a python version problem.
My local version is 3.9.1

And your error message shows Python312. (version 3.12).

Can you use version 3.9 of python?

Also note -- While my local conversion of pwkvn gave no error, I DID find an
'invertibility' problem -- as for instance {#SruteratiTI-<lb/>kftA#} -- the
problem is with the <lb/> within {#X#}.

@funderburkjim
Copy link
Contributor Author

The dot between volume & page got missed in the present pw.txt, and probably Jim might not mind bringing it back.

{%niederwerfen…,…niederhauen%} , [Page4.013-3]  (FROM THOMAS COMMENT)

I think that '.' in [Page4.013-3] was made by Thomas for his convenience (he was confused by [Page4013-3] since there is no page 4013 in the pdf.)

The '.' is not part of pw.txt, nor of the display of pw. Nor has it been previously.

So there is nothing to 'bring back'.

@Andhrabharati
Copy link

Andhrabharati commented Nov 15, 2023

Even I got confused with this at times; so looked around and thought of changing the pc as v-p-c, as in other cdsl works.

Even if not to "bring back", would you mind changing it @funderburkjim ?
[Of course, as I had already mentioned in my above posting, it might need changes at many places, not a single (and easy) task!]

@Andhrabharati
Copy link

Andhrabharati commented Nov 15, 2023

@funderburkjim

Uninstalled Python 3.12 and installed Python 3.9; but still the same error appears for me--

PS C:\pw-transcode> python pw_transcode.py slp1 deva .\pwkvn.txt .\pwkvn_deva.txt
Traceback (most recent call last):
File "C:\pw-transcode\pw_transcode.py", line 149, in
lineout = convert_metaline(line,tranin,tranout)
File "C:\pw-transcode\pw_transcode.py", line 71, in convert_metaline
k1a = transcode(k1,tranin,tranout)
File "C:\pw-transcode\pw_transcode.py", line 56, in transcode
y = transcoder.transcoder_processString(x,tranin,tranout)
File "C:\pw-transcode\transcoder.py", line 257, in transcoder_processString
transcoder_fsm(from1,to)
File "C:\pw-transcode\transcoder.py", line 74, in transcoder_fsm
tree = ET.parse(filein)
File "C:\Program Files\Python39\lib\xml\etree\ElementTree.py", line 1224, in parse
tree.parse(source, parser)
File "C:\Program Files\Python39\lib\xml\etree\ElementTree.py", line 580, in parse
self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0
PS C:\pw-transcode>

Would you pl. give me the converted pwkvn_deva file for now, so that I can start proofing the same?

@Andhrabharati
Copy link

BTW, where did you find {#SruteratiTI-<lb/>kftA#}?
My file has only {#SruteratiTIkftA#}!!

@funderburkjim
Copy link
Contributor Author

I see that the 'pwkvn' file also uses the 'v-page-col' form for 'pc'.

When I get to the task of integrating pwkvn into pw, then maybe will be the time to change
to change 'pc' in pw.txt from 'vpage-c' to 'v-page-col'.

@funderburkjim
Copy link
Contributor Author

{#SruteratiTI-<lb/>kftA#} appears in the current csl-orig pwkvn.txt at line 1094

@Andhrabharati
Copy link

I see that the 'pwkvn' file also uses the 'v-page-col' form for 'pc'.

When I get to the task of integrating pwkvn into pw, then maybe will be the time to change to change 'pc' in pw.txt from 'vpage-c' to 'v-page-col'.

Good to hear this!

So, shall I post my v.2 file as is now? [along with the steps involved in converting ab_17 to that form]
Or, should wait till your perusal of my pwkvn work is over?

@funderburkjim
Copy link
Contributor Author

Version confusion!

In the comments above, we've mentioned both pw and pwkvn.
and we've mentioned both AB.V1. and AB.V2.
You have uploaded a pwkvn_AB_v.1.txt.
You have requested pwkvn_deva from me.
Should this be a conversion of

  • your pwkvn_AB_v.1.txt ? OR
  • current cdsl pwkvn.txt?

You asked shall I post my v.2 file as is now? [along with the steps involved in converting ab_17 to that form].

  • If your revised pw (17) is ready, yes, do post it -- It will be the basis of work in this issue.

@Andhrabharati
Copy link

My copy pwkvn_AB_v.1.txt

@Andhrabharati
Copy link

Andhrabharati commented Nov 15, 2023

I had made v.2 (long back, over 2 months ago) from my v.1 file that was posted initially for the abbr. work.

And I have been updating the same with your successive steps from 1 to 16, so far.

Shall post the file tomorrow, as I had just shutdown my system and on my mobile now.

@funderburkjim
Copy link
Contributor Author

post the file tomorrow -- Sounds good.

@funderburkjim
Copy link
Contributor Author

why the conversion problem?

The python errors above are occurring at line 74 of transcoder.py
tree = ET.parse(filein), where filein is the name of one of the transcoder files (in the transcoder directory).

the et_example folder contains a published simple example of using ET.parse.

@Andhrabharati If you try this example on your local system, Does it work?

@Andhrabharati
Copy link

Here is the result--

PS C:\pw-transcode\test> python test1.py
data {}
Liechtenstein 1
Singapore 4
Panama 68
PS C:\pw-transcode\test>

@Andhrabharati
Copy link

Andhrabharati commented Nov 16, 2023

ab_17 to ab_17a (adjustments)

Step-1: merging the separate [Pagexxx] lines "into" the other lines.

(a) <LEND>\n[ -> <LEND> [ ;; 682113 -> 679088 (-3025)
(b) \[Page(.*?)\]\n<LEND> -> <LEND> \[Page\1\] ;; 679088 -> 679068 (-20)
(c) ([^\n])\n\[Page(.*?)\]\n -> \1 \[Page\2\] ;; 679068 -> 673636 (-5432)
(d) ] <div n= -> ]\n<div n= ;; 673636 -> 674062 (426)
Now we have equal line numbers (674062) in pw_ab_17a and pw (AB v.2), facilitating comparison.

Step-2: merging consecutive <ls n="Chr.

</ls>. <ls n="Chr.(.*?)"> -> '. '

Step-3: removing the italic terminations around [Pagexxx]

%} \[Page(.*?)\] {% -> ' [Page\1] '

Step-4: Changing the page & column numbers after <pc> and [Page

(a) Insert a '-' after the first (volume) digit.
(b) Change the ending (column) digit -[123] to a letter -[abc] resp.
After these changes, we have the two files differing in about 7000+ lines.

temp_pw_ab_17a.zip and pw (AB v2).zip

The majority of changes are--
(a) fetching more 'grouped' HWs in the file
(b) clubbing of ls-entities together
(c) punctuation

@funderburkjim
Copy link
Contributor Author

@Andhrabharati
Have been able to programmatically reproduce your temp_pw_ab_17a.txt from temp_pw_17a.txt. Work in issue102/step1.
Your description of these changes of great help!
See change_diff_4.txt for 21 additional corrections you made but did not mention.

I'll switch to pwkvn now (#103) before further investigation of your changes in temp_pw_AB_v2.txt

@Andhrabharati
Copy link

Andhrabharati commented Nov 17, 2023

@funderburkjim

All the 20 "[Pagexxx]" changes mentioned in your addl. corrections file above were "included" in the Step-4 in my notes.

And then there is one mistake (!?) in my file, which you have noted at <L>89441<pc>5-116-a<k1>yajus<k2>ya/jus<e>100

<L>89441<pc>5-116-a<k1>yajus<k2>ya/jus<e>100
440422 old <div n="2">— d〉 <ab>Bez.</ab> {%eines <ab>best.</ab> Spruches%} <ls>NṚS. TĀP. UP.</ls> (in der <ls>Bibl. ind.</ls>) <ls n="Chr.">1,3</ls> (zweimal).
;
440422 new <div n="2">— d〉 <ab>Bez.</ab> {%eines <ab>best.</ab> Spruches%} <ls>NṚS. TĀP. UP.</ls> (in der <ls>Bibl. ind.. 1,3</ls> (zweimal).

;; AB note
to change the complex
old <ls>NṚS. TĀP. UP.</ls> (in der <ls>Bibl. ind.</ls>) <ls n="Chr.">1,3</ls>
from
new <ls>NṚS. TĀP. UP.</ls> (in der <ls>Bibl. ind.. 1,3</ls>
to the simpler
new <ls>NṚS. TĀP. UP. 1,3</ls> (in der <ls>Bibl. ind.</ls>)
similar to the 475321 content <ls>NṚS. UP. 1,3</ls> in der <ls>Bibl. ind.</ls>
[as done in my v.2 file]

funderburkjim added a commit to sanskrit-lexicon/csl-pywork that referenced this issue Nov 27, 2023
funderburkjim added a commit to sanskrit-lexicon/csl-orig that referenced this issue Nov 27, 2023
funderburkjim added a commit to sanskrit-lexicon/csl-websanlexicon that referenced this issue Nov 27, 2023
funderburkjim added a commit to sanskrit-lexicon/csl-apidev that referenced this issue Nov 27, 2023
funderburkjim added a commit that referenced this issue Nov 27, 2023
funderburkjim added a commit that referenced this issue Nov 27, 2023
@funderburkjim
Copy link
Contributor Author

Resolve the 7000 differences

The work is done in the issue102/step2 directory.
The first small step was to remove the ';;' comments in AB version and make corresponding changes in cdsl version, resulting in temp_pw_v1_0.txt and temp_pw_v2_0.txt (v1=cdsl, v2=AB).
See diff_AB_v2_v2_0.txt for diff temp_pw_AB_v2.txt temp_pw_v2_0.txt.

change_v2_1.txt (39 changes) documents the further changes to AB version temp_pw_v2_0.txt.

change_v1_1.txt (7252 changes) documents the further changes to cdsl version temp_pw_v1_0.txt.

Respectively applying these changes yields temp_pw_v1_1.txt and temp_pw_v2_1.txt.

diff temp_pw_v1_1.txt temp_pw_v2_1.txt | wc -l
0 diffs. These files are identical, so all differences are resolved.

This is the version pushed to csl-orig repository at the commit mentioned in above comment.

Other repositories also required some change for xml-validation and proper behavior of the displays. Notably, the new [Page v-ppp-c] format is now used for page links, as requested in a previous comment.

image

@funderburkjim
Copy link
Contributor Author

@Andhrabharati I think this issue may now be closed. Agree?

funderburkjim added a commit that referenced this issue Nov 27, 2023
@Andhrabharati
Copy link

@funderburkjim

Out of the 32 Misc. changes done in the AB version, I've noticed 5 corrections (at 36207, 36215, 324234, 339882 and 426957)--
corrections.txt

You may correct these in the cdsl text also.

The rest are mostly related to italic marking, to which I deliberately didn't pay much attention earlier (having thought of doing a full text reading once; and I would probably take this up quite soon).

Glad that the CDSL and AB versions are now tallying!!

I have just returned home from a long journey (and too tired), and shall look at the rest of the actions (that you had taken) tomorrow.

@funderburkjim
Copy link
Contributor Author

@Andhrabharati @maltenth has been working on two types of corrections

  • German word spelling errors in italicized text
  • Spelling inconsistencies in <bot> tags.

Let's defer your further work (including your small 'corrections' file) until this work with Thomas is finished.
Thus, I'm closing this issue.

BTW, I am doubtful of the 'print change' suggestions of your corrections file, but we can discuss further in another issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants