-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
words in different languages #3
Comments
@Andhrabharati Here are words in bhs.txt that are
The remaining 164 words in |
Jmd is an abbr. for Jemand (= someone, somebody); and, it is a german word. |
This is what I had mentioned about the fr- and ger- marking (limited to italic strings) at the very beginning.
I knew very well that more italic strings need to be marked yet; and I've noticed quite a few non-italic strings as well, that belong to other languages. Hence was my request to you to try the programmatic approach using the spellchecker lists.
Here I meant continue marking the words using the spellchecker_french.txt and spellchecker_german.txt under the eng_error_lang folder. But, I presume that you now have started checking the words I had marked so far; I have marked the full italic string as a single lang., though it contained other language words, say like the ones you have pointed above [cucullus, admonitio, admonere : latin; balustrade : english; and trāyin, Śakti, Durgā (not urgā) : Sanskrit.] |
addition german, french markupWork done in issues/issue3 directory.
@Andhrabharati anything else to do regarding this issue? |
I had looked at the readme file and then the change_1_ger file. I think, your work is apparently limited to the italic strings alone (as seen in the change_1_ger file). So, the "MarvinJWendt" word-list is also not quite complete, like the spellchecker list!! This is just the result of a quick search and, I think, more could be lying in the text. |
So, you need to "traverse" in some different path to identify the words fully [I cannot and should not dare giving you tips and tricks!!], or leave the task to me to take up at sometime later. |
Found some more french/german italicized text. This based on examination of italicized text containing non-English word(s): check3.txt. @Andhrabharati Have I missed any? |
Glad that my post is taken by you in good spirit; I had felt later that my wordings are somewhat in 'negative shade'. I guess, there could be few more botanical (latin) names (you had listed/marked 10 now). And, pl. do the similar exercise with non-italic text too, for completeness. |
And, would you pl. post your latest file? Just found that you had got werden in line 5531- {%beklommen werden%}, but missed it in line 55371- {%werden%}! |
recueillements is marked at And did you get |
Just an example-- And let's have these marked with |
oversites handledSee 'Additional changes' in check3a_edit.txt.
|
Yes, I've seen some more words being in English borrowed from other languages 'as is', and it is debatable whether to mark them as the 'parent' language words!! One way that I feel a sure 'proper' manner is to decide by the context-- if occurring in the other language work (identifiable by the author's name and/or the work), it could be treated as the foreign 'parent' word. |
Note esp. that the ls, ab and lang tags are now increased further. |
Please upload your bhs_ab_2 version so I can resolve the differences. e.g., My latest version has 309 `fr, compared to your 314. |
Here it is, @funderburkjim -- And, pl. be noted that I have done some addl. corrections too, apart from updating the taggings. |
Thanks. I'll focus on the tag counts of your table for now. |
Once you are done with this phase, pl. post your file, and probably close the issue. Then I can take-up resolving the (latest) unidentified (or doubtful) ab- and ls- tags [as updated by you, using my AB_2 file], in another issue. |
additional revisionstemp_bhs_ab_3.zip contains the end result. Work done in compare sub-directory. Generally, the abbreviation markup changes of bhs.ab.2 were accepted; After resolving the abbreviation changes, I also identified and applied the remaining differences. temp_bhs_ab_3.txt is now the latest csl-orig verions for bhs, and is the basis of the displays. The 'tooltip' files (for general abbreviations and literary source abbreviations) were also modified to be consistent
Many of these (esp. for ls) are currently only 'placeholders', with '?' as the tooltip. These need to be resolved.I'll open another issue for this tooltip revision. @Andhrabharati If you accept temp_bhs_ab_3.txt, we can close this issue 3. |
See how odd the modifier apostrophe looks at these places (of course, this is a font dependent issue!); we never see such forms in any french print! The caron-forms are what are seen in print. As such, I suggest using ď (U+101F), ľ (U+013E) and Ľ (U+013D) at these places. AB:
AB: agreed, I had erroneously marked this as german.
This is purely a german form [occured 4500+ times in pwk and 6500+ times in PWG], and I suggest changing both the places where it occurred thus (which were picked up from the resp. german Worterbuch)--
PS. The expansion of
AB: agreed
Ledder is a Low German form , and I find this https://wordsense.eu site quite useful in identifying the words and languages.
AB: agreed
Firstly, this list has missed I suggest retaining all these with lat-tagging; these are all latin phrases (that were brought into English language as is), not abbr.s in any manner. I had followed the point that I mentioned above in marking these thus.
AB: agreed; and as I do in manual marking, you had also erred here
I had earlier marked is properly as BTW, just noticed that I had missed the ending letter long ī at this tagging. I think it is appropriate to mark it somehow as a language; but is not a big deal to break the heads over.
AB: disagree; here 'Caraka' is not referring to the legendary proponent of Ayurveda (that is ls-tagged), but to some king. No tagging needed here.
[Same for the next two as well; so, not elaborating them.] AB: not a big point to disagree; but just like to mention that there are many cases of ab- and ls- entities occurring with and without a dot followed throughout the text. We should treat is as the author's style, instead of trying to 'normalise' them!
AB: agree for these two changes. |
Re French apostrophe I disagree with The U+010E has some other purpose, I think. Similarly, https://en.wiktionary.org/wiki/%C4%BE says that I think in our work, the |
OK, @funderburkjim; now, I see that Google shows plenty of french pages with d̕ etc. [LATIN SMALL LETTER D WITH COMMA ABOVE RIGHT, 0064 + 0315]. Of course, Unicode chart itself recommends using 02BC instead of this-- Why not approach someone more knowledgeable in French to confirm and conclude the matter, say Sampada or Odile? |
Have sent email requesting help from Odile:
|
In the cdsl versions of Burnouf and Stchoupak, the simple apostrophe U+0027 is used. Odile has contributed extensively to these digitizations. |
comment on the meaning of the
|
Here is Odile's reply:
|
Also see https://en.wikipedia.org/wiki/Right_single_quotation_mark. My conclusion: Currently, we are using u02bc for the apostrophe (1523 instances) in bhs, both within Let's keep the apostrophe with u02bc. |
I do agree, @funderburkjim ! I have some reservation in using the right_single_quotation_mark, as it conflicts with my matching_pairs 'logic' (for the same reason, I had resorted to the '〉' in place of closing parenthesis mark ')' though it is present thus in the print). Let's stick to the u02bc, as I did in GRA, pw set etc. recently. |
Great! further revisionsI think all the items mentioned in comment above have been handled. @Andhrabharati Please check if I've missed anything. |
Please note correction made to documentation 'changes_bhs_ab_3a' for kaqambA. |
Good; so, it's time to close this issue? |
Revisions now installed at Cologne. |
You need to update the meta2 file in the download sets. Just seen that the details on it are somewhat obsolete now, namely the tag counts and some of the extended characters/counts. |
So many, would never expect |
The revisions to bhs.txt discussed in #1 provide markup which identifies the language of various phrases. The summary (from bhs-meta2.txt) is
In the displays, such text is shown in 'brown' color, and marked with tooltip (e.g. 'French language' for
<fr>X</fr>
).In #1, @Andhrabharati suggested
A few possible problems with this markup have been noticed by random observation, e.g.,
under Anantarya,
unmittelbare Folge
should be marked with<ger>
This issue opened as reminder of this idea to enhance bhs digitization.
The text was updated successfully, but these errors were encountered: