Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-english mw words #127

Closed
funderburkjim opened this issue Dec 1, 2021 · 41 comments
Closed

non-english mw words #127

funderburkjim opened this issue Dec 1, 2021 · 41 comments

Comments

@funderburkjim
Copy link
Contributor

This comment is one branch of #99.

By various means (which I'll describe below tomorrow), a list of 1509 words was developed which

  • are composed just of Normal a-zA-Z characters
  • which are confirmed to occur in mw (outside of markup)
  • which are not found in consulted English dictionaries (en_US and en_GB of 'enchant')
@funderburkjim
Copy link
Contributor Author

The end results are iin two text files:

Suggested first task

Although the Enchant dictionaries do not find these words as English words, nonetheless, a large portion of them look to actually be English words. Using Browser 'define: X' and/or https://www.merriam-webster.com/,
some of these will be found directly.
I suggest that the first task is to separate the words in words_mw_noneng.txt into two piles, depending on whether the word is found (with a plausible definition) in one of these online sources.

This is just a suggestion. What do you think @AnnaRybakovaT ? Good place to start? What else would you
need from me to get started?

@AnnaRybakovaT
Copy link
Contributor

Dear Jim,
I am very glad to continue work by your guiding!
During the day I will read everything and will try to start this task. If I have any questions, of course - I will ask your help.

@AnnaRybakovaT
Copy link
Contributor

Dear Jim,
The 1st task - everything is more than clear. Only let me know how do you prefer me to separate 2 groups of the words. Does this way suit ("nf" - not found, "found" - the word is exist in online sources)?
image

@funderburkjim
Copy link
Contributor Author

@AnnaRybakovaT Your system of identification of the two cases looks consistent, and easy to work with. 👍 When this step is done, the next step will be to examine further the nf (not-found) in the context of MW usage -- we'll think about this further when the time comes.

@funderburkjim
Copy link
Contributor Author

Construction details

This is a documentation summary of the files constructed leading up to the two files mentioned above.
As we proceed further with the analysis, some of the small details may be relevant.
The actual programmatic steps are detailed in the readme,

  • Start with unique words extracted.txt, which has 51298 lines, each containing a 'word' derived somehow from the MW digitization.
  • the above separated into three parts
  • 3 words_arabic.txt
  • 107 words_nonascii.txt words containing a non-ascii character
    • Some of these words need recoding in mw.txt
  • 51188 (temp_words_00.txt) -- the remaining words.
  • The remaining words were analyzed into two parts:
    • 24305 words_01.txt were those words consisting only of alphabetic characters ([a-zA-Z]) except for possible ending punctuation [.,;:?!] ; and that ending punctuation removed, and duplicates removed (e.g. the words 'Arab' and 'Arab,' both resolve to 'Arab')
    • 7825 words_other.txt words with non-alphabetic characters. In these, any ending punctuation was retained. There is room for further examination of sub-categories of these (numbers, hyphenated words, words beginning with a hyphen, probably other subcases).
  • words_01 was separated into two groups, based on whether the word was found in mw.txt.
    • all mw text within markup was EXCLUDED when searching for words.
    • specifically, in each line, a space character replaced any occurrences of these regex:
      • '<ab.*?</ab>', '<s>.*?</s>', '<ls.*?</ls>', '<info.*?/>', '<bot>.*?</bot>','<hom>.*?</hom>', '<etym>.*?</etym>', '<lang.*?</lang>', '<lex.*?</lex>','<s1.*?</s1>'
      • then the resulting text was split into words by 're.split(r'\b',text)`
      • finally, each such word was tested to be in words_01.txt
    • 23744 words_mw.txt words from words_01 and in mw.txt
    • words_notmw.txt words from words_01 not found in mw.txt
      • These also should be examined further. This is partly where the relation between this definition of 'words in mw' and the definition used to derive starting set unique words extracted.txt is important.
        • For example, why does the starting set of words contain bypersons as a word but this is not found in mw by the current analysis?
  • The list words_mw.txt is then divided into three pieces, depending on whether a word is identified as English or not. As mentioned, this determination is made by reliance upon two of the enchant English dictionaries

@AnnaRybakovaT
Copy link
Contributor

When this step is done, the next step will be to examine further the nf (not-found) in the context of MW usage -- we'll think about this further when the time comes.

Dear Jim,
What do you think, maybe better from the beginning to add some comments regarding "not found" words? Now I see 3 categories:

  • not English origin: names, plants, geographical names
  • Wrong spelling English words (correct spelling I give in a comment)
  • rare cases which need more deep investigations.

image

There is only my suggestion. If it is better to make the 1st step as you described above (I mean - to add only "found" and "nf"), I will do it by this way.

@funderburkjim
Copy link
Contributor Author

Adding those extra comments to the 'nf' is fine, since it will help in the next step of further analysis of the nf.

@gasyoun
Copy link
Member

gasyoun commented Dec 5, 2021

Seems @AnnaRybakovaT is where she belongs to again, thanks @funderburkjim for the guidance.

@AnnaRybakovaT
Copy link
Contributor

Dear Jim,
I am still working with the file words_mw_noneng.txt
The third part is ready (you can see the temporary results in the file words_mw_noneng_temp.txt). If you have any comments, please, let me know (I will try take them into account and include in futher analyzation).

@funderburkjim
Copy link
Contributor Author

Hi, Anna -- we must have been communicating telepathically, as I was thinking 'Where is Anna?' earlier today!
Will take a look at what you've been doing in the next day or two.

@gasyoun
Copy link
Member

gasyoun commented Dec 14, 2021

we must have been communicating telepathically

Indeed. I heard your question one day before you heard and asked Anna to push what he has. The task is big, so I proposed she splits it into parts. Let's have our annual a call on 26th of December? @funderburkjim @drdhaval2785 @AnnaRybakovaT @Andhrabharati @SergeA ? Last time it was around 12:00 Moscow time, or?

@funderburkjim
Copy link
Contributor Author

Noon Moscow would be 8PM in New York (my time zone). That time ok with me.

I suggest one discussion point be how to proceed with less from me. I want to spend considerably more time on (a) improving my Sanskrit literacy, (b) a long-standing mathematics project ignored for almost 4 years now. There is a huge backlog of sanskrit-lexicon tasks that are currently assigned to me. I aim to address these, but at a less intensive pace.* Perhaps others will adopt some of these tasks, or perhaps others may wish to move the sanskrit-lexicon project into new directions. It will be interesting to see how things unfold.

  • Getting Indishe spruch (boesp) into a link target for PW(K) and PWG, and improving the ls markup of PWG and nailing down the ls tooltips for these two dictionaries -- these sanskrit-lexicon tasks are top of mind for me at the moment.

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Dec 15, 2021

@funderburkjim

Many of these words (if not all) could be traced in the mw text, by regex searching for the word followed by [^\.], i.e., xxx[^\.]

You seem to have missed some of these, as you had removed the ending punctuation mark!!

As such, you may update the (above) lists by you, after checking.

@funderburkjim
Copy link
Contributor Author

@AnnaRybakovaT Thoughts looking at 'words_mw_noneng_temp.txt'

  • You have apparently been looking at instances in mw.txt also. Your marking of 'typo' is excellent, as this
    will indicate corrections that need to be made. Noticed 'Vallasor' also needs to be marked typo.
  • You have also marked 20 or so as 'print change'. It is less certain how to handle these, but fine that you
    marked them. For example, consider Carroway 3; nf; plant "Caraway" (print change) . Maybe we should
    invent a new markup <probably n="Caraway">Carroway</probably> that would provide a tooltip to users, but would leave the 'Carroway' spelling in place. [There might be a better name for the tag 'probably'].
    There are a few different types of print (such as 'cornifex') that apparently have other issues besides spelling.
  • The brief comments by many 'nf' I find good, such as Catarkot 1; nf; geographical name "Chatarkot".
    Have you considered similar comments for some of the obscure 'found' words, such as Cambay 1; found ?

We should probably somehow make use of the accepted words (i.e., those whose spelling we decide to leave unchanged) in mw.txt. For example, the word 'Capricornus' appears in AP90.txt and is one that you 'found'.
If we do a similar study of words in AP90, then we should build on your work, and therefore accept 'Capricornus' as ok, even though it is not among the Enchant English words. [Note 'Carroway' also in AP90].

It seems you have examined about 56% of the cases. Keep going!

@funderburkjim
Copy link
Contributor Author

@Andhrabharati re words_notmw.txt

Note that within my analysis (see Construction details note above)
all mw text within markup was EXCLUDED when searching for words

For example 'Acacia' appears in words_notmw.txt. Within mw.txt, this word DOES occur 113 times,
but always within a 'bot' element, e.g. <bot>Acacia Sirissa</bot>.

@Andhrabharati
Copy link
Contributor

Yes, checked that they are all marked now; but they weren't at the time of my working those days (during March 2021).

These are the 4 lines from the mw_iast.txt (dt 04.04.21) by you, which was the last one I had considered (after which I stopped tracking the mw, and shifted to other works)-

<L>44900<pc>257,1<k1>karṇamoṭā<k2>kárṇa—moṭā<e>3 <s>kárṇa—moṭā</s> ¦ <lex>f.</lex> Acacia arabica, <ls>L.</ls><info lex="f"/> <LEND>
<L>46461<pc>264,1<k1>kavarī<k2>kavarī<e>1B ¦ Acacia arabica or another plant, <ls>Npr.</ls><info lex="inh"/> <LEND>
<L>85434<pc>448,3<k1>tīkṣṇakaṇṭaka<k2>tīkṣṇá—kaṇṭaka<e>3A ¦ Acacia arabica, <ls>Npr.</ls><info lex="inh"/> <LEND>
<L>148230<pc>745,3<k1>bhaṇḍila<k2>bhaṇḍila<e>3A ¦ Acacia or <bot>Mimosa Sirissa</bot>, <ls>L.</ls><info lex="inh"/> <LEND>

Anyways, there are just about 500 words in the "words_notmw.txt", and is not a big issue to discuss more.
[All those might have got updated in the later days.]

@AnnaRybakovaT
Copy link
Contributor

  • Have you considered similar comments for some of the obscure 'found' words, such as Cambay 1; found ?

Dear Jim,
Many thanks for your comments. Now I am more confident that everything is going well.
Regarding the obscure 'found' words - I can double check and write short explanations.

@Andhrabharati
Copy link
Contributor

Let's have our annual a call on 26th of December? @funderburkjim @drdhaval2785 @AnnaRybakovaT @Andhrabharati @SergeA ?

What would be the agenda, @gasyoun?
And do you think I have a role to "play"?

@gasyoun
Copy link
Member

gasyoun commented Dec 16, 2021

What would be the agenda

One does not know in advance.

And do you think I have a role to "play"?

Yes, it will increase in 2022-2032.

Getting Indishe spruch (boesp) into a link target for PW(K) and PWG

Sounds like a plan.

long-standing mathematics project ignored for almost 4 years now

Can I send you a mathemathician to help out so you can ignore it even longer?

spend considerably more time on (a) improving my Sanskrit literacy

As per Sanskrit literacy - may I know what do exactly do you want to read?

invent a new markup Carroway that would provide a tooltip to users, but would leave the 'Carroway' spelling in place

Exactly, kind of ghostword or newEnglish. But as we have German dicitonaries with the same issues, so ghostword could be used?

accept 'Capricornus' as ok, even though it is not among the Enchant English words

Exactly.

Regarding the obscure 'found' words - I can double check and write short explanations.

So glad @AnnaRybakovaT is back - not only beutifull, but smart and hard working she is.

@funderburkjim
Copy link
Contributor Author

what do you want to read?

For starters, Kale's Hitopadesha, Lanman reader stories, Bhagavad Gita, Peter's Ramopakhyana, maybe Indishe Spruch verses -- I would like to be able to dip into any of these and sight read with ease.

@Andhrabharati
Copy link
Contributor

  • Getting Indishe spruch (boesp) into a link target for PW(K) and PWG, and improving the ls markup of PWG and nailing down the ls tooltips for these two dictionaries -- these sanskrit-lexicon tasks are top of mind for me at the moment.

@funderburkjim
I would like you including Ramayana and Mahabharata as link tagets, which are some of the major ones; and SCH for ls markup, as it goes with pwk and PWG as a set; and then take a break.

I am presently working on SCH and likely to be posting the results, before this month ending.

@AnnaRybakovaT
Copy link
Contributor

AnnaRybakovaT commented Jan 24, 2022

Dear Jim,
Finally I have finished analyzing this file. The results are contained in the file:
https://github.com/sanskrit-lexicon/MWS/blob/master/mws_issue_99/apps/unique_eng/words_mw_noneng_1.txt


Addendum to Anna's comment of Jan 24, 2022 (Jim)
Anna's file was renamed (01-22-2024) to

words_mw_noneng_1.txt

@gasyoun
Copy link
Member

gasyoun commented Jan 24, 2022

I am presently working on SCH and likely to be posting the results, before this month ending.

May you never feel weekness.

Finally I have finished analyzing this file. The results are contained in the file:

Absolutely impressed.

Kale's Hitopadesha, Lanman reader stories, Bhagavad Gita, Peter's Ramopakhyana, maybe Indishe Spruch verses

It's good you started with Kale. Indishe Spruch are mostly hard to understand, as is sometimes Bhagavad Gita. Peter's Ramopakhyana is interesting, but still more advanced than Lanman reader stories. It's good you started with Kale.

@Andhrabharati
Copy link
Contributor

Good work done, @AnnaRybakovaT; you indeed are a smart worker as @gasyoun mentioned above.

Just seen that there are some missings and errors in your file, and I'm sure @funderburkjim would be reviewing them all over before incorporating them into Cologne files.

Here are a few quick ones-

Galmei 1;	nf	German word for Calamine
Habush 2;	nf	a plant name in Bengali; look at the SKD entry हपुषा.
Mooltan 1;	nf	a place name (Multān)
annumeration 2;	nf	Addition to a former number (Webster's)
antiphlegmatic 2;	nf	anti-phlegmatic (used to reduce phlegm)
nonne 1;	nf		a Latin word used in interrogation
-----------------
Chandoiu 1;	nf;	//looks like a Sanskrit word// this is a typo for Chandom. (abbr. for Chandomanjari)

@AnnaRybakovaT
Copy link
Contributor

Just seen that there are some missings and errors in your file

Thanks a lot for your checking and explanation of missing cases (I had no ideas what it could be)!!!

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Jan 28, 2022

@funderburkjim

Would you mind regenerating the "latest" iast and deva files for the mw.txt?

I have noticed quite a few issues that need corrections, and thought of doing a complete proofing once for all. This time, I estimate a time-frame of about 6-8 months for the full proofing.

Hope to see your response soon on this.

@Andhrabharati
Copy link
Contributor

@drdhaval2785

Would you be interested to do this [as @funderburkjim is either not interested in this proposal, or did not "see" this above post yet (being busy on PWG ls working)]?

Or else, I will take up some other big work for a long term, starting a few days from now.

@drdhaval2785
Copy link
Contributor

You want new devanagari files, I can.
I am not sure about IAST though.

@drdhaval2785
Copy link
Contributor

https://github.com/sanskrit-lexicon/csl-devanagari/blob/main/v02/mw/mw.txt is the latest MW Devanagari version.

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Jul 28, 2022

In the last file by @AnnaRybakovaT at the #127 (comment), both

Rakshases 2; nf; Rakshasas &
Ushases 1; nf; Ushas

are proper in the text, being the plural of Rakshas & Ushas respectively, and no change required in those words.
Hope @funderburkjim would take this into account, while he 'works' on this file he has copied elsewhere.

@Andhrabharati
Copy link
Contributor

@funderburkjim

I had seen you copying Anna's work after a gap of 6 months; and now another year-and-half has elapsed.
Hope you might consider looking into her file and act upon the same, sometime sooner.

@funderburkjim
Copy link
Contributor Author

@Andhrabharati Am taking up review of words_mw_noneng_1.txt.

funderburkjim added a commit to sanskrit-lexicon/csl-corrections that referenced this issue Jan 25, 2024
funderburkjim added a commit to sanskrit-lexicon/csl-pywork that referenced this issue Jan 25, 2024
funderburkjim added a commit to sanskrit-lexicon/csl-orig that referenced this issue Jan 25, 2024
@funderburkjim
Copy link
Contributor Author

processing of nonenglish words.

Work directory is unique_eng.

  • words_mw_noneng_2.txt has my annotations of @AnnaRybakovaT file words_mw_noneng_1.txt.
    • My comments ';; xxxx'
  • items generating a change in mw.txt indicated by ';; 2024 ...'
  • about 200 lines of mw.txt changed. See also changes_2.txt or the csl-orig commit above
  • About 40 of these were marked as print-changes 'PRINT CHANGE', and were posted to mw_printchange.txt (see csl-corrections commit above).

For a few old words, these were useful:

For Latin words, sometimes this was useful: https://www.online-latin-dictionary.com/latin-english-dictionary.php

@funderburkjim
Copy link
Contributor Author

Further research and usage

There is a lot of good information in the research by @AnnaRybakovaT and @Andhrabharati. Not clear where to put it so that it may be available when needed another time. Maybe where @drdhaval2785 has put his word studies.

@Andhrabharati
Copy link
Contributor

@funderburkjim

Though you have mentioned that (Anna's and) my 'research' contained some good info, you had ignored/skipped this post above.

@Andhrabharati Andhrabharati reopened this Jan 25, 2024
@Andhrabharati
Copy link
Contributor

A quick looking into the 40 print-changes prompted me to comment thus--

cerebralisation 1; nf; cerebralization (typo);; 2024 correction L=110300 niveSa PRINT CHANGE

;; AB there are few more cases of such 's-z' variants-- realization (5) vs. realisation (4); cauterization (4) vs. cauterisation (1)
;; AB these American and British spelling variations may be seen throughout the MW text (see for e.g. courtezan, courtesan)
;; AB thus. I feel that this particular 'print-change' correction is to be reverted back.

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Jan 25, 2024

Another info, that I wanted to present here--

anum 1; nf; maybe "per annum" (in this case - print change) ;; no change. anum in pw, but otherwise not found

This does not indicate "per annum" as Anna thought; for the context (there are some more places that pw has used "per anum") seems to mean "from/by anus", anum being the inflected form of Anus (Latin word).

funderburkjim added a commit to sanskrit-lexicon/csl-orig that referenced this issue Jan 25, 2024
funderburkjim added a commit to sanskrit-lexicon/csl-corrections that referenced this issue Jan 25, 2024
funderburkjim added a commit that referenced this issue Jan 25, 2024
@funderburkjim
Copy link
Contributor Author

@Andhrabharati Revised per your comment(s). For details, see commits above.

@Andhrabharati
Copy link
Contributor

I presumed that these two plurals also would/should be marked, as <ns>Aṅgirases</ns> was.

funderburkjim added a commit to sanskrit-lexicon/csl-orig that referenced this issue Jan 25, 2024
@funderburkjim
Copy link
Contributor Author

@Andhrabharati <ns> markup added. See commits above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants