non-english mw words #127

funderburkjim · 2021-12-01T04:41:39Z

This comment is one branch of #99.

By various means (which I'll describe below tomorrow), a list of 1509 words was developed which

are composed just of Normal a-zA-Z characters
which are confirmed to occur in mw (outside of markup)
which are not found in consulted English dictionaries (en_US and en_GB of 'enchant')

funderburkjim · 2021-12-01T04:55:46Z

The end results are iin two text files:

words_mw_noneng.txt shows each word and number of instances found in mw.
instance_mw_noneng.txt shows, for the same list, all the lines in MW where the word occurs.

Suggested first task

Although the Enchant dictionaries do not find these words as English words, nonetheless, a large portion of them look to actually be English words. Using Browser 'define: X' and/or https://www.merriam-webster.com/,
some of these will be found directly.
I suggest that the first task is to separate the words in words_mw_noneng.txt into two piles, depending on whether the word is found (with a plausible definition) in one of these online sources.

This is just a suggestion. What do you think @AnnaRybakovaT ? Good place to start? What else would you
need from me to get started?

AnnaRybakovaT · 2021-12-01T08:26:58Z

Dear Jim,
I am very glad to continue work by your guiding!
During the day I will read everything and will try to start this task. If I have any questions, of course - I will ask your help.

AnnaRybakovaT · 2021-12-01T15:49:07Z

Dear Jim,
The 1st task - everything is more than clear. Only let me know how do you prefer me to separate 2 groups of the words. Does this way suit ("nf" - not found, "found" - the word is exist in online sources)?

funderburkjim · 2021-12-01T17:38:51Z

@AnnaRybakovaT Your system of identification of the two cases looks consistent, and easy to work with. 👍 When this step is done, the next step will be to examine further the nf (not-found) in the context of MW usage -- we'll think about this further when the time comes.

funderburkjim · 2021-12-01T23:04:46Z

Construction details

This is a documentation summary of the files constructed leading up to the two files mentioned above.
As we proceed further with the analysis, some of the small details may be relevant.
The actual programmatic steps are detailed in the readme,

Start with unique words extracted.txt, which has 51298 lines, each containing a 'word' derived somehow from the MW digitization.
the above separated into three parts
3 words_arabic.txt
107 words_nonascii.txt words containing a non-ascii character
- Some of these words need recoding in mw.txt
51188 (temp_words_00.txt) -- the remaining words.
The remaining words were analyzed into two parts:
- 24305 words_01.txt were those words consisting only of alphabetic characters ([a-zA-Z]) except for possible ending punctuation [.,;:?!] ; and that ending punctuation removed, and duplicates removed (e.g. the words 'Arab' and 'Arab,' both resolve to 'Arab')
- 7825 words_other.txt words with non-alphabetic characters. In these, any ending punctuation was retained. There is room for further examination of sub-categories of these (numbers, hyphenated words, words beginning with a hyphen, probably other subcases).
words_01 was separated into two groups, based on whether the word was found in mw.txt.
- all mw text within markup was EXCLUDED when searching for words.
- specifically, in each line, a space character replaced any occurrences of these regex:
  - '<ab.*?</ab>', '<s>.*?</s>', '<ls.*?</ls>', '<info.*?/>', '<bot>.*?</bot>','<hom>.*?</hom>', '<etym>.*?</etym>', '<lang.*?</lang>', '<lex.*?</lex>','<s1.*?</s1>'
  - then the resulting text was split into words by 're.split(r'\b',text)`
  - finally, each such word was tested to be in words_01.txt
- 23744 words_mw.txt words from words_01 and in mw.txt
- words_notmw.txt words from words_01 not found in mw.txt
  - These also should be examined further. This is partly where the relation between this definition of 'words in mw' and the definition used to derive starting set unique words extracted.txt is important.
    - For example, why does the starting set of words contain bypersons as a word but this is not found in mw by the current analysis?
The list words_mw.txt is then divided into three pieces, depending on whether a word is identified as English or not. As mentioned, this determination is made by reliance upon two of the enchant English dictionaries
- 21954 words_mw_US.txt using the 'en_US' dictionary
- 281 words_mw_GB.txt additional words identified as English using the 'en_GB' dictionary
- 1509 words_mw_noneng.txt the remaining words of words_mw.txt

AnnaRybakovaT · 2021-12-02T17:06:16Z

When this step is done, the next step will be to examine further the nf (not-found) in the context of MW usage -- we'll think about this further when the time comes.

Dear Jim,
What do you think, maybe better from the beginning to add some comments regarding "not found" words? Now I see 3 categories:

not English origin: names, plants, geographical names
Wrong spelling English words (correct spelling I give in a comment)
rare cases which need more deep investigations.

There is only my suggestion. If it is better to make the 1st step as you described above (I mean - to add only "found" and "nf"), I will do it by this way.

funderburkjim · 2021-12-02T18:09:43Z

Adding those extra comments to the 'nf' is fine, since it will help in the next step of further analysis of the nf.

gasyoun · 2021-12-05T23:23:02Z

Seems @AnnaRybakovaT is where she belongs to again, thanks @funderburkjim for the guidance.

AnnaRybakovaT · 2021-12-14T18:12:13Z

Dear Jim,
I am still working with the file words_mw_noneng.txt
The third part is ready (you can see the temporary results in the file words_mw_noneng_temp.txt). If you have any comments, please, let me know (I will try take them into account and include in futher analyzation).

AnnaRybakovaT · 2021-12-14T18:12:29Z

https://github.com/sanskrit-lexicon/MWS/blob/master/mws_issue_99/apps/unique_eng/words_mw_noneng_temp.txt

funderburkjim · 2021-12-14T19:05:44Z

Hi, Anna -- we must have been communicating telepathically, as I was thinking 'Where is Anna?' earlier today!
Will take a look at what you've been doing in the next day or two.

gasyoun · 2021-12-14T20:17:48Z

we must have been communicating telepathically

Indeed. I heard your question one day before you heard and asked Anna to push what he has. The task is big, so I proposed she splits it into parts. Let's have our annual a call on 26th of December? @funderburkjim @drdhaval2785 @AnnaRybakovaT @Andhrabharati @SergeA ? Last time it was around 12:00 Moscow time, or?

funderburkjim · 2021-12-15T03:57:19Z

Noon Moscow would be 8PM in New York (my time zone). That time ok with me.

I suggest one discussion point be how to proceed with less from me. I want to spend considerably more time on (a) improving my Sanskrit literacy, (b) a long-standing mathematics project ignored for almost 4 years now. There is a huge backlog of sanskrit-lexicon tasks that are currently assigned to me. I aim to address these, but at a less intensive pace.* Perhaps others will adopt some of these tasks, or perhaps others may wish to move the sanskrit-lexicon project into new directions. It will be interesting to see how things unfold.

Getting Indishe spruch (boesp) into a link target for PW(K) and PWG, and improving the ls markup of PWG and nailing down the ls tooltips for these two dictionaries -- these sanskrit-lexicon tasks are top of mind for me at the moment.

Andhrabharati · 2021-12-15T04:33:42Z

words_notmw.txt words from words_01 not found in mw.txt

@funderburkjim

Many of these words (if not all) could be traced in the mw text, by regex searching for the word followed by [^\.], i.e., xxx[^\.]

You seem to have missed some of these, as you had removed the ending punctuation mark!!

As such, you may update the (above) lists by you, after checking.

funderburkjim · 2021-12-15T18:21:06Z

@AnnaRybakovaT Thoughts looking at 'words_mw_noneng_temp.txt'

You have apparently been looking at instances in mw.txt also. Your marking of 'typo' is excellent, as this
will indicate corrections that need to be made. Noticed 'Vallasor' also needs to be marked typo.
You have also marked 20 or so as 'print change'. It is less certain how to handle these, but fine that you
marked them. For example, consider Carroway 3; nf; plant "Caraway" (print change) . Maybe we should
invent a new markup <probably n="Caraway">Carroway</probably> that would provide a tooltip to users, but would leave the 'Carroway' spelling in place. [There might be a better name for the tag 'probably'].
There are a few different types of print (such as 'cornifex') that apparently have other issues besides spelling.
The brief comments by many 'nf' I find good, such as Catarkot 1; nf; geographical name "Chatarkot".
Have you considered similar comments for some of the obscure 'found' words, such as Cambay 1; found ?

We should probably somehow make use of the accepted words (i.e., those whose spelling we decide to leave unchanged) in mw.txt. For example, the word 'Capricornus' appears in AP90.txt and is one that you 'found'.
If we do a similar study of words in AP90, then we should build on your work, and therefore accept 'Capricornus' as ok, even though it is not among the Enchant English words. [Note 'Carroway' also in AP90].

It seems you have examined about 56% of the cases. Keep going!

funderburkjim · 2021-12-15T18:28:37Z

@Andhrabharati re words_notmw.txt

Note that within my analysis (see Construction details note above)
all mw text within markup was EXCLUDED when searching for words

For example 'Acacia' appears in words_notmw.txt. Within mw.txt, this word DOES occur 113 times,
but always within a 'bot' element, e.g. <bot>Acacia Sirissa</bot>.

Andhrabharati · 2021-12-15T19:08:31Z

Yes, checked that they are all marked now; but they weren't at the time of my working those days (during March 2021).

These are the 4 lines from the mw_iast.txt (dt 04.04.21) by you, which was the last one I had considered (after which I stopped tracking the mw, and shifted to other works)-

<L>44900<pc>257,1<k1>karṇamoṭā<k2>kárṇa—moṭā<e>3 <s>kárṇa—moṭā</s> ¦ <lex>f.</lex> Acacia arabica, <ls>L.</ls><info lex="f"/> <LEND>
<L>46461<pc>264,1<k1>kavarī<k2>kavarī<e>1B ¦ Acacia arabica or another plant, <ls>Npr.</ls><info lex="inh"/> <LEND>
<L>85434<pc>448,3<k1>tīkṣṇakaṇṭaka<k2>tīkṣṇá—kaṇṭaka<e>3A ¦ Acacia arabica, <ls>Npr.</ls><info lex="inh"/> <LEND>
<L>148230<pc>745,3<k1>bhaṇḍila<k2>bhaṇḍila<e>3A ¦ Acacia or <bot>Mimosa Sirissa</bot>, <ls>L.</ls><info lex="inh"/> <LEND>

Anyways, there are just about 500 words in the "words_notmw.txt", and is not a big issue to discuss more.
[All those might have got updated in the later days.]

AnnaRybakovaT · 2021-12-16T15:58:00Z

Have you considered similar comments for some of the obscure 'found' words, such as Cambay 1; found ?

Dear Jim,
Many thanks for your comments. Now I am more confident that everything is going well.
Regarding the obscure 'found' words - I can double check and write short explanations.

Andhrabharati · 2021-12-16T16:13:45Z

Let's have our annual a call on 26th of December? @funderburkjim @drdhaval2785 @AnnaRybakovaT @Andhrabharati @SergeA ?

What would be the agenda, @gasyoun?
And do you think I have a role to "play"?

gasyoun · 2021-12-16T22:05:14Z

What would be the agenda

One does not know in advance.

And do you think I have a role to "play"?

Yes, it will increase in 2022-2032.

Getting Indishe spruch (boesp) into a link target for PW(K) and PWG

Sounds like a plan.

long-standing mathematics project ignored for almost 4 years now

Can I send you a mathemathician to help out so you can ignore it even longer?

spend considerably more time on (a) improving my Sanskrit literacy

As per Sanskrit literacy - may I know what do exactly do you want to read?

invent a new markup Carroway that would provide a tooltip to users, but would leave the 'Carroway' spelling in place

Exactly, kind of ghostword or newEnglish. But as we have German dicitonaries with the same issues, so ghostword could be used?

accept 'Capricornus' as ok, even though it is not among the Enchant English words

Exactly.

Regarding the obscure 'found' words - I can double check and write short explanations.

So glad @AnnaRybakovaT is back - not only beutifull, but smart and hard working she is.

funderburkjim · 2021-12-16T23:04:27Z

what do you want to read?

For starters, Kale's Hitopadesha, Lanman reader stories, Bhagavad Gita, Peter's Ramopakhyana, maybe Indishe Spruch verses -- I would like to be able to dip into any of these and sight read with ease.

Andhrabharati · 2021-12-17T05:39:05Z

Getting Indishe spruch (boesp) into a link target for PW(K) and PWG, and improving the ls markup of PWG and nailing down the ls tooltips for these two dictionaries -- these sanskrit-lexicon tasks are top of mind for me at the moment.

@funderburkjim
I would like you including Ramayana and Mahabharata as link tagets, which are some of the major ones; and SCH for ls markup, as it goes with pwk and PWG as a set; and then take a break.

I am presently working on SCH and likely to be posting the results, before this month ending.

AnnaRybakovaT · 2022-01-24T21:53:21Z

words_mw_noneng.txt shows each word and number of instances found in mw.

Dear Jim,
Finally I have finished analyzing this file. The results are contained in the file:
https://github.com/sanskrit-lexicon/MWS/blob/master/mws_issue_99/apps/unique_eng/words_mw_noneng_1.txt

Addendum to Anna's comment of Jan 24, 2022 (Jim)
Anna's file was renamed (01-22-2024) to

words_mw_noneng_1.txt

gasyoun · 2022-01-24T22:13:35Z

I am presently working on SCH and likely to be posting the results, before this month ending.

May you never feel weekness.

Finally I have finished analyzing this file. The results are contained in the file:

Absolutely impressed.

Kale's Hitopadesha, Lanman reader stories, Bhagavad Gita, Peter's Ramopakhyana, maybe Indishe Spruch verses

It's good you started with Kale. Indishe Spruch are mostly hard to understand, as is sometimes Bhagavad Gita. Peter's Ramopakhyana is interesting, but still more advanced than Lanman reader stories. It's good you started with Kale.

Andhrabharati · 2022-01-25T15:49:47Z

Good work done, @AnnaRybakovaT; you indeed are a smart worker as @gasyoun mentioned above.

Just seen that there are some missings and errors in your file, and I'm sure @funderburkjim would be reviewing them all over before incorporating them into Cologne files.

Here are a few quick ones-

Galmei 1;	nf	German word for Calamine
Habush 2;	nf	a plant name in Bengali; look at the SKD entry हपुषा.
Mooltan 1;	nf	a place name (Multān)
annumeration 2;	nf	Addition to a former number (Webster's)
antiphlegmatic 2;	nf	anti-phlegmatic (used to reduce phlegm)
nonne 1;	nf		a Latin word used in interrogation
-----------------
Chandoiu 1;	nf;	//looks like a Sanskrit word// this is a typo for Chandom. (abbr. for Chandomanjari)

AnnaRybakovaT · 2022-01-25T16:49:38Z

Just seen that there are some missings and errors in your file

Thanks a lot for your checking and explanation of missing cases (I had no ideas what it could be)!!!

Andhrabharati · 2022-01-28T03:24:34Z

@funderburkjim

Would you mind regenerating the "latest" iast and deva files for the mw.txt?

I have noticed quite a few issues that need corrections, and thought of doing a complete proofing once for all. This time, I estimate a time-frame of about 6-8 months for the full proofing.

Hope to see your response soon on this.

Andhrabharati · 2022-03-01T03:06:48Z

@drdhaval2785

Would you be interested to do this [as @funderburkjim is either not interested in this proposal, or did not "see" this above post yet (being busy on PWG ls working)]?

Or else, I will take up some other big work for a long term, starting a few days from now.

drdhaval2785 · 2022-03-01T04:43:57Z

You want new devanagari files, I can.
I am not sure about IAST though.

drdhaval2785 · 2022-03-01T04:59:41Z

https://github.com/sanskrit-lexicon/csl-devanagari/blob/main/v02/mw/mw.txt is the latest MW Devanagari version.

Andhrabharati · 2022-07-28T07:47:47Z

In the last file by @AnnaRybakovaT at the #127 (comment), both

Rakshases 2; nf; Rakshasas &
Ushases 1; nf; Ushas

are proper in the text, being the plural of Rakshas & Ushas respectively, and no change required in those words.
Hope @funderburkjim would take this into account, while he 'works' on this file he has copied elsewhere.

Andhrabharati · 2024-01-16T11:11:41Z

@funderburkjim

I had seen you copying Anna's work after a gap of 6 months; and now another year-and-half has elapsed.
Hope you might consider looking into her file and act upon the same, sometime sooner.

funderburkjim · 2024-01-22T19:21:05Z

@Andhrabharati Am taking up review of words_mw_noneng_1.txt.

Ref: sanskrit-lexicon/MWS#127

funderburkjim · 2024-01-25T02:31:50Z

processing of nonenglish words.

Work directory is unique_eng.

words_mw_noneng_2.txt has my annotations of @AnnaRybakovaT file words_mw_noneng_1.txt.
- My comments ';; xxxx'
items generating a change in mw.txt indicated by ';; 2024 ...'
about 200 lines of mw.txt changed. See also changes_2.txt or the csl-orig commit above
About 40 of these were marked as print-changes 'PRINT CHANGE', and were posted to mw_printchange.txt (see csl-corrections commit above).

For a few old words, these were useful:

For Latin words, sometimes this was useful: https://www.online-latin-dictionary.com/latin-english-dictionary.php

funderburkjim · 2024-01-25T02:36:04Z

Further research and usage

There is a lot of good information in the research by @AnnaRybakovaT and @Andhrabharati. Not clear where to put it so that it may be available when needed another time. Maybe where @drdhaval2785 has put his word studies.

Andhrabharati · 2024-01-25T03:38:34Z

@funderburkjim

Though you have mentioned that (Anna's and) my 'research' contained some good info, you had ignored/skipped this post above.

Andhrabharati · 2024-01-25T05:42:18Z

A quick looking into the 40 print-changes prompted me to comment thus--

cerebralisation 1; nf; cerebralization (typo);; 2024 correction L=110300 niveSa PRINT CHANGE

;; AB there are few more cases of such 's-z' variants-- realization (5) vs. realisation (4); cauterization (4) vs. cauterisation (1)
;; AB these American and British spelling variations may be seen throughout the MW text (see for e.g. courtezan, courtesan)
;; AB thus. I feel that this particular 'print-change' correction is to be reverted back.

Andhrabharati · 2024-01-25T07:46:39Z

Another info, that I wanted to present here--

anum 1; nf; maybe "per annum" (in this case - print change) ;; no change. anum in pw, but otherwise not found

This does not indicate "per annum" as Anna thought; for the context (there are some more places that pw has used "per anum") seems to mean "from/by anus", anum being the inflected form of Anus (Latin word).

Ref: sanskrit-lexicon/MWS#127

funderburkjim · 2024-01-25T17:32:23Z

@Andhrabharati Revised per your comment(s). For details, see commits above.

Andhrabharati · 2024-01-25T17:46:17Z

I presumed that these two plurals also would/should be marked, as <ns>Aṅgirases</ns> was.

Ref: sanskrit-lexicon/MWS#127

Ref: #127 (comment)

funderburkjim · 2024-01-25T18:07:03Z

@Andhrabharati <ns> markup added. See commits above.

funderburkjim added a commit that referenced this issue Dec 1, 2021

non-english mw words, #127, first commit

68735e2

funderburkjim added a commit to sanskrit-lexicon/csl-corrections that referenced this issue Jan 25, 2024

MW print changes.

c24c402

Ref: sanskrit-lexicon/MWS#127

funderburkjim added a commit to sanskrit-lexicon/csl-pywork that referenced this issue Jan 25, 2024

MW: mwab_input.txt.

ef72088

Ref: sanskrit-lexicon/MWS#127

funderburkjim added a commit to sanskrit-lexicon/csl-orig that referenced this issue Jan 25, 2024

MW: non-english word corrections.

85b222f

Ref: sanskrit-lexicon/MWS#127

funderburkjim added a commit that referenced this issue Jan 25, 2024

Process corrections from words_mw_noneng_2.txt. #127

534d153

funderburkjim closed this as completed Jan 25, 2024

Andhrabharati reopened this Jan 25, 2024

funderburkjim added a commit to sanskrit-lexicon/csl-orig that referenced this issue Jan 25, 2024

MW:110300 Revert cerebralization to cerebralisation.

b651036

Ref: sanskrit-lexicon/MWS#127

funderburkjim added a commit to sanskrit-lexicon/csl-corrections that referenced this issue Jan 25, 2024

MW: revise mw_printchange re cerebralisation.

8a0dc74

Ref: sanskrit-lexicon/MWS#127

funderburkjim added a commit that referenced this issue Jan 25, 2024

Revised per AB comments. #127

55c3e35

funderburkjim closed this as completed Jan 25, 2024

funderburkjim added a commit to sanskrit-lexicon/csl-orig that referenced this issue Jan 25, 2024

MW: mark rAkshases and Ushases with <ns>' tag.

b8959ba

Ref: sanskrit-lexicon/MWS#127

funderburkjim added a commit that referenced this issue Jan 25, 2024

Add <ns> markup to Rakshases, Ushases.

34311ff

Ref: #127 (comment)

non-english mw words #127

non-english mw words #127

Comments

funderburkjim commented Dec 1, 2021

funderburkjim commented Dec 1, 2021

Suggested first task

AnnaRybakovaT commented Dec 1, 2021

AnnaRybakovaT commented Dec 1, 2021

funderburkjim commented Dec 1, 2021

funderburkjim commented Dec 1, 2021

Construction details

AnnaRybakovaT commented Dec 2, 2021

funderburkjim commented Dec 2, 2021

gasyoun commented Dec 5, 2021

AnnaRybakovaT commented Dec 14, 2021

AnnaRybakovaT commented Dec 14, 2021

funderburkjim commented Dec 14, 2021

gasyoun commented Dec 14, 2021

funderburkjim commented Dec 15, 2021

Andhrabharati commented Dec 15, 2021 • edited Loading

funderburkjim commented Dec 15, 2021

funderburkjim commented Dec 15, 2021

Andhrabharati commented Dec 15, 2021

AnnaRybakovaT commented Dec 16, 2021

Andhrabharati commented Dec 16, 2021

gasyoun commented Dec 16, 2021 • edited Loading

funderburkjim commented Dec 16, 2021

Andhrabharati commented Dec 17, 2021

AnnaRybakovaT commented Jan 24, 2022 • edited by funderburkjim Loading

gasyoun commented Jan 24, 2022

Andhrabharati commented Jan 25, 2022

AnnaRybakovaT commented Jan 25, 2022

Andhrabharati commented Jan 28, 2022 • edited Loading

Andhrabharati commented Mar 1, 2022

drdhaval2785 commented Mar 1, 2022

drdhaval2785 commented Mar 1, 2022

Andhrabharati commented Jul 28, 2022 • edited Loading

Andhrabharati commented Jan 16, 2024

funderburkjim commented Jan 22, 2024

funderburkjim commented Jan 25, 2024

processing of nonenglish words.

funderburkjim commented Jan 25, 2024

Further research and usage

Andhrabharati commented Jan 25, 2024

Andhrabharati commented Jan 25, 2024

Andhrabharati commented Jan 25, 2024 • edited Loading

funderburkjim commented Jan 25, 2024

Andhrabharati commented Jan 25, 2024

funderburkjim commented Jan 25, 2024

Andhrabharati commented Dec 15, 2021 •

edited

Loading

gasyoun commented Dec 16, 2021 •

edited

Loading

AnnaRybakovaT commented Jan 24, 2022 •

edited by funderburkjim

Loading

Andhrabharati commented Jan 28, 2022 •

edited

Loading

Andhrabharati commented Jul 28, 2022 •

edited

Loading

Andhrabharati commented Jan 25, 2024 •

edited

Loading