Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrections to Burnouf IAST #420

Open
funderburkjim opened this issue Sep 27, 2018 · 12 comments
Open

Corrections to Burnouf IAST #420

funderburkjim opened this issue Sep 27, 2018 · 12 comments

Comments

@funderburkjim
Copy link
Contributor

In the review of sanskrit coding conventions, it was noticed (see):

However, the non-italic Sanskrit proper names have not been converted
to modern IAST; with @sanskritisampada 's help to identify the non-italic Sanskrit words, these will
also soon be converted to modern IAST.

This work has now been done. This issue aims to provide some documentation.

@funderburkjim
Copy link
Contributor Author

funderburkjim commented Sep 27, 2018

plainwords

The files mentioned are in this Burnouf/iastwork directory.

There are some Sanskrit words in Burnouf that appear in plain text, in IAST.
But these have not been converted to Sanskrit IAST, rather they remain in
Burnouf's IAST. We want to identify these words and then change the spelling
to standard IAST for Sanskrit.

The first step is to generate a list of all 'words' that appear in plain text in the Burnouf digitization.
There are about 20000 distinct such words identified.

Of course, many of these words are French. Using the pyenchant python library, and a related dictionary of French words, we may identify many of the 20000 words as French.

Note on pyenchant:
Although still available and working with Python 2.7, apparently this library is no longer maintained. Here is the repository for it. It would be good to know if there
is some replacement for this library, which will work with Python 3, since Python 2 will become obsolete in 2020.

This filter resulted in plainwords_french.txt (15202 words) and plainwords_other.txt (4843).

Each of these files shows a word on each line, along with how often it occurs (as a plain word) in the Burnouf digitization.

The program using pyenchant is fr_pyenchant.py.

funderburkjim added a commit that referenced this issue Sep 27, 2018
@funderburkjim
Copy link
Contributor Author

funderburkjim commented Sep 27, 2018

Initial work

An html file was prepared to help in providing context to the 4500 'other' words. (plainwords_other.html).

At this point, the task was turned over to @sanskritisampada . Her goal was to mark the French words (with an 'F') and the Sanskrit words (with an 'S') in the list of plainwords_other words.

Even with context, this is a difficult task; partly due to the nature of the Burnouf dictionary:

  • cognate words in many languages
  • scientific (Latinate) names of plants and animals
  • modern versions of place names
  • probably other word categories not yet noticed.
  • Probable French words incorporated into French from Sanskrit.

@funderburkjim
Copy link
Contributor Author

Google word detection tool

Sampada reported that this word identification was quite slow-going. This prompted a search for
ways to speed the process, and somewhere along the line I became aware of the language detection
functionality of Google Translate. In particular, there is a Python api, as described [here].

This was adapted for the current purposes in the sample_detect.py program.

After merging with what had been done thus far, the result was burnouf_sampada_detect.txt.
Note that each line now has, in addition to each word and its frequency,

  • a placeholder for the word identification
  • The language according to the Google language detection tool
  • a confidence number, also provided by the language detection tool.

Interestingly, even though the language detection is often quite odd, Sampada found it sped
up the process of identification.

The end result of the identifications thus far is in burnouf_sampada_detect_all.txt, with

  • 1525 words marked as French
  • 968 marked as Sanskrit
  • 86 marked as place names ('P')

All in all, about 2579 were marked, and 2264 remain unmarked.

@funderburkjim
Copy link
Contributor Author

French corrections

During the process of marking, Sampada identified many spelling corrections for French words.
With some editing, these were converted into digitization correction transactions, about 270.

@funderburkjim
Copy link
Contributor Author

Sanskrit corrections and markup

The plainwords identified as Sanskrit were examined with regard to their spelling correctness in light of
modern IAST spelling conventions. As mentioned in the discussion of Burnouf's use of diacritics in representing Sanskrit words, many of these conventions differ from the modern IAST conventions.
Spelling changes were made so that the resulting digitization uses modern IAST spellings for these non-italic Sanskrit words.

After such modernization changes, the resulting Sanskrit words were converted to SLP1 and compared to the spellings of headwords in the Monier-Williams dictionary. This resulted in several corrections
to spellings (for instance, 'Crishna' was changed to 'Kṛṣṇa' in 3 places.)

The identified Sanskrit words, whether needing correction or not, were entered in a form which maintains their identification as Sanskrit words:

<s1 slp1="tretAyuga">Tretāyuga</s1>

This markup form had previously been used for a similar purpose in the revision to the MW digitization.

All the Sanskrit plain word digitization changes are present in the manualByLine_sancorr.txt file.

@gasyoun
Copy link
Member

gasyoun commented Sep 29, 2018

'Crishna' was changed to 'Kṛṣṇa'

This is amazing.

@drdhaval2785
Copy link
Contributor

@funderburkjim the IAST conversion in BUR be treated over?

All in all, about 2579 were marked, and 2264 remain unmarked.

This line stopped me pressing close button.

@funderburkjim
Copy link
Contributor Author

There appears to be more that could be done to improve Burnouf , starting with further
examination based on burnouf_sampada_detect_all.txt.

@gasyoun
Copy link
Member

gasyoun commented Dec 19, 2020

There appears to be more that could be done to improve Burnouf

Let a French Sanskrit scholar be born and finalize it.

@sanskritisampada
Copy link

sanskritisampada commented Dec 19, 2020 via email

@gasyoun
Copy link
Member

gasyoun commented Dec 19, 2020

I could contribute further after the AP 90 task is complete.

You're a true miracle, Sampada.

@funderburkjim
Copy link
Contributor Author

@sanskritisampada Good idea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants