New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update TEI parser to handle Hellenistic and Imperial texts #9
Comments
Some of the new texts use a peculiar betacode encoding with diacriticals that precede the letter they apply to. (Diacriticals normally follow the letter they apply to, except in the case of capital letters marked with a Argon. 2.232 Argon 4.757 Callim. Hymn. 3.1 Dion. 1.129 Am I on the right track in wanting to apply diacriticals to the letter that follows in these cases, or do they look like more textual errors? |
With #11 fixed, aratus.xml is working. The others are hung up on the beta code issues mentioned in the previous comment. |
These are all errors. For Argon. 2.232 (which I think should be 2.244): In most other editions of these texts, the diacritics visually precede a capital letter, e.g. e.g. Argon. 2.757 in Diogenes (ed. Fraenkel 1961/1970; NB the soft breathing is nestled in the circumflex): The beta code may be trying to achieve this by preposing the diacritics in the code. The mistake at Argon. 2.244 is that the capital symbol should be first, followed by the breathing and then the accent. Cf. the correctly encoded "*)=wfilai" (Ὦφιλαι Argon. 1.657). The same goes for Argon. 4.757, which should be *)=iri. For Callim. Hymn. 3.1, the correct code is "ou)"; the problem is that the beta code is trying to begin a parenthetical statement (i.e. open a parenthesis) before the word. Cf. Diogenes ad loc.: The mistake continues when it attempts to close the parenthetical remark at the end of the line: laqe/sqai). I think the correct code for parenthesis is "[1" and "]1". Dion. 1.129 is trying to open and close a quotation embedded within direct speech (i.e. a quotation within a quotation). It appears to be using the diacritic as a single quotation mark. |
It's disheartening to see these errors in the source. It would not be so bad if not for the fact that we cannot detect all of them automatically—only the ones where they happen to trip up some other checking rule. We could attempt to infer, for example, that
There's no simple way (short of having a human expert look at it) to know whether it is supposed to represent (αλφα βετἀ γαμμα) or (αλφα βετα) γαμμἀ. Anyway, here's the current complete list of lines that cause beta code errors: |
Beta codes fixes in argonautica.xml (#9)
Beta code fixes in nonnusdionysiaca.xml (#9)
Beta code fixes in quintussmyrnaeus.xml (#9)
Beta code fixes in theocritus.xml (#9)
Agreed. A lot of the errors in Dion. arise from an attempt to encode quotations within quotations. It seems the encoder used parentheses to make single quotation marks that indicate direct speech within direct speech, e.g. Dion. 1.129. I wasn't sure how to properly encode this; the TLG Beta Code Manual (http://stephanus.tlg.uci.edu/encoding/BCM.pdf) seems to suggest code that Perseus Beta Code does not use (e.g. the bigram "3 encodes a single left quotation). It appears I have two options: 1) use , which I think would confuse things since it is within a quote; 2) replace "(" and ")" with an apostrophe (cf. Dion. 48.559 h)io/nes *na/coio, boh/sate: 'numfi/e *qhseu=,). I went with option 2. Ideally there would be specific code for direct quotations embedded within direct quotations; such information would be valuable for other research interests, especially in narratology. Most if not all of the errors in Theocritus were incorrect sequences of capitalization, breath, and accent—apparently it was encoded with a different notion of the correct order. |
Quotations don't have to be represented with beta code only. TEI has a |
Every file is able to be processed now. In eb9c9d8 I added the new texts to Makefile and make.sh. Here are the line number warnings from processing the new files. For whatever reason a lot of lines are out of order in the XML file compared to their stated numbers. (It's not just the files that are new in this ticket; the ones in the older corpus have it as well.)
|
#8 added new texts, but their format differs slightly from the previously existing ones and tei2csv cannot handle them yet.
As an example, some of the new texts use lowercase
<div type="book" ...>
instead of the uppercase<div type="Book" ...>
that tei2csv currently conservatively checks for.The text was updated successfully, but these errors were encountered: