# Working with XML and TEI

## XML

In progress

## TEI

The Text Encoding Initiative (TEI) is one of the longest standing and robust digital humanities projects. Thanks to the longstanding work of the TEI consortium, scholars have extensive tools at their disposal for encoding the nuances of textual data in a systematic way. Once a text has been organized and encoded, it can sometimes be difficult to tell what might come next for your encoded and shared documents. There are number of possibilities related to display and archiving, but let's take a look at the analytical affordances of such structured data.

TEI is a flavor of XML, and we can use the same tools to access it that we used in the scraping chapter for HTML. First let's read in some text. We have a TEI encoded version of Dostoevsky's _The Brothers Karamazov_, [available from Project Gutenberg](http://www.gutenberg.org/browse/authors/d#a314).

In [1]:
%load_ext soup

ModuleNotFoundError: No module named 'soup'

In [2]:
# import the BeautifulSoup library
from bs4 import BeautifulSoup

# store the filename of the text.
filename = 'corpus/brothers_karamazov.tei'

# read in the filename, store it temporarily as a variable called text.
with open(filename, 'r') as fin:
    text = fin.read()

# take the text, turn it into a BeautifulSoup object, and store in a variable called tei.
tei = BeautifulSoup(text, 'xml')

We know have access to the encoded text and can manipulate it in much the same way that we would an HTML file. It is worth noting that knowing excatly _what_ to query for depends to a large degree on knowledge of your object of study. No two TEI files will be formatted or encoded in exactly the same way, so you will need to closely examine your materials early on. Let's take a look at the first part of the TEI file.

In [3]:
tei.teiHeader

Note that the source material presented tags to us in camel case, with certain letters capitalized. We had to preserve these same tags when querying the TEI for the teiHeader. Neglecting to do so would return no results:

In [36]:
tei.teiheader

But if we had used a different parser to work with the TEI, a la

BeautifulSoup(text, 'lxml')

instead of

BeautifulSoup(text, 'xml')

The result would have actually flattened out all our capitalization, meaning that preserving the capitalization in our queries would have returned nothing! The bottom line - know your data. Looking at the TEI would tell us that a teiHeader tag exists with data in it, so it must be there somewhere.

Using this knowledge, we could use the TEI tags to build up a workable text. And let's go ahead and re-import the text with a different parser so that we don't have to worry about capitalization:

In [8]:
# read in the filename, store it temporarily as a variable called text.
with open(filename, 'r') as fin:
    text = fin.read()

# take the text, turn it into a BeautifulSoup object, and store in a variable called tei.
tei = BeautifulSoup(text, 'xml')
paragraphs =  tei.find_all('p')
paragraphs[100]

Using this same approach we could pull out all the text of those paragraphs, stripping away the tags using the .text function available to beautiful soup tags.

In [37]:
text_of_paragraphs = [paragraph.text for paragraph in paragraphs]
text_of_paragraphs[100]

'\nFyodor Pavlovitch, for the last time, your compact, do you\nhear? Behave properly or I will pay you out! Miüsov had time to\nmutter again.\n'

Of course, the whole point of TEI is that we actually care about the ways that the tags are interacting with the text itself. In our previous example, the encoder used the <q> tag to mark [quoted material](http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-q.html). We could use this tag to pull out all similar pieces of text that represent a moment of rhetorical distancing:

In [38]:
# get all questions
questions = [q.text for q in tei.find_all('q')]
# print the number of questions
len(questions)
print(questions[0:9])

['landowner', 'romantic', "One would think that you'd got a promotion, Fyodor Pavlovitch,\nyou seem so pleased in spite of your sorrow,", 'Lord, now lettest Thou Thy\nservant depart in peace,', 'clericals.', 'Those innocent eyes slit my soul up like a razor,', 'from the halter,', 'wronged', 'possessed\nby devils.']


The number of things that you'll be able to pull out of any particular text depends, ultimately, on the encoding itself. Marking a text up at all, but especially with TEI, is a deeply interpretive act. You'll want to look closely at your encoded text to get a sense of the options for you. With the example Dostoevsky text we could do many things, but the most basic might involve looking at the attributes of a tag to get a clearer sense of particular pieces of text. 

In [39]:
import nltk

# find all instances of language marked as foreign
foreign_text = tei.find_all('foreign')
# get the lang attribute for each tag, where the encoder has stored information about the language of the text. 
foreign_text[0].get('lang')
# get the text
language_markings = [instance.get('lang') for instance in foreign_text]
# get a set consisting of all the unique language flags.
set(language_markings)
nltk.FreqDist(language_markings)

FreqDist({'fr': 54, 'la': 5, None: 83, 'de': 11})

There are only three languages marked for the text. Of those markings, French phrases far exceed the number of German or latin. But 'None' is even more frequent. This might be an opportunity to clean up the TEI, as we could go through and assign language categories to those tags manually. This is a good reminder that the results of your text analysis should never be taken for granted. They're the results of human intervention, interpretation, and error all the way down.

We might use this information to pull out all the text of a particular language:

In [36]:
french_snippets = [instance.text for instance in foreign_text if instance.get('lang') == 'fr']
french_snippets[0:9]

['Il faudrait les\ninventer',
 "J'ai bu l'ombre d'un\ncocher qui avec l'ombre d'une brosse frottait l'ombre d'une carrosse.",
 'un\nchevalier parfait',
 'chevalier',
 'arrière-pensée',
 "coup d'état",
 'poseurs',
 'plus de noblesse que de sincérité',
 'plus de\nsincérité que de noblesse']

Here we go through the text and use the 'lang' attribute to check whether a particular snippet is of the language we care about - French in this case. If it is we will save the information in it. 