# Working with XML and TEI

The Text Encoding Initiative (TEI) is one of the longest standing and robust digital humanities projects. Thanks to the longstanding work of the TEI consortium, scholars have extensive tools at their disposal for encoding the nuances of textual data in a systematic way. Once a text has been organized and encoded, it can sometimes be difficult to tell what might come next for your encoded and shared documents. There are number of possibilities related to display and archiving, but let's take a look at the analytical affordances of such structured data.

TEI is a flavor of XML, and we can use the same tools to access it that we used in the scraping chapter for HTML. First let's read in some text. We have a TEI encoded version of Dostoevsky's _The Brothers Karamazov_, [available from Project Gutenberg](http://www.gutenberg.org/browse/authors/d#a314).

In [20]:
%load_ext soup

The soup extension is already loaded. To reload it, use:
  %reload_ext soup


In [33]:
# import the BeautifulSoup library
from bs4 import BeautifulSoup

# store the filename of the text.
filename = 'corpus/brothers_karamazov.tei'

# read in the filename, store it temporarily as a variable called text.
with open(filename, 'r') as fin:
    text = fin.read()

# take the text, turn it into a BeautifulSoup object, and store in a variable called tei.
tei = BeautifulSoup(text, 'xml')

We know have access to the encoded text and can manipulate it in much the same way that we would an HTML file. It is worth noting that knowing excatly _what_ to query for depends to a large degree on knowledge of your object of study. No two TEI files will be formatted or encoded in exactly the same way, so you will need to closely examine your materials early on. Let's take a look at the first part of the TEI file.

In [35]:
tei.teiHeader

Note that the source material presented tags to us in camel case, with certain letters capitalized. We had to preserve these same tags when querying the TEI for the teiHeader. Neglecting to do so would return no results:

In [36]:
tei.teiheader

But if we had used a different parser to work with the TEI, a la

BeautifulSoup(text, 'lxml')

instead of

BeautifulSoup(text, 'xml')

The result would have actually flattened out all our capitalization, meaning that preserving the capitalization in our queries would have returned nothing! The bottom line - know your data. Looking at the TEI would tell us that a teiHeader tag exists with data in it, so it must be there somewhere. 