 # Encoding Du Bois's Epigraphs
 
For today's lesson, we're going to think about how encoding affects how we read different kinds of texts.

We'll look at what we can do with a particular kind of encoding technique-–eXtensible Markup Language (XML).


## Character  and file encoding

A quick note before we read in our file. Last week, Nati asked about why we specify `encoding='utf-8'` to open our text file? 

Well, UTF-8 is a character encoding (a specific kind of Unicode). We need to specify a character encoding because — gasp! — computers don’t actually know what text is. Character encodings are systems that map characters to numbers. Each character is given a specific ID number. This way, computers can actually read and understand characters.

UTF-8 is part of an encoding system called Unicode -- which. Before 2007, the most common form of encoding was ASCII (which was founded in 1963 in the US and had limited support for characters in other languages). Unicode is the information standard for encoding and representing text from writing systems.

>For more on the limits of Unicode's universal encoding standard, see Aditya Mukerjee's ["I Can Text You a Pile of Poo But I Can't Write My Name" (2015)](https://modelviewculture.com/pieces/i-can-text-you-a-pile-of-poo-but-i-cant-write-my-name). Since 2015, Unicode has expanded to include some of the ommisions that Mukerjee mentions–– there is now
support for Bengali characters––but is still far from a universal representation of characters in writing systems.

You can check any characters’ “code point,” or place in the Unicode universe, with the function `ord()`

In [123]:
ord('h')

104

In [124]:
ord('🐈')

128008

In [128]:
sample_text_default = open('sample-text.txt', encoding='utf-8').read()
print(sample_text_default)

***
This is an example of curly quotation marks:
“She said, ‘I won’t bungle the encoding!’”
***

***
This is an example of an emoji:
🐈
***

***This is an example of Russian:
Говорили, что на набережной появилось новое лицо: дама с собачкой.
(It was said that a new person had appeared on the sea-front: a lady with a little dog.)

***
This is an example of Chinese:
如果我们想学习中文短篇小说怎么办？
(What if we want to study Chinese short stories?)
***


Let's try opening our sample text file using an ASCII encoding standard, like ['iso-8859-1'](https://en.wikipedia.org/wiki/ISO/IEC_8859-1)

In [130]:
sample_text_iso= open('sample-text.txt', encoding='iso-8859-1').read()
print(sample_text_iso)

***
This is an example of curly quotation marks:
âShe said, âI wonât bungle the encoding!ââ
***

***
This is an example of an emoji:
ð
***

***This is an example of Russian:
ÐÐ¾Ð²Ð¾ÑÐ¸Ð»Ð¸, ÑÑÐ¾ Ð½Ð° Ð½Ð°Ð±ÐµÑÐµÐ¶Ð½Ð¾Ð¹ Ð¿Ð¾ÑÐ²Ð¸Ð»Ð¾ÑÑ Ð½Ð¾Ð²Ð¾Ðµ Ð»Ð¸ÑÐ¾: Ð´Ð°Ð¼Ð° Ñ ÑÐ¾Ð±Ð°ÑÐºÐ¾Ð¹.
(It was said that a new person had appeared on the sea-front: a lady with a little dog.)

***
This is an example of Chinese:
å¦ææä»¬æ³å­¦ä¹ ä¸­æç­ç¯å°è¯´æä¹åï¼
(What if we want to study Chinese short stories?)
***


What do you notice?

----

## Reading and encoding the text of *The Souls of Black Folk*: facsimile, OCR, XML 

What if we wanted to work with the epigraphs in a text like W.E.B. Du Bois's *The Souls of Black Folk* (1903)?

How would we go about identifying just the epigraphs in our text?
 
### HathiTrust's facsimile 1903 copy of *The Souls of Black Folk*

https://babel.hathitrust.org/cgi/pt?id=hvd.32044010329985&view=1up&seq=20&skin=2021

![image](../_images/du-bois-epigraph1.png)

### OCR (optical character recogition)-generated text for HathiTrust's 1903 copy of *The Souls of Black Folk* 
https://babel.hathitrust.org/cgi/ssd?id=hvd.32044010329985;seq=20

![image](../_images/du-bois-epigraph1-OCR.png)


Anything we notice?

What might make this text difficult to work with if we wanted to examine Du Bois's epigraphs?

### XML: Documenting the American South's TEI edition of *The Souls of Black Folk*


The UNC Chapel Hill project [Documenting the American South(DocSouth)](https://docsouth.unc.edu/index.html) has digitzed and encoded a number of primary sources, including [*The Souls of Black Folk* (1903)](https://docsouth.unc.edu/church/duboissouls/menu.html).

Here's what [the XML version of *The Souls of Black Folk*](https://docsouth.unc.edu/church/duboissouls/dubois.xml) looks like:

![image](../_images/du-bois-xml.png)

We're going to use this addition, because the editors and transcribers of this edition have gone through and hand encoded various characteristics of the text -- things like page breaks, chapter headings, but also genre shifts ---betwen prose and verse or song––and the presence of foreign languagesin this English-language text. Most importantly for our purposes, they've marked different elements of the text, like "title," "chapter," and "epigraph."

## Parsing XML with Python
To make use of the rich metadata encoded into an XML file, we're going to use a Python library called [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/#) to extract data from a text file that has been "marked up" with additional tags to identify the different parts of the text. 

Let's set up `BeautifulSoup`. I've downloaded the [XML file from DocSouth](https://docsouth.unc.edu/church/duboissouls/dubois.xml) to our course files and saved it as "du-bois-the-soul-of-black-folk.xml"

In [None]:
# Run the line below to install the XML parser, lxml, and the Python library BeautifulSoup4
!pip install lxml beautifulsoup4 

In [131]:
# import Beautiful Soup

from bs4 import BeautifulSoup

# Define the filepath to our TEI-encoded XML file
tei_doc = '../_datasets/texts/literature/du-bois-the-soul-of-black-folk.xml'

# Read in our XML file as a BeautifulSoup object using the `open()` function to read it:
with open(tei_doc, 'r', encoding='utf-8') as tei:
    soup = BeautifulSoup(tei, 'lxml')


What have we just created? We've just created a `BeautifulSoup` object that we've called "soup." 

### Getting the names of all the tags that appear in the XML document

BeautifulSoup has a function called `.find_all()` that lets us find all appearances of a tag, or all the names of a tag. Here, we're going to use a `for` loop and to iterate over all the tags that appear in the document and print the names of those tags (not the text they enclose):

In [132]:
for tag in soup.find_all():
    print(tag.name)

html
body
tei.2
teiheader
filedesc
titlestmt
title
emph
author
funder
respstmt
resp
name
respstmt
resp
name
respstmt
resp
name
editionstmt
edition
date
extent
publicationstmt
publisher
pubplace
date
availability
p
a
sourcedesc
biblfull
titlestmt
title
title
author
editionstmt
edition
extent
publicationstmt
pubplace
publisher
date
authority
notesstmt
note
encodingdesc
projectdesc
p
hi
editorialdecl
p
p
p
p
p
p
p
p
p
p
classdecl
taxonomy
bibl
title
edition
profiledesc
langusage
language
language
language
language
language
textclass
keywords
list
item
item
item
item
revisiondesc
change
date
respstmt
name
resp
item
change
date
respstmt
name
resp
item
change
date
respstmt
name
resp
item
change
date
respstmt
name
resp
item
text
front
div1
p
figure
p
div1
p
figure
p
div1
p
titlepage
doctitle
titlepart
lb
titlepart
byline
docauthor
docedition
docimprint
pubplace
publisher
docdate
pb
docimprint
lb
publisher
docdate
docedition
lb
docimprint
lb
div1
pb
p
lb
lb
div1
pb
p
p
p
pb
p
p
hi
hi
p
closer


### Extracting all the contents of a particular tag

Are there any tags that we want to look at?

Let's look at the "epigraph" tag! To call up the first appearance of a tag in a `BeautifulSoup` object, we write the name of our object, '.' and the name of the tag, like so:

In [39]:
soup.epigraph

<epigraph>
<q direct="unspecified">
<lg type="verse">
<lg type="stanza">
<l>O water, voice of my heart, crying in the sand,</l>
<l>All night long crying with a mournful cry,</l>
<l>As I lie and listen, and cannot understand</l>
<l>The voice of my heart in my side or the voice of the sea,</l>
<l>O water, crying for rest, is it I, is it I?</l>
<l>All night long the water is crying to me.</l>
</lg>
<lg type="stanza">
<l>Unresting water, there shall never be rest</l>
<l>Till the last moon droop and the last tide fail,</l>
<l>And the fire of the end begin to burn in the west;</l>
<l>And the heart shall be weary and wonder and cry like the sea,</l>
<l>All life long crying without avail,</l>
<l>As the water all night long is crying to me.</l>
</lg>
</lg>
<bibl>ARTHUR SYMONS.</bibl>
</q>
</epigraph>

### Getting just the text enclosed by a particular tag (stripping away all other XML tags)

Great! We have the first epigraph. But what if we just wanted to see the text, without any of the markup? We can use the `print` and `.get_text()` function.

In [133]:
print(soup.epigraph.get_text())





O water, voice of my heart, crying in the sand,
All night long crying with a mournful cry,
As I lie and listen, and cannot understand
The voice of my heart in my side or the voice of the sea,
O water, crying for rest, is it I, is it I?
All night long the water is crying to me.


Unresting water, there shall never be rest
Till the last moon droop and the last tide fail,
And the fire of the end begin to burn in the west;
And the heart shall be weary and wonder and cry like the sea,
All life long crying without avail,
As the water all night long is crying to me.


ARTHUR SYMONS.




### Getting all appearances of a tag
Great! But this is only the first epigraph in the essay colleciton––what if we wanted *all* the epigraphs?

There are two ways. 

**Method 1: We could use the `.select()` function, which is good for selecting all apparances of a tag.** 

Here's an example: extracting all appearance of the `<bibl>` tag appear under the `<epigraph>` tag. 

In [53]:
soup.select("epigraph bibl")

[<bibl>ARTHUR SYMONS.</bibl>,
 <bibl>Lowell.</bibl>,
 <bibl>Byron.</bibl>,
 <bibl>SCHILLER.</bibl>,
 <bibl>WHITTIER.</bibl>,
 <bibl>OMAR KHAYYÁM (FITZGERALD).</bibl>,
 <bibl>THE SONG OF SOLOMON.</bibl>,
 <bibl>WILLIAM VAUGHN MOODY.</bibl>,
 <bibl>MRS. BROWNING.</bibl>,
 <bibl>FIONA MACLEOD.</bibl>,
 <bibl>SWINBURNE.</bibl>,
 <bibl>TENNYSON.</bibl>,
 <bibl>MRS. BROWNING.</bibl>,
 <bibl>NEGRO SONG.</bibl>]

**Method 2: We could write a `for` loop and we could use the BeautifulSoup `.find_all()` function to find all appearances of a particular tag.** 

Here's an example using a `for`loop to get all the apparances of the epigraph tag and its contents.

In [134]:
for entry in soup.find_all('epigraph'):
    print(entry)

<epigraph>
<q direct="unspecified">
<lg type="verse">
<lg type="stanza">
<l>O water, voice of my heart, crying in the sand,</l>
<l>All night long crying with a mournful cry,</l>
<l>As I lie and listen, and cannot understand</l>
<l>The voice of my heart in my side or the voice of the sea,</l>
<l>O water, crying for rest, is it I, is it I?</l>
<l>All night long the water is crying to me.</l>
</lg>
<lg type="stanza">
<l>Unresting water, there shall never be rest</l>
<l>Till the last moon droop and the last tide fail,</l>
<l>And the fire of the end begin to burn in the west;</l>
<l>And the heart shall be weary and wonder and cry like the sea,</l>
<l>All life long crying without avail,</l>
<l>As the water all night long is crying to me.</l>
</lg>
</lg>
<bibl>ARTHUR SYMONS.</bibl>
</q>
</epigraph>
<epigraph>
<q direct="unspecified">
<lg type="verse">
<l>Careless seems the great Avenger;</l>
<l>History's lessons but record</l>
<l>One death-grapple in the darkness</l>
<l>'Twixt old systems and

### Getting all appearances of a tag (just the text)

What makes the `for` loop method handy is that you can easily apply other functions to the output, like the `.get_text()` function. 

Here's an example of a `for` loop that iterates over all the epigraphs to print out the full text

In [41]:
for entry in soup.find_all('epigraph'):
    print(entry.get_text())





O water, voice of my heart, crying in the sand,
All night long crying with a mournful cry,
As I lie and listen, and cannot understand
The voice of my heart in my side or the voice of the sea,
O water, crying for rest, is it I, is it I?
All night long the water is crying to me.


Unresting water, there shall never be rest
Till the last moon droop and the last tide fail,
And the fire of the end begin to burn in the west;
And the heart shall be weary and wonder and cry like the sea,
All life long crying without avail,
As the water all night long is crying to me.


ARTHUR SYMONS.





Careless seems the great Avenger;
History's lessons but record
One death-grapple in the darkness
'Twixt old systems and the Word;
Truth forever on the scaffold,
Wrong forever on the throne;
Yet that scaffold sways the future,
And behind the dim unknown
Standeth God within the shadow
Keeping watch above His own.

Lowell.





From birth till death enslaved; in word, in deed, unmanned!

Hereditary bondsmen!

### Using a `for` to extract the contents of a tag and add them to a list

In [47]:
# create an empty list called "epigraphs"
epigraphs = []

# use a `for` loop to iterate over all the epigraph tags in our document 
## and add each tag's contents to our list, "epigraphs"
for entry in soup.find_all('epigraph'):
    epigraphs.append(entry)

In [63]:
# Let's look at the first entry of our new list
epigraphs[0]

<epigraph>
<q direct="unspecified">
<lg type="verse">
<lg type="stanza">
<l>O water, voice of my heart, crying in the sand,</l>
<l>All night long crying with a mournful cry,</l>
<l>As I lie and listen, and cannot understand</l>
<l>The voice of my heart in my side or the voice of the sea,</l>
<l>O water, crying for rest, is it I, is it I?</l>
<l>All night long the water is crying to me.</l>
</lg>
<lg type="stanza">
<l>Unresting water, there shall never be rest</l>
<l>Till the last moon droop and the last tide fail,</l>
<l>And the fire of the end begin to burn in the west;</l>
<l>And the heart shall be weary and wonder and cry like the sea,</l>
<l>All life long crying without avail,</l>
<l>As the water all night long is crying to me.</l>
</lg>
</lg>
<bibl>ARTHUR SYMONS.</bibl>
</q>
</epigraph>

In [66]:
# Print out just the text
for entry in epigraphs:
    print(entry.get_text())





O water, voice of my heart, crying in the sand,
All night long crying with a mournful cry,
As I lie and listen, and cannot understand
The voice of my heart in my side or the voice of the sea,
O water, crying for rest, is it I, is it I?
All night long the water is crying to me.


Unresting water, there shall never be rest
Till the last moon droop and the last tide fail,
And the fire of the end begin to burn in the west;
And the heart shall be weary and wonder and cry like the sea,
All life long crying without avail,
As the water all night long is crying to me.


ARTHUR SYMONS.





Careless seems the great Avenger;
History's lessons but record
One death-grapple in the darkness
'Twixt old systems and the Word;
Truth forever on the scaffold,
Wrong forever on the throne;
Yet that scaffold sways the future,
And behind the dim unknown
Standeth God within the shadow
Keeping watch above His own.

Lowell.





From birth till death enslaved; in word, in deed, unmanned!

Hereditary bondsmen!