Many NLP pipelines and tools assume a plain text (.txt) file input. If you only have access to your text in an XML or TEI tagged version, the following code can be used to strip the tags and write a text-only version of the document as a plain text file.

The following uses Homer's Iliad as translated by Alexander Pope [available from Project Gutenburg](https://www.gutenberg.org/ebooks/6130), and is stored in the corpus. 

In [1]:
# import the Beautiful Soup library
from bs4 import BeautifulSoup

# store the tagged text's filepath 
filename = 'corpus/iliad.tei'

# read in the filename, store it temporarily as a variable called raw_text 
with open(filename, 'r') as file_in:
    raw_text = file_in.read()

# take the raw_text, turn it into a BeautifulSoup object, and store it in a variable called soup
soup = BeautifulSoup(raw_text, 'lxml')

# pull all of the text from the BeautifulSoup object and store it as a variable called processed_tei
processed_tei = soup.text

# name the new file and write it 
with open('new_file.txt', 'w') as file_out:
    file_out.write(processed_tei)

The following will loop over a directory of tagged documents and write them as plain text to a new file_out directory. For this script to work, you must create the file_out directory before you run it.

In [3]:
import os
import glob
from bs4 import BeautifulSoup

def all_files(folder_name):
    texts = []
    for (root, _, files) in os.walk(folder_name):
        for fn in files:
            if fn[0] == '.':
                pass
            else:
                path = os.path.join(root, fn)
                texts.append(path)
    return texts

corpus_input = 'file_in'
fns = all_files(corpus_input)

corpus_output = 'file_out/'

all_texts = []

for fn in fns:
    with open (fn, 'r') as file_in:
        input = file_in.read()
        soup = BeautifulSoup(input, 'lxml')
        text = soup.text
        all_texts.append(text)
        with open(corpus_output + os.path.basename(fn)[:-3] + 'txt', 'w') as file_out:
           output = file_out.write(text)
# the print statement is optional but lets you easily see in the interpreter that the code worked
print(len(all_texts))

0
