## Extracting Metadata from TEI Text

### Step 1 - Set up Directory and Import Libraries
We will be using the pathlib library to locate files, lxml to parse TEI, and pandas to organize data. 

In [12]:
from pathlib import Path
from lxml import etree
import pandas as pd

home = Path.home()
tei_dir = home / "romantic_poets_project/tei_files"

### Step 2 - Create an Extraction Function
The function will use lxml.etree: 
1. Declare the namespace so etree recognizes the TEI XML elements.
2. Use XPath to search the document for the metadata we want to extract.
3. Create a dictionary of metadata keys to populate with values from the TEI files.

We are extracting a reduced set of metadata focused on key research details:
- Author
- Poem and Collection title
- Print publication details
- Digital Publication details
- Composition & manuscript information, where available

In [130]:
def extract_metadata(file): 
    tree = etree.parse(file)
    ns = {"tei": "http://www.tei-c.org/ns/1.0"}
    data = {}

    #file details 
    data["file name"] = Path(file).name
    
    #author details
    ref = ""
    author = ""
    file_author = tree.xpath("//tei:teiHeader/tei:fileDesc/tei:titleStmt/tei:author", namespaces=ns) 
    for a in file_author:
        ref = a.get("ref", "")
        author = a.text.strip() if a.text else ""         
    data["author"] = author
    data["author_reference"] = ref
 

    # source details 
    coll_titles = tree.xpath("//tei:teiHeader/tei:fileDesc/tei:sourceDesc/tei:biblStruct/tei:monogr/tei:title", namespaces=ns)
    coll_title = ""
    coll_subtitle = ""
    poem_title = ""
    for t in coll_titles:
        text = t.text.strip() if t.text else ""
        level = t.get("level")
        ttype = t.get("type") 
        if level == "m" and ttype == "main": 
            coll_title = text
        elif level == "m" and ttype == "sub":
            coll_subtitle = text
        elif level == "a" and ttype == "main":
            poem_title = text
    data["poem_title"] = poem_title
    data["print_collection_title"] = coll_title
    data["print_collection_subtitle"] = coll_subtitle
    
    #physical book details    
    pub_editor = tree.xpath("//tei:teiHeader/tei:fileDesc/tei:sourceDesc/tei:biblStruct/tei:monogr/tei:editor/text()", namespaces=ns)
    pub_editor = pub_editor[0] if pub_editor else ""
    data["print_editor"] = pub_editor
    
    pub_date = tree.xpath("//tei:teiHeader/tei:fileDesc/tei:sourceDesc/tei:biblStruct/tei:monogr/tei:imprint/tei:date/text()", namespaces=ns)
    pub_date = pub_date[0] if pub_date else ""
    data["print_publication_date"] = pub_date
    
    publisher = tree.xpath("//tei:teiHeader/tei:fileDesc/tei:sourceDesc/tei:biblStruct/tei:monogr/tei:imprint/tei:publisher/text()", namespaces=ns)
    publisher = publisher[0] if publisher else ""
    data["print_publisher"] = publisher
    
    pub_place = tree.xpath("//tei:teiHeader/tei:fileDesc/tei:sourceDesc/tei:biblStruct/tei:monogr/tei:imprint/tei:pubPlace/text()", namespaces=ns)
    pub_place = pub_place[0] if pub_place else ""
    data["print_publication_location"] = pub_place

    # archival details 
    digital_lib = tree.xpath("//tei:teiHeader/tei:fileDesc/tei:sourceDesc/tei:biblStruct/tei:monogr/tei:orgName/text()", namespaces=ns)
    digital_lib = digital_lib[0] if digital_lib else ""
    data["digitizing_institution"] = digital_lib
    
    archive_elem = tree.xpath("//tei:teiHeader/tei:fileDesc/tei:sourceDesc/tei:biblStruct/tei:note/tei:ptr", namespaces=ns)
    archive_link = archive_elem[0].get("target") if archive_elem and archive_elem[0].get("target") else ""
    data["digital_archive_link"] = archive_link

    #composition details & manuscript details, if avail 
    orig_date = tree.xpath("//tei:teiHeader/tei:fileDesc/tei:sourceDesc/tei:bibl/tei:date/text()", namespaces=ns)
    orig_date = orig_date[0] if orig_date else ""
    data["composition_date"] = orig_date

    composition_notes = tree.xpath("string(//tei:teiHeader/tei:fileDesc/tei:sourceDesc/tei:bibl/tei:note)",namespaces=ns).strip()
    data["notes_on_composition"] = composition_notes

    manuscript_loc = tree.xpath("//tei:teiHeader/tei:fileDesc/tei:sourceDesc/tei:bibl/tei:orgName/text()", namespaces=ns)
    manuscript_loc = manuscript_loc[0] if manuscript_loc else ""
    data["manuscript_location"] = manuscript_loc
    
    return data
        

### Step 3 - Extract Metadata with Extraction Function
We will loop through the TEI files to batch process them.
Metadata for each file will be stored in a dictionary, with keys being metadata fields and values as the corresponding data for the texts. 

In [131]:
metadata_records = []
for file in tei_dir.glob("*.xml"):
    metadata = extract_metadata(file)
    metadata_records.append(metadata)

### Step 4 - Transform Metadata into a DataFrame, Export to CSV
Using pandas, we convert the list of dictionaries into a DataFrame. Each metadata record becomes one row with columns for each metadata field. 
Finally, we set the output path and export the DataFrame as a CSV to the designated metadata folder. 

In [137]:
df = pd.DataFrame(metadata_records)
output_path = home / "romantic_poets_project/metadata_csv_outputs/metadata.csv"
df.to_csv(output_path, index=False)

A CSV of metadata will now be available in /metadata_csv_outputs. This script can be reused for any new TEI XML files added to the project folder.

If additional metadata fields from the TEI files are needed beyond this reduced set, the extraction function can be updated to include them.

The script extract_tei_metadata.py functions exactly the same as this notebook.