# 12-ssda-xml-parser

> Parsing xml files to get a dataframe of this, that, and the other.

In this notebook, we create a parser that reads an xml file.  Our xml file contains a large amount of metadata about each of the different entries which we want to extract the entities from.  Each of the entries is within a book (with source identifier), and each book has a number of different folios (i.e., pages - a folio for the front and a separate one for the back), also with unique identifiers.  Since these are stored in XML format, we parse out the different components so we can create a dataframe of each of the entries, where each row is an entry with its respective metadata.

In [None]:
# default_exp xml_parser

In [None]:
#export
import pandas as pd

Below is an example of an xml file we'd have.  The header metadata and two entries are shown below:

## Analysis and coding
The following code extracts the relevant parts of the metadata and constructs a dataframe which organizes the results by row.

In [None]:
#export
def parse_xml(file_name):
    master_xml = open(file_name,"r",encoding='utf-8')
    vol_titls = []
    vol_ids = []
    entry_txts = []
    entry_ids = []
    fol_ids = []
    curr_vol_titl = ""
    curr_vol_id= ""
    curr_fol_id = ""
    curr_entry = ""

    in_entry = False

    for line in master_xml:    
        if (line.find('<') != -1) and (line.find('<', line.find('<') + 1) != -1):        
            line_content = line[line.find('>') + 1:line.find('<', line.find('<') + 1)]        
        elif line[len(line) - 2] == '-':
            line_content = line[:len(line) - 2]
        else:
            line_content = line[:len(line) - 1] + ' '
        
        if "<volumeTitle>" in line:
            curr_vol_titl = line_content
            #set current volume title
        elif "<volumeIdentifier>" in line:
            curr_vol_id = line_content
        #set current volume identifier
        elif "<itemIdentifier>" in line:
            entry_id = 0
            curr_fol_id = line_content
        #set current folio id
        elif "<entry>" in line:
            entry_id += 1
            in_entry = True
            curr_entry = ""
        #toggle in entry flag
        elif in_entry and (not "</entry>" in line):
            curr_entry += line_content
        #add line to current entry
        elif in_entry and ("</entry>" in line):
            in_entry = False 
        #toggle entry flag, append all current variables to lists
            vol_titls.append(curr_vol_titl)
            vol_ids.append(curr_vol_id)
            fol_ids.append(curr_fol_id)
            entry_txts.append(curr_entry)
            entry_ids.append(curr_fol_id + '-' + str(entry_id))
           
    columns = {'vol_titl':vol_titls, 'vol_id':vol_ids, 'fol_id':fol_ids, 'text':entry_txts, 'entry_no':entry_ids}
    
    df = pd.DataFrame(columns)
    master_xml.close()
    return df        

The following code shows the first 10 rows of the processed xml document:

In [None]:
#export

def parse_xml_v2(path_to_xml):
    master_xml = open(path_to_xml,"r",encoding='utf-8')
    vol_titls = []
    vol_ids = []
    entry_txts = []
    entry_ids = []
    fol_ids = []
    curr_vol_titl = ""
    curr_vol_id= ""
    curr_fol_id = ""
    curr_entry = ""

    in_entry = False

    for line in master_xml:        
        if "<volume" in line:
            title_start = line.find('\"', line.find("title=")) + 1
            title_end = line.find('\"', title_start)
            curr_vol_titl = line[title_start:title_end]
            id_start = line.find('\"', line.find("id=")) + 1
            id_end = line.find('\"', id_start)
            curr_vol_id = line[id_start:id_end]       
        elif "<image" in line:
            entry_id = 0
            id_start = line.find('\"', line.find("id=")) + 1
            id_end = line.find('\"', id_start)
            curr_fol_id = line[id_start:id_end]                   
        elif "<entry" in line:
            entry_id += 1
            in_entry = True
            curr_entry = ""            
        elif in_entry and (not "</entry>" in line):
            if (len(curr_entry) > 0) and (curr_entry[-1] == '-'):
                curr_entry = curr_entry[:-1]
                curr_entry += line
            elif len(curr_entry) > 0:
                curr_entry += ' ' + line
            else:
                curr_entry += line
            curr_entry = curr_entry[:-1]
        elif in_entry and ("</entry>" in line):
            in_entry = False            
            vol_titls.append(curr_vol_titl)
            vol_ids.append(curr_vol_id)
            fol_ids.append(curr_fol_id)
            entry_txts.append(curr_entry)
            entry_ids.append(curr_fol_id + '-' + str(entry_id))
           
    columns = {'vol_titl':vol_titls, 'vol_id':vol_ids, 'fol_id':fol_ids, 'text':entry_txts, 'entry_no':entry_ids}
    
    df = pd.DataFrame(columns)
    master_xml.close()
    return df        

In [None]:
#no_test

test_df = parse_xml_v2("transcriptions\\239746.xml")
test_df.head()

Unnamed: 0,vol_titl,vol_id,fol_id,text,entry_no
0,Baptisms - 1793-1807,239746,1013,"1. María Dolores Sanchez Lunes, día veinte y ...",1013-1
1,Baptisms - 1793-1807,239746,1013,"2. Antonio Guillo Miercoles, día veinte de No...",1013-2
2,Baptisms - 1793-1807,239746,1014,"3. María Juana Francisca Fish Domingo, día ve...",1014-1
3,Baptisms - 1793-1807,239746,1014,"4. Maria Teresa Camel Domingo, día veinte y q...",1014-2
4,Baptisms - 1793-1807,239746,1015,"Maria Josefa Andrea de la Puente Miércoles, d...",1015-1


In [None]:
#export

def xml_v2_to_json(path_to_xml):
    master_xml = open(path_to_xml,"r",encoding='utf-8')
    vol_titls = []
    vol_ids = []
    entry_txts = []
    entry_ids = []
    img_ids = []
    img_types = []
    img_num = []
    curr_vol_titl = ""
    curr_vol_id= ""
    curr_img_id = ""
    curr_entry = ""
    curr_img_type = ""
    curr_img_num = ""

    in_entry = False
    
    images = []    
    curr_img_dict = None

    for line in master_xml:        
        if "<volume" in line:
            title_start = line.find('\"', line.find("title=")) + 1
            title_end = line.find('\"', title_start)
            curr_vol_titl = line[title_start:title_end]
            id_start = line.find('\"', line.find("id=")) + 1
            id_end = line.find('\"', id_start)
            curr_vol_id = line[id_start:id_end]       
        elif "<image" in line:
            if curr_img_dict != None:
                images.append(curr_img_dict)            
            entry_id = 0
            id_start = line.find('\"', line.find("id=")) + 1
            id_end = line.find('\"', id_start)
            curr_img_id = line[id_start:id_end]
            type_start = line.find('\"', line.find("type=")) + 1
            type_end = line.find('\"', type_start)
            curr_img_type = line[type_start:type_end]
            if line.find("number=") == -1:
                curr_img_num = None
            else:
                num_start = line.find('\"', line.find("number=")) + 1
                num_end = line.find('\"', num_start)
                curr_img_num = line[num_start:num_end]
            curr_img_dict = {"id": curr_img_id, "type": curr_img_type, "number": curr_img_num, "entries": []}
        elif "<entry" in line:
            entry_id += 1
            in_entry = True
            curr_entry = ""            
        elif in_entry and (not "</entry>" in line):
            if (len(curr_entry) > 0) and (curr_entry[-1] == '-'):
                curr_entry = curr_entry[:-1]
                curr_entry += line
            elif len(curr_entry) > 0:
                curr_entry += ' ' + line
            else:
                curr_entry += line
            curr_entry = curr_entry[:-1]
        elif in_entry and ("</entry>" in line):
            in_entry = False            
            vol_titls.append(curr_vol_titl)
            vol_ids.append(curr_vol_id)
            img_ids.append(curr_img_id)
            entry_txts.append(curr_entry)
            entry_ids.append(curr_img_id + '-' + str(entry_id))
            curr_img_dict["entries"].append({"id": entry_id, "text": curr_entry})
            
    
    master_xml.close()
    return images

In [None]:
#no_test

test = xml_v2_to_json("transcriptions\\15834.xml")
print(test[:5])

[{'id': '1033', 'type': 'jpg', 'number': '1r', 'entries': [{'id': 1, 'text': '[margin]: Juana. Esc.va Domingo veinte y dos de [roto] y nueve yo Thomas de Orvera baptize, y pusse [roto] s.tos oleos a Juana de nacion Mina esclava de[roto] Juan Joseph de Justis fueron sus P.P. Joseph Salcedo y Ana de Santiago su mugger, y lo firmé. [signed]: Thomas de Orvera'}, {'id': 2, 'text': '[margin]: Paula. Esc.a Juebes veinte y tres de feb.o de mil sietec.tos. y diez y nueve Yo Thomas de Orvera baptizé, y pusse los santos15 oleos á Paula h. l.16 de Juan Joseph, y Maria Josepha esc.s del Capitan D. Luis Hurtado de Mendoza fue su Padrino Bartholome Rixo, y lo firmé. [signed]: Thomas de Orvera'}, {'id': 3, 'text': '[margin]: Maria Esc.a Miercoles prim.o de feb.o de mil siete.tos y diez y nueve Yo Thomas de Orvera baptizé, y pusse los santos oleos á Maria h. l. de Juan, y Josepha esc.s del Capitan Antonio Benites fue su Madrina Ysabel Mendez, y lo firmé. [signed]: Thomas de Orvera'}, {'id': 4, 'text': 

In [None]:
#export

def retrieve_volume_metadata(path_to_xml):
    xml = open(path_to_xml,"r",encoding='utf-8')
    volume_metadata = {}
    metadata_fields = ["type", "country", "state", "city", "institution", "id", "title"]    
    
    for line in xml:        
        if "<volume" in line:            
            for field in metadata_fields:                
                volume_metadata[field] = line[line.find('=', line.find(field)) + 2:line.find('\"', line.find('=', line.find(field)) + 2)]            
            
    return volume_metadata    

In [None]:
#no_test

test = retrieve_volume_metadata("transcriptions\\15834.xml")
print(test)

{'type': 'baptism', 'country': 'Cuba', 'state': 'Matanzas', 'city': 'Matanzas', 'institution': 'Catedral de San Carlos Borromeo', 'id': '15834', 'title': 'Libro 1 de Bautismos de Pardos y Morenos, 1719 - 1752, Parroquia de San Carlos de Matanzas'}


In [None]:
#no_test

from nbdev.export import notebook2script
notebook2script()

Converted 12-ssda-xml-parser.ipynb.
Converted 31-collate-xml-entities-spans.ipynb.
Converted 33-split-data.ipynb.
Converted 41-generic-framework-for-spacy-training.ipynb.
Converted 42-initial-model.ipynb.
Converted 51-data-preprocessing.ipynb.
Converted 52-unstructured-to-markup.ipynb.
Converted 53-markup-to-spatial-historian.ipynb.
Converted 54-utility-functions.ipynb.
Converted 61-prodigy-output-training-demo.ipynb.
Converted 62-full-model-application-demo.ipynb.
Converted 63-pt-model-training.ipynb.
Converted 64-es-model-training.ipynb.
Converted 65-all-annotations-model-training.ipynb.
Converted 66-es-guatemala-model-training.ipynb.
Converted 67-death-and-birth-records-together.ipynb.
Converted 71-relationship-builder.ipynb.
Converted 72-full-volume-processor.ipynb.
