# Parsing XML

In Chapter 11, we wrote an XML parser for the neoplasm taxonomy. While parsing
the file, our script automatically checked to determine that the file is well-formed
XML (i.e., if it conforms to the rules of XML syntax). Had there been any nonconforming
lines or characters anywhere in the 10+ megabyte (MB) neoplasm taxonomy
file, our script would have indicated the specific lines in the file where a syntactic error
occurred. Let us write a script whose only purpose is to check XML documents for
proper syntax.

In [1]:
import xml.sax
import pprint
parser = xml.sax.make_parser( )
parser.parse('./K11946_Files/NEOCL.XML')

## Script Algorithm: Parsing XML

1. Import an XML parsing module.
2. Create a new parser object.
3. Using a (parsing) method available in the parsing module, provide the method
with the name of the file you wish to parse.
4. The parsing module will send a message to your screen if any parts of the file
are not well formed.

## Analysis: Parsing XML

This script takes just a few lines of code, and parses XML files very quickly. The script
determines whether the XML file is well formed. For this example, I deliberately
opened the 10+ MB neocl.xml file, and created a syntax error by removing the end
of the stage tag a few lines from the end of the file. The script found the error and
reported the file location where the error occurred. Syntax errors, when they occur,
are always detected.
There are basically two types of XML parsing methods: stream methods (such as
our script), and DOM (Document Object Model) methods.
Stream methods parse through the file, much like any text parser, until an XML
“event” occurs (such as an encounter with the beginning of an XML tag, or the end of
an XML tag). When an event occurs, information is collected that must be reconciled
with subsequent events (e.g., every tag must have an end tag, and child tags must end
before the parent tag ends). The streaming parsers permit users to add additional commands
to be executed during an event.
DOM parsers build a model of the XML structure (i.e., all the XML objects and
their relationships to each other). DOMs allow us to use the relationships among
XML objects in applications. The drawback of DOM parsers is that iterations up and
down the relational model, as the XML document is parsed, slow the script. A large
XML document (many megabytes) with a complex XML structure can take a very
long time to parse.
Because I tend to use big XML documents, with many child elements, that have
long lineages, I use stream parsers exclusively. I suspect that healthcare workers, who
use large XML data sets, will tend to rely on stream parsers.

# Dublin Core Metadata

The Dublin Core consists of about 15 data elements, selected by a group of librarians,
that specify the kind of file information a librarian might use to describe a file, index
the described file, and retrieve files based on included information.
There are many publicly available documents that describe the Dublin Core elements:
http://www.ietf.org/rfc/rfc2731.txt
The Dublin Core elements can be inserted into HTML documents, simple XML
documents, or RDF documents. A public document explains exactly how the Dublin
Core elements can be used in these file formats:
http://dublincore.org/documents/usageguide/#rdfxml
An example of a simple, and shortened, Dublin Core file description in RDF format
is shown in Figure 18.1:
Because RDF is a dialect of XML, we can parse RDF files with the same scripts
that parse XML files. Because XML (and RDF) are ASCII files, they can be inserted
into the header sections of image files. When Dublin Core RDF is inserted into an
image file, it can be easily extracted and used to identify the file, and the individual
Dublin Core elements can be combined with Dublin Core elements from other files
to organize a wide range of data sources.

# Insert an RDF Document into an Image File

It is easy to insert an RDF document into the header of a JPEG image file, and it is
just as easy to extract the RDF triples.

In [2]:
def pngsave(im, file):
    from PIL import PngImagePlugin
    meta = PngImagePlugin.PngInfo()
    for k,v in im.info.items():
        meta.add_text(k, v, 0)
    im.save(file, "PNG", pnginfo=meta)
from PIL import Image
image = Image.open("./K11946_Files/3320.jpg")
image.save("./K11946_Files/3320.png")
rdf_file = open("./K11946_Files/RDF_DESC.XML", "rb")
description = rdf_file.read()
rdf_file.close()
im = Image.open("./K11946_Files/3320.png")
im.info["description"] = description
pngsave(im, "./K11946_Files/3320.png")

## Script Algorithm: Insert an RDF Document into an Image File

1. Prepare your RDF document. In this case, we will use the RDF file containing
a few Dublin Core elements, available at
http://www.julesberman.info/book/rdf_desc.xml
2. Open an image file. In this case, we use the JPEG file/3320.jpg, available at
http://www.julesberman.info/book/3320.jpg
3. Insert the RDF document into the Comment section of the JPEG header.
4. Save the file.
5. Extract the header comments.

## Analysis: Insert an RDF Document into an Image File

When you include Dublin Core elements in your image headers, you accomplish several
very important goals at once:
1. You provide your colleagues with important descriptive information about
the image.
2. You provide indexing services and search engines with information that they can
extract, from your Web-residing images, that permits others to find your images.
3. If you provide copyright information and language that fully explains the
rights of the creator and the user, you can ensure that anyone who acquires
your image will have the information they need to use your intellectual property
in a responsible and legal manner.
4. You turn your image into a mini-database, that can be integrated with other
database files.

# Insert an Image File into an RDF Document

Though we distinguish text files from binary files, all files are actually binary files.
Sequential bytes of 8 bits are converted to ASCII equivalents, and if the ASCII equivalents are alphanumeric, we call the file a text file. If the ASCII values of 8-bit
sequential file chunks are nonalphanumeric, we call the files binary files.
Standard format image files are always binary files. Because RDF syntax is a pure
ASCII file format, image binaries cannot be directly pasted into an RDF document.
However, binary files can be interconverted to and from ASCII format, using a simple
software utility.

In [3]:
import base64, re
image_file = open("./K11946_Files/3320.jpg", "rb")
image_string = image_file.read()
image_file.close()
contents = ""
encoded = base64.encodebytes(image_string)
rdf_file = open("./K11946_Files/RDF_DESC.XML", "r")
rdf_string = rdf_file.read()
rdf_file.close()
rdflist = re.split(r'dc:description>', rdf_string)
contents = rdflist[0] + "dc:description>BEGIN\n" + \
encoded + "END\n" + rdflist[1] + "dc:description" + rdflist[2]
rdf_out = open("c:/ftp/rdf_image.xml", "w")
print(rdf_out, contents)

TypeError: can only concatenate str (not "bytes") to str

## Script Algorithm: Insert an Image File into an RDF Document

1. Call the external Base64 module.
2. Use any image file. In the example, we use 3320.jpg, available for download at
http://www.julesberman.info/book/3320.jpg.
3. Put the entire contents of the image file into a string variable.
4. Encode the contents of the image file into base64, using the encoding method
from the external module.
5. Open the RDF file. In this example, we will use the rdf_desc.xml file, available
at
http://www.julesberman.info/book/rdf_desc.xml
6. Split the file on the <dc:description> tag, and put the base64-encoded string
into this tagged data section.
7. Mark the base64 text with “BEGIN” and “END.”
8. Put the modified contents of the rdf_desc.xml file, now containing the base64
representation of the image file, into a new file, named rdf_image.xml.

## Analysis: Insert an Image FIle into an RDF Document

The abbreviated output is shown in Figure 18.2.
The full file exceeds a megabyte in length. The central section of the base64 image
block is removed, to permit us to see the structure of the output file.
The sample script is not particularly robust. It requires the presence of a Dublin
Core description tag appearing in an exact format (i.e., <dc:description>). Otherwise,
the script would just fail. The script inelegantly shoves the base64 representation of
the binary image data into the Dublin Core description field. If this were a real RDF
implementation, you would prepare a specific RDF tag for the base64 data, and you
would prepare an external Schema document that defined the tag and its properties.
The script shows us that RDF files can hold binary data files (represented as base64
ASCII strings). There may be instances when you might prefer to insert an image file
into an RDF document, rather than inserting an RDF document into an image file. This might be the case when a single RDF file must contain information on multiple
different image files. Although it is nice to know that the option of inserting image
data into an RDF file is available, in most instances, you will simply point to the external
image file (using its Web address), and retrieve the image data from its URL.

# RDF Schema

RDF has a formal way of defining objects (and their properties, but we will not discuss
properties here). This is called RDF Schema. You can think of RDF Schema as a
dictionary for the terms in an RDF data document. RDF Schema is written in RDF
syntax. This means that all RDF Schemas are RDF documents and consist of statements
in the form of triples.
The important point about RDF Schemas is that they clarify the relationships
among classes of objects in a knowledge domain. Here is an example of Class relationships
formally specified as a Schema in RDF:
<rdfs:Class rdf:ID=”Neoplasm”>
<rdfs:subClassOf
rdfs:resource=”http://www.w3.org/2000/01/rdf-schema#Class”/>
</rdfs:Class>
<rdfs:Class rdf:ID=”Neural_crest”>
<rdfs:subClassOf
neo:resource=”#Neoplasm”/>
</rdfs:Class>
<rdfs:Class rdf:ID=”Germ_cell”>
<rdfs:subClassOf
neo:resource=”#Neoplasm”/>
</rdfs:Class>
<rdfs:Class rdf:ID=”Mesoderm”>
<rdfs:subClassOf
neo:resource=”#Neoplasm”/>
</rdfs:Class>
<rdfs:Class rdf:ID=”Coelomic”>
<rdfs:subClassOf
neo:resource=”#Mesoderm”/>
</rdfs:Class>
<rdfs:Class rdf:ID=”Sub_coelomic”>
<rdfs:subClassOf
neo:resource=”#Mesoderm”/>
</rdfs:Class>
<rdfs:Class rdf:ID=”Sub_coelomic_gonadal”>
<rdfs:subClassOf
neo:resource=”#Sub_coelomic”/>
</rdfs:Class>
RDF schemas can be transformed into directed graphs (graphs consisting of connected
nodes and arcs and directions for the arcs). The process of transforming an
RDF Schema into a graphic representation requires a special software application,
such as GraphViz.

# Visualizing an RDF Schema with GraphViz

GraphViz is a free, open source application that produces graphic representations of
hierarchical structures that are described in the GraphViz scripting language.
As an example, here is the hierarchical organization of the Neoplasm Classification,
described in the GraphViz scripting language:
digraph G {
size=”10,16”;
ranksep=”1.75”;
    node [style=filled color=gray65];
Neoplasm [label=”Neoplasm”];
node [style=filled color=lightgray];
EndodermEctoderm
[label=”Endoderm\/\nEctoderm”];
NeuralCrest [label=”Neural Crest”];
GermCell [label=”Germ cell”];
Neoplasm -> EndodermEctoderm;
Neoplasm -> Mesoderm;
Neoplasm -> GermCell;
Neoplasm -> Trophectoderm;
Neoplasm -> Neuroectoderm;
Neoplasm -> NeuralCrest;
node [style=filled color=gray95];
Trophectoderm -> Molar;
Trophectoderm -> Trophoblast;
EndodermEctoderm -> Odontogenic;
EndodermEctodermPrimitive
[label=”Endoderm\/Ectoderm\nPrimitive”];
EndodermEctoderm -> EndodermEctodermPrimitive;
Endocrine
[label=”Endoderm/Ectoderm\nEndocrine”];
EndodermEctoderm -> Endocrine;
EndodermEctoderm -> Parenchymal;
Odontogenic
[label=”Endoderm/Ectoderm\nOdontogenic”];
EndodermEctoderm -> Surface;
MesodermPrimitive
[label=”Mesoderm\nPrimitive”];
Mesoderm -> MesodermPrimitive;
Mesoderm -> Subcoelomic;
Mesoderm -> Coelomic;
NeuroectodermPrimitive
[label=”Neuroectoderm\nPrimitive”];
NeuroectodermNeuralTube
[label=”Central Nervous\nSystem”];
Neuroectoderm -> NeuroectodermPrimitive;
Neuroectoderm -> NeuroectodermNeuralTube;
NeuralCrestMelanocytic
[label=”Melanocytic”];
NeuralCrestPrimitive
[label=”Neural Crest\nPrimitive”];
    NeuralCrestEndocrine
[label=”Neural Crest\nEndocrine”];
PeripheralNervousSystem
[label=”Peripheral\nNervous System”];
NeuralCrestOdontogenic
[label=”Neural Crest\nOdontogenic”];
NeuralCrest -> NeuralCrestPrimitive;
NeuralCrest -> PeripheralNervousSystem;
NeuralCrest -> NeuralCrestEndocrine;
NeuralCrest -> NeuralCrestMelanocytic;
NeuralCrest -> NeuralCrestOdontogenic;
GermCell -> Differentiated;
GermCell -> Primordial;
}
By eliminating the lowest level of subclasses, we can generate a simpler schematic
(Figure 18.3).

# Obtaining GraphViz

GraphViz is free software. The GraphViz download site is
http://www.graphviz.org/Download.php
Windows® users can download graphviz-2.14.1.exe (5,614,329 bytes). You can
install the software by running the .exe file. GraphViz has subapplications: dot, fdp, twopi, neato, and circo. The twopi application,
which we use here, creates graphs that have a radial layout.
Extensive information on GraphViz is available at
http://www.graphviz.org/

# Converting a Data Structure to GraphViz

If you work with RDF (and every biomedical professional should understand how
RDF is used to specify data), you will want a method that can instantaneously
render a schematic of your RDF Schema (ontology) or of any descendant section of
your Schema.
Because the GraphViz language is designed with a similar purpose as RDF
Schema—to describe the relationships among hierarchical classes of object—it is
always possible to directly translate an RDF Schema into the GraphViz language. This
is a type of poor man’s metaprogramming (using a programming language to generate
another program). When an RDF Schema has been translated into the GraphViz
language, the GraphViz software can display the class structure as a graph.

In [None]:
#!/usr/local/bin/python
import re, string
in_file = open(‘schema.txt’, “r”)
out_file = open(“schema.dot”, “w”)
print>>out_file, “digraph G {“
print>>out_file, “size=\”15,15\”;”
print>>out_file, “ranksep=\”3.00\”;”
clump = “”
for line in in_file:
namematch = re.match(r’\<\/rdfs\:Class>’, line)
if (namematch):
father = “”
child = “”
clump = re.sub(r’\n’, ‘ ‘, clump)
fathermatch = re.search(r’\:resource\=\”[a-zA-Z0-9\:\/\_\.\-]*
\#([a-zA-Z\_]+)\”’, clump)
if fathermatch:
father = fathermatch.group(1)
childmatch = re.search(r’rdf\:ID\=\”([a-zA-Z\_]+)\”’, clump)
if childmatch:
child = childmatch.group(1)
print>>out_file, father + “ -> “ + child + “;”
clump = “”
else:
clump = clump + line
print>>out_file, “}”
exit

## Script Algorithm: Converting a Data Structure to GraphViz

1. Open the file containing the Schema relationships, in RDF syntax (available
at www.julesberman.info/book/schema.txt).
2. Open an output file, to write the transformed class relationships, in the
GraphViz language.
3. Print the first lines of the GraphViz file, which begins with a statement indicating
that a digraph will follow (instructions for a directed graph), its size,
and the length of the separator lines between classes.
4. Parse through the RDF classes, using the end tag “</rdfs:Class>” to indicate
the end of one class definition and the beginning of the next class definition.
5. Obtain the name of the class and the name of the class to which the class is a
subclass (i.e., the name of the child class’s father).
All of the Schema class statements will have a form equivalent to the following
example:
<rdfs:Class rdf:ID=“Neoplasm”>
<rdfs:subClassOf
rdfs:resource=“http://www.w3.org/2000/01/rdf-schema#Class”/>
</rdfs:Class>
The class name appears in quotes, after “rdf:ID=”. The superclass name appears
at the end of a resource statement: “resource=“http://www.w3.org/2000/01/
rdf-schema#”. Use regular expressions to obtain the name of the child class
and the father class from each RDF Schema statement.
6. Print to the output file each encountered child, and father class in a GraphViz
statement of the following general form:
father class -> child class;
7. After the schema is parsed, print “}” to the output file, to close the GraphViz
script.

## Analysis: Converting a Data Structure to GraphViz

The output is the script, schema.dot, which is equivalent to the digraph (GraphViz
language) script shown at the beginning of this section.
After installing GraphViz, we can create the image schema.png, from the schema.dot
specification by invoking the twopi subapplication on a command line.
c:\ftp>twopi -Tpng schema.dot.dot -o schema.png
GraphViz produced the graph shown in Figure 18.3, from a GraphViz script produced
by transformations on a RDF Schema.