<a href="https://datamine.unc.edu/methods_in_medical_informatics_yuchenh/" ><h1>Back to Notebook List</h3></a>
<br/>

Welcome to chapter eigthteen of Methods in Medical Informatics! Extensible Markup Language (XML) is a data organization standard. In its most basic form, XML is a method for marking up files so every piece of data is surrounded by bracketed text that describes the data (ie. \<number>5\</number). This chapter will introduce working with XML in depth. Lets begin!

> Disclaimer: The content below is adapted from the book "Methods in Medical Informatics - Fundamental of Healthcare Programming in Perl, Python, and Ruby" by Jules J. Berman. All content is for testing, education, and teaching purposes only. No content will be openly released to the internet. 

# Parsing XML

In Chapter 11, we wrote an XML parser for the neoplasm taxonomy. While parsing
the file, our script automatically checked to determine that the file is properly formed
XML (i.e., if the XML syntax is properly formatted). Had there been any formatting error in the neoplasm taxonomy
file, our script would have indicated the specific lines in the file where an error
occurred. Let us write a script whose only purpose is to check XML documents for
proper syntax.*

*Note: An XML is only considered well formed if it has a proper XML header, it contains text in a readable format, and follows general rules for tagging data*

> This script will utilize the file [neocl.xml](https://datamine.unc.edu/datafiles_jm/). neocl.xml is the Neoplasm Classification formated as an XML document. Additional information [here](https://datamine.unc.edu/datafiles_jm/)

**Description adapted from page 250 of "Methods in Medical Informatics"*

In [1]:
import xml.sax
import pprint
parser = xml.sax.make_parser( )
parser.parse('./K11946_Files/NEOCL.XML')
print('No errors found!')

No errors found!


## Script Algorithm: Parsing XML

Import an XML parsing module.*

In [2]:
import xml.sax
import pprint

Create a new parser object.

In [3]:
parser = xml.sax.make_parser( )

Using a (parsing) method available in the parsing module, provide the method
with the name of the file you wish to parse.

In [4]:
parser.parse('./K11946_Files/NEOCL.XML')

The parsing module will send a message to your screen if any parts of the file
are not well formed.

In [5]:
print('No errors found!')

No errors found!


**This section is adapted from section 18.1.1, "Script Algorithm", of page 250 from "Methods in Medical Informatics".*

## Analysis: Parsing XML

This script takes just a few lines of code, and parses XML files very quickly. The script determines whether the XML file is well formed. The script was able to determine whether the a 10+ MB document was well formed in only a few seconds.*

**This section is adapted from section 18.1.2, "Analysis", of page 252 in "Methods in Medical Informatics".*

# Resource Description Framework

An important framework for this chapter is the Resource Description Framework (RDF). RDF is a variant of XML that uses the same tagging format as XML does. In XML, data and metadata are paired with an identified forming a data "triple". This is demonstrated by the example below:

"Mr. Rheeus" "blood glucose level" "77"

The data is the number, "77". The metadata is the descriptor "blood glucose level". THe specific object is "Mr. Rheeus". Here is the same triple below expressed with XML tags:

\<Description>
\<Description_object>Mr. Rheeus\</Description_object>
<Blood_glucose_level>77\</Blood_glucose_level>
\</Description>

This structure indicates there is a blood glucose level of 77 that belongs to Mr. Rheeus. RDF has its own syntax for expressing these triples as well. See below for an exmaple of the above for an RDF format: 

\<rdf:Description rdf:about=“http://www.patient_info.com/lab.htm#Mr_Rheeus”>
\<lab:Blood_glucose_level>77\</lab:Blood_glucose_level>
\</rdf:Description>

In RDF, objects are specified using a web address or some some unique identifier that can distinguish the objects from others. In this chapter, we will examine a specific RDF annotation style known as the Dublic Core.*

**This section was adapted from pages 252 - 253 of "Methods in Medical Informatics*

# Dublin Core Metadata

The Dublin Core consists of about 15 data elements
that specify the kind of file information a librarian might use to describe a file, index
the file, and retrieve files based on included information.
There are numerous publicly available documents that describe the Dublin Core elements:
http://www.ietf.org/rfc/rfc2731.txt
The Dublin Core elements can be inserted into a variety of document types (ie. HTML, XML, RDF). A public document explains exactly how the Dublin
Core elements can be used in these file formats:
http://dublincore.org/documents/usageguide/#rdfxml

We can parse RDF files with the same scripts that parse XML files since RDF is a subtype of XML. XML and RDF files can also be inserted into the header sections of image files. When a Dublic Core RDF is inserted into the header of an image file, it can be extracted to identiyf the file.*

**Description adapted from page 254 of "Methods in Medical Informatics"*

# Insert an RDF Document into an Image File

It is easy to insert an RDF document into the header of a JPEG image file, and it is
just as easy to extract the RDF triples.*

> This script will utilize the file [3320.jpg](https://datamine.unc.edu/datafiles_jm/), [3320.png](https://datamine.unc.edu/datafiles_jm/), and [RDF_DESC.XML](https://datamine.unc.edu/datafiles_jm/). 3320.jpg is an example jpg image. 3320.png is an example png image. The iamge is the same as 3320.jpg. RDF-DESC.XML is an xml file containing a few Dublin Core elements. Additional information [here](https://datamine.unc.edu/datafiles_jm/)

**Description adapted from page 254 of "Methods in Medical Informatics"*

In [3]:
def pngsave(im, file):
    from PIL import PngImagePlugin
    meta = PngImagePlugin.PngInfo()
    for k,v in im.info.items():
        meta.add_text(k, v, 0)
    im.save(file, "PNG", pnginfo=meta)
from PIL import Image
image = Image.open("./K11946_Files/3320.jpg")
image.save("./K11946_Files/3320.png")
rdf_file = open("./K11946_Files/RDF_DESC.XML", "rb")
description = rdf_file.read()
rdf_file.close()
im = Image.open("./K11946_Files/3320.png")
im.info["description"] = description
pngsave(im, "./K11946_Files/3320.png")
# RDF Header for Image
im.info

{'description': b'<?xml version="1.0" encoding="UTF-8"?>\n<rdf:RDF\n   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"\n   xmlns:dc="http://purl.org/dc/elements/1.1/">\n<rdf:Description rdf:about="http://www.julesberman.info/">\n    <dc:creator>Jules J. Berman</dc:creator>\n    <dc:title>Methods in Medical Informatics</dc:title>\n    <dc:description>\n    Medical Informatics methods and algorithms in Perl, Python, and Ruby\n    </dc:description>\n    <dc:date>2010</dc:date>\n</rdf:Description>\n</rdf:RDF>\n'}

## Script Algorithm: Insert an RDF Document into an Image File

Prepare your RDF document. In this case, we will use the RDF file containing a few Dublin Core elements. Open an image file. In this case, we use the JPEG file/3320.jpg.*

In [4]:
def pngsave(im, file):
    from PIL import PngImagePlugin
    meta = PngImagePlugin.PngInfo()
    for k,v in im.info.items():
        meta.add_text(k, v, 0)
    im.save(file, "PNG", pnginfo=meta)
from PIL import Image
image = Image.open("./K11946_Files/3320.jpg")
image.save("./K11946_Files/3320.png")
rdf_file = open("./K11946_Files/RDF_DESC.XML", "rb")
description = rdf_file.read()
rdf_file.close()
im = Image.open("./K11946_Files/3320.png")

Insert the RDF document into the Comment section of the JPEG header. Save the file. Extract the header comments.

In [9]:
im.info["description"] = description
pngsave(im, "./K11946_Files/3320.png")
# RDF Header for Image
im.info

{'description': b'<?xml version="1.0" encoding="UTF-8"?>\n<rdf:RDF\n   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"\n   xmlns:dc="http://purl.org/dc/elements/1.1/">\n<rdf:Description rdf:about="http://www.julesberman.info/">\n    <dc:creator>Jules J. Berman</dc:creator>\n    <dc:title>Methods in Medical Informatics</dc:title>\n    <dc:description>\n    Medical Informatics methods and algorithms in Perl, Python, and Ruby\n    </dc:description>\n    <dc:date>2010</dc:date>\n</rdf:Description>\n</rdf:RDF>\n'}

**This section is adapted from section 18.3.1, "Script Algorithm", of page 255 from "Methods in Medical Informatics".*

## Analysis: Insert an RDF Document into an Image File

When you include Dublin Core elements in your image file headers, you accomplish several
important goals:*
1. You provide important descriptive information about
the image.
2. You provide information that search engines can extract, from your online images, that permits others to find your images.
3. If you provide copyright information, you can ensure that anyone who use your image will have the information to use your intellectual property in a legal manner.
4. You turn your image into a mini-database. This allow your file to be integrated with other database files.

**This section is adapted from section 18.3.2, "Analysis", of page 256 from "Methods in Medical Informatics".*

# Insert an Image File into an RDF Document

Although we may distinguish text files from binary files, all files are actually binary files.
Bytes of 8 bits are converted to ASCII equivalents. If the ASCII equivalents are alphanumeric, we then label the file as a text file. If the ASCII values are not alphanumeric, we then label the files binary files.
Standard format image files are always binary files. Since RDF syntax is an
ASCII file format, image files cannot be pasted directly into a RDF document.
However, binary files can be converted to and from ASCII format, using a simple
script.*

> This script will utilize the file [3320.jpg](https://datamine.unc.edu/datafiles_jm/) and [RDF_DESC.XML](https://datamine.unc.edu/datafiles_jm/). 3320.jpg is an example jpg image. RDF-DESC.XML is an xml file containing a few Dublin Core elements. Additional information [here](https://datamine.unc.edu/datafiles_jm/)

**Description adapted from pages 256-257 of "Methods in Medical Informatics"*

In [7]:
import base64, re
image_file = open("./K11946_Files/3320.jpg", "rb")
image_string = image_file.read()
image_file.close()
contents = ""
encoded = str(base64.encodebytes(image_string))
rdf_file = open("./K11946_Files/RDF_DESC.XML", "r")
rdf_string = rdf_file.read()
rdf_file.close()
rdflist = re.split(r'dc:description>', rdf_string)
contents = rdflist[0] + "dc:description>BEGIN\n" + encoded + "END\n" + rdflist[1] + "dc:description" + rdflist[2]
rdf_out = open("crdf_image.xml", "w")
print(rdf_out, contents)

<_io.TextIOWrapper name='crdf_image.xml' mode='w' encoding='UTF-8'> <?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="http://www.julesberman.info/">
    <dc:creator>Jules J. Berman</dc:creator>
    <dc:title>Methods in Medical Informatics</dc:title>
    <dc:description>BEGIN
b'/9j/4AAQSkZJRgABAQEASABIAAD/2wBDAAEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEB\nAQEBAQICAQECAQEBAgICAgICAgICAQICAgICAgICAgL/2wBDAQEBAQEBAQEBAQECAQEBAgICAgIC\nAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgL/wAARCAMABAADASIA\nAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQA\nAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3\nODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWm\np6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEA\nAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtRE

## Script Algorithm: Insert an Image File into an RDF Document

Call the external Base64 module. Use any image file. In the example, we use 3320.jpg. Put the entire contents of the image file into a string variable.*

In [10]:
import base64, re
image_file = open("./K11946_Files/3320.jpg", "rb")
image_string = image_file.read()
image_file.close()
contents = ""

Encode the contents of the image file into base64, using the encoding method
from the external module.

In [12]:
encoded = str(base64.encodebytes(image_string))

Open the RDF file. In this example, we will use the rdf_desc.xml file.

In [13]:
rdf_file = open("./K11946_Files/RDF_DESC.XML", "r")
rdf_string = rdf_file.read()
rdf_file.close()

Split the file on the <dc:description> tag, and put the base64-encoded string
into this tagged data section. Mark the base64 text with “BEGIN” and “END.” Put the modified contents of the rdf_desc.xml file, now containing the base64
representation of the image file, into a new file, named rdf_image.xml.

In [14]:
rdflist = re.split(r'dc:description>', rdf_string)
contents = rdflist[0] + "dc:description>BEGIN\n" + encoded + "END\n" + rdflist[1] + "dc:description" + rdflist[2]
rdf_out = open("crdf_image.xml", "w")
print(rdf_out, contents)

<_io.TextIOWrapper name='crdf_image.xml' mode='w' encoding='UTF-8'> <?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="http://www.julesberman.info/">
    <dc:creator>Jules J. Berman</dc:creator>
    <dc:title>Methods in Medical Informatics</dc:title>
    <dc:description>BEGIN
b'/9j/4AAQSkZJRgABAQEASABIAAD/2wBDAAEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEB\nAQEBAQICAQECAQEBAgICAgICAgICAQICAgICAgICAgL/2wBDAQEBAQEBAQEBAQECAQEBAgICAgIC\nAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgL/wAARCAMABAADASIA\nAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQA\nAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3\nODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWm\np6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEA\nAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtRE

**This section is adapted from section 18.4.1, "Script Algorithm", of page 257 from "Methods in Medical Informatics".*

## Analysis: Insert an Image FIle into an RDF Document

The sample script is not particularly flexible. It requires a Dublin
Core description tag appearing in an exact format (i.e., <dc:description>). Otherwise,
the script will fail and throw an error. The script directly pushes the base64 representation of
the binary image data into the Dublin Core description field. If this were a real RDF
implementation, you would prepare beforehand a specific RDF tag for the base64 data, and you
would prepare an external Schema document that also defined the tag and its properties.
The script simply demonstrates that RDF files can hold binary data files (represented as base64
ASCII strings). There may be instances when you might prefer to insert an image file
into an RDF document, rather than inserting an RDF document into an image file. This may be the case when a single RDF file must contain information on multiple
different image files. Although it is nice to know that the option of inserting image
data into an RDF file is available, in most instances, you will simply point to the external
image file (using its Web address), and retrieve the image data from its URL.*

**This section is adapted from section 18.4.2, "Analysis", of pages 258-259 in "Methods in Medical Informatics".*

# RDF Schema

RDF has a formal way of defining objects. This is called a RDF Schema. A RDF Schema is similar to a 
dictionary for the terms in an RDF data document. RDF Schema is written using RDF
syntax. This means that all RDF Schemas are RDF documents and consist of statements
in the form of data triples.
The important point about RDF Schemas is that they clarify the relationships
among classes of objects in a knowledge domain. Here is an example of Class relationships
formally specified as a Schema in RDF:

<rdfs:Class rdf:ID=”Neoplasm”>
<rdfs:subClassOf
rdfs:resource=”http://www.w3.org/2000/01/rdf-schema#Class”/>
</rdfs:Class>
<rdfs:Class rdf:ID=”Neural_crest”>
<rdfs:subClassOf
neo:resource=”#Neoplasm”/>
</rdfs:Class>
<rdfs:Class rdf:ID=”Germ_cell”>
<rdfs:subClassOf
neo:resource=”#Neoplasm”/>
</rdfs:Class>
<rdfs:Class rdf:ID=”Mesoderm”>
<rdfs:subClassOf
neo:resource=”#Neoplasm”/>
</rdfs:Class>
<rdfs:Class rdf:ID=”Coelomic”>
<rdfs:subClassOf
neo:resource=”#Mesoderm”/>
</rdfs:Class>
<rdfs:Class rdf:ID=”Sub_coelomic”>
<rdfs:subClassOf
neo:resource=”#Mesoderm”/>
</rdfs:Class>
<rdfs:Class rdf:ID=”Sub_coelomic_gonadal”>
<rdfs:subClassOf
neo:resource=”#Sub_coelomic”/>
</rdfs:Class>

RDF schemas can be transformed into directed graphs. These are graphs consisting of connected
nodes and arcs and directions for the arcs. The process of transforming an
RDF Schema into a graphic representation requires a special software application,
such as GraphViz. We will explore GraphViz as we move through this chapter*

**Description adapted from pages 259-260 of "Methods in Medical Informatics"*

# Visualizing an RDF Schema with GraphViz

GraphViz is a free, open source application that produces graphic representations of
hierarchical structures that are described using the GraphViz scripting language.
As an example, here is the hierarchical organization of the Neoplasm Classification,
described in the GraphViz scripting language:

digraph G {<br>
size=”10,16”;<br>
ranksep=”1.75”;<br>
    node [style=filled color=gray65];<br>
Neoplasm [label=”Neoplasm”];<br>
node [style=filled color=lightgray];<br>
EndodermEctoderm<br>
[label=”Endoderm\/\nEctoderm”];<br>
NeuralCrest [label=”Neural Crest”];<br>
GermCell [label=”Germ cell”];<br>
Neoplasm -> EndodermEctoderm;<br>
Neoplasm -> Mesoderm;<br>
Neoplasm -> GermCell;<br>
Neoplasm -> Trophectoderm;<br>
Neoplasm -> Neuroectoderm;<br>
Neoplasm -> NeuralCrest;<br>
node [style=filled color=gray95];<br>
Trophectoderm -> Molar;<br>
Trophectoderm -> Trophoblast;<br>
EndodermEctoderm -> Odontogenic;<br>
EndodermEctodermPrimitive<br>
[label=”Endoderm\/Ectoderm\nPrimitive”];<br>
EndodermEctoderm -> EndodermEctodermPrimitive;<br>
Endocrine<br>
[label=”Endoderm/Ectoderm\nEndocrine”];<br>
EndodermEctoderm -> Endocrine;<br>
EndodermEctoderm -> Parenchymal;<br>
Odontogenic<br>
[label=”Endoderm/Ectoderm\nOdontogenic”];<br>
EndodermEctoderm -> Surface;<br>
MesodermPrimitive<br>
[label=”Mesoderm\nPrimitive”];<br>
Mesoderm -> MesodermPrimitive;<br>
Mesoderm -> Subcoelomic;<br>
Mesoderm -> Coelomic;<br>
NeuroectodermPrimitive<br>
[label=”Neuroectoderm\nPrimitive”];<br>
NeuroectodermNeuralTube<br>
[label=”Central Nervous\nSystem”];<br>
Neuroectoderm -> NeuroectodermPrimitive;<br>
Neuroectoderm -> NeuroectodermNeuralTube;<br>
NeuralCrestMelanocytic<br>
[label=”Melanocytic”];<br>
NeuralCrestPrimitive<br>
[label=”Neural Crest\nPrimitive”];<br>
    NeuralCrestEndocrine<br>
[label=”Neural Crest\nEndocrine”];<br>
PeripheralNervousSystem<br>
[label=”Peripheral\nNervous System”];<br>
NeuralCrestOdontogenic<br>
[label=”Neural Crest\nOdontogenic”];<br>
NeuralCrest -> NeuralCrestPrimitive;<br>
NeuralCrest -> PeripheralNervousSystem;<br>
NeuralCrest -> NeuralCrestEndocrine;<br>
NeuralCrest -> NeuralCrestMelanocytic;<br>
NeuralCrest -> NeuralCrestOdontogenic;<br>
GermCell -> Differentiated;<br>
GermCell -> Primordial;<br>
}<br>

By eliminating the lowest level of subclasses, we can generate a simpler schematic.*

**Description adapted from pages 260-262 of "Methods in Medical Informatics"*

# Obtaining GraphViz

GraphViz is free, open software. The GraphViz download site is
http://www.graphviz.org/Download.php
You can
install the software by running the .exe file. GraphViz has many subapplications: dot, fdp, twopi, neato, and circo. The twopi application,
which we use in this chapter, creates graphs that have a radial layout.
Extensive information on GraphViz is available at
http://www.graphviz.org/*

**Description adapted from pages 262-263 of "Methods in Medical Informatics"*

# Converting a Data Structure to GraphViz

If you work with RDF, you will want a method that can instantaneously
render a schematic of your RDF Schema (ontology) or of any descendant section of
your Schema.
Because the GraphViz language is designed with a similar purpose as a RDF
Schema it is
possible to directly translate an RDF Schema into the GraphViz language. This
is an example of metaprogramming (using a programming language to generate
another program). When an RDF Schema has been translated into the GraphViz
language following the script, the GraphViz software can display the class structure as a graph.*

> This script will utilize the file [SCHEMA.TXT](https://datamine.unc.edu/datafiles_jm/). This is a text file containing a schema in RDF syntax. Additional information [here](https://datamine.unc.edu/datafiles_jm/)

**Description adapted from page 263 of "Methods in Medical Informatics"*

In [10]:
#!/usr/local/bin/python
import re, string
in_file = open('./K11946_Files/SCHEMA.TXT', "r")
out_file = open("schema.dot", "w")
print(out_file, "digraph G {")
print(out_file, "size=\"15,15\";")
print(out_file, "ranksep=\"3.00\";")
clump = ""
for line in in_file:
    namematch = re.match(r'\<\/rdfs\:Class>', line)
    if (namematch):
        father = ""
        child = ""
        clump = re.sub(r'\n', ' ', clump)
        fathermatch = re.search(r'\:resource\=\"[a-zA-Z0-9\:\/\_\.\-]*#([a-zA-Z\_]+)\"', clump)
        if fathermatch:
            father = fathermatch.group(1)
        childmatch = re.search(r'rdf\:ID\=\"([a-zA-Z\_]+)\"', clump)
        if childmatch:
            child = childmatch.group(1)
        print(out_file, father + " -> " + child + ";")
        clump = ""
    else:
        clump = clump + line
print(out_file, "}")

<_io.TextIOWrapper name='schema.dot' mode='w' encoding='UTF-8'> digraph G {
<_io.TextIOWrapper name='schema.dot' mode='w' encoding='UTF-8'> size="15,15";
<_io.TextIOWrapper name='schema.dot' mode='w' encoding='UTF-8'> ranksep="3.00";
<_io.TextIOWrapper name='schema.dot' mode='w' encoding='UTF-8'> Class -> Neoplasm;
<_io.TextIOWrapper name='schema.dot' mode='w' encoding='UTF-8'> Neural_tube -> Neural_tube_parenchyma;
<_io.TextIOWrapper name='schema.dot' mode='w' encoding='UTF-8'> Mesoderm -> Sub_coelomic;
<_io.TextIOWrapper name='schema.dot' mode='w' encoding='UTF-8'> Neoplasm -> Endoderm_or_ectoderm;
<_io.TextIOWrapper name='schema.dot' mode='w' encoding='UTF-8'> Neoplasm -> Neural_crest;
<_io.TextIOWrapper name='schema.dot' mode='w' encoding='UTF-8'> Neoplasm -> Germ_cell;
<_io.TextIOWrapper name='schema.dot' mode='w' encoding='UTF-8'> Sub_coelomic -> Sub_coelomic_gonadal;
<_io.TextIOWrapper name='schema.dot' mode='w' encoding='UTF-8'> Trophectoderm -> Molar;
<_io.TextIOWrapper name='

## Script Algorithm: Converting a Data Structure to GraphViz

Open the file containing the Schema relationships, in RDF syntax. Open an output file, to write the transformed class relationships, in the
GraphViz language.*

In [None]:
#!/usr/local/bin/python
import re, string
in_file = open('./K11946_Files/SCHEMA.TXT', "r")
out_file = open("schema.dot", "w")

Print the first lines of the GraphViz file, which begins with a statement indicating
that a digraph will follow (instructions for a directed graph), its size,
and the length of the separator lines between classes.

In [None]:
print(out_file, "digraph G {")
print(out_file, "size=\"15,15\";")
print(out_file, "ranksep=\"3.00\";")
clump = ""

Parse through the RDF classes, using the end tag “</rdfs:Class>” to indicate
the end of one class definition and the beginning of the next class definition.
Obtain the name of the class and the name of the class to which the class is a
subclass (i.e., the name of the child class’s father).
All of the Schema class statements will have a form equivalent to the following
example:
<rdfs:Class rdf:ID=“Neoplasm”>
<rdfs:subClassOf
rdfs:resource=“http://www.w3.org/2000/01/rdf-schema#Class”/>
</rdfs:Class>
The class name appears in quotes, after “rdf:ID=”. The superclass name appears
at the end of a resource statement: “resource=“http://www.w3.org/2000/01/
rdf-schema#”. Use regular expressions to obtain the name of the child class
and the father class from each RDF Schema statement.
Print to the output file each encountered child, and father class in a GraphViz
statement of the following general form:
father class -> child class; After the schema is parsed, print “}” to the output file, to close the GraphViz
script.

In [None]:
for line in in_file:
    namematch = re.match(r'\<\/rdfs\:Class>', line)
    if (namematch):
        father = ""
        child = ""
        clump = re.sub(r'\n', ' ', clump)
        fathermatch = re.search(r'\:resource\=\"[a-zA-Z0-9\:\/\_\.\-]*#([a-zA-Z\_]+)\"', clump)
        if fathermatch:
            father = fathermatch.group(1)
        childmatch = re.search(r'rdf\:ID\=\"([a-zA-Z\_]+)\"', clump)
        if childmatch:
            child = childmatch.group(1)
        print(out_file, father + " -> " + child + ";")
        clump = ""
    else:
        clump = clump + line
print(out_file, "}")

**This section is adapted from section 18.8.1, "Script Algorithm", of pages 263-264 from "Methods in Medical Informatics".*

## Analysis: Converting a Data Structure to GraphViz

The output is the script, schema.dot, which is equivalent to the digraph (GraphViz
language) script shown at the beginning of this section.
After installing GraphViz, we can create the image schema.png, from the schema.dot
specification by invoking the twopi subapplication on a command line.<br><br>
c:\ftp>twopi -Tpng schema.dot.dot -o schema.png.*

****This section is adapted from section 18.8.2, "Analysis", of pages 265-266 in "Methods in Medical Informatics".*