# Gestión y uso de Metadatos

## Librerías Necesarias

xml.etree.ElementTree

requests

In [1]:
import xml.etree.ElementTree as ET
import requests

## Metadata attachment

En este Notebook descubriremos cómo pueden explotarse metadatos publicados en formatos basados en etiquetas, como XML.

<img src="https://www.republica.com/wp-content/uploads/2017/04/grito.jpg " width="250">

Vamos a empezar por describir un par de objetos, empezando por un cuadro, "El grito", de  Edvard Munch.

* Title: El grito
* Creator: Edvard Munch
* Subject:  Cuadro
* Description: Cuadro al óleo de un hombre gritando
* Publisher: Galeria nacional de Noruega
* Contributor: -
* Date: Agosto 2006
* Type: óleo
* Format: Lienzo, medidas
* Identifier: GNN-12312
* Source: -
* Language: -
* Relation: -
* Coverage: -
* Rights: Entrada a la galería

Con Dublin Core también podemos describir datasets científicos. Vamos a probar con:

https://zenodo.org/record/3372754#.XcFkhE9Kg5k

<div class="alert alert-warning" role="alert" style="margin: 10px">
<p>**Consejo**</p>

<p>En el propio repositorio puedes encontrar metadatos</p>
</div>

* Title: 
* Creator: 
* Subject:  
* Description: 
* Publisher: 
* Contributor:
* Date: 
* Type: 
* Format: 
* Identifier: 
* Source: 
* Language:
* Relation:
* Coverage:  
* Rights:

A partir de las descripciones, podemos crear documentos XML que sean interpretables por máquinas (entendiendo máquinas como scripts, software, etc). 

El grito:
  
  ```XML
 <dc:contributor></dc:contributor>
  <dc:coverage></dc:coverage>
  <dc:creator></dc:creator>
  <dc:date>Agosto 2006</dc:date>
  <dc:description></dc:description>
  <dc:format>Lienzo</dc:format>
  <dc:identifier></dc:identifier>
  <dc:language></dc:language>
  <dc:publisher>Galeria Nacional de Noruega</dc:publisher>
  <dc:relation></dc:relation>
  <dc:rights></dc:rights>
  <dc:source></dc:source>
  <dc:subject></dc:subject>
  <dc:title>El grito</dc:title>
  <dc:type>Óleo</dc:type>
```

Dataset:
  ```XML
 <dc:contributor> </dc:contributor>
  <dc:coverage> </dc:coverage>
  <dc:creator></dc:creator>
  <dc:date></dc:date>
  <dc:subject></dc:subject>
  <dc:description></dc:description>
  <dc:format>  </dc:format>
  <dc:identifier></dc:identifier>
  <dc:language> </dc:language>
  <dc:publisher></dc:publisher>
  <dc:relation> </dc:relation>
  <dc:rights> </dc:rights>
  <dc:source> </dc:source>
  <dc:title></dc:title>
  <dc:type></dc:type>
```

Ahora vamos a ver cómo podemos manejar estos datos en Python. Para ello, utilizaremos la librería xml.

Para crear un documento XML bien formado, es necesario definir dónde está descrito el prefijo Dublin Core o "dc:". Para ello, añadimos antes de los datos la siguiente cabecera:

```XML
<?xml version="1.0" encoding="UTF-8" standalone="no"?><?xml-stylesheet type="text/xsl" href="/webservices/catalog/xsl/searchRetrieveResponse.xsl"?>
<searchRetrieveResponse xmlns:oclcterms="http://purl.org/oclc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:diag="http://www.loc.gov/zing/srw/diagnostic/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
```

Sin olvidar añadir al final:

```XML
</searchRetrieveResponse>
```

In [2]:
import xml.etree.ElementTree as ET
dc_xml = '''<?xml version="1.0" encoding="UTF-8" standalone="no"?><?xml-stylesheet type="text/xsl" href="/webservices/catalog/xsl/searchRetrieveResponse.xsl"?>
<searchRetrieveResponse xmlns:oclcterms="http://purl.org/oclc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:diag="http://www.loc.gov/zing/srw/diagnostic/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">


     <dc:contributor>Edvard Munch </dc:contributor>
  <dc:coverage>Lugar indeterminado</dc:coverage>
  <dc:creator>Edvard Munch </dc:creator>
  <dc:date>1910</dc:date>
  <dc:description>Cuadro...</dc:description>
  <dc:format>Oleo sobre carton</dc:format>
  <dc:identifier>id_museo_grito</dc:identifier>
  <dc:language></dc:language>
  <dc:publisher>Galeria nacional de Oslo</dc:publisher>
  <dc:relation>cuadro1, cuadro2, cuadro3</dc:relation>
  <dc:rights>Acceso al museo</dc:rights>
  <dc:source></dc:source>
  <dc:title>El gripo</dc:title>
  <dc:type>Cuadro</dc:type>



</searchRetrieveResponse>'''

tree = ET.fromstring(dc_xml)
tree

<Element 'searchRetrieveResponse' at 0x7f923c9c4db8>

Si queremos recorrer los elementos del XML que hemos formado, podemos utilizar un bucle, teniendo en cuenta que la información que nos interesa la tenemos en 'searchRetrieveResponse':

In [6]:
for table in tree.getiterator('searchRetrieveResponse'): #Se genera un iterador partir de la raíz del árbol
    for child in table:
        print(child.tag, child.text)

{http://purl.org/dc/elements/1.1/}contributor Edvard Munch 
{http://purl.org/dc/elements/1.1/}coverage Lugar indeterminado
{http://purl.org/dc/elements/1.1/}creator Edvard Munch 
{http://purl.org/dc/elements/1.1/}date 1910
{http://purl.org/dc/elements/1.1/}description Cuadro...
{http://purl.org/dc/elements/1.1/}format Oleo sobre carton
{http://purl.org/dc/elements/1.1/}identifier id_museo_grito
{http://purl.org/dc/elements/1.1/}language None
{http://purl.org/dc/elements/1.1/}publisher Galeria nacional de Oslo
{http://purl.org/dc/elements/1.1/}relation cuadro1, cuadro2, cuadro3
{http://purl.org/dc/elements/1.1/}rights Acceso al museo
{http://purl.org/dc/elements/1.1/}source None
{http://purl.org/dc/elements/1.1/}title El gripo
{http://purl.org/dc/elements/1.1/}type Cuadro


Observa que, al utilizar el prefijo 'dc:' e indicarle que está descrito en la URL 'http://purl.org/dc/elements/1.1/', la eqtiqueta o "tag" aparece como, por ejemplo {URL}contributor.

Prueba a mostrar los metadatos que has creado a partir del cuadro y del dataset:

In [7]:
for table in tree.getiterator('searchRetrieveResponse'):
    for child in table:
        print(child.text)

Edvard Munch 
Lugar indeterminado
Edvard Munch 
1910
Cuadro...
Oleo sobre carton
id_museo_grito
None
Galeria nacional de Oslo
cuadro1, cuadro2, cuadro3
Acceso al museo
None
El gripo
Cuadro


Utilizando findall() sobre el arbol (tree), podemos encontrar todos los elementos con una etiqueta determinada.

In [8]:
relation = tree.findall('{http://purl.org/dc/elements/1.1/}relation')
print(relation)

[<Element '{http://purl.org/dc/elements/1.1/}relation' at 0x7f923c9d55e8>]


Ten en cuenta que lo que encontramos es, en realidad, una parte del documento XML, por lo que hay que iterarlo como antes:

In [9]:
for child in relation:
    print(child.tag, child.text)

{http://purl.org/dc/elements/1.1/}relation cuadro1, cuadro2, cuadro3


XML utiliza prefijos para no necesitar referenciar a la URL de un tipo cada vez, lo podemos ver en la cabecera:

```XML
<?xml version="1.0" encoding="UTF-8" standalone="no"?><?xml-stylesheet type="text/xsl" href="/webservices/catalog/xsl/searchRetrieveResponse.xsl"?>
<searchRetrieveResponse xmlns:oclcterms="http://purl.org/oclc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:diag="http://www.loc.gov/zing/srw/diagnostic/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">'''
```

Por ejemplo, cada vez que queremos utilizar un tipo de Dublin Core, utilizamos el prefijo dc: que equivale a llamar a la definición:

xmlns:dc="http://purl.org/dc/elements/1.1/"

Sin embargo, para utilizar ElementTree en Python, tenemos que utilizar la URL completa. Esto puede resultar un poco engorroso, así que podemos definir el namespace para utilizar también el prefijo:

In [10]:
namespaces = {'dc': 'http://purl.org/dc/elements/1.1/'} # add more as needed

tree.find('dc:rights',namespaces).text

'Acceso al museo'

Los documentos XML, aparte de las etiquetas y los valores, pueden contener atributos. Dado el siguiente ejemplo, vamos a ver cómo obtener la lista y los valores de los atributos

In [13]:
dc_xml = '''<?xml version="1.0" encoding="UTF-8" standalone="no"?><?xml-stylesheet type="text/xsl" href="/webservices/catalog/xsl/searchRetrieveResponse.xsl"?>
<searchRetrieveResponse xmlns:oclcterms="http://purl.org/oclc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:diag="http://www.loc.gov/zing/srw/diagnostic/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<dc:contributor>asdsadsad</dc:contributor>
<dc:coverage>dfsd</dc:coverage>
<dc:creator>sadsa</dc:creator>
<dc:date>sadas</dc:date>
<dc:description atributo1="valor1" atributo2="valor2" lang="ES">sadsa</dc:description>
<dc:format>sadasd</dc:format>
<dc:identifier>sadsad</dc:identifier>
<dc:language>asdasd</dc:language>
<dc:publisher>wqewq</dc:publisher>
<dc:relation >wqeqw</dc:relation>
<dc:rights>ffefe</dc:rights>
<dc:source>vfvf</dc:source>
<dc:title>wqewqe</dc:title>
<dc:type>ewfrb</dc:type>
</searchRetrieveResponse>'''

tree2 = ET.fromstring(dc_xml)

In [14]:
tree2.find('dc:description',namespaces).attrib

{'atributo1': 'valor1', 'atributo2': 'valor2', 'lang': 'ES'}

Conociendo los nombres de estos atributos, puedes extraer su valor. Esto serviría para dar una información adicional al contenido de la etiqueta. Por ejemplo, se podría añadir el idioma como atributo en la descripción.

In [15]:
print(tree2.find('dc:description',namespaces).attrib['atributo1'])
print(tree2.find('dc:description',namespaces).attrib['atributo2'])


valor1
valor2


Vamos a analizar un documento XML más complejo, empezando por descargarlo:

In [57]:
import requests

response = requests.get('https://gist.githubusercontent.com/vivien/580729/raw/651d1b216357c0d7d9fc47075071fb482e11fb36/dublincore-example.xml')
if response.status_code == 200:
    with open("./dublincore-example.xml", 'wb') as f:
        f.write(response.content)

<div class="alert alert-warning" role="alert" style="margin: 10px">
<p>**Recuerda!**</p>

<p>Jupyter permite ejecutar ciertos comandos bash</p>
</div>

In [17]:
ls

dublincore-example.xml  metadataIntro.ipynb


Y lo cargamos en python:

In [58]:
tree = ET.parse('dublincore-example.xml')
namespaces = {'dc': 'http://purl.org/dc/elements/1.1/'} # add more as needed
for table in tree.getiterator('{http://www.loc.gov/zing/srw/}searchRetrieveResponse'):
    for child in table:
        print(child.tag, child.text)

{http://www.loc.gov/zing/srw/}version 1.1
{http://www.loc.gov/zing/srw/}numberOfRecords 33587
{http://www.loc.gov/zing/srw/}records 

{http://www.loc.gov/zing/srw/}nextRecordPosition 11
{http://www.loc.gov/zing/srw/}resultSetIdleTime None
{http://www.loc.gov/zing/srw/}echoedSearchRetrieveRequest 



In [59]:
all_records = tree.findall('{http://www.loc.gov/zing/srw/}records')
print(all_records)

[<Element '{http://www.loc.gov/zing/srw/}records' at 0x7f923c09c228>]


In [60]:
for table in tree.getiterator('{http://www.loc.gov/zing/srw/}record'):
    for child in table:
        print(child.tag, child.text)

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http://www.loc.gov/zing/srw/}recordPacking xml
{http://www.loc.gov/zing/srw/}recordData 

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http://www.loc.gov/zing/srw/}recordPacking xml
{http://www.loc.gov/zing/srw/}recordData 

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http://www.loc.gov/zing/srw/}recordPacking xml
{http://www.loc.gov/zing/srw/}recordData 

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http://www.loc.gov/zing/srw/}recordPacking xml
{http://www.loc.gov/zing/srw/}recordData 

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http://www.loc.gov/zing/srw/}recordPacking xml
{http://www.loc.gov/zing/srw/}recordData 

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http://www.loc.gov/zing/srw/}recordPacking xml
{http://www.loc.gov/zing/srw/}recordData 

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http:

In [61]:
for table in tree.getiterator('{http://www.loc.gov/zing/srw/}recordData'):
    for child in table:
        print(child.tag, child.text)

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 



In [62]:
for table in tree.getiterator('{http://www.loc.gov/zing/srw/}oclcdcs'):
    for child in table:
        print(child.tag, child.text)

{http://purl.org/dc/elements/1.1/}creator Snelling, Lauraine.
{http://purl.org/dc/elements/1.1/}date c2003
{http://purl.org/dc/elements/1.1/}description "Ruby Torvald sets out on a daunting journey with her young sister, Opal, to hopefully see their long-lost father once more and claim the promised inheritance. But instead of the treasure they expected, the sisters discover something most shocking." -- Book Cover.
{http://purl.org/dc/elements/1.1/}format 320 p. ; 22 cm.
{http://purl.org/dc/elements/1.1/}identifier 0764290762
{http://purl.org/dc/elements/1.1/}identifier 9780764290763
{http://purl.org/dc/elements/1.1/}identifier 0764222228
{http://purl.org/dc/elements/1.1/}identifier 9780764222221
{http://purl.org/dc/elements/1.1/}language eng
{http://purl.org/dc/elements/1.1/}publisher Bethany House Publishers
{http://purl.org/dc/elements/1.1/}relation Dakotah treasures ; 1
{http://purl.org/dc/elements/1.1/}subject Inheritance and succession--Fiction.
{http://purl.org/dc/elements/1.1/}s

In [63]:
table = tree.findall('.//{http://purl.org/dc/elements/1.1/}identifier')
for child in table:
    print(child.tag, child.text)

{http://purl.org/dc/elements/1.1/}identifier 0764290762
{http://purl.org/dc/elements/1.1/}identifier 9780764290763
{http://purl.org/dc/elements/1.1/}identifier 0764222228
{http://purl.org/dc/elements/1.1/}identifier 9780764222221
{http://purl.org/dc/elements/1.1/}identifier 0060277327
{http://purl.org/dc/elements/1.1/}identifier 9780060277321
{http://purl.org/dc/elements/1.1/}identifier 0060277335 (lib. bdg)
{http://purl.org/dc/elements/1.1/}identifier 9780060277338 (lib. bdg)
{http://purl.org/dc/elements/1.1/}identifier 0590189239 (hc)
{http://purl.org/dc/elements/1.1/}identifier 9780590189231 (hc)
{http://purl.org/dc/elements/1.1/}identifier 0671759353
{http://purl.org/dc/elements/1.1/}identifier 9780671759353
{http://purl.org/dc/elements/1.1/}identifier 0316236438 (lib. bdg.) 
{http://purl.org/dc/elements/1.1/}identifier 9780316236430 (lib. bdg.)
{http://purl.org/dc/elements/1.1/}identifier 0316236608 (pbk.)
{http://purl.org/dc/elements/1.1/}identifier 9780316236607 (pbk.)
{http://p

In [64]:
relation = tree.findall('.//{http://purl.org/dc/elements/1.1/}identifier')
for elem in relation:
    print(elem.tag, elem.text)

{http://purl.org/dc/elements/1.1/}identifier 0764290762
{http://purl.org/dc/elements/1.1/}identifier 9780764290763
{http://purl.org/dc/elements/1.1/}identifier 0764222228
{http://purl.org/dc/elements/1.1/}identifier 9780764222221
{http://purl.org/dc/elements/1.1/}identifier 0060277327
{http://purl.org/dc/elements/1.1/}identifier 9780060277321
{http://purl.org/dc/elements/1.1/}identifier 0060277335 (lib. bdg)
{http://purl.org/dc/elements/1.1/}identifier 9780060277338 (lib. bdg)
{http://purl.org/dc/elements/1.1/}identifier 0590189239 (hc)
{http://purl.org/dc/elements/1.1/}identifier 9780590189231 (hc)
{http://purl.org/dc/elements/1.1/}identifier 0671759353
{http://purl.org/dc/elements/1.1/}identifier 9780671759353
{http://purl.org/dc/elements/1.1/}identifier 0316236438 (lib. bdg.) 
{http://purl.org/dc/elements/1.1/}identifier 9780316236430 (lib. bdg.)
{http://purl.org/dc/elements/1.1/}identifier 0316236608 (pbk.)
{http://purl.org/dc/elements/1.1/}identifier 9780316236607 (pbk.)
{http://p

## XPATH

XPath es un lenguaje que permite construir expresiones que recorren y procesan un documento XML. La idea es parecida a las expresiones regulares para seleccionar partes de un texto sin atributos. XPath permite buscar y seleccionar teniendo en cuenta la estructura jerárquica del XML

<table border="1" class="docutils">
<colgroup>
<col width="30%">
<col width="70%">
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Syntax</th>
<th class="head">Meaning</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">tag</span></code></td>
<td>Selects all child elements with the given tag.
For example, <code class="docutils literal notranslate"><span class="pre">spam</span></code> selects all child elements
named <code class="docutils literal notranslate"><span class="pre">spam</span></code>, and <code class="docutils literal notranslate"><span class="pre">spam/egg</span></code> selects all
grandchildren named <code class="docutils literal notranslate"><span class="pre">egg</span></code> in all children named
<code class="docutils literal notranslate"><span class="pre">spam</span></code>.</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">*</span></code></td>
<td>Selects all child elements.  For example, <code class="docutils literal notranslate"><span class="pre">*/egg</span></code>
selects all grandchildren named <code class="docutils literal notranslate"><span class="pre">egg</span></code>.</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">.</span></code></td>
<td>Selects the current node.  This is mostly useful
at the beginning of the path, to indicate that it’s
a relative path.</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">//</span></code></td>
<td>Selects all subelements, on all levels beneath the
current  element.  For example, <code class="docutils literal notranslate"><span class="pre">.//egg</span></code> selects
all <code class="docutils literal notranslate"><span class="pre">egg</span></code> elements in the entire tree.</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">..</span></code></td>
<td>Selects the parent element.</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">[@attrib]</span></code></td>
<td>Selects all elements that have the given attribute.</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">[@attrib='value']</span></code></td>
<td>Selects all elements for which the given attribute
has the given value.  The value cannot contain
quotes.</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">[tag]</span></code></td>
<td>Selects all elements that have a child named
<code class="docutils literal notranslate"><span class="pre">tag</span></code>.  Only immediate children are supported.</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">[tag='text']</span></code></td>
<td>Selects all elements that have a child named
<code class="docutils literal notranslate"><span class="pre">tag</span></code> whose complete text content, including
descendants, equals the given <code class="docutils literal notranslate"><span class="pre">text</span></code>.</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">[position]</span></code></td>
<td>Selects all elements that are located at the given
position.  The position can be either an integer
(1 is the first position), the expression <code class="docutils literal notranslate"><span class="pre">last()</span></code>
(for the last position), or a position relative to
the last position (e.g. <code class="docutils literal notranslate"><span class="pre">last()-1</span></code>).</td>
</tr>
</tbody>
</table>

Como ves, hay que ir entendiendo la jerarquía del XML para poder obtener la información. 

¿Puedes obtener los títulos de los recursos descritos en el XML?

<div class="alert alert-warning" role="alert" style="margin: 10px">
<p>**Ayuda**</p>

<p>'//' para indicar que empiece a buscar desde el elemento actual + tipo+nombre del elemento a buscar ({http://purl.org/dc/elements/1.1/} title)</p>
</div>

In [25]:
relation = tree.findall('.//{http://purl.org/dc/elements/1.1/}title')
for elem in relation:
    print(elem.tag, elem.text)

{http://purl.org/dc/elements/1.1/}title Ruby 
{http://purl.org/dc/elements/1.1/}title Ruby Holler 
{http://purl.org/dc/elements/1.1/}title Through my eyes 
{http://purl.org/dc/elements/1.1/}title Ruby 
{http://purl.org/dc/elements/1.1/}title Ruby 
{http://purl.org/dc/elements/1.1/}title I'm not Julia Roberts 
{http://purl.org/dc/elements/1.1/}title I am not Julia Roberts
{http://purl.org/dc/elements/1.1/}title The chaos king 
{http://purl.org/dc/elements/1.1/}title Bad apple 
{http://purl.org/dc/elements/1.1/}title The Wall and the Wing 
{http://purl.org/dc/elements/1.1/}title Good girls 


Haz lo mismo utilizando namespace

In [65]:
namespaces = {'dc': 'http://purl.org/dc/elements/1.1/'} # add more as needed
relation = tree.findall('.//dc:title',namespaces)
for elem in relation:
    print(elem.tag, elem.text)

{http://purl.org/dc/elements/1.1/}title Ruby 
{http://purl.org/dc/elements/1.1/}title Ruby Holler 
{http://purl.org/dc/elements/1.1/}title Through my eyes 
{http://purl.org/dc/elements/1.1/}title Ruby 
{http://purl.org/dc/elements/1.1/}title Ruby 
{http://purl.org/dc/elements/1.1/}title I'm not Julia Roberts 
{http://purl.org/dc/elements/1.1/}title I am not Julia Roberts
{http://purl.org/dc/elements/1.1/}title The chaos king 
{http://purl.org/dc/elements/1.1/}title Bad apple 
{http://purl.org/dc/elements/1.1/}title The Wall and the Wing 
{http://purl.org/dc/elements/1.1/}title Good girls 


Ejemplo con EML

In [27]:
import requests

response = requests.get('https://zenodo.org/record/841691/files/amt_prototype.xml')
if response.status_code == 200:
    with open("./amt_prototype.xml", 'wb') as f:
        f.write(response.content)
        


In [28]:
ls

amt_prototype.xml  dublincore-example.xml  metadataIntro.ipynb


En estándares más complejos, el xml de base puede tener una jerarquía anidada, como es el caso de EML. Entonces, cada elemento puede tener de 0 a N "hijos", formando nuevos árboles.

In [4]:
tree = ET.parse('amt_prototype.xml')
root = tree.getroot()

for table in root.getiterator():
    for child in table:
        if len(child)==0:
            print(child.tag, "|", child.text)

alternateIdentifier | 
10.5281/zenodo.841183

title | water reservoir of Cuerda del Pozo
organizationName | IFCA
electronicMailAddress | marco@ifca.unican.es
salutation | Mr
givenName | Jesus Marco
surName | De Lucas
deliveryPoint | Avda Castros s/n
city | Santander
postalCode | 39005
country | Spain
organizationName | IFCA
electronicMailAddress | aguilarf@ifca.unican.es
role | guardian
givenName | Fernando
surName | Aguilar
deliveryPoint | Avda Castros s/n
city | Santander
postalCode | 39005
country | Spain
para | The CTD 60 is a precision probe for oceanographic and limnological measurements of physical, chemical and optical parameters up to a depth of 2000 m. It allows the simultaneous measurement of following parameters: Pressure (depth), temperature, conductivity, raw O2, REDOX, dissolved oxygen, pH, Oxigen Saturation, Salinity.
keyword | measure
keyword | water reservoir
keyword | sensor
keyword | physical and chemical parameters
geographicDescription | water reservoir
westBoundi

Explora un poco: Nombre del proyecto, autores, lista de atributos...

In [31]:
#elementos = tree.findall('/dataset[1]/creator/individualName/salutation')
elementos = tree.findall('.//attributeList/attribute[@id="1465311292527"]/attributeName')
for e in elementos:
    print(e.tag + ":", e.text)

attributeName: date


In [32]:
elementos = tree.findall('.//dataset')
for e in elementos:
    print(e.tag + ":", e.text)
    for i in e.getiterator():
        print(i.tag + ":", i.text)

dataset:  

dataset:  

title: water reservoir of Cuerda del Pozo
creator:  
individualName: None
salutation: Mr
givenName: Jesus Marco
surName: De Lucas
organizationName: IFCA
address: None
deliveryPoint: Avda Castros s/n
city: Santander
postalCode: 39005
country: Spain
electronicMailAddress: marco@ifca.unican.es
associatedParty: None
individualName: None
givenName: Fernando
surName: Aguilar
organizationName: IFCA
address: None
deliveryPoint: Avda Castros s/n
city: Santander
postalCode: 39005
country: Spain
electronicMailAddress: aguilarf@ifca.unican.es
role: guardian
abstract: None
para: The CTD 60 is a precision probe for oceanographic and limnological measurements of physical, chemical and optical parameters up to a depth of 2000 m. It allows the simultaneous measurement of following parameters: Pressure (depth), temperature, conductivity, raw O2, REDOX, dissolved oxygen, pH, Oxigen Saturation, Salinity.
keywordSet: None
keyword: measure
keyword: water reservoir
keyword: sensor
key

In [33]:
dataset = ET.SubElement(root,'dataset')
for table in dataset.getiterator():
    print(child.tag, child.text)

description practical salinity unit


# Ejercicio personal

## Ejercicio 1

A partir del ejemplo completo del esquema de metadatos de DataCite, muestra por pantalla los elementos que sean equivalentes a los propuestos por Dublin Core (cada uno en una línea). Es posible que tengas que combinar en uno varios campos del archivo de metadatos (Por ejemplo, en coverage las coordenadas + el nombre).

* Title: 
* Creator: 
* Subject:  
* Description: 
* Publisher: 
* Contributor:
* Date: 
* Type: 
* Format: 
* Identifier: 
* Source: 
* Language:
* Relation:
* Coverage:  
* Rights:

Recurso: https://schema.datacite.org/meta/kernel-3.1/example/datacite-example-full-v3.1.xml

In [5]:
import xml.etree.ElementTree as ET
import requests

response = requests.get('https://schema.datacite.org/meta/kernel-3.1/example/datacite-example-full-v3.1.xml')
if response.status_code == 200:
    with open("./datacite-example-full-v3.1.xml", 'wb') as f:
        f.write(response.content)

In [36]:
ls

amt_prototype.xml               dublincore-example.xml
datacite-example-full-v3.1.xml  metadataIntro.ipynb


In [13]:
tree = ET.parse('datacite-example-full-v3.1.xml')
namespaces = {'dc': 'http://datacite.org/schema/kernel-3'}
"""
DClist = ['title','creatorName', 'subject', 'description', 'publisher', 'contributorName', 'date', 'resourceType', 'format',
          'identifier', 'source', 'language', 'relatedIdentifier', 'coverage', 'rights']

find = tree.findall('.//dc:title',namespaces)

for child in find:
    print(child.tag, "|", child.text)
    
find = tree.findall('.//dc:creatorName',namespaces)

for child in find:
    print(child.tag, "|", child.text)
    
find = tree.findall('.//dc:subject',namespaces)

for child in find:
    print(child.tag, "|", child.text)

find = tree.findall('.//dc:description',namespaces)

for child in find:
    print(child.tag, "|", child.text)
    
find = tree.findall('.//dc:publisher',namespaces)

for child in find:
    print(child.tag, "|", child.text)
    
find = tree.findall('.//dc:contributorName',namespaces)

for child in find:
    print(child.tag, "|", child.text)
    
find = tree.findall('.//dc:date',namespaces)

for child in find:
    print(child.tag, "|", child.text)
    
find = tree.findall('.//dc:resourceType',namespaces)

for child in find:
    print(child.tag, "|", child.text)
    
find = tree.findall('.//dc:format',namespaces)

for child in find:
    print(child.tag, "|", child.text)
    
find = tree.findall('.//dc:identifier',namespaces)

for child in find:
    print(child.tag, "|", child.text)
    
find = tree.findall('.//dc:source',namespaces) #No saca nada

for child in find:
    print(child.tag, "|", child.text)

find = tree.findall('.//dc:language',namespaces)

for child in find:
    print(child.tag, "|", child.text)

find = tree.findall('.//dc:relatedIdentifier',namespaces)

for child in find:
    print(child.tag, "|", child.text)

find1 = tree.findall('.//dc:geoLocationPoint',namespaces) 
find2 = tree.findall('.//dc:geoLocationPlace',namespaces)

for child1, child2 in zip(find1, find2):
    print('coverage', "|", child1.text + ', ' + child2.text)

for child in find:
    print(child.tag, "|", child.text)

find = tree.findall('.//dc:rights',namespaces)

for child in find:
    print(child.tag, "|", child.text)


{http://datacite.org/schema/kernel-3}title | Full DataCite XML Example
{http://datacite.org/schema/kernel-3}title | Demonstration of DataCite Properties.
{http://datacite.org/schema/kernel-3}creatorName | Miller, Elizabeth
{http://datacite.org/schema/kernel-3}subject | 000 computer science
{http://datacite.org/schema/kernel-3}description | 
            XML example of all DataCite Metadata Schema v3.1 properties.
        
{http://datacite.org/schema/kernel-3}publisher | DataCite
{http://datacite.org/schema/kernel-3}contributorName | Starr, Joan
{http://datacite.org/schema/kernel-3}date | 2014-10-17
{http://datacite.org/schema/kernel-3}resourceType | XML
{http://datacite.org/schema/kernel-3}format | application/xml
{http://datacite.org/schema/kernel-3}identifier | 10.5072/example-full
{http://datacite.org/schema/kernel-3}language | en-us
{http://datacite.org/schema/kernel-3}relatedIdentifier | http://data.datacite.org/application/citeproc+json/10.5072/example-full
{http://datacite.org/sc

## Ejercicio 2

Haz un listado de todas las etiquetas del documento XML con sus atributos (si lo tienen)

In [24]:
tree = ET.parse('datacite-example-full-v3.1.xml')
root = tree.getroot()

for table in root.getiterator():
    for child in table:
        if child.attrib != {}:
            print(child.attrib, '\n', child.tag, '\n')
        else:
            print(child.tag, '\n')

{'identifierType': 'DOI'} 
 {http://datacite.org/schema/kernel-3}identifier 

{http://datacite.org/schema/kernel-3}creators 

{http://datacite.org/schema/kernel-3}titles 

{http://datacite.org/schema/kernel-3}publisher 

{http://datacite.org/schema/kernel-3}publicationYear 

{http://datacite.org/schema/kernel-3}subjects 

{http://datacite.org/schema/kernel-3}contributors 

{http://datacite.org/schema/kernel-3}dates 

{http://datacite.org/schema/kernel-3}language 

{'resourceTypeGeneral': 'Software'} 
 {http://datacite.org/schema/kernel-3}resourceType 

{http://datacite.org/schema/kernel-3}alternateIdentifiers 

{http://datacite.org/schema/kernel-3}relatedIdentifiers 

{http://datacite.org/schema/kernel-3}sizes 

{http://datacite.org/schema/kernel-3}formats 

{http://datacite.org/schema/kernel-3}version 

{http://datacite.org/schema/kernel-3}rightsList 

{http://datacite.org/schema/kernel-3}descriptions 

{http://datacite.org/schema/kernel-3}geoLocations 

{http://datacite.org/schema/ke

## Ejercicio 3

Muestra los distintos identificadores que tiene ese documento de este modo: Identificador [tipo] = [identificador]

Ejemplo: Identificador DOI = 10.3122/121321

In [35]:
tree = ET.parse('datacite-example-full-v3.1.xml')
root = tree.getroot()

for table in root.getiterator():
    for child in table:
        if 'identifier' or 'Identifier' in child.tag:
            print(1)
            if 'Type' in child.attrib:
                print('Identificador', child.attrib, ' = ', child.text)


1
1
1
1
1
1
1


In [42]:
tree = ET.parse('datacite-example-full-v3.1.xml')
root = tree.getroot()

for table in root.getiterator():
    for child in table:
        if child.attrib != {} and 'identifier' or 'Identifier' in child.tag:
            print('Identificador', child.attrib, ' = ', child.text, '\n')

Identificador {'identifierType': 'DOI'}  =  10.5072/example-full 

Identificador {'resourceTypeGeneral': 'Software'}  =  XML 

Identificador {}  =  
         

Identificador {}  =  
         

Identificador {'schemeURI': 'http://orcid.org/', 'nameIdentifierScheme': 'ORCID'}  =  0000-0001-5000-0007 

Identificador {'{http://www.w3.org/XML/1998/namespace}lang': 'en-us'}  =  Full DataCite XML Example 

Identificador {'{http://www.w3.org/XML/1998/namespace}lang': 'en-us', 'titleType': 'Subtitle'}  =  Demonstration of DataCite Properties. 

Identificador {'{http://www.w3.org/XML/1998/namespace}lang': 'en-us', 'schemeURI': 'http://dewey.info/', 'subjectScheme': 'dewey'}  =  000 computer science 

Identificador {'contributorType': 'ProjectLeader'}  =  
             

Identificador {'schemeURI': 'http://orcid.org/', 'nameIdentifierScheme': 'ORCID'}  =  0000-0002-7285-027X 

Identificador {'dateType': 'Updated'}  =  2014-10-17 

Identificador {'alternateIdentifierType': 'URL'}  =  http://schema