# Gestión y uso de Metadatos

## Librerías Necesarias

xml.etree.ElementTree

requests

In [2]:
import xml.etree.ElementTree as ET
import requests

## Metadata attachment

En este Notebook descubriremos cómo pueden explotarse metadatos publicados en formatos basados en etiquetas, como XML.

<img src="https://www.republica.com/wp-content/uploads/2017/04/grito.jpg " width="250">

Vamos a empezar por describir un par de objetos, empezando por un cuadro, "El grito", de  Edvard Munch.

* Title: El Grito
* Creator: Edvard Munch
* Subject: pintura, cuadro, grito
* Description: Cuadro de un hombre gritando en un puente
* Publisher: Galería Nacional Noruega
* Contributor:
* Date: 
* Type: 
* Format: 
* Identifier: 
* Source: 
* Language:
* Relation:
* Coverage:  
* Rights:

Con Dublin Core también podemos describir datasets científicos. Vamos a probar con:

https://zenodo.org/record/3372754#.XcFkhE9Kg5k

<div class="alert alert-warning" role="alert" style="margin: 10px">
<p>**Consejo**</p>

<p>En el propio repositorio puedes encontrar metadatos</p>
</div>

* Title: 
* Creator: 
* Subject:  
* Description: 
* Publisher: 
* Contributor:
* Date: 
* Type: 
* Format: 
* Identifier: 
* Source: 
* Language:
* Relation:
* Coverage:  
* Rights:

A partir de las descripciones, podemos crear documentos XML que sean interpretables por máquinas (entendiendo máquinas como scripts, software, etc). 

El grito:
  
  ```XML
 <dc:contributor></dc:contributor>
  <dc:coverage></dc:coverage>
  <dc:creator>Edvard Munch</dc:creator>
  <dc:date></dc:date>
  <dc:description>Cuadro de un hombre gritando en un puente</dc:description>
  <dc:format></dc:format>
  <dc:identifier></dc:identifier>
  <dc:language></dc:language>
  <dc:publisher>Galería Nacional Noruega</dc:publisher>
  <dc:relation></dc:relation>
  <dc:rights></dc:rights>
  <dc:source></dc:source>
  <dc:title>El Grito</dc:title>
  <dc:type></dc:type>
```

Dataset:
  ```XML
 <dc:contributor> </dc:contributor>
  <dc:coverage> </dc:coverage>
  <dc:creator></dc:creator>
  <dc:date></dc:date>
  <dc:subject></dc:subject>
  <dc:description></dc:description>
  <dc:format>  </dc:format>
  <dc:identifier></dc:identifier>
  <dc:language> </dc:language>
  <dc:publisher></dc:publisher>
  <dc:relation> </dc:relation>
  <dc:rights> </dc:rights>
  <dc:source> </dc:source>
  <dc:title></dc:title>
  <dc:type></dc:type>
```

Ahora vamos a ver cómo podemos manejar estos datos en Python. Para ello, utilizaremos la librería xml.

Para crear un documento XML bien formado, es necesario definir dónde está descrito el prefijo Dublin Core o "dc:". Para ello, añadimos antes de los datos la siguiente cabecera:

```XML
<?xml version="1.0" encoding="UTF-8" standalone="no"?><?xml-stylesheet type="text/xsl" href="/webservices/catalog/xsl/searchRetrieveResponse.xsl"?>
<searchRetrieveResponse xmlns:oclcterms="http://purl.org/oclc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:diag="http://www.loc.gov/zing/srw/diagnostic/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
```

Sin olvidar añadir al final:

```XML
</searchRetrieveResponse>
```

In [3]:
import xml.etree.ElementTree as ET
dc_xml = '''<?xml version="1.0" encoding="UTF-8" standalone="no"?><?xml-stylesheet type="text/xsl" href="/webservices/catalog/xsl/searchRetrieveResponse.xsl"?>
<searchRetrieveResponse xmlns:oclcterms="http://purl.org/oclc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:diag="http://www.loc.gov/zing/srw/diagnostic/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">


     <dc:contributor>Edvard Munch </dc:contributor>
  <dc:coverage>Lugar indeterminado</dc:coverage>
  <dc:creator>Edvard Munch </dc:creator>
  <dc:date>1910</dc:date>
  <dc:description>Cuadro...</dc:description>
  <dc:format>Oleo sobre carton</dc:format>
  <dc:identifier>id_museo_grito</dc:identifier>
  <dc:language></dc:language>
  <dc:publisher>Galeria nacional de Oslo</dc:publisher>
  <dc:relation>cuadro1, cuadro2, cuadro3</dc:relation>
  <dc:rights>Acceso al museo</dc:rights>
  <dc:source></dc:source>
  <dc:title>El grito</dc:title>
  <dc:type>Cuadro</dc:type>



</searchRetrieveResponse>'''

tree = ET.fromstring(dc_xml)
tree

<Element 'searchRetrieveResponse' at 0x0000019A4ED08130>

Si queremos recorrer los elementos del XML que hemos formado, podemos utilizar un bucle, teniendo en cuenta que la información que nos interesa la tenemos en el elemento raíz 'searchRetrieveResponse':

In [4]:
for table in tree.iter('searchRetrieveResponse'):
    for child in table:
        print(child.tag, child.text)

{http://purl.org/dc/elements/1.1/}contributor Edvard Munch 
{http://purl.org/dc/elements/1.1/}coverage Lugar indeterminado
{http://purl.org/dc/elements/1.1/}creator Edvard Munch 
{http://purl.org/dc/elements/1.1/}date 1910
{http://purl.org/dc/elements/1.1/}description Cuadro...
{http://purl.org/dc/elements/1.1/}format Oleo sobre carton
{http://purl.org/dc/elements/1.1/}identifier id_museo_grito
{http://purl.org/dc/elements/1.1/}language None
{http://purl.org/dc/elements/1.1/}publisher Galeria nacional de Oslo
{http://purl.org/dc/elements/1.1/}relation cuadro1, cuadro2, cuadro3
{http://purl.org/dc/elements/1.1/}rights Acceso al museo
{http://purl.org/dc/elements/1.1/}source None
{http://purl.org/dc/elements/1.1/}title El grito
{http://purl.org/dc/elements/1.1/}type Cuadro


Observa que, al utilizar el prefijo 'dc:' e indicarle que está descrito en la URL 'http://purl.org/dc/elements/1.1/', la eqtiqueta o "tag" aparece como, por ejemplo {URL}contributor.

Prueba a mostrar los metadatos que has creado a partir del cuadro y del dataset:

In [5]:
for table in tree.iter('searchRetrieveResponse'):
    for child in table:
        print(child.text)

Edvard Munch 
Lugar indeterminado
Edvard Munch 
1910
Cuadro...
Oleo sobre carton
id_museo_grito
None
Galeria nacional de Oslo
cuadro1, cuadro2, cuadro3
Acceso al museo
None
El grito
Cuadro


Utilizando findall() sobre el arbol (tree), podemos encontrar todos los elementos con una etiqueta determinada.

In [6]:
relation = tree.findall('{http://purl.org/dc/elements/1.1/}relation')
print(relation)

[<Element '{http://purl.org/dc/elements/1.1/}relation' at 0x0000019A4ED0A7A0>]


Ten en cuenta que lo que encontramos es, en realidad, una parte del documento XML, por lo que hay que iterarlo como antes:

In [7]:
for child in relation:
    print(child.tag, child.text)

{http://purl.org/dc/elements/1.1/}relation cuadro1, cuadro2, cuadro3


XML utiliza prefijos para no necesitar referenciar a la URL de un tipo cada vez, lo podemos ver en la cabecera:

```XML
<?xml version="1.0" encoding="UTF-8" standalone="no"?><?xml-stylesheet type="text/xsl" href="/webservices/catalog/xsl/searchRetrieveResponse.xsl"?>
<searchRetrieveResponse xmlns:oclcterms="http://purl.org/oclc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:diag="http://www.loc.gov/zing/srw/diagnostic/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">'''
```

Por ejemplo, cada vez que queremos utilizar un tipo de Dublin Core, utilizamos el prefijo dc: que equivale a llamar a la definición:

xmlns:dc="http://purl.org/dc/elements/1.1/"

Sin embargo, para utilizar ElementTree en Python, tenemos que utilizar la URL completa. Esto puede resultar un poco engorroso, así que podemos definir el namespace para utilizar también el prefijo:

In [8]:
namespaces = {'dc': 'http://purl.org/dc/elements/1.1/'} # add more as needed

tree.find('dc:rights',namespaces).text

'Acceso al museo'

Los documentos XML, aparte de las etiquetas y los valores, pueden contener atributos. Dado el siguiente ejemplo, vamos a ver cómo obtener la lista y los valores de los atributos

In [9]:
dc_xml = '''<?xml version="1.0" encoding="UTF-8" standalone="no"?><?xml-stylesheet type="text/xsl" href="/webservices/catalog/xsl/searchRetrieveResponse.xsl"?>
<searchRetrieveResponse xmlns:oclcterms="http://purl.org/oclc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:diag="http://www.loc.gov/zing/srw/diagnostic/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<dc:contributor>asdsadsad</dc:contributor>
<dc:coverage>dfsd</dc:coverage>
<dc:creator>sadsa</dc:creator>
<dc:date>sadas</dc:date>
<dc:description atributo1="valor1" atributo2="valor2">sadsa</dc:description>
<dc:format>sadasd</dc:format>
<dc:identifier>sadsad</dc:identifier>
<dc:language>asdasd</dc:language>
<dc:publisher>wqewq</dc:publisher>
<dc:relation >wqeqw</dc:relation>
<dc:rights>ffefe</dc:rights>
<dc:source>vfvf</dc:source>
<dc:title>wqewqe</dc:title>
<dc:type>ewfrb</dc:type>
</searchRetrieveResponse>'''

tree2 = ET.fromstring(dc_xml)

In [10]:
tree2.find('dc:description',namespaces).attrib

{'atributo1': 'valor1', 'atributo2': 'valor2'}

Conociendo los nombres de estos atributos, puedes extraer su valor. Esto serviría para dar una información adicional al contenido de la etiqueta. Por ejemplo, se podría añadir el idioma como atributo en la descripción.

In [11]:
print(tree2.find('dc:description',namespaces).attrib['atributo1'])
print(tree2.find('dc:description',namespaces).attrib['atributo2'])


valor1
valor2


Vamos a analizar un documento XML más complejo, empezando por descargarlo:

In [12]:
import requests

response = requests.get('https://gist.githubusercontent.com/vivien/580729/raw/651d1b216357c0d7d9fc47075071fb482e11fb36/dublincore-example.xml')
if response.status_code == 200:
    with open("./dublincore-example.xml", 'wb') as f:
        f.write(response.content)

<div class="alert alert-warning" role="alert" style="margin: 10px">
<p>**Recuerda!**</p>

<p>Jupyter permite ejecutar ciertos comandos bash</p>
</div>

In [13]:
ls

 El volumen de la unidad C no tiene etiqueta.
 El n�mero de serie del volumen es: CE62-6A44

 Directorio de c:\Users\Ruben\Proyectos\master\asignaturas\data life

13/11/2023  18:22    <DIR>          .
13/11/2023  18:22    <DIR>          ..
13/11/2023  18:14            13.595 amt_prototype.xml
13/11/2023  18:22             3.072 datacite-example.xml
16/11/2023  15:57            19.563 dublincore-example.xml
15/11/2023  16:08            81.930 metadataIntro_2023.ipynb
               4 archivos        118.160 bytes
               2 dirs  20.964.851.712 bytes libres


Y lo cargamos en python:

In [14]:
tree = ET.parse('dublincore-example.xml')
namespaces = {'dc': 'http://purl.org/dc/elements/1.1/'} # add more as needed
for table in tree.iter('{http://www.loc.gov/zing/srw/}searchRetrieveResponse'):
    for child in table:
        print(child.tag, child.text)

{http://www.loc.gov/zing/srw/}version 1.1
{http://www.loc.gov/zing/srw/}numberOfRecords 33587
{http://www.loc.gov/zing/srw/}records 

{http://www.loc.gov/zing/srw/}nextRecordPosition 11
{http://www.loc.gov/zing/srw/}resultSetIdleTime None
{http://www.loc.gov/zing/srw/}echoedSearchRetrieveRequest 



In [15]:
all_records = tree.findall('{http://www.loc.gov/zing/srw/}records')
print(all_records)

[<Element '{http://www.loc.gov/zing/srw/}records' at 0x0000019A4ED2C810>]


In [16]:
for table in tree.iter('{http://www.loc.gov/zing/srw/}record'):
    for child in table:
        print(child.tag, child.text)

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http://www.loc.gov/zing/srw/}recordPacking xml
{http://www.loc.gov/zing/srw/}recordData 

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http://www.loc.gov/zing/srw/}recordPacking xml
{http://www.loc.gov/zing/srw/}recordData 

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http://www.loc.gov/zing/srw/}recordPacking xml
{http://www.loc.gov/zing/srw/}recordData 

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http://www.loc.gov/zing/srw/}recordPacking xml
{http://www.loc.gov/zing/srw/}recordData 

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http://www.loc.gov/zing/srw/}recordPacking xml
{http://www.loc.gov/zing/srw/}recordData 

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http://www.loc.gov/zing/srw/}recordPacking xml
{http://www.loc.gov/zing/srw/}recordData 

{http://www.loc.gov/zing/srw/}recordSchema info:srw/schema/1/dc
{http:

In [17]:
for table in tree.iter('{http://www.loc.gov/zing/srw/}recordData'):
    for child in table:
        print(child.tag, child.text)

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 

{http://www.loc.gov/zing/srw/}oclcdcs 



In [18]:
for table in tree.iter('{http://www.loc.gov/zing/srw/}oclcdcs'):
    for child in table:
        print(child.tag, child.text)

{http://purl.org/dc/elements/1.1/}creator Snelling, Lauraine.
{http://purl.org/dc/elements/1.1/}date c2003
{http://purl.org/dc/elements/1.1/}description "Ruby Torvald sets out on a daunting journey with her young sister, Opal, to hopefully see their long-lost father once more and claim the promised inheritance. But instead of the treasure they expected, the sisters discover something most shocking." -- Book Cover.
{http://purl.org/dc/elements/1.1/}format 320 p. ; 22 cm.
{http://purl.org/dc/elements/1.1/}identifier 0764290762
{http://purl.org/dc/elements/1.1/}identifier 9780764290763
{http://purl.org/dc/elements/1.1/}identifier 0764222228
{http://purl.org/dc/elements/1.1/}identifier 9780764222221
{http://purl.org/dc/elements/1.1/}language eng
{http://purl.org/dc/elements/1.1/}publisher Bethany House Publishers
{http://purl.org/dc/elements/1.1/}relation Dakotah treasures ; 1
{http://purl.org/dc/elements/1.1/}subject Inheritance and succession--Fiction.
{http://purl.org/dc/elements/1.1/}s

In [19]:
table = tree.findall('//{http://purl.org/dc/elements/1.1/}identifier')
for child in table:
    print(child.tag, child.text)

{http://purl.org/dc/elements/1.1/}identifier 0764290762
{http://purl.org/dc/elements/1.1/}identifier 9780764290763
{http://purl.org/dc/elements/1.1/}identifier 0764222228
{http://purl.org/dc/elements/1.1/}identifier 9780764222221
{http://purl.org/dc/elements/1.1/}identifier 0060277327
{http://purl.org/dc/elements/1.1/}identifier 9780060277321
{http://purl.org/dc/elements/1.1/}identifier 0060277335 (lib. bdg)
{http://purl.org/dc/elements/1.1/}identifier 9780060277338 (lib. bdg)
{http://purl.org/dc/elements/1.1/}identifier 0590189239 (hc)
{http://purl.org/dc/elements/1.1/}identifier 9780590189231 (hc)
{http://purl.org/dc/elements/1.1/}identifier 0671759353
{http://purl.org/dc/elements/1.1/}identifier 9780671759353
{http://purl.org/dc/elements/1.1/}identifier 0316236438 (lib. bdg.) 
{http://purl.org/dc/elements/1.1/}identifier 9780316236430 (lib. bdg.)
{http://purl.org/dc/elements/1.1/}identifier 0316236608 (pbk.)
{http://purl.org/dc/elements/1.1/}identifier 9780316236607 (pbk.)
{http://p

  table = tree.findall('//{http://purl.org/dc/elements/1.1/}identifier')


In [20]:
relation = tree.findall('//{http://purl.org/dc/elements/1.1/}identifier')
for elem in relation:
    print(elem.tag, elem.text)

{http://purl.org/dc/elements/1.1/}identifier 0764290762
{http://purl.org/dc/elements/1.1/}identifier 9780764290763
{http://purl.org/dc/elements/1.1/}identifier 0764222228
{http://purl.org/dc/elements/1.1/}identifier 9780764222221
{http://purl.org/dc/elements/1.1/}identifier 0060277327
{http://purl.org/dc/elements/1.1/}identifier 9780060277321
{http://purl.org/dc/elements/1.1/}identifier 0060277335 (lib. bdg)
{http://purl.org/dc/elements/1.1/}identifier 9780060277338 (lib. bdg)
{http://purl.org/dc/elements/1.1/}identifier 0590189239 (hc)
{http://purl.org/dc/elements/1.1/}identifier 9780590189231 (hc)
{http://purl.org/dc/elements/1.1/}identifier 0671759353
{http://purl.org/dc/elements/1.1/}identifier 9780671759353
{http://purl.org/dc/elements/1.1/}identifier 0316236438 (lib. bdg.) 
{http://purl.org/dc/elements/1.1/}identifier 9780316236430 (lib. bdg.)
{http://purl.org/dc/elements/1.1/}identifier 0316236608 (pbk.)
{http://purl.org/dc/elements/1.1/}identifier 9780316236607 (pbk.)
{http://p

  relation = tree.findall('//{http://purl.org/dc/elements/1.1/}identifier')


## XPATH

XPath es un lenguaje que permite construir expresiones que recorren y procesan un documento XML. La idea es parecida a las expresiones regulares para seleccionar partes de un texto sin atributos. XPath permite buscar y seleccionar teniendo en cuenta la estructura jerárquica del XML

<table border="1" class="docutils">
<colgroup>
<col width="30%">
<col width="70%">
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Syntax</th>
<th class="head">Meaning</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">tag</span></code></td>
<td>Selects all child elements with the given tag.
For example, <code class="docutils literal notranslate"><span class="pre">spam</span></code> selects all child elements
named <code class="docutils literal notranslate"><span class="pre">spam</span></code>, and <code class="docutils literal notranslate"><span class="pre">spam/egg</span></code> selects all
grandchildren named <code class="docutils literal notranslate"><span class="pre">egg</span></code> in all children named
<code class="docutils literal notranslate"><span class="pre">spam</span></code>.</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">*</span></code></td>
<td>Selects all child elements.  For example, <code class="docutils literal notranslate"><span class="pre">*/egg</span></code>
selects all grandchildren named <code class="docutils literal notranslate"><span class="pre">egg</span></code>.</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">.</span></code></td>
<td>Selects the current node.  This is mostly useful
at the beginning of the path, to indicate that it’s
a relative path.</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">//</span></code></td>
<td>Selects all subelements, on all levels beneath the
current  element.  For example, <code class="docutils literal notranslate"><span class="pre">.//egg</span></code> selects
all <code class="docutils literal notranslate"><span class="pre">egg</span></code> elements in the entire tree.</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">..</span></code></td>
<td>Selects the parent element.</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">[@attrib]</span></code></td>
<td>Selects all elements that have the given attribute.</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">[@attrib='value']</span></code></td>
<td>Selects all elements for which the given attribute
has the given value.  The value cannot contain
quotes.</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">[tag]</span></code></td>
<td>Selects all elements that have a child named
<code class="docutils literal notranslate"><span class="pre">tag</span></code>.  Only immediate children are supported.</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">[tag='text']</span></code></td>
<td>Selects all elements that have a child named
<code class="docutils literal notranslate"><span class="pre">tag</span></code> whose complete text content, including
descendants, equals the given <code class="docutils literal notranslate"><span class="pre">text</span></code>.</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">[position]</span></code></td>
<td>Selects all elements that are located at the given
position.  The position can be either an integer
(1 is the first position), the expression <code class="docutils literal notranslate"><span class="pre">last()</span></code>
(for the last position), or a position relative to
the last position (e.g. <code class="docutils literal notranslate"><span class="pre">last()-1</span></code>).</td>
</tr>
</tbody>
</table>

Como ves, hay que ir entendiendo la jerarquía del XML para poder obtener la información. 

¿Puedes obtener los títulos de los recursos descritos en el XML?

<div class="alert alert-warning" role="alert" style="margin: 10px">
<p>**Ayuda**</p>

<p>'//' para indicar que empiece a buscar desde el elemento actual desde el que parte el arbol + tipo+nombre del elemento a buscar ({http://purl.org/dc/elements/1.1/} title)</p>
</div>

In [21]:
relation = tree.findall('.//{http://purl.org/dc/elements/1.1/}title')
for elem in relation:
    print(elem.tag, elem.text)

{http://purl.org/dc/elements/1.1/}title Ruby 
{http://purl.org/dc/elements/1.1/}title Ruby Holler 
{http://purl.org/dc/elements/1.1/}title Through my eyes 
{http://purl.org/dc/elements/1.1/}title Ruby 
{http://purl.org/dc/elements/1.1/}title Ruby 
{http://purl.org/dc/elements/1.1/}title I'm not Julia Roberts 
{http://purl.org/dc/elements/1.1/}title I am not Julia Roberts
{http://purl.org/dc/elements/1.1/}title The chaos king 
{http://purl.org/dc/elements/1.1/}title Bad apple 
{http://purl.org/dc/elements/1.1/}title The Wall and the Wing 
{http://purl.org/dc/elements/1.1/}title Good girls 


Haz lo mismo utilizando namespace

In [22]:
namespaces = {'dc': 'http://purl.org/dc/elements/1.1/'} # add more as needed
relation = tree.findall('.//dc:title',namespaces)
for elem in relation:
    print(elem.tag, elem.text)

{http://purl.org/dc/elements/1.1/}title Ruby 
{http://purl.org/dc/elements/1.1/}title Ruby Holler 
{http://purl.org/dc/elements/1.1/}title Through my eyes 
{http://purl.org/dc/elements/1.1/}title Ruby 
{http://purl.org/dc/elements/1.1/}title Ruby 
{http://purl.org/dc/elements/1.1/}title I'm not Julia Roberts 
{http://purl.org/dc/elements/1.1/}title I am not Julia Roberts
{http://purl.org/dc/elements/1.1/}title The chaos king 
{http://purl.org/dc/elements/1.1/}title Bad apple 
{http://purl.org/dc/elements/1.1/}title The Wall and the Wing 
{http://purl.org/dc/elements/1.1/}title Good girls 


## Ejemplo con EML

In [23]:
import requests

response = requests.get('https://zenodo.org/record/841691/files/amt_prototype.xml')
if response.status_code == 200:
    with open("./amt_prototype.xml", 'wb') as f:
        f.write(response.content)
        


In [24]:
ls

 El volumen de la unidad C no tiene etiqueta.
 El n�mero de serie del volumen es: CE62-6A44

 Directorio de c:\Users\Ruben\Proyectos\master\asignaturas\data life

13/11/2023  18:22    <DIR>          .
13/11/2023  18:22    <DIR>          ..
16/11/2023  15:57            13.595 amt_prototype.xml
13/11/2023  18:22             3.072 datacite-example.xml
16/11/2023  15:57            19.563 dublincore-example.xml
15/11/2023  16:08            81.930 metadataIntro_2023.ipynb
               4 archivos        118.160 bytes
               2 dirs  20.958.400.512 bytes libres


En estándares más complejos, el xml de base puede tener una jerarquía anidada, como es el caso de EML. Entonces, cada elemento puede tener de 0 a N "hijos", formando nuevos árboles.

In [118]:
tree = ET.parse('amt_prototype.xml')
root = tree.getroot()

for table in root.iter():
    for child in table:
        if len(child)==0:
            print(child.tag, child.text)

attributeName: date
attributeName: Temperature
attributeName: Press
attributeName: Conductivity
attributeName: Salinity
attributeName: Dissolved oxygen
attributeName: rawO2
attributeName: Oxygen Saturation
attributeName: ph
attributeName: Redox
alternateIdentifier 
10.5281/zenodo.841183

title water reservoir of Cuerda del Pozo
organizationName IFCA
electronicMailAddress marco@ifca.unican.es
salutation Mr
givenName Jesus Marco
surName De Lucas
deliveryPoint Avda Castros s/n
city Santander
postalCode 39005
country Spain
organizationName IFCA
electronicMailAddress aguilarf@ifca.unican.es
role guardian
givenName Fernando
surName Aguilar
deliveryPoint Avda Castros s/n
city Santander
postalCode 39005
country Spain
para The CTD 60 is a precision probe for oceanographic and limnological measurements of physical, chemical and optical parameters up to a depth of 2000 m. It allows the simultaneous measurement of following parameters: Pressure (depth), temperature, conductivity, raw O2, REDOX, di

Explora un poco: Nombre del proyecto, autores, lista de atributos...

In [120]:
#elementos = tree.findall('/dataset[1]/creator/individualName/salutation')
elementos = tree.findall('.//attributeList/attribute[@id="1465311292527"]/attributeName')
for e in elementos:
    print(e.tag + ":", e.text)

attributeName: date


In [27]:
elementos = tree.findall('.//dataset')
for e in elementos:
    print(e.tag + ":", e.text)
    for i in e.iter():
        print(i.tag + ":", i.text)

dataset:  

dataset:  

title: water reservoir of Cuerda del Pozo
creator:  
individualName: None
salutation: Mr
givenName: Jesus Marco
surName: De Lucas
organizationName: IFCA
address: None
deliveryPoint: Avda Castros s/n
city: Santander
postalCode: 39005
country: Spain
electronicMailAddress: marco@ifca.unican.es
associatedParty: None
individualName: None
givenName: Fernando
surName: Aguilar
organizationName: IFCA
address: None
deliveryPoint: Avda Castros s/n
city: Santander
postalCode: 39005
country: Spain
electronicMailAddress: aguilarf@ifca.unican.es
role: guardian
abstract: None
para: The CTD 60 is a precision probe for oceanographic and limnological measurements of physical, chemical and optical parameters up to a depth of 2000 m. It allows the simultaneous measurement of following parameters: Pressure (depth), temperature, conductivity, raw O2, REDOX, dissolved oxygen, pH, Oxigen Saturation, Salinity.
keywordSet: None
keyword: measure
keyword: water reservoir
keyword: sensor
key

In [28]:
dataset = ET.SubElement(root,'dataset')
for table in dataset.iter():
    print(child.tag, child.text)

description practical salinity unit


# Ejercicio personal

## Ejercicio 1

A partir del ejemplo completo del esquema de metadatos de DataCite, muestra por pantalla los elementos que sean equivalentes a los propuestos por Dublin Core (cada uno en una línea). Es posible que tengas que combinar en uno varios campos del archivo de metadatos (Por ejemplo, en coverage las coordenadas + el nombre).

* Title: 
* Creator: 
* Subject:  
* Description: 
* Publisher: 
* Contributor:
* Date: 
* Type: 
* Format: 
* Identifier: 
* Source: 
* Language:
* Relation:
* Coverage:  
* Rights:

Recurso: https://schema.datacite.org/meta/kernel-3.1/example/datacite-example-full-v3.1.xml

In [40]:
import requests

response = requests.get('https://schema.datacite.org/meta/kernel-3.1/example/datacite-example-full-v3.1.xml')
if response.status_code == 200:
    with open("./datacite-example.xml", 'wb') as f:
        f.write(response.content)

In [41]:
#Creo una función para en los que introduzco el árbol y la etiqueta que quiero buscar.
#De esta forma me evito tener que reperir el mismo código en cada etiqueta
def get_datacite_params(tree, xmldata, tag):
    datacite_tag = xmldata + tag
    params = ""
    
    for table in tree.iter(datacite_tag):        
        if len(table) > 0: #Si la longitud es mayor que 0 es que tiene hijos
            for i in range(0,len(table)):
                child = table[i]
                if(i < len(table) - 1): #Si no es el último elemento le pongo una coma
                    if(child.text.strip() != ""): #Si el contenido está vacío significa que tiene un hijo con el nombre
                        params += child.text + ", "
                    else:
                        for child2 in child.iter():
                            if("Name" in child2.tag.__str__()):
                                params += child2.text + ", "
                else:
                    if(child.text.strip() != ""):
                        params += child.text
                    else:
                        for child2 in child.iter():
                            if("Name" in child2.tag.__str__()):
                                params += child2.text
        else:
            params = table.text
    return params 

In [42]:
import xml.etree.ElementTree as ET
tree = ET.parse('datacite-example.xml')
xmldata = '{http://datacite.org/schema/kernel-3}'

print("Title:", get_datacite_params(tree, xmldata, 'titles'))
print("Creator:", get_datacite_params(tree, xmldata, 'creators'))   
print("Subject:", get_datacite_params(tree, xmldata, 'subjects'))   
print("Description:", get_datacite_params(tree, xmldata, 'descriptions'))
print("Publisher:", get_datacite_params(tree, xmldata, 'publisher'))
print("Contributor:", get_datacite_params(tree, xmldata, 'contributors'))
print("Date:", get_datacite_params(tree, xmldata, 'dates'))
print("Type:", get_datacite_params(tree, xmldata, 'resourceType'))
print("Format:", get_datacite_params(tree, xmldata, 'formats'))
print("Identifier:", get_datacite_params(tree, xmldata, 'identifier'))
print("Source:", get_datacite_params(tree, xmldata, 'source')) 
print("Language:", get_datacite_params(tree, xmldata, 'language'))  
print("Relation:", get_datacite_params(tree, xmldata, 'relation'))  
print("Coverage:", get_datacite_params(tree, xmldata, 'coverage'))  
print("Rights:", get_datacite_params(tree, xmldata, 'rightsList'))     

Title: Full DataCite XML Example, Demonstration of DataCite Properties.
Creator: Miller, Elizabeth
Subject: 000 computer science
Description: 
            XML example of all DataCite Metadata Schema v3.1 properties.
        
Publisher: DataCite
Contributor: Starr, Joan
Date: 2014-10-17
Type: XML
Format: application/xml
Identifier: 10.5072/example-full
Source: 
Language: en-us
Relation: 
Coverage: 
Rights: CC0 1.0 Universal


## Ejercicio 2

Haz un listado de todas las etiquetas del documento XML con sus atributos (si lo tienen)

In [44]:
root = tree.getroot()

for table in root.iter():
    print(table.tag) #Pinto cada etiqueta padre
    for child in table:
        if len(child)==0:
            print(child.tag, child.text) #Pinto cada etiqueta hijo
            for attribute, value in child.attrib.items():
                print(f"- {attribute}: {value}") #Pinto cada atributo de cada etiqueta hijo

{http://datacite.org/schema/kernel-3}resource
{http://datacite.org/schema/kernel-3}identifier 10.5072/example-full
- identifierType: DOI
{http://datacite.org/schema/kernel-3}publisher DataCite
{http://datacite.org/schema/kernel-3}publicationYear 2014
{http://datacite.org/schema/kernel-3}language en-us
{http://datacite.org/schema/kernel-3}resourceType XML
- resourceTypeGeneral: Software
{http://datacite.org/schema/kernel-3}version 3.1
{http://datacite.org/schema/kernel-3}identifier
{http://datacite.org/schema/kernel-3}creators
{http://datacite.org/schema/kernel-3}creator
{http://datacite.org/schema/kernel-3}creatorName Miller, Elizabeth
{http://datacite.org/schema/kernel-3}nameIdentifier 0000-0001-5000-0007
- schemeURI: http://orcid.org/
- nameIdentifierScheme: ORCID
{http://datacite.org/schema/kernel-3}affiliation DataCite
{http://datacite.org/schema/kernel-3}creatorName
{http://datacite.org/schema/kernel-3}nameIdentifier
{http://datacite.org/schema/kernel-3}affiliation
{http://datacit

## Ejercicio 3

Muestra los distintos identificadores que tiene ese documento de este modo: Identificador [tipo] = [identificador]

Ejemplo: Identificador DOI = 10.3122/121321

In [45]:
for table in tree.iter(): #Itero en cada etiqueta padre
    if('identifier' in table.tag.lower()): #Si el nombre de la etiqueta contiene la palabra 'identifier' se trata
        for attrib in table.attrib:
            if('identifiertype' in attrib.__str__().lower()): #Se busca el atributo 'identifiertype' de la etiqueta
                value = table.attrib.get(attrib)
                print('Identificador', value, table.text)

Identificador DOI 10.5072/example-full
Identificador URL http://schema.datacite.org/schema/meta/kernel-3.1/example/datacite-example-full-v3.1.xml
Identificador URL http://data.datacite.org/application/citeproc+json/10.5072/example-full
Identificador arXiv arXiv:0706.0001


## Ejercicio 4

Modifica el documento XML "amt_prototype.xml" para que todos los atributos incluyan su unidad (completa si es necesario).

Una vez editado, muestra la lista de atributos con su unidad. 

Ejemplo:

"Atribute: Salinity | Unit: psu"

In [47]:
#El documento tiene ya las unidades que le corresponden a cada atributo
tree = ET.parse('amt_prototype.xml')

elementos = tree.findall('.//attribute') #Busco cada etiqueta de nombre atributo
for attrib in elementos:
    name = attrib.find('attributeName').text #Busco el 'attributeName' de las etiquetas 'atributo' y lo establezco como nombre
    unitElement = attrib.find('.//unit') #Busco la etiqueta 'unit' dentro de la etiqueta 'atributo'
    if(unitElement is not None): #Si el atributo tiene una etiqueta 'unit', se utiliza la etiqueta hijo de 'unit' para establecerla como unidad
        for unit in unitElement:
            unit = unit.text
    else:
        unit = 'None' #Si no tiene etiqueta 'unit' escribo 'None'
    print("Attribute:", name, "| Unit:", unit)

Attribute: date | Unit: None
Attribute: Temperature | Unit: celsius
Attribute: Press | Unit: dbar
Attribute: Conductivity | Unit: mS/cm
Attribute: Salinity | Unit: psu
Attribute: Dissolved oxygen | Unit: milligramsPerLiter
Attribute: rawO2 | Unit: dimensionless
Attribute: Oxygen Saturation | Unit: dimensionless
Attribute: ph | Unit: dimensionless
Attribute: Redox | Unit: dimensionless
