# Upload XML schemas to CDCS

This notebook is for managing the AM Bench schema and XML documents.

The AM Bench XML schema is an implementation of a data model that describes the Additive Manufacturing Benchmark 2022 series data. The data model provides a robust set of metadata for the build processes and their resulting specimens and
for measurements made on these in the context of the AM Bench 2022 project. The metadata are entered in excel spreadsheets by the project scientists. The metadata in excel files are translated to XML documents compliant with the schema using Python scripts defined in <code>ambench.mapping</code> and <code>Excel to XML.ipynb</code>. Both the schema and the resulting XML documents are uploaded into <b>private</b> AM Bench CDCS datebase instance (henceforth called CDCS) using REST API provided by <code>pycdcs</code> (https://github.com/usnistgov/pycdcs).

If you wish to access the private CDCS site, please contact Lyle E. Levine (lyle.levine@nist.gov).

This notebook includes the followings:
* Uploading  schemas and associated XML documents in <b>private</b> CDCS.
* Replacing existing schemas and associated XML documents in CDCS by new versions of the schemas and the XML documents. 

_Please note that this notebook does not provide scripts to migrate XML documents if its schema changes from one version to a later version._ 


In [None]:
import lxml.etree as ET
import pandas
import os
from pathlib import Path
import xmlschema
import getpass
from cdcs import CDCS
import requests
import json
import glob
import uuid
import sys
import importlib

In [1]:
# If you run this notebook in SciServer, please run this cell after uncommenting the import statement below.

# import SciServer.Authentication as sauth

# Steps of uploading or replacing XML schemas and XML documents in CDCS 

In order to upload an XML schema in CDCS you need to create a <code>Template</code> first. <code>Template</code> is a name assigned to a XML schema including all its versions in CDCS. By default, Template is set to a namespace of a corresponding XML schema. If there are dependencies between a set of schemas, it is sufficient to create only a single template to represent the set of the schemas.   Therefore, for <code>CONFIG.TEMPLATE</code>, only one value is required.


you set the top level schema (<code>CONFIG.ROOT_SCHEMA</code>) which contains the root element in arguments of <code>loadSchema</code> function below.


* Create a configuration file in JSON format. The example is given in <code>default_config.json</code>
* Put all files containing schemas in a folder called <code>XSD</code> in your configuration file. 
1. Instantiate <code>__CONFIG class</code> (Step 1).
2. Connect to CDCS (Step 2).
3. Make workspace folders to put valid and invalid XML files (Step 3).
4. Build XML validator from schema (Step 4).
5. Define utility function (Step 5).
6. Get template from CDCS (Step 6).
   Check whether a template with its value <code>TEMPLATE</code> given in configuration already exists in CDCS or not.
 
    1. If it does not, do the followings:
        1. Create a template whose value is <code>TEMPLATE</code> in CDCS and upload all schemas into CDCS and link to the template  (Step 6).
        2. Upload XML files to the template in CDCS (Step 10)
   
   2. Otherwise, the XML documents in CDCS need to be migrated to the new version of the schema while their &lt;pid&gt;s remain unchanged. 
       1. Get the versions of template (Step 6).
       2.  Download all XML documents in the template from CDCS (Step 7). 
       3. Make them valid against new version of schemas.
       4. Upload both the new version of schemas (Step 9). 
       5. Upload all valid XML documents into the CDCS template (Step 10).



## Step 1. Instantiate __CONFIG class 
- In order to run this notebook create your own configuration file in JSON format. Please see the example given in <code>default_config.json</code>.
- Enter your JSON file in argument of the constructor of <code>__CONFIG class</code> defined in <code>config.py</code>. If no argument is passed in the constructor, <code>default_config.json</code> is used.

In [None]:
# Import config and instantiate __CONFIG class.
import sys

import config
from config import __CONFIG

CONFIG = __CONFIG(conf_json = "./your_config.json")


In [None]:
# If USER or PASS are null in your configuration, you are asked enter them in the prompts interactively.
# For anonymous user enter nothing in the prompts you get when you run this cell.

if CONFIG.USER is None:
    CONFIG.USER = input('username: ')
if CONFIG.PASS is None:
    CONFIG.PASS = getpass.getpass('enter password ')

AUTH=(CONFIG.USER, CONFIG.PASS)    


In [None]:
# Include the directory path for the required Python modules. 

sys.path.insert(0, CONFIG.pyUTILS_path)
import ambench.cdcs_utils
from ambench.cdcs_utils import AMBench2022, xmlschema
from ambench.mapping import new_mapper


## 2. Create AMBench2022 instance
* AMBench2022 is a wrapper class of which base class is <code>CDCS</code> of <code>pycdcs</code>. 

In [None]:
ambench2022=AMBench2022(CONFIG.TEMPLATE,CONFIG.AMBENCH_URL,auth=AUTH)

## 3. Make workspace folders to put valid and invalid XML files

Create temporary workspace folders <code>VALID_XML</code> and <code>INVALID_XML</code>. Download all existing XML files in CDCS template. Check the validity of these files against a new version of the schemas. Put compliant XML files in <code>VALID_XML</code> folder and noncompliant ones in <code>INVALID_XML</code> folder. <code>XML_WORKSPACE</code> is a path to the folder where these folders be created.


In [3]:
# If you run this notebook in SciServer Compute container, create XML_WORKSPACE 
# in your SciServer scratch user volume of which folder path is given below.
# Before running this cell, replace {SCISERVER_USER} by your SciServer username and uncomment both lines.
#
# SCISERVER_USER=sauth.getKeystoneUserWithToken(token).userName
# XML_WORKSPACE=f"/home/idies/workspace/Temporary/{SCISERVER_USER}/scratch/AMBENCH/XML_TEMP"

In [None]:
# Otherwise, enter path to your folder below and uncomment the line

# XML_WORKSPACE="" 

In [None]:
VALID_XML=f"{XML_WORKSPACE}/VALID"
INVALID_XML=f"{XML_WORKSPACE}/INVALID"
os.makedirs(VALID_XML,exist_ok=True)
os.makedirs(INVALID_XML,exist_ok=True)


## 4. Build XML validator from schema
Build XML schema object using from schema files using <code>xmlschema</code> library. This object provides a schema validator which is used for validating XML documents.

In [None]:
# Build XML schema objects from XSD files using xmlschema library

TITLE_PREFIX=''
TEMPLATE=f'{TITLE_PREFIX}{CONFIG.TEMPLATE}'
xsd_filename=f'{CONFIG.XSD}{CONFIG.ROOT_SCHEMA}'
SCHEMA=xmlschema.XMLSchema(xsd_filename,build=False)
SCHEMA.build()
SCHEMA.validity

## 5. Define utility function 
Define a function deleting all XML files in <code>VALID_XML</code> and <code>INVALID_XML</code>.

In [None]:
def clearTemps():
    '''
    Remove all XML files in VALID_XML and INVALID_XML folders.
    '''
    ROOT=os.getcwd()
    path = Path(ROOT)

    filelist = glob.glob(os.path.join(VALID_XML, "*.xml"))
    for f in filelist:
        os.remove(f)
    print("removed",len(filelist),"files from",VALID_XML)
    filelist = glob.glob(os.path.join(INVALID_XML, "*.xml"))
    for f in filelist:
        os.remove(f)
    print("removed",len(filelist),"files from",INVALID_XML)

# 6. Get template from CDCS.

<code>loadSchema</code> creates template if it doesn't exist in CDCS and upload schemas to the template in CDCS.

In [None]:
try:
    ambench2022=AMBench2022(TEMPLATE,CONFIG.AMBENCH_URL,auth=AUTH)
    if ambench2022.template is None:
        print("Template",TEMPLATE,"does not yet exists, trying to create it now")
        ambench2022.loadSchema(XSD,TITLE_PREFIX,CONFIG.ROOT_SCHEMA)
    tversionsms=ambench2022.get_template_managers(title=TEMPLATE)
    if len(tversionsms)>0:
        CURRENT=tversionsms['current'][0]
    else:
        CURRENT = None
    TEMPLATE_VERSIONS=ambench2022.get_templates(title=TEMPLATE,current=False) 
except Exception as e:
    print(e)
    raise(e)

In [None]:
CURRENT

## 7. Get all existing  XML documents and validate them against new schemas.
Download all XML documents in template from CDCS and validate them against new schemas.
The valid documents are put in <code>VALID_XML</code> folder and the invalid ones in <code>INVALID_XML</code>.

In [None]:
AMDocs=ambench2022.get_records(template=ambench2022.template)
print(len(AMDocs))
AMDocs.head(3)

In [None]:
clearTemps()

In [None]:
valid_ids=[]
valids=[]
invalid_ids=[]
invalids={}

for t in AMDocs.itertuples():
    is_valid=SCHEMA.is_valid(t.xml_content)
    fname=t.title
    if not(fname.endswith(".xml")):
        fname=fname+".xml"
    if is_valid:
        valid_ids.append(t.id)
        valids.append(t)
        with open(f"{VALID_XML}/{fname}","w") as f:
            f.write(t.xml_content)
    else:
        invalid_ids.append(t.id)
        with open(f"{INVALID_XML}/{fname}","w") as f:
            f.write(t.xml_content)
        try:
            SCHEMA.validate(t.xml_content)
        except Exception as e:
            invalids[t.title]=e

print(len(valid_ids),"VALID")
print(len(invalid_ids),"INVALID")

## 8. Fix invalid XML documents according to new schema.

This step is out of the scope for this notebook. Once you fix the invalid XML documents put them in <code>VALID_XML</code> folder. 

In [None]:
if len(invalid_ids)>0:
    print("WARNING")
print(len(invalid_ids),"INVALID FILES WERE FOUND")

## 9. Upload new schema
Once all XML documents are valid against new version of the schema, you can upload new schema to CDCS.

In [None]:
ambench2022.loadSchema(CONFIG.XSD,TITLE_PREFIX,CONFIG.ROOT_SCHEMA)


## 10. Determine new CURRENT

Find <code>ID</code> for new template from CDCS and make it <code>CURRENT</code>. 

In [None]:
OLD_CURRENT=CURRENT
try:
    ambench2022=AMBench2022(CONFIG.TEMPLATE,CONFIG.AMBENCH_URL,auth=AUTH)
    if ambench2022.template is None:
        print("Template",TEMPLATE,"does not yet exists, trying to create it now") ##DO WE NEED THIS? ALREADY DID IT ABOVE.
        ambench2022.loadSchema(CONFIG.XSD,TITLE_PREFIX,CONFIG.ROOT_SCHEMA)
    tversionsms=ambench2022.get_template_managers(title=CONFIG.TEMPLATE)
    if len(tversionsms)>0:
        CURRENT=tversionsms['current'][0]
    else:
        CURRENT = None
    TEMPLATE_VERSIONS=ambench2022.get_templates(title=CONFIG.TEMPLATE,current=False)
    print("new current:",CURRENT,"old current:",OLD_CURRENT)
except Exception as e:
    print(e)
    raise(e)

## 10. Migrate valid XML files to new CDCS template

In [None]:
template_id=CURRENT
if template_id is not None:
    r=ambench2022.migrate(template_id,valid_ids)
    if r.status_code <200 or r.status_code >=400:
        print("PROBLEM:",r.content)
    else:
        print("Migration succeeded")
else:
    print("ERROR, no CURRENT template_id detected")