# Upload XML schemas to CDCS
This notebook is for managing XML schemas for metadata describing AM Bench 2022 series samples, build processes and measurements. It does uploading new schemas to <b>private</b> AM Bench CDCS datebase instance (henceforth called CDCS) or replacing the existing schemas by their new versions in CDCS as well as updating existing XML files in CDCS according to new ones. If you wish to access the private CDCS site, please contact Lyle E. Levine (lyle.levine@nist.gov).

The steps of uploading new schemas or updating schemas are as follows:
* You put all schema files inside a folder called <code>XSD</code> specified in <code>__CONFIG</code>. 
* Check whether a template with its title <code>TEMPLATE</code> as defined n in <code>__CONFIG</code> already exists in CDCS. Template is a name of a group of versions of an XML schema. 
  * If it does not, do the followings:
   * Upload the schemas and create a Template for each schema in CDCS
   * Upload XML files to CDCS template of value <code>TEMPLTE</code> of <code>__CONFIG</code>.
  * Otherwise, the XML documents in CDCS have to be migrated to be compliant to the new version of the template while their &lt;pid&gt;s remain unchanged. 
   * Download all XML documents in the template and deleted them from CDCS. 
   * Make them valid against new schemas.
   * Upload both the new schemas and the updated XML documents to CDCS.
  

In [None]:
import lxml.etree as ET
import pandas
import os
from pathlib import Path
import xmlschema
import getpass
from cdcs import CDCS
import requests
import json
import glob
import uuid
import sys
import importlib
import SciServer.Authentication as sauth

# Initiantiate __CONFIG class 
- In order to run this notebook create your own configuration file in JSON format. Please see the example given in default_config.json.
- Enter your json file in the argument of the constructor of <code>__CONFIG class</code> defined in config.py. If no argument is passed in the constructor, default_config.json is used.

In [None]:
# Import config and instantiate __CONFIG class.
import sys

import config
from config import __CONFIG

CONFIG = __CONFIG(conf_json = "./myconfig-sciserver.json")


In [None]:
# If USER or PASS are None in the configuraton setting, enter them interactively. 
#For anonymous user do not enter anything.

if CONFIG.USER is None:
    CONFIG.USER = input('username: ')
if CONFIG.PASS is None:
    CONFIG.PASS = getpass.getpass('enter password ')

AUTH=(CONFIG.USER, CONFIG.PASS)    


In [None]:
# Include the directory path for the required Python modules.

sys.path.insert(0, CONFIG.pyUTILS_path)
import ambench.cdcs_utils
from ambench.cdcs_utils import AMBench2022, xmlschema
from ambench.mapping import new_mapper
#from ambench.cdcs_utils import *

# Create AMBench2022 instance
* AMBench2022 is a wrapper class of which base class is CDCS from pycdcs. It has additional methods including querying, and uploading XML schemas and documents in the CDCS instance.

In [None]:
ambench2022=AMBench2022(CONFIG.TEMPLATE,CONFIG.AMBENCH_URL,auth=AUTH)

# Define utility function

In [None]:
def clearTemps():
    '''
    Remove all XML files in VALID_XML and INVALID_XML folders.
    '''
    ROOT=os.getcwd()
    path = Path(ROOT)

    filelist = glob.glob(os.path.join(VALID_XML, "*.xml"))
    for f in filelist:
        os.remove(f)
    print("removed",len(filelist),"files from",VALID_XML)
    filelist = glob.glob(os.path.join(INVALID_XML, "*.xml"))
    for f in filelist:
        os.remove(f)
    print("removed",len(filelist),"files from",INVALID_XML)

In [None]:
# # TEST!!!
# sauth.getToken()
# import SciServer.CasJobs as cj
# cj.executeQuery("select * from information_schema.tables","AMBench")

# 0. define parameters
input
* url to ambench CDCS instance
* folder with XSD files to be uploaded/updated
* file name of the XSD file defining the root element (AMDocs.xsd)
* name of the CDCS template

objects
* pycdcs CDCS instance
* list of ids of all versions of the template
* id of the current version
* xmlschema instance for the schema

# Make workspace folders to put Valid and Invalid XML files

Create temporary workspace folders <code>VALID_XML</code> and <code>INVALID_XML</code> for downloaded and updated XML files. (<b>TODO IMPROVE HERE!!!</b>)<code>XML_WORKSPACE</code> is a path to the folder where these folders be created.

If you run this notebook in SciServer Compute container, create <code>XML_WORKSPACE</code> in your SciServer scratch user volume of which its folder path given in a cell below. 

In [None]:
# If using SciServer Compute container, use the folder path given below two lines.

# SCISERVER_USER=sauth.getKeystoneUserWithToken(token).userName
# XML_WORKSPACE=f"/home/idies/workspace/Temporary/{SCISERVER_USER}/scratch/AMBENCH/XML_TEMP"

#============================================================================================#

# Otherwise enter path to your folder below
# XML_WORKSPACE="" 

In [None]:
XML_WORKSPACE=f"/home/idies/workspace/Temporary/jkim485/scratch/AMBENCH/XML_TEMP" ##TEST!!

In [None]:
VALID_XML=f"{XML_WORKSPACE}/VALID"
INVALID_XML=f"{XML_WORKSPACE}/INVALID"
os.makedirs(VALID_XML,exist_ok=True)
os.makedirs(INVALID_XML,exist_ok=True)


In [None]:
# Build XML schema objects from XSD files using xmlschema library

TITLE_PREFIX='' # DO WE NEED TITLE_PREFIX???????????
TEMPLATE=f'{TITLE_PREFIX}{CONFIG.TEMPLATE}'
xsd_filename=f'{CONFIG.XSD}{CONFIG.ROOT_SCHEMA}Click restart the kernel and clear all output
'
SCHEMA=xmlschema.XMLSchema(xsd_filename,build=False)
SCHEMA.build()
SCHEMA.validity

# Get templates from CDCS.

<code>Template</code> is a name of a group of versions of XML schema in CDCS. By default, Template is set to a namespace of a corresponding XML schema. If there are dependencies among the schemas loaded in <code>XSD</code> defined in <code>__CONFIG</code> it is sufficient to pass the top level schema (<code>ROOT_SCHEMA</code> of <code>__CONFIG</code>) in arguments of <code>loadSchema</code> function below.

In [None]:
try:
    ambench2022=AMBench2022(TEMPLATE,CONFIG.AMBENCH_URL,auth=AUTH)
    if ambench2022.template is None:
        print("Template",TEMPLATE,"does not yet exists, trying to create it now")
        ambench2022.loadSchema(XSD,TITLE_PREFIX,CONFIG.ROOT_SCHEMA)
#     tversionsms=ambench2022.get_template_managers(title=TEMPLATE)
    tversionsms=ambench2022.get_template_managers()  ###THIS LINE TEST
    print(tversionsms)
    if len(tversionsms)>0:
        CURRENT=tversionsms['current'][0]
    else:
        CURRENT = None
#     TEMPLATE_VERSIONS=ambench2022.get_templates(title=TEMPLATE,current=False) 
    TEMPLATE_VERSIONS=ambench2022.get_templates(title=None,current=False) # this line test

    print(TEMPLATE_VERSIONS)

except Exception as e:
    print(e)
    raise(e)

In [None]:
CURRENT

# 1. Check all loaded XML docs
For the current version of the template!

In [None]:
AMDocs=ambench2022.get_records(template=ambench2022.template)
print(len(AMDocs))
AMDocs.head(3)

for all versions

# 2. check validity of retrieved XML docs wrt new schema

In [None]:
clearTemps()

In [None]:
valid_ids=[]
valids=[]
invalid_ids=[]
invalids={}

for t in AMDocs.itertuples():
    is_valid=SCHEMA.is_valid(t.xml_content)
#     if not(is_valid):
#         print(t.title,is_valid)
    fname=t.title
    if not(fname.endswith(".xml")):
        fname=fname+".xml"
    if is_valid:
        valid_ids.append(t.id)
        valids.append(t)
        with open(f"{VALID_XML}/{fname}","w") as f:
            f.write(t.xml_content)
    else:
        invalid_ids.append(t.id)
        with open(f"{INVALID_XML}/{fname}","w") as f:
            f.write(t.xml_content)
        try:
            SCHEMA.validate(t.xml_content)
        except Exception as e:
            invalids[t.title]=e
#             print(e,"\n=====\n")
print(len(valid_ids),"VALID")
print(len(invalid_ids),"INVALID")

# 3. Deal with invalid XML docs
For now keep them in CDCS, they will be linked to the old version of the schema, hence less visible.
Eventually we can keep them and rerun the XML creation from the  "raw" metadata (excel file), jsut making sure the pid is set to the correct value.

In [None]:
if len(invalid_ids)>0:
    print("WARNING")
print(len(invalid_ids),"INVALID FILES WERE FOUND")

# 5. Upload new schema

In [None]:
ambench2022.loadSchema(CONFIG.XSD,TITLE_PREFIX,CONFIG.ROOT_SCHEMA)
# XSD,TITLE_PREFIX,ROOT_SCHEMA

## determine new CURRENT

In [None]:
OLD_CURRENT=CURRENT
try:
    ambench2022=AMBench2022(CONFIG.TEMPLATE,CONFIG.AMBENCH_URL,auth=AUTH)
    if ambench2022.template is None:
        print("Template",TEMPLATE,"does not yet exists, trying to create it now")
        ambench2022.loadSchema(CONFIG.XSD,TITLE_PREFIX,CONFIG.ROOT_SCHEMA)
    tversionsms=ambench2022.get_template_managers(title=CONFIG.TEMPLATE)
    if len(tversionsms)>0:
        CURRENT=tversionsms['current'][0]
    else:
        CURRENT = None
    TEMPLATE_VERSIONS=ambench2022.get_templates(title=CONFIG.TEMPLATE,current=False)
    print("new current:",CURRENT,"old current:",OLD_CURRENT)
except Exception as e:
    print(e)
    raise(e)

# 6. generate pyxb classes from new schema
requires
<pre>
%pip install pyxb
</pre>
Do this in terminal

# 7. Deal with valid XML docs: migrate them to new template

In [None]:
# find id new template
template_id=CURRENT
# for k,t in todo.items():
#     if t['title'] == TEMPLATE:
#         template_id=t['id']
#         break
if template_id is not None:
    r=ambench2022.migrate(template_id,valid_ids)
    if r.status_code <200 or r.status_code >=400:
        print("PROBLEM:",r.content)
    else:
        print("Migration succeeded")
else:
    print("ERROR, no CURRENT template_id detected")