# Evaluation, analysis, and reporting on your metadata collection

The first step is to extract all nodes that contain text, element or attribute, into a csv that flattens the xml while retaining all information, except for order of elements (though there is a parameter to extract that information in the XSL if you're interested in extending the code to test the content of an element).

Second, we create a version of the data that only contains the xpaths from the FAIR recommendation you've created. To do this use the xpaths that coorespond with the FAIR recommendation concepts you're including, and in some cases, the element name. This instantiation of the recommendation does not go all the way into the child elements neccessary for the recommendation, but is employed in such a way as to scrape all the children elements used. This way the result contains all of the metadata that that site used to add additional context to the concepts the recommendation contains.

Next these csv are analyzed for occurrence.

Finally to compare directly between the differences in child elements each site uses, we pivot the data to create a table containing the completeness percentage for the highest occurring child element visualize the completeness. 

This is the NbMeta metadata record for this notebook:

[Create_a_metadataset_nbmeta.json](../metadata/Evaluate_Analyze_and_Report_metadata_nbmeta.json)

## Prepare the notebook

* import modules
* define variables
* define recommendation

In [1]:
import sys
import os
import pandas as pd
import gzip
import shutil
import subprocess
import tarfile

#import local python module
sys.path.append(os.path.join(os.path.dirname(sys.path[0]),'../scripts'))
import EARmd as md

os.makedirs("../data/FAIR", exist_ok=True)

# create a list of each collections name
collectionsToProcess = [name for name in os.listdir("../collection") if not name.startswith('.') ]

#### FAIR_EML for DataONE Member Nodes
<p>EML elements that contain the metadata concepts for FAIR Metadata at DataONE</p>
<p>Adapted from: </p>
    
[text to name the workshop](link_to_workshopAgenda)

In [2]:
# create a pattern to look for elements used in fulfilling the communities stated information needs
elements = ["attributeList/attribute/attributeDefinition",
            'attributeLabel',
            '/eml:eml/@xsi:schemaLocation',# recommended
            '/eml:eml/@packageId',
            '/eml:eml/@system',# optional
            '/eml:eml/access',# optional
            '/eml:eml/dataset/alternateIdentifier',
            '/eml:eml/dataset/title', 
            '/eml:eml/dataset/creator',
            '/eml:eml/dataset/contact',# required
            '/eml:eml/dataset/metadataProvider',
            '/eml:eml/dataset/associatedParty',
            '/eml:eml/dataset/publisher',
            '/eml:eml/dataset/pubDate',
            '/eml:eml/dataset/abstract',
            '/eml:eml/project/funding',
            '/eml:eml/dataset/project/abstract',
            '/eml:eml/dataset/keywordSet',
            '/eml:eml/dataset/project/keywordSet',
            '/eml:eml/dataset/intellectualRights',
            'physical/distribution',
            "/eml:eml/dataset/coverage/geographicCoverage",
            "/eml:eml/dataset/coverage/taxonomicCoverage",
            "/eml:eml/dataset/coverage/temporalCoverage",
            '/eml:eml/dataset/maintenance',
            '/eml:eml/dataset/methods',
            '/eml:eml/dataset/project',
            '/eml:eml/dataset/dataTable',
            '/eml:eml/dataset/spatialRaster',
            '/eml:eml/dataset/spatialVector',
            '/eml:eml/dataset/storedProcedure',
            '/eml:eml/dataset/view',
            '/eml:eml/dataset/otherEntity',
            "/eml:eml/dataset/dataTable/attributeList",
            "/eml:eml/dataset/spatialRaster/attributeList",
            "/eml:eml/dataset/spatialVector/attributeList",
            "/eml:eml/dataset/storedProcedure/attributeList",
            "/eml:eml/dataset/view/attributeList",
            "/eml:eml/dataset/otherEntity/attributeList",
            "/eml:eml/dataset/dataTable/constraint",
            "/eml:eml/dataset/spatialRaster/constraint",
            "/eml:eml/dataset/spatialVector/constraint",
            "/eml:eml/dataset/storedProcedure/constraint",
            "/eml:eml/dataset/view/constraint",
            "/eml:eml/dataset/otherEntity/constraint",
            '/eml:eml/additionalMetadata',
            'enumeratedDomain',
            'precision',
            'qualityControl',
            'missingValueCode',
            'entityDescription'
           ]

# A dictionary containing the recommendation xpaths and the relevent sub element. 
RecDict = {'/eml:eml/project/funding': 'funding',
            'attributeLabel': 'attributeLabel',
            'enumeratedDomain': 'enumeratedDomain',
            'qualityControl': 'qualityControl',
            'precision': 'precision',
            'missingValueCode': 'missingValueCode',
            'entityDescription': 'entityDescription',
            '/eml:eml/@xsi:schemaLocation': "xsi:schemaLocation",
            "/eml:eml/@packageId": "packageId",
            '/eml:eml/@system': 'system',
            "/eml:eml/access": "access",
            '/eml:eml/dataset/alternateIdentifier': "alternateIdentifier",
            "/eml:eml/dataset/title": "title",
            "/eml:eml/dataset/creator": "creator",
            "/eml:eml/dataset/contact": "contact",
            "/eml:eml/dataset/metadataProvider": "metadataProvider",
            "/eml:eml/dataset/associatedParty": "associatedParty",
            "/eml:eml/dataset/publisher": "publisher",
            "/eml:eml/dataset/pubDate": "pubDate",
            "/eml:eml/dataset/abstract": "abstract",
            '/eml:eml/dataset/project/abstract': "abstract",
            "/eml:eml/dataset/keywordSet": "keywordSet",
            "/eml:eml/dataset/project/keywordSet": "keywordSet",
            "/eml:eml/dataset/intellectualRights": "intellectualRights",
            "/eml:eml/dataset/maintenance": "maintenance",
            "/eml:eml/dataset/methods": "methods",
            "/eml:eml/dataset/project": "project",
            'physical/distribution': 'distribution',
            "/eml:eml/dataset/dataTable/attributeList": "attributeList",
            "/eml:eml/dataset/spatialRaster/attributeList": "attributeList",
            "/eml:eml/dataset/spatialVector/attributeList": "attributeList",
            "/eml:eml/dataset/storedProcedure/attributeList": "attributeList",
            "/eml:eml/dataset/view/attributeList": "attributeList",
            "/eml:eml/dataset/otherEntity/attributeList": "attributeList",
            "/eml:eml/dataset/dataTable/constraint": "constraint",
            "/eml:eml/dataset/spatialRaster/constraint": "constraint",
            "/eml:eml/dataset/spatialVector/constraint": "constraint",
            "/eml:eml/dataset/storedProcedure/constraint": "constraint",
            "/eml:eml/dataset/view/constraint": "constraint",
            "/eml:eml/dataset/otherEntity/constraint": "constraint",
            "/eml:eml/dataset/dataTable": "[entity]",
            "/eml:eml/dataset/spatialRaster": "[entity]",
            "/eml:eml/dataset/spatialVector": "[entity]",
            "/eml:eml/dataset/storedProcedure": "[entity]",
            "/eml:eml/dataset/view": "[entity]",
            "/eml:eml/dataset/otherEntity": "[entity]",
            "/eml:eml/dataset/project": "project",
            "/eml:eml/dataset/coverage/geographicCoverage": 'geographicCoverage',
            "/eml:eml/dataset/coverage/taxonomicCoverage": 'taxonomicCoverage',
            "/eml:eml/dataset/coverage/temporalCoverage": 'temporalCoverage',
            "attributeList/attribute/attributeDefinition": 'attributeDefinition',
            '/eml:eml/additionalMetadata': 'additionalMetadata',
            "Number of Records": "Number of Records"
           }
# define a list of element recommendation level
LevelOrder = ["Number of Records",'Findable','Findable','Findable','Findable','Findable','Findable','Findable','Findable',
              'Findable','Findable','Findable','Findable','Findable','Accessible','Accessible','Interoperable','Interoperable',
              'Interoperable','Interoperable','Interoperable','Interoperable','Interoperable','Interoperable','Interoperable',
              'Interoperable','Interoperable','Interoperable','Reusable','Reusable','Reusable','Reusable','Reusable']

# create a list to order the table that corresponds with the order of the FAIR recommendation levels. 
ElementOrder = ["Number of Records",
                'xsi:schemaLocation',
                'packageId',
                'system',
                'access',
                'alternateIdentifier',
                'title', 
                'creator',
                'contact',
                'metadataProvider',
                'associatedParty',
                'publisher',
                'pubDate',
                'abstract',
                'keywordSet',
                'intellectualRights',
                'distribution',
                'geographicCoverage',
                'taxonomicCoverage',
                'temporalCoverage',
                'maintenance',
                'methods',
                'qualityControl',
                'project',
                '[entity]',
                'entityDescription',
                'attributeList',
                'attributeLabel',
                'attributeDefinition',
                'enumeratedDomain',
                'missingValueCode',
                'precision',
                'constraint',
                'additionalMetadata']
# Used to order a dataframe in the order of the recommendation
ConceptOrder = ['Number of Records','','','','','','','','','','','','','','','','','','','','','','','','','','','','']

## Evaluation using the AllNodes.xsl transform

This XSL is standards agnostic. AllNodes will work with any number of valid XML records, regardless of their standards compliance or creativity.
The transform flattens the XML in each record in a directory into a csv. For each node that has text the XSL writes a row that contains the directory name, file name, text content, and the Xpath for each element and attribute in the records in the collection.


In [3]:
# install a java runtime
import os       #importing os to set environment variable
def install_java():
  !apt-get install -y openjdk-8-jdk-headless -qq > /dev/null      #install openjdk
  os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"     #set environment variable
  !java -version       #check java version
install_java()
# use the list of collections to run the evaluation for each collection
for collection in collectionsToProcess:

    """
    build a shell command to run the Evaluation XSL. 
    You'll need java installed and describe the location in the first string of the cmd list
    """   
    cmd = ["/usr/bin/java",
           '-jar', "../scripts/saxon-b-9.0.jar",
           '-xsl:' + "../scripts/AllNodes.xsl",
           '-s:' + "../scripts/dummy.xml",
           '-o:' + "../data/FAIR/"+ str(collection) + "_XpathEvaluated.csv",
           'recordSetPath=' + "../collection/" + str(collection) + "/"]
    # run the transform
    subprocess.run(' '.join(cmd), shell=True, check=True)
    xpath_eval_file = "../data/FAIR/"+ str(collection) + "_XpathEvaluated.csv"
    with open(xpath_eval_file, 'rb') as f:
            gzxpath_eval_file = xpath_eval_file + '.gz'
            with gzip.open(gzxpath_eval_file, 'wb') as gzf:
                shutil.copyfileobj(f, gzf)
                os.remove(xpath_eval_file)

/bin/sh: apt-get: command not found
java version "1.8.0_20"
Java(TM) SE Runtime Environment (build 1.8.0_20-b26)
Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)


## Analysis using the EARmd.py module
The module has already been used for getting the records via the Requests module. Now we are going to utilize the flat nature of the evaluated metadataset to use pandas to analyze the metadata for elements in the FAIR recommendation we've built. This process will yield two versions of the dataset: the absolute return of the evaluation, and the subset the recommendation pattern identified. Each version will be organized differently. Both versions will have an analysis applied called XpathOccurrence. It returns various information about the occurrence of each xpath used in the collection's records. The most important of these for our purposes is the percentage of records that contained which elements.



In [4]:
for collection in collectionsToProcess:
    # places for all the evaluated and analyzed data
    XpathEvaluated = os.path.join("../data/FAIR/", collection + "_XpathEvaluated.csv.gz")
    XpathOccurrence = os.path.join("../data/FAIR/", collection +'_XpathOccurrence.csv')

    # Read in the evaluated metadata
    EvaluatedDF = pd.read_csv(XpathEvaluated)

    # Use above dataframe and apply the xpathOccurrence functions from MDeval
    md.XpathOccurrence(EvaluatedDF, collection, XpathOccurrence)
    
    # Apply the recommendation to the collection
    md.applyRecommendation(elements, 'FAIR', collection)

## Create reports 

#### All Elements Useage
* The first row is the number of records. Use the *RecordCount* column
* Rows are Xpath in any record throughout the collection
* Columns are XpathCount, RecordCount, AverageOccurrencePerRecord, CollectionOccurrence%

#### FAIR Elements Useage
* same as the Element Usage Analysis, but limited to elements and their children that occurr in the conceptual recommendation.
We will first apply a list of xpaths from a "50 thousand foot view". What is meant by this is that instead of  explicitly looking for each child element of /eml:eml/dataset/contact looking for xpaths that contain /eml:eml/dataset/contact. This will allow us to create a version of the evaluation that contains elements important to fulfilling specific recommendation needs. It will also allow for additional insight in how element choices shift over time. 

#### FAIR Concepts Useage
* Take the occurrence percentage from the most used child element for each recommendation level parent element, and assign it to the element to get a high level view on recommendations compliance over time.

Use the analyzed data to create reports for each collection. All reports are created as Excel spreadsheets.

#### Visualize FAIR Fitness
* Visualize the FAIR completeness percentage for your collection as a way to determine the likelyhood the catalog will address the FAIR information needs of your data users and producers. 
<p>Gordon, S 2019 Is your metadata catalog in shape?. Zenodo. https://doi.org/10.5281/zenodo.2558631</p>



### Create a FAIRness report on our collection or collections 

In [5]:
os.makedirs("../reports/FAIR", exist_ok=True)

#for collection in collectionsToProcess:
    # places for all the combined data and combined Report
DataDestination = os.path.join('../reports/FAIR', "combinedCollections.xlsx")
XpathOccurrence = os.path.join("../data/FAIR", 'combinedCollections_XpathOccurrence.csv')

FAIROccurrence = os.path.join("..", "data", "FAIR", 'combinedCollections_FAIRoccurrence.csv')
FAIRConcept = os.path.join('..','data','FAIR', 'combinedCollections_FAIRcompleteness.csv')
FAIRGraph = os.path.join('..','data','FAIR', 'combinedCollections_FAIR_.png')

# combine the absolute occurance analysis for a site through time
XpathOccurrenceToCombine = [os.path.join("../data/FAIR", name) for name in os.listdir("../data/FAIR") if name.endswith('_XpathOccurrence.csv') ]
md.CombineXPathOccurrence(XpathOccurrenceToCombine,
                          XpathOccurrence, to_csv=True)

# Build lists of recommendation specific occurrence analysis for a site through time  
FAIRoccurrenceToCombine = [os.path.join("../data/FAIR", name) for name in os.listdir("../data/FAIR") if name.endswith('_FAIROccurrence.csv') ]

# utilize function to combine the recommendation specific analyses 
md.CombineAppliedRecommendation(collection, elements, 'FAIR', FAIRoccurrenceToCombine)

# create recommendation pivot tables and radar graphs to acess the parent elements useage through time
md.Collection_ConceptAnalysis(collection, 'FAIR', RecDict, LevelOrder, ConceptOrder, ElementOrder, collectionsToProcess)

#write full quality image to Google Drive and get a link to insert next to the lower-quality picture in the google sheet
MyfolderID = '1UJNvXdlLO-4QwYKESr7B5N4hWXrkKRTY'
FAIRGraphLink = md.WriteToGoogle(
    os.path.join('..','data','FAIR', 'combinedCollections_FAIR_.png'), folderID=MyfolderID, Convert=None, Link=True)
                                   
#create Excel report on all analyses, write additional functions on data to provide some collection analytics
md.CombinationSpreadsheet(XpathOccurrence, FAIROccurrence,
                          FAIRConcept, FAIRGraph,
                          FAIRGraphLink, DataDestination
                         )
# write the spreadsheet to Google Drive, convert to Sheet
md.WriteToGoogle(DataDestination, folderID=MyfolderID, Convert=True)