### While the evaluator identifies and assigns a concept name to many geoscience metadata dialects, there is also much content that is not defined in the dialect. This notebook compares the output from KnownNodes.xsl and AllNodes.xsl to identify xpaths that do not have a dialect definition. It then appends the resulting dataframe to the KnownNodes dataframe. This gives a complete representation of the metadata. 

Import the python packages needed

In [1]:
import pandas as pd


Read in the defined content for BCO-DMO's GeoTraces collection, written in the ISO dialect.

In [2]:
KnownNodesDF= pd.read_csv('../data/BCO-DMO/GeoTraces_ISO_dataKnown.csv')
KnownPaths=KnownNodesDF.XPath.unique()
KnownNodesDF

FileNotFoundError: File b'../data/BCO-DMO/GeoTraces_ISO_dataKnown.csv' does not exist

Read in all the content for BCO-DMO's GeoTraces collection, written in the ISO dialect.

In [3]:
AllNodesDF = pd.read_csv('../data/BCO-DMO/GeoTraces_ISO_dataAll.csv')
AllNodesDF

Unnamed: 0,Record,XPath,Content
0,dataset_3470.xml,/gmi:MI_Metadata/@xsi:schemaLocation,http://www.isotc211.org/2005/gmi http://www.ng...
1,dataset_3470.xml,/gmi:MI_Metadata/gmd:fileIdentifier/gco:Charac...,http://lod.bco-dmo.org/id/dataset/3470
2,dataset_3470.xml,/gmi:MI_Metadata/gmd:language/gco:CharacterString,eng; USA
3,dataset_3470.xml,/gmi:MI_Metadata/gmd:characterSet/gmd:MD_Chara...,utf8
4,dataset_3470.xml,/gmi:MI_Metadata/gmd:characterSet/gmd:MD_Chara...,http://www.isotc211.org/2005/resources/Codelis...
5,dataset_3470.xml,/gmi:MI_Metadata/gmd:characterSet/gmd:MD_Chara...,utf8
6,dataset_3470.xml,/gmi:MI_Metadata/gmd:hierarchyLevel/gmd:MD_Sco...,dataset
7,dataset_3470.xml,/gmi:MI_Metadata/gmd:hierarchyLevel/gmd:MD_Sco...,http://www.isotc211.org/2005/resources/Codelis...
8,dataset_3470.xml,/gmi:MI_Metadata/gmd:hierarchyLevel/gmd:MD_Sco...,dataset
9,dataset_3470.xml,/gmi:MI_Metadata/gmd:hierarchyLevel/gmd:MD_Sco...,005


Create a new dataframe from the rows in AllNodesDF that have an xpath that does not occur in KnownNodesDF then append it to the KnownNodesDF, creating ConceptIfKnownDF sorting by record and concept, then saving it to csv after dropping duplicates and filling in the null spaces in the Concept column with "Unknown"

In [24]:
UnknownNodesDF = AllNodesDF.loc[~((AllNodesDF['Record'].isin(KnownNodesDF['Record'])) & AllNodesDF['XPath'].isin(KnownPaths))]
UnknownNodesDF

Unnamed: 0,Record,XPath,Content
0,dataset_3470.xml,/gmi:MI_Metadata/@xsi:schemaLocation,http://www.isotc211.org/2005/gmi http://www.ng...
1,dataset_3470.xml,/gmi:MI_Metadata/gmd:fileIdentifier/gco:Charac...,http://lod.bco-dmo.org/id/dataset/3470
2,dataset_3470.xml,/gmi:MI_Metadata/gmd:language/gco:CharacterString,eng; USA
3,dataset_3470.xml,/gmi:MI_Metadata/gmd:characterSet/gmd:MD_Chara...,utf8
4,dataset_3470.xml,/gmi:MI_Metadata/gmd:characterSet/gmd:MD_Chara...,http://www.isotc211.org/2005/resources/Codelis...
5,dataset_3470.xml,/gmi:MI_Metadata/gmd:characterSet/gmd:MD_Chara...,utf8
7,dataset_3470.xml,/gmi:MI_Metadata/gmd:hierarchyLevel/gmd:MD_Sco...,http://www.isotc211.org/2005/resources/Codelis...
8,dataset_3470.xml,/gmi:MI_Metadata/gmd:hierarchyLevel/gmd:MD_Sco...,dataset
9,dataset_3470.xml,/gmi:MI_Metadata/gmd:hierarchyLevel/gmd:MD_Sco...,005
10,dataset_3470.xml,/gmi:MI_Metadata/gmd:hierarchyLevelName/gco:Ch...,"Highest level of data collection, from a commo..."


In [26]:
UnknownNodesDF = AllNodesDF.loc[~((AllNodesDF['Record'].isin(KnownNodesDF['Record'])) & AllNodesDF['XPath'].isin(KnownNodesDF['XPath']))]
#UnknownNodesDF = AllNodesDF[~((AllNodesDF.Record.isin(KnownNodesDF.Record)) & (AllNodesDF.XPath.isin(KnownNodesDF.XPath)))]
ConceptIfKnownDF=pd.concat([KnownNodesDF,UnknownNodesDF], axis=0).sort_values(['Record', 'Concept'])
uniqueCIKDF=ConceptIfKnownDF.drop_duplicates().fillna('Unknown')
uniqueCIKDF.to_csv('../data/BCO-DMO/GeoTraces_ISO_Evaluated.csv', index=False)
uniqueCIKDF

Unnamed: 0,Concept,Content,Record,XPath
2,Abstract,"Nanomolar concentrations of PO4, NO3, NO2 (sur...",dataset_3470.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...
261,Acknowledgement,Funding provided by NSF Ocean Sciences (NSF OC...,dataset_3470.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...
262,Acknowledgement,Funding provided by NSF Ocean Sciences (NSF OC...,dataset_3470.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...
80,Address,WHOI MS#36,dataset_3470.xml,/gmi:MI_Metadata/gmd:contact/gmd:CI_Responsibl...
81,Address,"Department of Ocean, Earth, and Atmospheric Sc...",dataset_3470.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...
82,Address,WHOI MS#36,dataset_3470.xml,/gmi:MI_Metadata/gmd:contentInfo/gmd:MD_Featur...
83,Address,WHOI MS#36,dataset_3470.xml,/gmi:MI_Metadata/gmd:distributionInfo/gmd:MD_D...
84,Address,WHOI MS#36,dataset_3470.xml,/gmi:MI_Metadata/gmd:metadataMaintenance/gmd:M...
243,AssociatedDIFs,U.S. GEOTRACES,dataset_3470.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...
244,AssociatedDIFs,U.S. GEOTRACES NAT,dataset_3470.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...


In [4]:
UnknownNodesDF=AllNodesDF[(~AllNodesDF.XPath.isin(KnownNodesDF.XPath))]
ConceptIfKnownDF=pd.concat([KnownNodesDF,UnknownNodesDF], axis=0).sort_values(['Record', 'Concept'])
uniqueCIKDF=ConceptIfKnownDF.drop_duplicates().fillna('Unknown')
uniqueCIKDF.to_csv('../data/BCO-DMO/GeoTraces_ISO_Evaluated.csv', index=False)
uniqueCIKDF

Unnamed: 0,Concept,Content,Record,XPath
2,Abstract,"Nanomolar concentrations of PO4, NO3, NO2 (sur...",dataset_3470.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...
261,Acknowledgement,Funding provided by NSF Ocean Sciences (NSF OC...,dataset_3470.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...
262,Acknowledgement,Funding provided by NSF Ocean Sciences (NSF OC...,dataset_3470.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...
80,Address,WHOI MS#36,dataset_3470.xml,/gmi:MI_Metadata/gmd:contact/gmd:CI_Responsibl...
81,Address,"Department of Ocean, Earth, and Atmospheric Sc...",dataset_3470.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...
82,Address,WHOI MS#36,dataset_3470.xml,/gmi:MI_Metadata/gmd:contentInfo/gmd:MD_Featur...
83,Address,WHOI MS#36,dataset_3470.xml,/gmi:MI_Metadata/gmd:distributionInfo/gmd:MD_D...
84,Address,WHOI MS#36,dataset_3470.xml,/gmi:MI_Metadata/gmd:metadataMaintenance/gmd:M...
243,AssociatedDIFs,U.S. GEOTRACES,dataset_3470.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...
244,AssociatedDIFs,U.S. GEOTRACES NAT,dataset_3470.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD...


I noticed that some ISO dialect definitions create exact duplicate rows, so the next cell drops those from the dataframe and saves another version of the csv. To double check I reran using just saxon:path() for the absolute location and still had the same amount of reduction in rows, so it may be a better output. I'm not sure that there will be a need for it in dialects that use xpaths that resolve to the level of the content rather than a higher order element in the hierarchy.  