### While the evaluator identifies and assigns a concept name to many geoscience metadata dialects, there is also much content that is not defined in the dialect. This notebook compares the output from KnownNodes.xsl and AllNodes.xsl to identify xpaths that do not have a dialect definition. It then appends the resulting dataframe to the KnownNodes dataframe. This gives a complete representation of the metadata. 

Import the python packages needed

In [68]:
import pandas as pd


Read in the defined content for BCO-DMO's GeoTraces collection, written in the ISO dialect.

In [69]:
KnownNodesDF= pd.read_csv('../data/BCO-DMO/GeoTraces_ISO_dataKnown.csv')
KnownNodesDF

Unnamed: 0,Record,Concept,XPath,Content
0,dataset_3470.xml,Resource Title,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD_DataIdentification/gmd:citation/gmd:CI_Citation/gmd:title,GT10 - Nanomolar Nutrients - Surface from the U.S. GEOTRACES NAT project of the U.S. GEOTRACES program
1,dataset_3470.xml,Resource Creation/Revision Date,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD_DataIdentification/gmd:citation/gmd:CI_Citation/gmd:date/gmd:CI_Date/gmd:date/gco:Date,2013-02-27
2,dataset_3470.xml,Abstract,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD_DataIdentification/gmd:abstract,"Nanomolar concentrations of PO4, NO3, NO2 (surface) Dataset Description: &lt;p&gt;Nanomolar concentrations of PO4, NO3, NO2 - Surface Transects.&l..."
3,dataset_3470.xml,Topic Category,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD_DataIdentification/gmd:topicCategory/gmd:MD_TopicCategoryCode,oceans
4,dataset_3470.xml,Theme Keyword,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD_DataIdentification/gmd:descriptiveKeywords/gmd:MD_Keywords/gmd:keyword,cruise_id
5,dataset_3470.xml,Theme Keyword,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD_DataIdentification/gmd:descriptiveKeywords/gmd:MD_Keywords/gmd:keyword,date
6,dataset_3470.xml,Theme Keyword,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD_DataIdentification/gmd:descriptiveKeywords/gmd:MD_Keywords/gmd:keyword,time
7,dataset_3470.xml,Theme Keyword,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD_DataIdentification/gmd:descriptiveKeywords/gmd:MD_Keywords/gmd:keyword,latitude
8,dataset_3470.xml,Theme Keyword,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD_DataIdentification/gmd:descriptiveKeywords/gmd:MD_Keywords/gmd:keyword,longitude
9,dataset_3470.xml,Theme Keyword,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD_DataIdentification/gmd:descriptiveKeywords/gmd:MD_Keywords/gmd:keyword,sample


Read in all the content for BCO-DMO's GeoTraces collection, written in the ISO dialect.

In [70]:
AllNodesDF = pd.read_csv('../data/BCO-DMO/GeoTraces_ISO_dataAll.csv')
AllNodesDF

Unnamed: 0,Record,XPath,Content
0,dataset_3470.xml,/gmi:MI_Metadata/@xsi:schemaLocation,http://www.isotc211.org/2005/gmi http://www.ngdc.noaa.gov/metadata/published/xsd/schema.xsd
1,dataset_3470.xml,/gmi:MI_Metadata/gmd:fileIdentifier/gco:CharacterString,http://lod.bco-dmo.org/id/dataset/3470
2,dataset_3470.xml,/gmi:MI_Metadata/gmd:language/gco:CharacterString,eng; USA
3,dataset_3470.xml,/gmi:MI_Metadata/gmd:characterSet/gmd:MD_CharacterSetCode,utf8
4,dataset_3470.xml,/gmi:MI_Metadata/gmd:characterSet/gmd:MD_CharacterSetCode/@codeList,http://www.isotc211.org/2005/resources/Codelist/gmxCodelists.xml#MD_CharacterSetCode
5,dataset_3470.xml,/gmi:MI_Metadata/gmd:characterSet/gmd:MD_CharacterSetCode/@codeListValue,utf8
6,dataset_3470.xml,/gmi:MI_Metadata/gmd:hierarchyLevel/gmd:MD_ScopeCode,dataset
7,dataset_3470.xml,/gmi:MI_Metadata/gmd:hierarchyLevel/gmd:MD_ScopeCode/@codeList,http://www.isotc211.org/2005/resources/Codelist/gmxCodelists.xml#MD_ScopeCode
8,dataset_3470.xml,/gmi:MI_Metadata/gmd:hierarchyLevel/gmd:MD_ScopeCode/@codeListValue,dataset
9,dataset_3470.xml,/gmi:MI_Metadata/gmd:hierarchyLevel/gmd:MD_ScopeCode/@codeSpace,005


Create a new dataframe from the rows in AllNodesDF that have an xpath that does not occur in KnownNodesDF then append it to the KnownNodesDF, creating ConceptIfKnownDF sorting by record and concept, then saving it to csv.

In [71]:
UnknownNodesDF=AllNodesDF[(~AllNodesDF.XPath.isin(KnownNodesDF.XPath))]
ConceptIfKnownDF=pd.concat([KnownNodesDF,UnknownNodesDF], axis=0).sort_values(['Record', 'Concept'])
ConceptIfKnownDF.to_csv('../data/BCO-DMO/data.csv', index=False)
ConceptIfKnownDF

Unnamed: 0,Concept,Content,Record,XPath
2,Abstract,"Nanomolar concentrations of PO4, NO3, NO2 (surface) Dataset Description: &lt;p&gt;Nanomolar concentrations of PO4, NO3, NO2 - Surface Transects.&l...",dataset_3470.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD_DataIdentification/gmd:abstract
261,Acknowledgement,Funding provided by NSF Ocean Sciences (NSF OCE) Award Number: OCE-0926423 Award URL: http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0926423,dataset_3470.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD_DataIdentification/gmd:credit
262,Acknowledgement,Funding provided by NSF Ocean Sciences (NSF OCE) Award Number: OCE-0926092 Award URL: http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0926092,dataset_3470.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD_DataIdentification/gmd:credit
80,Address,WHOI MS#36,dataset_3470.xml,/gmi:MI_Metadata/gmd:contact/gmd:CI_ResponsibleParty/gmd:contactInfo/gmd:CI_Contact/gmd:address/gmd:CI_Address/gmd:deliveryPoint
81,Address,"Department of Ocean, Earth, and Atmospheric Sciences 4500 Elkhorn Ave",dataset_3470.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD_DataIdentification/gmd:pointOfContact/gmd:CI_ResponsibleParty/gmd:contactInfo/gmd:CI_Contact/gmd:ad...
82,Address,WHOI MS#36,dataset_3470.xml,/gmi:MI_Metadata/gmd:contentInfo/gmd:MD_FeatureCatalogueDescription/gmd:featureCatalogueCitation/gmd:CI_Citation/gmd:citedResponsibleParty/gmd:CI_...
83,Address,WHOI MS#36,dataset_3470.xml,/gmi:MI_Metadata/gmd:distributionInfo/gmd:MD_Distribution/gmd:distributor/gmd:MD_Distributor/gmd:distributorContact/gmd:CI_ResponsibleParty/gmd:co...
84,Address,WHOI MS#36,dataset_3470.xml,/gmi:MI_Metadata/gmd:metadataMaintenance/gmd:MD_MaintenanceInformation/gmd:contact/gmd:CI_ResponsibleParty/gmd:contactInfo/gmd:CI_Contact/gmd:addr...
243,AssociatedDIFs,U.S. GEOTRACES,dataset_3470.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD_DataIdentification/gmd:aggregationInfo/gmd:MD_AggregateInformation/gmd:aggregateDataSetIdentifier/g...
244,AssociatedDIFs,U.S. GEOTRACES NAT,dataset_3470.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD_DataIdentification/gmd:aggregationInfo/gmd:MD_AggregateInformation/gmd:aggregateDataSetIdentifier/g...


I noticed that some ISO dialect definitions create exact duplicate rows, so the next cell drops those from the dataframe and saves another version of the csv. To double check I reran using just saxon:path() for the absolute location and still had the same amount of reduction in rows, so it may be a better output. I'm not sure that there will be a need for it in dialects that use xpaths that resolve to the level of the content rather than a higher order element in the hierarchy.  

In [72]:
uniqueCIKDF=ConceptIfKnownDF.drop_duplicates()
uniqueCIKDF.to_csv('../data/BCO-DMO/data2.csv', index=False)
uniqueCIKDF

Unnamed: 0,Concept,Content,Record,XPath
2,Abstract,"Nanomolar concentrations of PO4, NO3, NO2 (surface) Dataset Description: &lt;p&gt;Nanomolar concentrations of PO4, NO3, NO2 - Surface Transects.&l...",dataset_3470.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD_DataIdentification/gmd:abstract
261,Acknowledgement,Funding provided by NSF Ocean Sciences (NSF OCE) Award Number: OCE-0926423 Award URL: http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0926423,dataset_3470.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD_DataIdentification/gmd:credit
262,Acknowledgement,Funding provided by NSF Ocean Sciences (NSF OCE) Award Number: OCE-0926092 Award URL: http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0926092,dataset_3470.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD_DataIdentification/gmd:credit
80,Address,WHOI MS#36,dataset_3470.xml,/gmi:MI_Metadata/gmd:contact/gmd:CI_ResponsibleParty/gmd:contactInfo/gmd:CI_Contact/gmd:address/gmd:CI_Address/gmd:deliveryPoint
81,Address,"Department of Ocean, Earth, and Atmospheric Sciences 4500 Elkhorn Ave",dataset_3470.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD_DataIdentification/gmd:pointOfContact/gmd:CI_ResponsibleParty/gmd:contactInfo/gmd:CI_Contact/gmd:ad...
82,Address,WHOI MS#36,dataset_3470.xml,/gmi:MI_Metadata/gmd:contentInfo/gmd:MD_FeatureCatalogueDescription/gmd:featureCatalogueCitation/gmd:CI_Citation/gmd:citedResponsibleParty/gmd:CI_...
83,Address,WHOI MS#36,dataset_3470.xml,/gmi:MI_Metadata/gmd:distributionInfo/gmd:MD_Distribution/gmd:distributor/gmd:MD_Distributor/gmd:distributorContact/gmd:CI_ResponsibleParty/gmd:co...
84,Address,WHOI MS#36,dataset_3470.xml,/gmi:MI_Metadata/gmd:metadataMaintenance/gmd:MD_MaintenanceInformation/gmd:contact/gmd:CI_ResponsibleParty/gmd:contactInfo/gmd:CI_Contact/gmd:addr...
243,AssociatedDIFs,U.S. GEOTRACES,dataset_3470.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD_DataIdentification/gmd:aggregationInfo/gmd:MD_AggregateInformation/gmd:aggregateDataSetIdentifier/g...
244,AssociatedDIFs,U.S. GEOTRACES NAT,dataset_3470.xml,/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD_DataIdentification/gmd:aggregationInfo/gmd:MD_AggregateInformation/gmd:aggregateDataSetIdentifier/g...
