### Identifying the records that share identifiers at an LTER site to determine if the concepts in the recommendation were a focus of the improvement effort of reuploading records

During my analysis of LTER sites through time, I discovered while sorting records for complete examples of the conceptual LTER recommendation for Completeness (cite esip wiki) that some records in the LTER membernode of DataONE share identifiers. I thought this must be some mistake, so I compared the counts and saw the same numbers for almost every concept and element. When I investigated the content I found the same titles, identifiers, creators. I used the BNZ, BNZ_2005, BNZ_2015 as well as the AND, AND_2005, and AND_2015 reports in Excel (great for data browsing) where record names are the unique portion of the record's identifier element.

I thought something must be fishy with my analysis or my dataset so I went back to the process of how I created it and decided to go back to the SOLR source. First I clarified that my dataset creation SOLR query choice to include any records made that year, regardless of if they were obsoleted. The idea was that anything published that year (and here we conflated upload and publish date a bit imho) should be reflective of the understanding of the recommendation at the time. This for example is evident in the records 1.16 1.17 and 1.18 

This can make a huge difference for a site like BNZ that actively iterates on a record when it is created. In the query below, we can see that there are over 5000 records,

http://cn.dataone.org/cn/v2/query/solr/?q=formatType:METADATA+AND+authoritativeMN:*LTER+AND+identifier:%27-lter-bnz%27&fl=obsoletes,obsoletedBy,dateUploaded,datePublished,dataUrl&rows=6446&sort=dateUploaded+asc&facet=true&facet.missing=true&facet.limit=-1&facet.range=dateUploaded&facet.range.start=2005-01-01T00:00:00Z&facet.range.end=2018-12-31T23:59:59.999Z&facet.range.gap=%2B1YEAR&wt=xml

but when we add in the additional constraint "-obsoletedBy:\*" we see fewer than half that:

http://cn.dataone.org/cn/v2/query/solr/?q=formatType:METADATA+AND+authoritativeMN:*LTER+-obsoletedBy:*+AND+identifier:%27-lter-bnz%27&fl=obsoletes,obsoletedBy,dateUploaded,datePublished,dataUrl&rows=6446&sort=dateUploaded+asc&facet=true&facet.missing=true&facet.limit=-1&facet.range=dateUploaded&facet.range.start=2005-01-01T00:00:00Z&facet.range.end=2018-12-31T23:59:59.999Z&facet.range.gap=%2B1YEAR&wt=xml

However, records that were obsoleted generally were iterated on within the year span and as such not complete reflections of the understanding of the recommendation, and probably skewed results. Take this snippet from the query above with an additional request for the record's identifier:

http://cn.dataone.org/cn/v2/query/solr/?q=formatType:METADATA+AND+authoritativeMN:*LTER+-obsoletedBy:*+AND+identifier:%27-lter-bnz%27&fl=identifier,obsoletes,obsoletedBy,dateUploaded,datePublished,dataUrl&rows=6446&sort=dateUploaded+asc&facet=true&facet.missing=true&facet.limit=-1&facet.range=dateUploaded&facet.range.start=2005-01-01T00:00:00Z&facet.range.end=2018-12-31T23:59:59.999Z&facet.range.gap=%2B1YEAR&wt=xml

Here we see quick iterations on a record that is then not obsoleted by anything else:

However, when we look at the 2015 uploads we see records 1.17 and 1.18 and a new version, 1.19 are all not obsoleted by anything, but are new record versions for the same ongoing temperature measurement data at Bonanza Creek

When we visit the dataUrl for records named 1.18 in the analysis we see they share the same @packageId and are both unobsoleted records:

https://cn.dataone.org/cn/v2/resolve/https%3A%2F%2Fpasta.lternet.edu%2Fpackage%2Fmetadata%2Feml%2Fknb-lter-bnz%2F1%2F17


https://cn.dataone.org/cn/v2/resolve/knb-lter-bnz.1.18



How did these records from different dataUrls get the same name so this was noticable? I decided during my data collection that I should not allow file names to contain / to clarify paths and file names. This mistake, or rather, "... this happy little accident(cite bob ross?!)" allowed the record name column to be matchable. Since it's probably best to identify the records that share an identifier and obsolete one of the versions for discovery result clarity, it would be difficult to trace successive versions of the same record in the post 2015 records, but the previous versions should provide a decent amount of curation guidance. This highlights the need for automation tests of content across holdings and not just for a specific record, particularly for interdisciplinary repositories that serve as a secondary point of discovery to the originating organization. It also perhaps highlights a need to utilize existing identifiers from within the record rather than assigning a repo specific id. 

It may also signify a need for the originating community to have oss that can directly compare the content of records from different resharing repositories focused on discovery for resources they created.

Let's see if both of these trends continue in other LTER sites to determine if a new dataset should be created of just the unobsoleted records.

In [4]:
import os
import glob
import pandas as pd
import MDeval as md


In [6]:
@contextmanager
def cd(newdir):
    # create a way to easily move between directories
    prevdir = os.getcwd()
    os.chdir(os.path.expanduser(newdir))
    try:
        yield
    finally:
        os.chdir(prevdir)

    DirectoryChoice = '../data/'
    # change to chosen directory
    with cd(DirectoryChoice):
        # identify all files of a specific type (occurrence.csv)
        CollectionComparisons = glob.glob('/**/*_XpathOccurrence.csv')
# where to put it
        DataDestination = (
            '../' + Organization + '/' +
            Organization + '_xpathOccurrence.csv'
        XPathOccurrenceList.sort()
        md.CombineEvaluatedMetadata(CollectionComparisons, DataDestination)

SyntaxError: invalid syntax (<ipython-input-6-8f0e86cd42d1>, line 20)