# Data mining big metadata at DataONE: Creating a (meta)dataset
###  Identifying and preparing EML records uploaded to DataONE by individual LTER sites, 2005-2018 for analysis

This notebook chronicles the research process in creating a dataset to look at the evolution of LTER metadata, as it was uploaded to DataONE. 

This is a description of the hardware and software used in this research:

[Hardware_and_Software.md](../scripts/Hardware_and_Software.md)

This is the NbMeta metadata record for this notebook:

[Create_a_metadataset_nbmeta.json](../metadata/Create_a_metadataset_nbmeta.json)

During a 2016 investigation of LTER recommendation completeness through time, it was discovered that records from a later year were no more likely to be complete than those of an earlier year. This result came from a dataset that was a random sample of LTER. 

[2016 AGU poster](https://agu.confex.com/agu/fm16/mediafile/Handout/Paper197117/DoCommunityRecommendationsImproveMetadataFINAL.pdf)

The data on the investigation into the LTER DataONE Member node's useage of the LTER Recommendation seemed to indicate that either there was not an organization knowledge of how to use recommendation or the recommendation was followed to varying degrees by individuals. Upon learning that LTER was actually comprised of several sites with varying start dates and technical knowledge, I decided to investigate if the DataONE metadata catalog for LTER could be reliably broken up by site and by year to see how the community evolved the useage of the recommendations.

I learned from the EDI that the packageid element and DataONE identifier convention included the three letter site abbreviation. With this information I began to dig into DataONE's central node to see if I could build a dataset that could allow me to investigate each site over time. 

This notebook describes the data mining and analysis used to decide which records would make a dataset that could be used to investigate the knowledge and understanding of the LTER Recommendation for completeness.

Once the records are identified they are downloaded to directories following the convention "Site__Year".

This precise dataset building is neccessary to get a clear view as we expect to see that different sites have different rates of improvement from different starting points. Additionally, even though they are all a part of LTER, they have different information needs and are likely to use slightly different metadata .

 To do this we'll need a list of all the sites and some metadata about records from DataONE's Solr index to see if it's possible to use part of the record name or idenifier to create collections of records for each site.

I obtained a list of sites from https://lternet.edu/site/

* Andrews Forest LTER (AND)
* Arctic LTER (ARC)
* Baltimore Ecosystem Study (BES)
* Beaufort Lagoon Ecosystem LTER (BLE)
* Bonanza Creek LTER (BNZ)
* California Current Ecosystem LTER (CCE)
* Cedar Creek Ecosystem Science Reserve (CDR)
* Central Arizona – Phoenix LTER (CAP)
* Coweeta LTER (CWT)
* Florida Coastal Everglades LTER (FCE)
* Georgia Coastal Ecosystems LTER (GCE)
* Harvard Forest LTER (HFR)
* Hubbard Brook LTER (HBR)
* Jornada Basin LTER (JRN)
* Kellogg Biological Station LTER (KBS)
* Konza Prairie LTER (KNZ)
* LTER Network (NWK)
* LTER Network Communications Office (NCO)
* Luquillo LTER (LUQ)
* McMurdo Dry Valleys LTER (MCM)
* Moorea Coral Reef LTER (MCR)
* Niwot Ridge LTER (NWT)
* North Temperate Lakes LTER (NTL)
* Northeast U.S. Shelf (NES)
* Northern Gulf of Alaska (NGA)
* Palmer Antarctica LTER (PAL)
* Plum Island Ecosystems LTER (PIE)
* Santa Barbara Coastal LTER (SBC)
* Sevilleta LTER (SEV)
* Virginia Coast Reserve LTER (VCR)

### DataONE SOLR queries 

I queried DataONE using their SOLR index: https://cn.dataone.org/cn/v2/query/solr  

First I determined how many metadata records were in the LTER collection at DataONE: http://cn.dataone.org/cn/v2/query/solr/?q=formatType:METADATA+AND+authoritativeMN:*LTER&rows=1

On November 29th the LTER network had 77301 metadata records in DataONE. The response contains a subsection of the indexed metadata from the record. It's fairly complete and contains a couple of elements in the doc element that look promising, namely the identifier and the dataUrl.





<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">18</int>
<lst name="params">
<str name="q">formatType:METADATA AND authoritativeMN:*LTER</str>
<str name="rows">1</str>
</lst>
</lst>
<result name="response" numFound="77194" start="0">
<doc>
<str name="id">doi:10.6073/AA/knb-lter-sbc.1016.4</str>
<str name="identifier">doi:10.6073/AA/knb-lter-sbc.1016.4</str>
<str name="formatId">eml://ecoinformatics.org/eml-2.0.1</str>
<str name="formatType">METADATA</str>
<long name="size">15736</long>
<str name="checksum">dbb22b723005888b41fef9b8e2a0a42b</str>
<str name="submitter">uid=sbc,o=LTER,dc=ecoinformatics,dc=org</str>
<str name="checksumAlgorithm">MD5</str>
<str name="rightsHolder">uid=sbc,o=LTER,dc=ecoinformatics,dc=org</str>
<bool name="replicationAllowed">false</bool>
<str name="obsoletes">doi:10.6073/AA/knb-lter-sbc.1016.3</str>
<str name="obsoletedBy">doi:10.6073/AA/knb-lter-sbc.1016.5</str>
<date name="dateUploaded">2009-04-27T23:00:00Z</date>
<date name="updateDate">2009-04-27T23:00:00Z</date>
<date name="dateModified">2012-06-26T17:14:17.18Z</date>
<str name="datasource">urn:node:LTER</str>
<str name="authoritativeMN">urn:node:LTER</str>
<arr name="replicaMN">
<str>urn:node:CN</str>
<str>urn:node:LTER</str>
<.../>
<str name="dataUrl">
https://cn.dataone.org/cn/v2/resolve/doi%3A10.6073%2FAA%2Fknb-lter-sbc.1016.4
<.../>
</doc>
</result>
</response>

Let's look at 5 records to determine if we can identify the sites by the record's identifier or dataUrl elements in a consistent way.

http://cn.dataone.org/cn/v2/query/solr/?q=formatType:METADATA+AND+authoritativeMN:*LTER&fl=identifier,dataUrl&rows=5

It looks like the identifier and the DataUrl contain site abbreviations from three of the sites in just 5 responses, and one, "bes" repeats three times in the same section of the string.

```xml
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">16</int>
<lst name="params">
<str name="q">formatType:METADATA AND authoritativeMN:*LTER</str>
<str name="fl">identifier,dataUrl</str>
<str name="rows">10</str>
</lst>
</lst>
<result name="response" numFound="77194" start="0">
<doc>
<str name="identifier">doi:10.6073/AA/knb-lter-sbc.1016.4</str>
<str name="dataUrl">
https://cn.dataone.org/cn/v2/resolve/doi%3A10.6073%2FAA%2Fknb-lter-sbc.1016.4
</str>
</doc>
<doc>
<str name="identifier">doi:10.6073/AA/knb-lter-arc.584.1</str>
<str name="dataUrl">
https://cn.dataone.org/cn/v2/resolve/doi%3A10.6073%2FAA%2Fknb-lter-arc.584.1
</str>
</doc>
<doc>
<str name="identifier">doi:10.6073/AA/knb-lter-bes.10.37</str>
<str name="dataUrl">
https://cn.dataone.org/cn/v2/resolve/doi%3A10.6073%2FAA%2Fknb-lter-bes.10.37
</str>
</doc>
<doc>
<str name="identifier">doi:10.6073/AA/knb-lter-bes.12.48</str>
<str name="dataUrl">
https://cn.dataone.org/cn/v2/resolve/doi%3A10.6073%2FAA%2Fknb-lter-bes.12.48
</str>
</doc>
<doc>
<str name="identifier">doi:10.6073/AA/knb-lter-bes.432.53</str>
<str name="dataUrl">
https://cn.dataone.org/cn/v2/resolve/doi%3A10.6073%2FAA%2Fknb-lter-bes.432.53
</str>
</doc>
</result>
</response>
```

 Are there a lot of records with this convention? Back to Solr to add parameters to our query. Let's see how many records contain the substring lter-bes in their identifiers.

http://cn.dataone.org/cn/v2/query/solr/?q=formatType:METADATA+AND+authoritativeMN:*LTER+AND+identifier:%27-lter-bes%27&rows=0

Over 4000 records! I think we're on to something, how about the rest of the sites?

In [2]:
# combine the results and create a table with records that share the same identifier grouped together

import glob
import pandas as pd
import sys 
import os
import random
import subprocess

from IPython.core.display import display, HTML

sys.path.append(os.path.join(os.path.dirname(sys.path[0]),'../scripts'))

import MDeval as md


Sites = ['and','arc','bes','ble','bnz',
         'cce','cdr','cap','cwt','fce',
         'gce','hfr','hbr','jrn','kbs',
         'knz','luq','mcm', 'mcr','nwk',
         'nco','nwt','ntl','nes','nga',
         'pal','pie','sbc','sev','sgs','vcr']
# create a url for each of the sites
for site in Sites:
    dataURLprefix = "http://cn.dataone.org/cn/v2/query/solr/?q=formatType:METADATA+AND+authoritativeMN:*LTER+AND+identifier:%27-lter-"
    dataURLsuffix = "%27&rows=0"
    hyperlink = str(dataURLprefix + site + dataURLsuffix)
    href = '<a href="' + hyperlink + '"/>' + hyperlink + '</a>'
    display(HTML(href))

### Site record counts

I retrieved total collection size for each site on October 30th, 2018 using the above query urls

* Andrews Forest LTER (AND) = 429
* Arctic LTER (ARC) = 4495
* Baltimore Ecosystem Study (BES) = 4320
* Beaufort Lagoon Ecosystem (BLE) = 0
* Bonanza Creek LTER (BNZ) = 5501
* California Current Ecosystem LTER (CCE) = 219
* Cedar Creek Ecosystem Science Reserve (CDR) = 3063
* Central Arizona – Phoenix LTER (CAP) = 1775
* Coweeta LTER (CWT) = 459
* Florida Coastal Everglades LTER (FCE) = 2216
* Georgia Coastal Ecosystems LTER (GCE) = 3777
* Harvard Forest LTER (HFR) = 2199
* Hubbard Brook LTER (HBR) = 406
* Jornada Basin LTER (JRN) = 450
* Kellogg Biological Station LTER (KBS) = 927
* Konza Prairie LTER (KNZ) = 515
* LTER Network (NWK) = 6
* LTER Network Communications Office (NCO) = 0
* Luquillo LTER (LUQ) = 571
* McMurdo Dry Valleys LTER (MCM) = 918
* Moorea Coral Reef LTER (MCR) = 449
* Niwot Ridge LTER (NWT) = 464
* North Temperate Lakes LTER (NTL) = 1337
* Northeast U.S. Shelf (NES) = 2
* Northern Gulf of Alaska (NGA) = 0 
* Palmer Antarctica LTER (PAL) = 308
* Plum Island Ecosystems LTER (PIE) = 1190
* Santa Barbara Coastal LTER (SBC) = 1165
* Sevilleta LTER (SEV) = 1113
* Shortgrass Steppe (No longer funded by NSF LTER) (SGS) = 504
* Virginia Coast Reserve LTER (VCR) = 1965



### Something doesn't add up!

Grand total found at sites: 40743

Total DataONE gave: 77194

That means almost half the collection's identifiers did not follow the convention or were otherwise unavailable. Are there other LTER sites not in their list?

I found Landsat records while searching for a string that contained and.

LTER-Landsat-LEDAPS had 20593 records. They are all uploaded at the same time, so it is likely a batch transform to EML from CSDGM, and would not yield any data concerning improvement through time.

Leaving a total of 15,858 records still unaccounted for. there are identifier substrings like lter-nin that have a couple hundred records, but for the majority of the records we can use the identifier to search and retrieve by site.

40743 records is still a lot of metadata and most sites have hundreds if not thousands of records. Are the record uploads spread throughout the years in question?

Let's look at the returned queries created below to see if we can break up the records by year uploaded using the dateUploaded element. Most sites appear to upload multiple years worth of records in their first year, and continue to deposit some records for most of the following years.

### Do LTER sites deposit through time?
Once we've removed the sites with less than 10 records, we can facet the Solr query for each of the 26 sites to count records for each of the remaining sites and each year.

In [3]:
# rebuild the sites list with collections from the LTER list that have more than 10 records that follow the predictible identifier pattern

Sites = ['and','arc','bes','bnz','cce',
         'cdr','cap','cwt','fce','gce',
         'hfr','hbr','jrn','kbs','knz',
         'luq','mcm', 'mcr','nwt','ntl',
         'pal','pie','sbc','sev','sgs','vcr']
# create a hyperlink that queries each site
for site in Sites:
    dataURLprefix = "http://cn.dataone.org/cn/v2/query/solr/?q=formatType:METADATA+AND+authoritativeMN:*LTER+AND+identifier:%27-lter-"
    dataURLsuffix = "%27&fl=dateUploaded,datePublished,dataUrl&rows=0&sort=dateUploaded+asc&facet=true&facet.missing=true&facet.limit=-1&facet.range=dateUploaded&facet.range.start=2005-01-01T00:00:00Z&facet.range.end=2018-12-31T23:59:59.999Z&facet.range.gap=%2B1YEAR&wt=xml"
    hyperlink = str(dataURLprefix + site + dataURLsuffix)
    href = '<a href="' + hyperlink + '"/>' + site + '</a>'
    display(HTML(href))

#### Sample from GCE

```xml 
<lst name="dateUploaded">
<lst name="counts">
<int name="2005-01-01T00:00:00Z">1443</int>
<int name="2006-01-01T00:00:00Z">158</int>
<int name="2007-01-01T00:00:00Z">31</int>
<int name="2008-01-01T00:00:00Z">66</int>
<int name="2009-01-01T00:00:00Z">8</int>
<int name="2010-01-01T00:00:00Z">63</int>
<int name="2011-01-01T00:00:00Z">3</int>
<int name="2012-01-01T00:00:00Z">65</int>
<int name="2013-01-01T00:00:00Z">42</int>
<int name="2014-01-01T00:00:00Z">47</int>
<int name="2015-01-01T00:00:00Z">982</int>
<int name="2016-01-01T00:00:00Z">41</int>
<int name="2017-01-01T00:00:00Z">23</int>
<int name="2018-01-01T00:00:00Z">53</int>
</lst>
<str name="gap">+1YEAR</str>
<date name="start">2005-01-01T00:00:00Z</date>
<date name="end">2019-01-01T00:00:00Z</date>
```

As you can see, some years have significantly more activity but there is at least some demonstration of metadata records uploaded in each of the 14 years. Other sites will have years they do not upload metadata and the directory structure should reflect that. 
### Using python to create, submit, and retrieve the SOLR queries's responses as XML

Here we are creating similar queries but instead of counts we are interested in not just the site and the dateUploaded, we'd also like to gather the identifier, dataUrl, and if the record was obsoleted and which record it was obsoleted by.

In [4]:
Sites = ['and','arc','bes','bnz','cce',
         'cdr','cap','cwt','fce','gce',
         'hfr','hbr','jrn','kbs','knz',
         'luq','mcm', 'mcr','nwt','ntl',
         'pal','pie','sbc','sev','sgs','vcr']

dataURLprefix = "http://cn.dataone.org/cn/v2/query/solr/?q=formatType:METADATA+AND+authoritativeMN:*LTER+AND+identifier:%27-lter-"
dataURLsuffix = "%27&fl=identifier,dateUploaded,datePublished,dataUrl,obsoletes,obsoletedBy&rows=6000&sort=dateUploaded+asc&facet=true&facet.missing=true&facet.limit=-1&facet.range=dateUploaded&facet.range.start=2005-01-01T00:00:00Z&facet.range.end=2018-12-31T23:59:59.999Z&facet.range.gap=%2B1YEAR&wt=xml" 
filePath = "../Solr/Responses/"
if not os.path.exists(filePath):
    os.makedirs(filePath)
fileType = ".xml"

dataURLs = [dataURLprefix + Site + dataURLsuffix for Site in Sites]
   

XMLnames = [filePath + Site.upper() + fileType for Site in Sites]

# download the query responses
md.get_records(dataURLs, XMLnames, well_formed=False)


### Digging in to the Solr responses 

We need to determine a few things about the records returned in the query to see if all of them are appropriate for inclusion in the metadataset.

First, do the LTER records get updated in DataONE, or do LTER sites contribute records without active curation?

Second, if records get updated, what is the rate of curation? Are records iteratively created in a short period of time? If so, the iterations to a complete record may add noise to measures of yearly collection completeness. Are records improved over a series of years? These records would be valuable as they may reflect an improvement in the understanding of the levels of the LTER Recommendation for Completeness. This means we would not want to only collect unobsoleted records.

Finally are there any issues with collections? Metadata quality is not merely curation of an individual record, but creating a collection of records that serve the needs of the user. Can analysis tools expose opportunities for the curators to improve the clarity of their communication of contextualizing data that turns datasets into information for the user?

To see if contributors are curating the representation of their resources, let's use the SolrResponses directory of XML. If we follow a chain of obsoletedBy elements, we can see that record versions are the final part of the identifier. We can apply an XSL transform to extract the base identifier and year, the record identifier, and the year, and write it to a csv for each site. This will result in a table that allows us to see if records are iteratively created and to what degree active curation plays a role in the metadata collection. We can also check to see if record identifiers are reused in the collection.

In [5]:
# driving the XSL transformation for each of the SolrResponses

os.makedirs("../Solr/Requests/", exist_ok=True)
os.makedirs("../Solr/Versions/", exist_ok=True)

# use Sites list to iterate over the creation of a dataset of csv describing each record at each site

for Site in Sites:
    version = ["/usr/bin/java",
           '-jar', "../scripts/saxon-b-9.0.jar",
           '-xsl:' + "../scripts/findIdentifiers.xsl",
           '-s:' + "../Solr/Responses/" + Site.upper() + ".xml",
           '-o:' + "../Solr/Versions/" + Site.upper() + '.csv'
          ]  
 
    subprocess.run(' '.join(version), shell=True, check=True)

In [6]:

# combine data from csv into one dataframe
CombinedVersionsDF = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "../Solr/Versions/", "*.csv"))))
CombinedVersionsDF.head(5)

Unnamed: 0,baseIDyear,Identifier,Year,ObsoletedByID,dataUrl
0,arc.584 2005,arc.584.1,2005,,https://cn.dataone.org/cn/v2/resolve/doi%3A10....
1,arc.994 2005,arc.994.10,2005,arc.994.11,https://cn.dataone.org/cn/v2/resolve/knb-lter-...
2,arc.1406 2005,arc.1406.7,2005,,https://cn.dataone.org/cn/v2/resolve/knb-lter-...
3,arc.1388 2005,arc.1388.2,2005,arc.1388.3,https://cn.dataone.org/cn/v2/resolve/doi%3A10....
4,arc.1596 2005,arc.1596.1,2005,arc.1596.2,https://cn.dataone.org/cn/v2/resolve/doi%3A10....


### Do records get updated in DataONE by LTER sites?

To check this, let's get all of the results of the FindIdentifiers.xsl transform into a dataframe. We make an assumption based on observing obsoleted chains in the Solr metadata about the identifier. We create a baseID which we define as the part of the identifier string before the last period and attach it to a year so that when a record is updated the previous version gets an "obsoletedBy" field in the SolrResponses. If a record is obsoleted by another record, there will be a "Yes" in the Obsoleted Column. We can use this information to count how many records there are that have an obsoletedBy element and divide this by the number of records identified by the SolrResponses metadata.

### Let's choose a baseID that has a lot of use and then compare the Identifier with the identifier in the Obsoleted

In [7]:
#Split the year off of the baseIDyear so we can look at the whole collection
BaseCombinedVersionsDF = CombinedVersionsDF
BaseCombinedVersionsDF['baseIDyear'] = CombinedVersionsDF['baseIDyear'].str.split(' ').apply(pd.Series, 1)
# Identify base ID use by counting up the baseIDyear data that share the same values
uniqueBaseIDcount = BaseCombinedVersionsDF.groupby('baseIDyear')["dataUrl"].nunique().sort_values(ascending=False)
uniqueBaseIDcount.head(5)

baseIDyear
bes.1      36
bes.401    36
bes.4      36
bes.14     34
bes.424    33
Name: dataUrl, dtype: int64

Use a top result to create a table and sort it

In [8]:
# use 'bes.4' to limit the rows of the data frame to just records we think are related
sampleRecordRows = BaseCombinedVersionsDF[BaseCombinedVersionsDF['baseIDyear']=='bes.4'].sort_values('Identifier')
pd.set_option('max_colwidth',400)
sampleRecordRows


Unnamed: 0,baseIDyear,Identifier,Year,ObsoletedByID,dataUrl
66,bes.4,bes.4.19,2006,bes.4.20,https://cn.dataone.org/cn/v2/resolve/doi%3A10.6073%2FAA%2Fknb-lter-bes.4.19
54,bes.4,bes.4.20,2006,bes.4.21,https://cn.dataone.org/cn/v2/resolve/doi%3A10.6073%2FAA%2Fknb-lter-bes.4.20
73,bes.4,bes.4.21,2006,bes.4.22,https://cn.dataone.org/cn/v2/resolve/doi%3A10.6073%2FAA%2Fknb-lter-bes.4.21
75,bes.4,bes.4.22,2006,bes.4.23,https://cn.dataone.org/cn/v2/resolve/doi%3A10.6073%2FAA%2Fknb-lter-bes.4.22
86,bes.4,bes.4.23,2006,bes.4.24,https://cn.dataone.org/cn/v2/resolve/doi%3A10.6073%2FAA%2Fknb-lter-bes.4.23
76,bes.4,bes.4.24,2006,bes.4.26,https://cn.dataone.org/cn/v2/resolve/doi%3A10.6073%2FAA%2Fknb-lter-bes.4.24
74,bes.4,bes.4.26,2006,bes.4.27,https://cn.dataone.org/cn/v2/resolve/doi%3A10.6073%2FAA%2Fknb-lter-bes.4.26
69,bes.4,bes.4.27,2006,bes.4.28,https://cn.dataone.org/cn/v2/resolve/doi%3A10.6073%2FAA%2Fknb-lter-bes.4.27
65,bes.4,bes.4.28,2006,bes.4.29,https://cn.dataone.org/cn/v2/resolve/doi%3A10.6073%2FAA%2Fknb-lter-bes.4.28
87,bes.4,bes.4.29,2006,bes.4.30,https://cn.dataone.org/cn/v2/resolve/doi%3A10.6073%2FAA%2Fknb-lter-bes.4.29


It appears as although many records follow the expected pattern, there are related records that share the same baseID. In the bes.4 example there are two records that are present in the data, and while record 4.19 is sequentially (by the hundreths) versioned versions we'd expect to see such as bes.4.35 aren't a part of the collection anymore, and there are some jumps such as 4.39 to 4.41, there are also records like 4.4 and 4.5 that are merely related to similar datasets (or documents)

#### How many records have these LTER Sites uploaded to DataONE?

In [9]:
RecordTotal = len(CombinedVersionsDF.index)
print("There are", RecordTotal, "records described in the SolrResponse metadata records.")

There are 40922 records described in the SolrResponse metadata records.


In [10]:
ObsoletedCount = len(CombinedVersionsDF[(CombinedVersionsDF['ObsoletedByID']!='None')])
print('There are', ObsoletedCount, 'records that have been obsoleted by a newer version.')

There are 18860 records that have been obsoleted by a newer version.


#### What is the total percentage of records that have been obsoleted by newer versions?

In [11]:
                
print("{:.2%}".format(ObsoletedCount / RecordTotal),'of records are obsoleted by other records in the DataONE LTER metadata holdings.')

46.09% of records are obsoleted by other records in the DataONE LTER metadata holdings.


#### Are multiple versions uploaded and obsoleted in a given year?
The reason this question is important is that if record versions are iterated upon in rapid succession, including them all in a measure of collection completeness would probably introduce noise into the dataset. We can see from the previous table that records are infact iterated rapidly, so we need to find a way to only retrieve the most recent version within a year.

### Wow some groups are frequently uploading new versions of records on DataONE!
It turns out that although just over half of the records are not updated within a year there are quite a few records that are, some quite heavily.  

Out of curiousity, lets count the identifiers

In [12]:
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "../Solr/Versions/", "*.csv"))))
dfg = df.groupby('Identifier')['Year'].nunique().sort_values(ascending=False)
dfg.head(10)


Identifier
gce.317.17    2
cap.226.10    2
gce.455.8     2
bnz.100.16    2
bnz.100.17    2
bnz.100.19    2
hfr.108.28    2
mcm.9021.6    2
gce.455.7     2
cap.224.9     2
Name: Year, dtype: int64

Woah, how many identifiers share a record?

In [13]:
dfg.ge(2).value_counts(ascending=True)

True      4909
False    30957
Name: Year, dtype: int64

There are 4908 identifiers that are used for more than one unobsoleted record.

In [14]:
ids = CombinedVersionsDF["Identifier"]
duplicateIdentifiers = CombinedVersionsDF[ids.isin(ids[ids.duplicated()])].sort_values(["Identifier",'Year'])
duplicateIDdf = duplicateIdentifiers.drop("baseIDyear", axis=1)
duplicateIDdf.head()

Unnamed: 0,Identifier,Year,ObsoletedByID,dataUrl
47,and.2720.8,2005,,https://cn.dataone.org/cn/v2/resolve/knb-lter-and.2720.8
256,and.2720.8,2015,,https://cn.dataone.org/cn/v2/resolve/https%3A%2F%2Fpasta.lternet.edu%2Fpackage%2Fmetadata%2Feml%2Fknb-lter-and%2F2720%2F8
48,and.2721.6,2005,,https://cn.dataone.org/cn/v2/resolve/knb-lter-and.2721.6
258,and.2721.6,2015,,https://cn.dataone.org/cn/v2/resolve/https%3A%2F%2Fpasta.lternet.edu%2Fpackage%2Fmetadata%2Feml%2Fknb-lter-and%2F2721%2F6
67,and.2722.5,2005,and.2722.6,https://cn.dataone.org/cn/v2/resolve/doi%3A10.6073%2FAA%2Fknb-lter-and.2722.5


#### This looks like a serious problem, 12% of records are using a packageID already assigned to another record. 
Now remember, I normalized the sub-string that I extract from the full identifier. I chose to do this to improve consistency in my file names and to remove potential conficts with using "/" in filepaths. The identifiers are more akin to the /eml:eml/@packageID in the XML this way, just without the prefix of "knb-lter", so it is likely that the identifiers in the records share this confilict. Regardless of Solr identifier naming convention this is how the string is represented at the end of the dataUrl as well. 

Are these records actually the same, or do they just share an identifier? More concerningly, why are the original versions of the records not obsoleted by the new version? Worse, why are records that were obsoleted in 2005 being reintroduced to the collection as active results?

In [15]:
print("{:.2%}".format(4905/40735))

12.04%


#### Thats a sizable collection. Let's get a random sample of these records to compare the contents directly. 
Since there are over 4500 Identifiers and more than 9000 dataUrls in the list of duplicated records to download, let's take a random sample for further inquiry.

In [16]:
IDs = duplicateIDdf.Identifier.unique().tolist()
IDcount = len(IDs)
IDcount

5056

In [17]:

sampleIDs = random.sample(IDs, round(IDcount/99))
len(sampleIDs)


51

In [18]:


UniqueDuplicatedIDdf = duplicateIDdf[duplicateIDdf["Identifier"].isin(sampleIDs)]
# create collections of year to avoid records overwriting each other and to retain the data in the dataframe
for Year in UniqueDuplicatedIDdf.Year.unique().tolist():
    YearDuplicateIDdf = UniqueDuplicatedIDdf[UniqueDuplicatedIDdf["Year"]==Year]
    filePath = "../collections/duplicateIdentifiers/" + str(Year) + "/"
    if not os.path.exists(filePath):
        os.makedirs(filePath)
    fileType = ".xml"

    dataURLs = YearDuplicateIDdf.dataUrl.tolist()

    XMLnames = [filePath + Identifier + fileType for Identifier in YearDuplicateIDdf.Identifier.tolist()]
    
    md.get_records(dataURLs, XMLnames, well_formed=False)

There was an error downloading from https://cn.dataone.org/cn/v2/resolve/knb-lter-sbc.1006.8


### Evaluate the absolute content of the XML collections and create a table of metadata verticals so that content of an element can be compared across each instance of a shared identifier.

This will allow us to compare the records and see if they actually describe the same resource. Our first step is to evaluate the collection for content and structure.  

To create a table that allows us to compare records side by side, we'll use MDeval to arrange the collection evaluation ito a table with xpaths as rows and records as columns. We will add in the value 'No Content' if the record does not contain the xpaths we've chose to look at. Since these records generally have over a hundred different xpaths, we'll limit the output to some basic citation elements.

In [19]:
# use xsl to evaluate each XML year collection

os.makedirs("../data/duplicateIdentifiers", exist_ok=True)
for Year in UniqueDuplicatedIDdf.Year.unique().tolist():
    cmd = ["/usr/bin/java",
           '-jar', "../scripts/saxon-b-9.0.jar",
           '-xsl:' + "../scripts/AllNodes.xsl",
           '-s:' + "../scripts/dummy.xml",
           '-o:' + "../data/duplicateIdentifiers/"+ str(Year) + "_XpathEvaluated.csv",
           'recordSetPath=' + "../collections/duplicateIdentifiers/" + str(Year) + "/"]
 
    subprocess.run(' '.join(cmd), shell=True, check=True)

pd.set_option('max_colwidth',100)

# create a pattern to identify common citation concepts

LTERrecElementSample = ['/eml:eml/@packageId',
 '/eml:eml/dataset/title',
 '/eml:eml/dataset/creator/individualName/surName',
 '/eml:eml/dataset/pubDate',
 '/eml:eml/dataset/abstract',
 '/eml:eml/dataset/keywordSet/keyword']

LTERrecElementsPattern = '|'.join(LTERrecElementSample)

# combine data from csv into one dataframe
CombinedDataDF = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "../data/duplicateIdentifiers/", "*.csv"))))
CombinedDataDF_LTERsample = CombinedDataDF[CombinedDataDF["XPath"].str.contains(LTERrecElementsPattern)]
XpathContent = md.recordXpathContent(CombinedDataDF_LTERsample)
RecordVerticalsDF = XpathContent.sort_values(by=['Record','Collection'])
RecordVerticalsDF.transpose()

Unnamed: 0,15,95,27,48,37,49,38,50,39,51,...,89,26,90,12,91,13,92,93,14,94
,,,,,,,,,,,,,,,,,,,,,
Collection,2006,2016,2010,2015,2013,2015,2013,2015,2013,2015,...,2015,2009,2015,2005,2015,2005,2015,2015,2005,2015
Record,arc.10006.7.xml,arc.10006.7.xml,arc.10081.4.xml,arc.10081.4.xml,arc.10227.2.xml,arc.10227.2.xml,arc.10251.2.xml,arc.10251.2.xml,arc.10333.2.xml,arc.10333.2.xml,...,sbc.1006.8.xml,sbc.5004.5.xml,sbc.5004.5.xml,vcr.113.16.xml,vcr.113.16.xml,vcr.55.14.xml,vcr.55.14.xml,vcr.67.17.xml,vcr.75.23.xml,vcr.75.23.xml
/eml:eml/@packageId,knb-lter-arc.10006.7,knb-lter-arc.10006.7,knb-lter-arc.10081.4,knb-lter-arc.10081.4,knb-lter-arc.10227.2,knb-lter-arc.10227.2,knb-lter-arc.10251.2,knb-lter-arc.10251.2,knb-lter-arc.10333.2,knb-lter-arc.10333.2,...,knb-lter-sbc.1006.8,knb-lter-sbc.5004.5,knb-lter-sbc.5004.5,knb-lter-vcr.113.16,knb-lter-vcr.113.16,knb-lter-vcr.55.14,knb-lter-vcr.55.14,knb-lter-vcr.67.17,knb-lter-vcr.75.23,knb-lter-vcr.75.23
/eml:eml/dataset/abstract/para,"Leaf area, biomass, foliar carbon and nitrogen by species for destructive vegetation harvests. P...",,Yearly file describing the physical/chemical values recorded at various lakes near Toolik Resear...,Yearly file describing the physical/chemical values recorded at various lakes near Toolik Resear...,Total Chlorophyll a and primary production data for Toolik Lake and surrounding lakes near the A...,Total Chlorophyll a and primary production data for Toolik Lake and surrounding lakes near the A...,Yearly file describing the physical/chemical values recorded at various lakes near Toolik Resear...,Yearly file describing the physical/chemical values recorded at various lakes near Toolik Resear...,"Depth profiles of NO3-N NH4-N PO-4, TDN, TDP, PP, PN and PC for lakes at the Arctic LTER Toolik ...","Depth profiles of NO3-N NH4-N PO-4, TDN, TDP, PP, PN and PC for lakes at the Arctic LTER Toolik ...",...,,Precipitation was collected by the Santa Barbara County Flood Control District at Cold Springs B...,Precipitation was collected by the Santa Barbara County Flood Control District at Cold Springs B...,"This contains coordinates (UTM zone 18N, NAD27) for permanent vegetation plots on Parramore Isla...","This contains coordinates (UTM zone 18N, NAD27) for permanent vegetation plots on Parramore Isla...",Soil water samples were collected at two depths (15 cm and 50 cm) with porus cup lysimeters from...,Soil water samples were collected at two depths (15 cm and 50 cm) with porus cup lysimeters from...,"This is a 20-year record of small mammal trapping from the Powdermill Biological Station, Rector...",This dataset contains information on nutrient concentrations a the Virginia Coast Reserve Long-T...,This dataset contains information on nutrient concentrations a the Virginia Coast Reserve Long-T...
/eml:eml/dataset/abstract/section/para,,,,,,,,,,,...,The data described here were collected on LTER06 which took place from 2003-02-25 to 2003-03-06 ...,,,,,,,,,
/eml:eml/dataset/abstract/section/para/literalLayout,,"Leaf area, biomass, foliar carbon and nitrogen by species for destructive vegetation harvests. P...",,,,,,,,,...,,,,,,,,,,
/eml:eml/dataset/abstract/section/title,,,,,,,,,,,...,There are 4 basic types of measurements:,,,,,,,,,
/eml:eml/dataset/creator/individualName/surName,Shaver,Shaver,"Giblin, Luecke, Kling","Giblin, Luecke, Kling",Giblin,Giblin,"Miller, Giblin, Kipphut","Miller, Giblin, Kipphut","Giblin, Miller","Giblin, Miller",...,"Brzezinski, Carlson, Siegel, Washburn",Melack,Melack,"Richardson, Shugart, Porter","Richardson, Shugart, Porter",Day,Day,Merritt,"McGlathery, Christian","McGlathery, Christian"
/eml:eml/dataset/keywordSet/keyword,"leaf area, normalized difference vegetation index, foliar nitrogen, leaf carbon, carbon to nitro...","specific leaf area, primary production, organic matter, leaves, leaf area, inorganic nutrients, ...","alkalinity, ammonium, anions, aquatic ecosystems, arctic, bacteria, bacterial abundance, biogeoc...","alkalinity, ammonium, anions, aquatic ecosystems, arctic, bacteria, bacterial abundance, biogeoc...","chlorophyll a, pheophytin, arctic lakes, lakes","chlorophyll a, pheophytin, arctic lakes, lakes","chemistry, depth, pH, conductivity, temperature, photosynthetically active radiation, light, dis...","chemistry, depth, pH, conductivity, temperature, photosynthetically active radiation, light, dis...","phosphorus, ammonium, nitrate, total dissolved nitrogen, total dissolved phosphorus, particulate...","phosphorus, ammonium, nitrate, total dissolved nitrogen, total dissolved phosphorus, particulate...",...,"LTER, Santa Barbara Coastal, Ocean_biogeochemistry, phytoplankton, UNOLS_cruises, Biomass, Carbo...","hydrology, precipitation, COLDSPRINGS210, Rain, precipitation amount","hydrology, precipitation, COLDSPRINGS210, Rain, precipitation amount",System State/Condition,System State/Condition,"Inorganic Nutrients, Disturbance, soil water, nitrogen, fertilization, ammonium, nitrate, phosph...","Inorganic Nutrients, Disturbance, soil water, nitrogen, fertilization, ammonium, nitrate, phosph...","Populations, Biodiversity, populations, small mammals, trapping","Inorganic Nutrients, System State/Condition, nitrogen, phosphorus, nitrite, Ammonium, water qual...","Inorganic Nutrients, System State/Condition, nitrogen, phosphorus, nitrite, Ammonium, water qual..."


### Records do share common content

But they are not exactly the same records. Occasionally there are slight differences in element usage in the different versions, but a more in-depth survey of elements in the two records show no difference in the content for the metadata concept other than dataset access information being about Pasta, and the elements used in abstract.

So now we see that the records are actually slightly different, but not over the content that a DataONE user is likely to search over. This is a problem because potentially useful datasets will get pushed out of the top results by all of the repetitions and unobsoleted records. 

While this may have to do with the way that the membernode uploaded the metadata, it clearly shows that metadata curation needs an EAR not just for the creator, but centralized repositories need to turn a listening EAR to the collections they recieve as well so they can best serve their users with accurate results the user can trust.

Since we are interested in records created and uploaded in particular years, these repeated records do not add clarity, and thus we will not add them. If records were merely new versions and LTER continued to use the Solr obsoletedBy element, or this massive reupload didnt reupload previously obsoleted records alongside the most current new version, I could see including them in the year collection for a site, but I'd rather get a clear picture of the metadata created in those years as it contains the time when Pasta came on line and LTER made a big push to improve their metadata records for the data packages they share through DataONE. LEt's check to see if anyone is obsoleting records anymore, and try to create a set of requests that still contain data on the years 2015-2018.

#### Are any of the records created after 2014 in the original dataframe obsoleted by any other record?

As we can see from the results below, a couple of the LTER sites have continued to obsolete records in the DataONE holdings, but no groups obsoleted the previous versions of the records we've identified as reintroduced in 2015

In [20]:
IsObsoletedDF = CombinedVersionsDF[CombinedVersionsDF['ObsoletedByID']!='None']
IsObsoletedDF[IsObsoletedDF['Year'].isin(['2015', '2016', '2017', '2018'])]

Unnamed: 0,baseIDyear,Identifier,Year,ObsoletedByID,dataUrl
4349,arc.20036,arc.20036.3,2017,arc.20036.4,https://cn.dataone.org/cn/v2/resolve/https%3A%2F%2Fpasta.lternet.edu%2Fpackage%2Fmetadata%2Feml%...
4769,bnz.414,bnz.414.9,2015,bnz.414.10,https://cn.dataone.org/cn/v2/resolve/knb-lter-bnz.414.9
4770,bnz.451,bnz.451.10,2015,bnz.451.11,https://cn.dataone.org/cn/v2/resolve/knb-lter-bnz.451.10
4771,bnz.451,bnz.451.12,2015,bnz.451.14,https://cn.dataone.org/cn/v2/resolve/knb-lter-bnz.451.12
4772,bnz.450,bnz.450.12,2015,bnz.450.14,https://cn.dataone.org/cn/v2/resolve/knb-lter-bnz.450.12
4773,bnz.453,bnz.453.9,2015,bnz.453.10,https://cn.dataone.org/cn/v2/resolve/knb-lter-bnz.453.9
4774,bnz.455,bnz.455.5,2015,bnz.455.7,https://cn.dataone.org/cn/v2/resolve/knb-lter-bnz.455.5
4775,bnz.455,bnz.455.8,2015,bnz.455.9,https://cn.dataone.org/cn/v2/resolve/knb-lter-bnz.455.8
4776,bnz.455,bnz.455.10,2015,bnz.455.11,https://cn.dataone.org/cn/v2/resolve/knb-lter-bnz.455.10
4777,bnz.412,bnz.412.11,2015,bnz.412.13,https://cn.dataone.org/cn/v2/resolve/knb-lter-bnz.412.11


### Results 

Records have the same titles and content, with some changes that have to do with the pasta system, but not anything having to do with the contextualizing information for the dataset. Additionally very few records are obsoleted after 2015, and none of the records that were obsoleted before 2015. Including records after the pasta system came online would introduce a lot of noise into the metadataset, and it looks like records that were previously added to DataONE and obsoleted are reintroduced as active records so in 2015-2018 we have records being duplicated and unobsoleted, exposing reliability issues for the discoverability of datasets the DataONE repository holds.

While this spells out the value of not only the content creators using qualitative and quantitative checks, it describes the need for all repositories to ensure that the metadata in their active collections is accurate and useful for their users. It means the records available in DataONE still need active curation from DataONE. It is not enough to simply rely on the uploaders for the quality and completeness of records.

We can utilize our identification of the duplicated record's dataUrls to remove the duplicate records from the SOLR/Requests csvs to just download new records. This allows us a more accurate view into the evolution of the metadata's records over time and gives us the opportunity to look at perhaps the most interesting time period The only remaining concern is the trend of not utilizing the ObsoletedBy version chaining, but this may be the result of using an external system for the creation of records. 


In [21]:
# create csv of records to download

Sites = ['and','arc','bes','bnz','cce',
         'cdr','cap','cwt','fce','gce',
         'hfr','hbr','jrn','kbs','knz',
         'luq','mcm', 'mcr','nwt','ntl',
         'pal','pie','sbc','sev','sgs','vcr']
dataURLprefix = "http://cn.dataone.org/cn/v2/query/solr/?q=formatType:METADATA+AND+authoritativeMN:*LTER+AND+identifier:%27-lter-"
dataURLsuffix = "%27+AND+dateUploaded:[2005-01-01T00:00:00Z%20TO%202018-12-31T23:59:59.999Z]&fl=identifier,dateUploaded,datePublished,dataUrl,obsoletes,obsoletedBy&rows=6000&sort=dateUploaded+asc&facet=true&facet.missing=true&facet.limit=-1&facet.range=dateUploaded&facet.range.start=2005-01-01T00:00:00Z&facet.range.end=2014-12-31T23:59:59.999Z&facet.range.gap=%2B1YEAR&wt=xml" 
filePath = "../Solr/Responses/"
if not os.path.exists(filePath):
    os.makedirs(filePath)
fileType = ".xml"

dataURLs = [dataURLprefix + Site + dataURLsuffix for Site in Sites]
   

XMLnames = [filePath + Site.upper() + fileType for Site in Sites]

# download the query responses
md.get_records(dataURLs, XMLnames, well_formed=False)

### Using XSL to transform the metadata about metadata into the metadata collections we want to analyze

Now that we have a XML document for each site that lists a download url and a date uploaded for each record as well as an obsoleted by that contains an identifier.

Running these csv with the getRecords function results in directories being created for each site and year records were uploaded. For example, the records for AND from 2005 were downloaded to [../metadata/AND__2005](../metadata/AND/2005) 

### Downloading the metadataset
Now that we have the metadata for records from the years we want to select from, we need to ensure that we only download the records that did not get obsoleted that year so we get just the version of the record that is not obsoleted that year, which should give us the best understanding of recommendation completeness for the time period, since we see a high degree of versioning to records in short time periods by some LTER sites in the repository.

To do this we'll apply an XSL transform to the XML records our Solr queries generated for each site. The transform will utilize a key of record identifiers and their upload years that the obsloetedBy and identifier can be checked against for each record. The transform will return a csv that we can use with the MDeval getRecords function to create directories for the records we want to include in the metadataset. Before we use the dataframe with the function, we will use the list of records that share names and remove the ones after 2014 from the set of requests. MDeval.getRecords will download the records into the correct directory.

In [22]:
# run xsl to get most recent version for each year

os.makedirs("../Solr/Requests/", exist_ok=True)
for Site in Sites:
    cmd = ["/usr/bin/java",
           '-jar', "../scripts/saxon-b-9.0.jar",
           '-xsl:' + "../scripts/YearlyCollectionOrganizationMostRecentVersion.xsl",
           '-s:' + "../Solr/Responses/" + Site + ".xml",
           '-o:' + "../Solr/Requests/" + Site.upper() + '.csv'
          ]
 
    subprocess.run(' '.join(cmd), shell=True, check=True)

In [23]:
'''
create list of dataUrls that are duplicates of previous years and use it
to remove the 12% of the collection that is noise to the dataset
'''
years = [2015,2016,2017,2018]

removeDupes = duplicateIDdf[duplicateIDdf.Year.isin(years)]
removeDupes_dataUrl_list = removeDupes.dataUrl.tolist()
Records = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "../Solr/Requests/", "*.csv"))))
NoDupesDF = Records[~Records.recordURL.isin(removeDupes_dataUrl_list)]


In [24]:
# use the above dataframe to download records
recordURL = NoDupesDF.recordURL.tolist()
recordPath = NoDupesDF.recordPath.tolist()
for path in recordPath:
    os.makedirs(os.path.join('../collections',path.split('/')[-3], path.split('/')[-2]), exist_ok=True)

md.get_records(recordURL, recordPath, well_formed=False)

There was an error downloading from https://cn.dataone.org/cn/v2/resolve/knb-lter-arc.1135.2
There was an error downloading from https://cn.dataone.org/cn/v2/resolve/knb-lter-arc.1145.2
There was an error downloading from https://cn.dataone.org/cn/v2/resolve/knb-lter-arc.1573.2
There was an error downloading from https://cn.dataone.org/cn/v2/resolve/knb-lter-arc.1575.2
There was an error downloading from https://cn.dataone.org/cn/v2/resolve/knb-lter-arc.1188.2
There was an error downloading from https://cn.dataone.org/cn/v2/resolve/knb-lter-arc.1364.2
There was an error downloading from https://cn.dataone.org/cn/v2/resolve/knb-lter-arc.1472.2
There was an error downloading from https://cn.dataone.org/cn/v2/resolve/knb-lter-arc.1479.2
There was an error downloading from https://cn.dataone.org/cn/v2/resolve/knb-lter-arc.1578.2
There was an error downloading from https://cn.dataone.org/cn/v2/resolve/knb-lter-arc.1167.2
There was an error downloading from https://cn.dataone.org/cn/v2/resol

#### Just under one percent of records aren't available

Let's try to download the records again to ensure that it wasn't just a temporary error.

In [25]:
''' 
I created a csv of the urls in the above error messages to find the proper line in the original
request and try to download those files again
'''


ErrorRequests = pd.read_csv("../scripts/missingRecord_dataUrls.csv", header=None)[0]

TryAgainDF = NoDupesDF[NoDupesDF['recordURL'].isin(ErrorRequests)]

recordURL = TryAgainDF.recordURL.tolist()
recordPath = TryAgainDF.recordPath.tolist()

md.get_records(recordURL, recordPath, well_formed=False)


There was an error downloading from https://cn.dataone.org/cn/v2/resolve/knb-lter-arc.1135.2
There was an error downloading from https://cn.dataone.org/cn/v2/resolve/knb-lter-arc.1145.2
There was an error downloading from https://cn.dataone.org/cn/v2/resolve/knb-lter-arc.1573.2
There was an error downloading from https://cn.dataone.org/cn/v2/resolve/knb-lter-arc.1575.2
There was an error downloading from https://cn.dataone.org/cn/v2/resolve/knb-lter-arc.1188.2
There was an error downloading from https://cn.dataone.org/cn/v2/resolve/knb-lter-arc.1364.2
There was an error downloading from https://cn.dataone.org/cn/v2/resolve/knb-lter-arc.1472.2
There was an error downloading from https://cn.dataone.org/cn/v2/resolve/knb-lter-arc.1479.2
There was an error downloading from https://cn.dataone.org/cn/v2/resolve/knb-lter-arc.1578.2
There was an error downloading from https://cn.dataone.org/cn/v2/resolve/knb-lter-arc.1167.2
There was an error downloading from https://cn.dataone.org/cn/v2/resol

In [26]:
# created second error list to compare with inital failures to see if any succeeded

secondTryDF = pd.read_csv("../scripts/2ndTry_missingRecord_dataUrls.csv", header=None)[0]

secondTryList = secondTryDF.tolist()

len(ErrorRequests) - len(secondTryDF)

0

### Consistent failures on the records in question

I'm not sure what the significance is of records being in the Solr index but not existing in the holdings. In any event it is not desired behavior on a discovery platform such as DataONE, even if it isn't the typical interface. Some of those data url look like DOI. Can we find the metadata that way?

https://www.doi%3A10.6073%2FAA%2Fknb-lter-cdr.129133.123

It turns out that even the DOI of these records has been terminated. These records don't exist anymore. By identifying the problem and alerting the repository, these errors were fixed, which enhances the clarity of the collection and allows users of the repository to have a better chance to find the datasets they want for their research.


We didn't get any more records. Lets remove any empty records created for downloads that failed, and, if they exist, any empty directories created to hold them.

In [27]:
# use 2ndTry_missingRecord_dataUrls to identify records we need to delete.

DeleteDF = Records.loc[Records['recordURL'].isin(secondTryDF)]

# create a list of filepaths to delete
recordPath = DeleteDF.recordPath.tolist()

# remove records that were DL errors
for record in recordPath:
    if os.path.isfile(record):
        os.remove(record)
# delete directories if they are empty (shouldn't be any, but...)
top = '../collections/LTER/'
for root, dirs, files in os.walk(top, topdown=False):
    for name in dirs:
        dir_path = os.path.join(root, name)
        if not os.listdir(dir_path):  # An empty list is False
            os.rmdir(os.path.join(root, name))
            
# Test to see if any failed downloads are still there

for record in recordPath:
    if os.path.isfile(record):
        print(record)

Now that we have our metadataset we can look at the structure and content of the records with some questions in mind.

* Is there an influence of the EML Best Practices for LTER Sites recommendation on each site as they started contributing to DataONE? 
* Do the sites improve their use of the narrative recommendation by submitting more complete records as the years progress? 
* Was there a measurable effect when Pasta and other automated came online in the fields used?
* What are the checks for Pasta?
* Was there a change in records when the recommendation changed in 2011?
* Did records become more similar over time?
* Did they become more or less focused on elements in the recommendations over time
* Did groups become more or less verbose over time? 
* Do the sites agree on what quality metadata is?
* What is the most complete site__year for the recommendation?
* What collection has most improved the completeness of their metadata for the recommendations?
* Have any collections not changed?

To try and address these questions, let's utilize another notebook that uses a module called EARmd.py.

### [Evaluate, Analyze, Report - LTER site uploads 2005-2018](Evaluate_Analyze_and_Report_metadata.ipynb)