# Pipeline Stage 2
There are a number of files used to help manage the data synthesis process for SGCN species that are stored with the root collection item in ScienceBase. This notebook is used to cache the files for processing, but was also used initially to help document those data objects for further reference and validation with JSON schema.

I added a return_data method fo the cache_sgcn_metadata() function in the pysgcn package that handles the process of grabing up and caching the additional metadata files from the SGCN collection item so that it could return the actual data from those files for building schema documentation. I added a build_definitions parameter to the generate_json_schema() function in the pySppIn utils module that triggers an input process during schema generation to prompt for titles and descriptions on the schema itself and each property in the schema. By looping over the individual metadata files and firing the schema generation function with this parameter, we can actively provide a little bit of additional documentation that provides essentially entity/attribute metadata for the package. I then dumped those schema documents to files for later reference in data validation steps.

Right now, the generate_json_schema() function will work for relatively simple cases like these where we essentially just have tables of attributes and values to work with. It will not yet work for more complex nested data structures. I added the functions for working with schemas to pySppIn because they will apply to other cases well beyond the SGCN, but we will ultimately want to either contribute code to some other project or build out a more general data management package.

In [1]:
import pysgcn
sgcn = pysgcn.sgcn.Sgcn()
from IPython.display import display
import json

import pysppin
pysppin_utils = pysppin.utils.Utils()

In [2]:
sgcn_meta = sgcn.cache_sgcn_metadata(return_data=True)

In [3]:
for meta_file in sgcn_meta.items():
    print(meta_file[0])
    display(meta_file[1][0])
    schema = json.loads(pysppin_utils.generate_json_schema(meta_file[1], build_definitions=True))
    with open(f"{meta_file[0].replace(' ','_')}_schema.json", "w") as f:
        json.dump(schema, f)
        f.close()

Historic 2005 SWAP National List


{'scientific_name': '?Agastoschizomus n.sp.'}

schema title: Historic 2005 SWAP National List
schema description: List of scientific names extracted from the original State Wildlife Action Plans used for historical purposes to identify species that were on the 2005 National List
title for scientific_name: Scientific Name
description for scientific_name: Scientific name used in the 2005 National List
SGCN ITIS Overrides


{'ScientificName_original': 'Vermiforma pinus',
 'taxonomicAuthorityID': 'http://services.itis.gov/?q=tsn:950011'}

schema title: SGCN ITIS Overrides
schema description: Scientific names linked to ITIS Taxonomic Serial Numbers by hand used to compensate for insufficient name matches
title for ScientificName_original: Original Scientific Name
description for ScientificName_original: Scientific name string supplied in the state/territory submissions
title for taxonomicAuthorityID: Taxonomic Authority ID
description for taxonomicAuthorityID: Identifier in URL form for the ITIS Taxonomic Serial Number
Taxonomic Group Mappings


{'name': 'Amphibia', 'rank': 'Class', 'sgcntaxonomicgroup': 'Amphibians'}

schema title: Taxonomic Group Mappings
schema description: Set of logical mappings from names in the taxonomic hierarchy to logical taxa groups used in the SGCN
title for name: Name
description for name: Name at taxonomic rank
title for rank: Rank
description for rank: Taxonomic rank from ITIS or WoRMS
title for sgcntaxonomicgroup: SGCN Taxonomic Group
description for sgcntaxonomicgroup: Logical taxa group name used in the SGCN
Hierarchy for FWS Listing Status


{'status': 'Endangered', 'abb': 'E', 'order': '1'}

schema title: Hierarchy for FWS Listing Status
schema description: Ordered hierarchy for determining the listing status to display for species found on the FWS List of Threatened and Endangered species to account for cases where there are multiple listing statuses for a given species
title for status: Status
description for status: Full name of listing status designation
title for abb: Abbreviation
description for abb: Abbreviated listing status
title for order: Order
description for order: Numeric order for selecting a single listing status to display
NatureServe National Conservation Status Descriptions


{'code': 'NNR', 'definition': 'Unranked'}

schema title: NatureServe National Conservation Status Descriptions
schema description: Mappings of NatureServe National Conservation Status codes to descriptions
title for code: Code
description for code: NatureServe National Conservation Status code
title for definition: NatureServe National Conservation Status description
description for definition: NatureServe National Conservation Status definition
Fish and Wildlife Service Endangered Species Program Species Status Codes


{'tess_code': 'Endangered',
 'definition': 'A species in danger of extinction throughout all or a significant portion of its range.'}

schema title: Fish and Wildlife Service Endangered Species Program Species Status Codes
schema description: Definitions of FWS listing status designations
title for tess_code: TESS Code
description for tess_code: Listing status designation code used in the FWS Ecological Conservation Online System
title for definition: Definition
description for definition: Full description of the listing status code
