# DRAFT Analysis Report

## Context

## Objective

- Identify a list of endpoints representing where the data is stored/catalogued, 
- Harvest the endpoints as possible,
- Analyse those data for fair-interoperability work done up till now,
- Report the results & translate into recommendations 

## Analysis methods & results
- for each of the given endpoints, a manual assessment of the FAIRness & interoperability of the offered data. 
- the FAIRness & interoperability assessment consisted of checking:
    - how easily can data be found and accessed?
    - what is the data granularity? i.e. to what level is data available? 
    - is data semantically unambiguous and interoperable?
    - how easily is data integratable?

These analyses can be consulted in the jupyter notebooks.  

## Findings 

General findings:

Overall good level of FAIRness at a basic level. 
1. Given the endpoints, documentation and domain specific knowledge, data is generally findable, accessible and usable.  

2. Interoperability/standardization  
Certain level of standardization present offered data:
    - at level of the service/endpoint:  some services were developed following a standard (e.g. SensorThings API, swagger APIs, ERDDAP).
    - at level of the data offered by the service: e.g. 
        - use of same column headers across similar kinds of services. 
        - in some cases use of standard terms such as orcID, urls, ... 

However, there is still room for improvement:  (maybe move this to recommendations)
    - unambiguous semantics: use of codes, more & more consistent use of externally defined terms 
    - alignment of data model structure across endpoints, currently internal data structure is exposed via the endpoints (good practice seen in ICOS SPARQL endpoint --> offer a dcat description --> domain of ocean observation would benefit from a common data model structure to expose data given that data is similar in kind/type/nature across endpoints)     


3. Domain specific knowledge is required to be able to access and use the data  
    - on the level of type of endpoint and data format:
    in order to access the data via the given endpoint, one must know how to navigate that type of endpoint, be it a JSON API, ERDDAP server, SPARQL endpoint  
    additionally, one msut also know how to work with the file format in which the data is offered (with this project, most data is offered as netCDF, JSON, ...)

    - on the level fo the data model:
    in order to use the data, one must know what the data(points) represents and this requires knowledge on how the data is modelled  

--> consequences:
    - domain specific knowledge either is available inhouse or needs to be obtained trough learning curve. 
    - makes data less interopreable & (re-)usable
    - ...

3. Granularity of data offered by endpoint varies:
    - sometimes to file level, other times to observation level
    - in most cases, can go to observations/measurments levels with additional steps (e.g. retrieval of data from within files) 
    - ...

4. Data integration is possible but hindered by the required domain specific knowledge & unambiguous semantics
    - also makes mistakes more likely when combining data 
    - ...

## Recommendations

Overall FAIRness & interoperability of services is good at basic level. Data is findable, accessible & useable.

However, if data is to be used at wider scale, one cannot assume domain specific knowledge to be present, and hence data should become more self-descriptive.  
(because then you cannot assume domain knowledge to be present & without being more self descriptive analysis mistakes are very likely to occur)

Formulated recommendations for improved self-descriptivness of data at 2 levels:

1. Description of available services
    - describe the offered services (~ the endpoints analysed) via LD ---> todo: provide an example!!
        - more a quick fix, 
        - to improve finding your way around available services & data, & more quickly determine which service one can use keeping in mind inhouse knowledge (e.g. having someone who can work with json API, S3 buckets, ...)  
    - more use of external standard terms, 
        - e.g. use of OrcID, ROR ID, urls for observed parameters, ...
        - but would be best if chosen terms are aligned acros stakeholders, (see also second recommendation)


2. Common data model
    - develop a data model, with various stakeholders from the community, to better describe the offered data 
      agree on the common entities that are described (instruments, events, observations, ...), and with which properties they're described 
    - tis would allow to more easily integrate data from different sources
    - and make this data available as LOD in the future
        - one can get inspiration from the ARGO & ICOS data models in their respective SPARQL endpoints
        - other common ontologies:
            - prov
            - dct
            - ssn
            - qube --> very suitable for the description of NetCDF files
            - dcat
            - ...

3. overall complexity of systems hindered quantitive analysis

4. We'll do an analysis of degree of alignment between data properties --> TODO: make lists of properties for each endpoint
