# DRAFT Analysis Report

## Context

FAIRness = the FAIR principles (Findable, Accessible, Interoperable, and Reusable) aim to ensure that research data is easy to locate, retrieve, and integrate, while also ensuring that data can be reused by others in future research.

Interoperability = the ability of different systems, services, or applications to work together seamlessly.

## Objective

- Identify a list of endpoints representing where the data is stored/catalogued, 
- Harvest the endpoints as possible,
- Analyse those data for fair-interoperability work done up till now,
- Report the results & translate into recommendations 

## Analysis methods & results

A list of services, with their endpoint and documentation, was obtained and is available here:  
[confluence page](https://icos-ri.atlassian.net/wiki/spaces/GEOR/pages/372801570/Your+input+contribution+to+GEORGE+WP4+Task+1.+i+roadmap)  


Each of the given endpoints listed here was assessed based on its FAIRness & interoperability.  
In the FAIRness & interoperability assessment we took following key considerations into account:
- Standardization:  
Services that adhere to widely accepted standards (like SPARQL and JSON) are generally more interoperable.
- Data Format:  
Services that provide data in structured, machine-readable formats are more interoperable than those that rely on raw file sharing.
- Integration Complexity:  
The ease with which a service can be integrated with other systems affects its interoperability.  
More specifically, how easily the data is accessible and its semantics understandable. 


These analyses can be consulted in the jupyter notebooks.  

## Findings 

Overall, there is good FAIRness and interoperability at a basic level. Given the endpoints, documentation and domain specific knowledge, data is generally findable, accessible and usable.  

Interoperability hindered by requirement of domain specific knowledge to access and use the data:  
- on the level of type of endpoint and data format:  
in order to access the data via the given endpoint, one must know how to navigate that type of endpoint, be it a JSON API, ERDDAP server, SPARQL endpoint.  
Additionally, one must also know how to work with the file format in which the data is offered (with this project, most data is offered as netCDF, JSON, ...)
- on the level fo the data model:
in order to use the data, one must know what the data(points) represents and this requires knowledge on how the data is modelled  
(created a threshold for the quantitative analysis of the services --> note that I come from data / not familiar with the domain, had to learn how to access & use various file formats (netCDF, ...) + steps in the other notebooks )

todo: make disntinction between interoperability at level of data access and data themselves

The types/kinds of services can be ordered as follows: 

- File Server:
    - Interoperability: Low
    - 2 stars
    - Explanation: 
        - File servers primarily store and share files, in formats like text, CSV or NetCDF.  
        - Accessing data requires additional steps (constructing uri template pattern to retrieve data via ftp)
        - While the data is accessible, the lack of structured, queryable interfaces means that other systems must download and process the files before they can use the data, which limits interoperability.


- ERDDAP Server:
    - Interoperability: Medium (~ possibly high if metadata contains links to e.g. BODC terms)
    - 3 stars
    - Explanation: 
        - ERDDAP (Environmental Research Division's Data Access Program) server provides data in mainly NetCDF formats. 
        - analysis of alignment between properties between the ERDDAP servers shows a degree of similarity in naming, but the majority of properties appear to be unique. 
        - due to limited use of persistent identifiers and standard terms, there remains a great deal a semantic ambiguity and therefore limiting interoperability.
        - ERDDAP servers offer more interoperability than a file server because it can support a variety of output formats and standard protocols, allowing for easier integration with other systems. However, the different servers offer a variety in properties with sometimes unclear semantics, which limits interoperability.
        


- JSON APIs:
    - Interoperability: High
    - 3 stars
    - Explanation: 
        - JSON APIs provide data in a structured format (JSON), which is widely used and easy to consume across various platforms and programming languages. 
        - the API design and data models follow standard protocols (e.g. SensorThings protocol, Swagger API documentation) which facilitates integration of data into other applications, mobile apps and systems. 
        - Some occurrence of URLs and standard terms, but overall limited use of persistent identifiers hinders interoperability.


- SPARQL Endpoint:
    - Interoperability: Very High
    - 4 stars
    - Explanation: 
        - A SPARQL endpoint provides a way to query RDF (Resource Description Framework) data using the SPARQL query language. RDF is a standard model for data interchange on the web, and SPARQL is a W3C-standardized query language. 
        - This combination allows for highly interoperable data sharing across different systems, particularly in the context of linked data and the Semantic Web. 
        - SPARQL endpoints are particularly powerful in environments where data integration from multiple sources is required.

Explanation of star data (see [link](https://5stardata.info/en/)) ...  


However, still room for improvement:  (maybe move this to recommendations)
    - unambiguous semantics: use of codes, more & more consistent use of externally defined terms 
    - alignment of data model structure across endpoints, currently internal data structure is exposed via the endpoints (good practice seen in ICOS SPARQL endpoint --> offer a dcat description --> domain of ocean observation would benefit from a common data model structure to expose data given that data is similar in kind/type/nature across endpoints)   


## Recommendations

Overall FAIRness & interoperability of services is good at basic level. Data is findable, accessible & useable.

However, if data is to be used at wider scale, one cannot assume domain specific knowledge to be present, and hence data should become more self-descriptive.  
(because then you cannot assume domain knowledge to be present & without being more self descriptive analysis mistakes are very likely to occur)

Formulated recommendations for improved self-descriptivness of data at 2 levels:

1. Description of available services
    - provide a LD description of the offered services (~ the endpoints analysed) ---> todo: provide an example!!
        - more a quick fix, 
        - to improve finding your way around available services & data, & more quickly determine which service one can use keeping in mind inhouse knowledge (e.g. having someone who can work with json API, S3 buckets, ...)  
        - common ontologies/standards used: [schema](https://schema.org/), [dcat](https://www.w3.org/TR/vocab-dcat-3/)  
        - 
    - more use of external standard terms, 
        - e.g. use of OrcID, ROR ID, urls for observed parameters, BODC standard vocabularies](https://vocab.nerc.ac.uk), ...
        - but would be best if chosen terms are aligned across stakeholders, (see also second recommendation)


2. Common data model
    - develop a data model, with various stakeholders from the community, to better describe the offered data 
    - agree on the common entities that are described (instruments, events, observations, ...), and with URIs & which properties they're described
    - elevate data to level 4 & 5 of [5 star data](https://5stardata.info/en/) (by developing this model community wide, it would facilitate pointing to other data and have other point at yours since URIs will be commonly known)
    - this would allow to more easily integrate data from different sources
    - and make this data available as LOD in the future
        - one can get inspiration from the ARGO & ICOS data models in their respective SPARQL endpoints
        - other common ontologies:
            - [prov](https://www.w3.org/TR/prov-o/)
            - [dct](https://www.dublincore.org/specifications/dublin-core/dcmi-terms/)
            - [ssn](https://www.w3.org/TR/vocab-ssn/)
            - [data cube](https://www.w3.org/TR/vocab-data-cube/) --> very suitable for the description of NetCDF files
            - [dcat](https://www.w3.org/TR/vocab-dcat-3/)
            - ...


4. We'll do an analysis of degree of alignment between data properties
~ identify & describe the steps needed for interoperability  --> what if possible, what is possible with extra work & what is not possible

concrete example for data integration:
ship at location --> want to know go from location A to B?
depending on value for parameter X from one location & from another location 

4.B **documenting the self-descriptveniss of the endpoints** in each notebook (being RDF, other format that are available, ...)

5. concrete steps to get to level 4&5 ~create a interopability roadmap

6. how this can benefit the trails / steps needed .. (how this can help partners in the wp)

(note: emphasize on examples on notebooks where bottleneck are)
