# Are Data Objects referenced from publications FAIR

A list of data objects referenced from a set of publications is checked to determine their FAIRness.

Instead of only referencing to data, these tests are applied to **data objects**, which are any data which is published to complement the publication, this includes raw data, supplementary data, processing data, tables, images, movies, and compilations containing one or more of such resources.

The tests to be performed are aimed at finding out if the data objects are: 
 - **F**indable:  can the object be found easily?
 - **A**ccessible: can the object be retrieved?
 - **I**nteroperable: can the object be accesed programatically to extract data and metadata?
 - **R**eusable: can the object be readily used?

In [1]:
# library containign read and write functions to csv file
import lib.handle_csv as csvh

# managing files and file paths
from pathlib import Path

# library for handling url searchs
import lib.handle_urls as urlh

# add a progress bar
from tqdm import tqdm_notebook
    
# library for accessing system functions
import os

# import custom functions (common to various notebooks)
import processing_functions as pr_fns

## Findable
Most of the data objects are assumed to be findable as we were able to find links to them. However, some are references to other pages, references to contact the authors or point to repositories without identifying a specific record.

### Findability Score
A findability score was calculated for each data object as follows: assing 5 points if the object is referenced from the publication web page, it is referenced directly, and further details from it can be recovered (name, type and size) just by accessing that reference.

After this, points are deucted from the top score, if to find to the referenced object:
- Special access to the publication is needed (download the pdf, get a password or token for mining the publication, other blocks) \[-1 point\]
- Human access to the publication online is required (there is no metadata or clear pattern to identify a reference on the publication landing page or the pdf version redirects to the article). \[-1 point\]
- Recovering the reference object details (name, type and size) requires more than a single query. \[-1 point\]
- The reference is wrong (broken link). \[-2 points\]
- The reference points to contact the authors or lookup a data repository without an ID. \[-4 points\]

In [4]:
# get names and links for references in data mentions
data_reference, _ = csvh.get_csv_data('pub_data_fairness.csv', 'num')

for dr in tqdm_notebook(data_reference):
    if data_reference[dr]['ret_code'] == "" and data_reference[dr]['f_score'] == "":
        # try to get data object details from reference
        print("Article Link: https://doi.org/" + data_reference[dr]['doi'])
        ref_name = data_reference[dr]['name']
        ref_link = data_reference[dr]['data_url']
        print("Search for: Data Name:", ref_name, "data link:", ref_link)
        head = urlh.getPageHeader(ref_link)
        if head != None:
            data_reference[dr]['ret_code'] = head.status_code 
            data_reference[dr]['resoruce_name'] = os.path.basename(head.url)
            if head.status_code == 200:
                #print (head.headers, head.url)
                if 'content-type' in head.headers.keys():
                    data_reference[dr]['ref_content'] = head.headers['content-type']
                if 'content-length' in head.headers.keys():
                    data_reference[dr]['ref_size'] = head.headers['content-length']
                data_reference[dr]['ref_redirect'] = head.url
            elif head.status_code == 302 or head.status_code == 301:
                #print(head, head.headers)
                data_reference[dr]['ref_redirect'] = head.headers['location']
                data_reference[dr]['resoruce_name'] = os.path.basename(head.headers['location'])
            else:
                print(head, head.headers)
        else:
            data_reference[dr]['f_score'] = 1
    elif data_reference[dr]['f_score'] == "":
        data_reference[dr]['f_score'] = 5
        #print ("start ", dr, data_reference[dr]['f_score'])
        if data_reference[dr]['html_mined'] == 'FALSE' and data_reference[dr]['pdf_mined'] == 'TRUE':
            data_reference[dr]['f_score'] -= 1 # the publication page is not accessible directly to get the DO 
            #print ("deduct pdf mined", dr, data_reference[dr]['f_score'])
        if data_reference[dr]['html_mined'] == 'FALSE' and data_reference[dr]['user_mined'] == 'TRUE':
            data_reference[dr]['f_score'] -= 1 # a human user needed to access the resource
            #print ("deduct manually mined", dr, data_reference[dr]['f_score'])
        if data_reference[dr]['ret_code'] in ['0','404']:
            data_reference[dr]['f_score'] -= 2 # there is a problem with the link
            #print ("deduct page not found", dr, data_reference[dr]['f_score'])
        elif data_reference[dr]['ret_code'] != '200' and not 'doi.org' in data_reference[dr]['data_url'].lower():
            # dois always redirect
            data_reference[dr]['f_score'] -= 1 # there is some form of redirect to get to the  object
            #print ("deduct page redirect", dr, data_reference[dr]['f_score'], data_reference[dr]['ret_code'])

        

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  after removing the cwd from sys.path.


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=374.0), HTML(value='')))




In [5]:
if len(data_reference) > 0:
    csvh.write_csv_data(data_reference, 'pub_data_fairness.csv')

## Accessible
Having an identifier and a link does not guarantee access. Resources may be behind walls (login, redirections, email owner to get it, or similar). This is tranlated as: Can we get the resource? Again this is not a yes or no question. Getting the resource means that once the resource is at one's disposal.

### Accessibility score
An accessibility  score was calculated for each data object as follows: 5 if the object referenced allows direct download of the object just by accessing that reference. After this, for each additional step points are deucted from the top score, if to obtain to the referenced object:

- Special access to the publication is need (get a password or token for mining the publication, or similar acces blocks) [-1 point]
- Human access to the publication online is required (there is no metadata or clear pattern to identify a reference on the publication landing page or the pdf version redirects to the article). [-1 point]
- Recovering the reference object details (name, type and size) requires more than a single query. [-1 point]
- The reference is wrong (broken link). [-2 points]
- The reference points to contact the authors or lookup a data repository without an ID. [-4 points]


In [6]:
# get names and links for data object references
data_reference, _ = csvh.get_csv_data('pub_data_fairness.csv', 'num')
for dr in tqdm_notebook(data_reference):
    # if data objects has not been recovered before
    if data_reference[dr]['a_score'] == "":
        # try to get data object from reference
        print("Article Link: https://doi.org/" + data_reference[dr]['doi'])
        ref_name = data_reference[dr]['name']
        ref_link = data_reference[dr]['data_url']
        print("Search for: Data Name:", ref_name, "data link:", ref_link)
        
        if 'doi.org' in ref_link.lower():
            data_object = urlh.getObjectMetadata(ref_link)
            print(data_object)
            data_reference[dr]['got_object'] = True 
            data_reference[dr]['do_id'] = data_object['resource_url'] # assume url is the identifier for object
            data_reference[dr]['do_type'] = data_object['type'] # should match type in ref_content
            data_reference[dr]['do_metadata'] = data_object['metadata'] 
        else:
            data_object = urlh.getObject(ref_link)
            if data_object != {}:
                data_reference[dr]['got_object'] = True 
                data_reference[dr]['do_id'] = data_object['resource_url'] # assume url is the identifier for object
                data_reference[dr]['do_type'] = data_object['type'] # should match type in ref_content
                data_reference[dr]['do_size'] = data_object['size'] # should match size in ref_size
                data_reference[dr]['do_file'] = data_object['file_name'] 
            else:
                # score is 0 if the data cannot be downloaded
                data_reference[dr]['a_score'] = 0
    if data_reference[dr]['a_score'] != 0:
        data_reference[dr]['a_score'] = 5
        # type of object is diferent from availability check
        if data_reference[dr]['do_type'] != data_reference[dr]['ref_content']:
            data_reference[dr]['a_score'] -= 1
        # size of object is diferent from availability check
        if 'do_size' in data_reference[dr].keys() and data_reference[dr]['do_size'] != data_reference[dr]['ref_size']:
            data_reference[dr]['a_score'] -= 1
        # the file should exist and contain data of the specified type
        if 'file_name' in data_reference[dr].keys() and not Path(data_object['file_name']).is_file():
           data_reference[dr]['a_score'] = 0
    if dr = 3
    break

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  This is separate from the ipykernel package so we can avoid doing imports until


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=374.0), HTML(value='')))

Article Link: https://doi.org/10.1038/s41929-019-0334-3
Search for: Data Name: https://doi.org/10.17035/d.2019.0079744472 data link: https://doi.org/10.17035/d.2019.0079744472
trying to recover object from https://doi.org/10.17035/d.2019.0079744472
got something back
resource url https://data.crosscite.org/10.17035%2Fd.2019.0079744472
{'resource_url': 'https://data.crosscite.org/10.17035%2Fd.2019.0079744472', 'type': 'application/vnd.citationstyles.csl+json; charset=utf-8', 'metadata': {'type': 'article', 'id': 'https://doi.org/10.17035/d.2019.0079744472', 'categories': ['X-ray Photoelectron Spectroscopy (XPS)', 'Near Infrared Spectroscopy', 'Scanning Electron Microscopy', 'EXAFS', 'Gas Chromatography'], 'language': 'en', 'author': [{'family': 'MacIno', 'given': 'Margherita'}, {'family': 'Barnes', 'given': 'Alexandra J'}, {'family': 'Althahban', 'given': 'Sultan M'}, {'family': 'Qu', 'given': 'Ruiyang'}, {'family': 'Gibson', 'given': 'Emma K'}, {'family': 'Freakley', 'given': 'Simon J'

In [7]:
import requests
import json

req_head = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
req_head['Accept'] = 'application/x-bibtex'
url_text = 'https://doi.org/10.17035/d.2019.0079744472'
response = requests.get(url_text, headers = req_head)
print('*************BibTex*******************')
print(response.content.decode())
#req_head = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
print('*************Research Info Systems (RIS)*******************')
req_head['Accept'] = 'application/x-research-info-systems'
response = requests.get(url_text, headers = req_head)
print(response.content.decode())
print('*************VND Citation Styles CSL*******************')
req_head['Accept'] = 'application/vnd.citationstyles.csl+json'
response = requests.get(url_text, headers = req_head)
contents = response.content
contents_json = json.loads(contents.decode())
#contents_str = contents.decode('utf-8') 
print(json.dumps(contents_json, indent=4, sort_keys=True))

*************BibTex*******************
@misc{https://doi.org/10.17035/d.2019.0079744472,
  doi = {10.17035/D.2019.0079744472},
  url = {https://research.cardiff.ac.uk/converis/portal/detail/Dataset/79744472?auxfun=&lang=en_GB},
  author = {MacIno, Margherita and Barnes, Alexandra J and Althahban, Sultan M and Qu, Ruiyang and Gibson, Emma K and Freakley, Simon J and Dimitratos, Nikolaos and Kiely, Christopher J and Gao, Xiang and Beale, Andrew M and Bethell, Donald and He, Qian and Sankar, Meenakshisundaram and Hutchings, Graham J},
  keywords = {X-ray Photoelectron Spectroscopy (XPS), Near Infrared Spectroscopy, Scanning Electron Microscopy, EXAFS, Gas Chromatography},
  language = {en},
  title = {Tuning of catalytic sites in Pt/TiO2 catalysts for chemoselective hydrogenation of 3-nitrostyrene},
  publisher = {Cardiff University},
  year = {2019}
}

*************Research Info Systems (RIS)*******************
TY  - GEN
T1  - Tuning of catalytic sites in Pt/TiO2 catalysts for chemoselec

In [8]:
contents.decode()

'{\n  "type": "article",\n  "id": "https://doi.org/10.17035/d.2019.0079744472",\n  "categories": [\n    "X-ray Photoelectron Spectroscopy (XPS)",\n    "Near Infrared Spectroscopy",\n    "Scanning Electron Microscopy",\n    "EXAFS",\n    "Gas Chromatography"\n  ],\n  "language": "en",\n  "author": [\n    {\n      "family": "MacIno",\n      "given": "Margherita"\n    },\n    {\n      "family": "Barnes",\n      "given": "Alexandra J"\n    },\n    {\n      "family": "Althahban",\n      "given": "Sultan M"\n    },\n    {\n      "family": "Qu",\n      "given": "Ruiyang"\n    },\n    {\n      "family": "Gibson",\n      "given": "Emma K"\n    },\n    {\n      "family": "Freakley",\n      "given": "Simon J"\n    },\n    {\n      "family": "Dimitratos",\n      "given": "Nikolaos"\n    },\n    {\n      "family": "Kiely",\n      "given": "Christopher J"\n    },\n    {\n      "family": "Gao",\n      "given": "Xiang"\n    },\n    {\n      "family": "Beale",\n      "given": "Andrew M"\n    },\n    {\

In [9]:
if len(data_reference) > 0:
    csvh.write_csv_data(data_reference, 'pub_data_fairness.csv')

In [10]:
str(b'a string').encode().decode()
str("'a string'")

"'a string'"

## Interoperable
Access to a resource does not guarantee interoperability, it is interoperable if the data is stored in a format which makes it easy to interpret by humans and machines. So an object in an open format is more interoperable that an object in a proprietary format.

### Interoperability score
The interoperability score is defined along the lines of [5 Star Open Data](https://5stardata.info/en/). The definition is relaxed ommiting the request to publish with an open license. The scoring is as follows:

- 1 if the data object is available on the Web (whatever format). 
- 2 if the data object is available as structured data (e.g., Excel instead of image scan of a table)
- 3 make it available in a non-proprietary open format (e.g., CSV instead of Excel)
- 4 use URIs to denote things, so that people can point at your stuff
- 5 the data object is linked  to other data to provide context


## Reusable
Finding, retrieving and interpreting an object is not all ther is. For the resource to be reusable it needs to be a) licensed for use and b) in an appropriate format to guarantee long term support (related to 3)

