# Are Data Objects referenced from publications FAIR

A list of data objects referenced from a set of publications is checked to determine their FAIRness.

Instead of only referencing to data, these tests are applied to **data objects**, which are any data which is published to complement the publication, this includes raw data, supplementary data, processing data, tables, images, movies, and compilations containing one or more of such resources.

The tests to be performed are aimed at finding out if the data objects are: 
 - **F**indable:  can the object be found easily?
 - **A**ccessible: can the object be retrieved?
 - **I**nteroperable: can the object be accesed programatically to extract data and metadata?
 - **R**eusable: can the object be readily used?

In [3]:
# library containign read and write functions to csv file
import lib.handle_csv as csvh

# managing files and file paths
from pathlib import Path

# library for handling url searchs
import lib.handle_urls as urlh

# add a progress bar
from tqdm import tqdm_notebook
    
# library for accessing system functions
import os

# import custom functions (common to various notebooks)
import processing_functions as pr_fns

## Findable
Most of the data objects are assumed to be findable as we were able to find links to them. However, some are references to other pages, references to contact the authors or point to repositories without identifying a specific record.

### Findability Score
A findability score was calculated for each data object as follows: assing 5 points if the object is referenced from the publication web page, it is referenced directly, and further details from it can be recovered (name, type and size) just by accessing that reference.

After this, points are deucted from the top score, if to find to the referenced object:
- Special access to the publication is needed (download the pdf, get a password or token for mining the publication, other blocks) \[-1 point\]
- Human access to the publication online is required (there is no metadata or clear pattern to identify a reference on the publication landing page or the pdf version redirects to the article). \[-1 point\]
- Recovering the reference object details (name, type and size) requires more than a single query. \[-1 point\]
- The reference is wrong (broken link). \[-2 points\]
- The reference points to contact the authors or lookup a data repository without an ID. \[-4 points\]

In [None]:
# get names and links for references in data mentions
data_reference, _ = csvh.get_csv_data('pub_data_fairness.csv', 'num')

for dr in tqdm_notebook(data_reference):
    if data_reference[dr]['ret_code'] == "" and data_reference[dr]['f_score'] == "":
        # try to get data object details from reference
        print("Article Link: https://doi.org/" + data_reference[dr]['doi'])
        ref_name = data_reference[dr]['name']
        ref_link = data_reference[dr]['data_url']
        print("Search for: Data Name:", ref_name, "data link:", ref_link)
        head = urlh.getPageHeader(ref_link)
        if head != None:
            data_reference[dr]['ret_code'] = head.status_code 
            data_reference[dr]['resoruce_name'] = os.path.basename(head.url)
            if head.status_code == 200:
                #print (head.headers, head.url)
                if 'content-type' in head.headers.keys():
                    data_reference[dr]['ref_content'] = head.headers['content-type']
                if 'content-length' in head.headers.keys():
                    data_reference[dr]['ref_size'] = head.headers['content-length']
                data_reference[dr]['ref_redirect'] = head.url
            elif head.status_code == 302 or head.status_code == 301:
                #print(head, head.headers)
                data_reference[dr]['ref_redirect'] = head.headers['location']
                data_reference[dr]['resoruce_name'] = os.path.basename(head.headers['location'])
            else:
                print(head, head.headers)
        else:
            data_reference[dr]['f_score'] = 1
    elif data_reference[dr]['f_score'] == "":
        data_reference[dr]['f_score'] = 5
        #print ("start ", dr, data_reference[dr]['f_score'])
        if data_reference[dr]['html_mined'] == 'FALSE' and data_reference[dr]['pdf_mined'] == 'TRUE':
            data_reference[dr]['f_score'] -= 1 # the publication page is not accessible directly to get the DO 
            #print ("deduct pdf mined", dr, data_reference[dr]['f_score'])
        if data_reference[dr]['html_mined'] == 'FALSE' and data_reference[dr]['user_mined'] == 'TRUE':
            data_reference[dr]['f_score'] -= 1 # a human user needed to access the resource
            #print ("deduct manually mined", dr, data_reference[dr]['f_score'])
        if data_reference[dr]['ret_code'] in ['0','404']:
            data_reference[dr]['f_score'] -= 2 # there is a problem with the link
            #print ("deduct page not found", dr, data_reference[dr]['f_score'])
        elif data_reference[dr]['ret_code'] != '200' and not 'doi.org' in data_reference[dr]['data_url'].lower():
            # dois always redirect
            data_reference[dr]['f_score'] -= 1 # there is some form of redirect to get to the  object
            #print ("deduct page redirect", dr, data_reference[dr]['f_score'], data_reference[dr]['ret_code'])

        

In [None]:
if len(data_reference) > 0:
    csvh.write_csv_data(data_reference, 'pub_data_fairness.csv')

## Accessible
Having an identifier and a link does not guarantee access. Resources may be behind walls (login, redirections, email owner to get it, or similar). This is tranlated as: Can we get the resource? Again this is not a yes or no question. Getting the resource means that once the resource is at one's disposal.

### Accessibility score
An accessibility  score was calculated for each data object as follows: 5 if the object referenced allows direct download of the object just by accessing that reference. After this, for each additional step points are deucted from the top score, if to obtain to the referenced object:

- Special access to the publication is need (get a password or token for mining the publication, or similar acces blocks) [-1 point]
- Human access to the publication online is required (there is no metadata or clear pattern to identify a reference on the publication landing page or the pdf version redirects to the article). [-1 point]
- Recovering the reference object details (name, type and size) requires more than a single query. [-1 point]
- The reference is wrong (broken link). [-2 points]
- The reference points to contact the authors or lookup a data repository without an ID. [-4 points]


In [None]:
# get names and links for data object references
data_reference, _ = csvh.get_csv_data('pub_data_fairness.csv', 'num')
for dr in tqdm_notebook(data_reference):
    # if data objects has not been recovered before
    if data_reference[dr]['a_score'] == "":
        # try to get data object from reference
        print("Article Link: https://doi.org/" + data_reference[dr]['doi'])
        ref_name = data_reference[dr]['name']
        ref_link = data_reference[dr]['data_url']
        if data_reference[dr]['correct_url'] != "":
            ref_link = data_reference[dr]['correct_url']
        print("Search for: Data Name:", ref_name, "data link:", ref_link)
        
        if 'doi.org' in ref_link.lower():
            data_object = urlh.getObjectMetadata(ref_link)
            print(data_object)
            if data_object != {}:
                data_reference[dr]['got_object'] = True 
                data_reference[dr]['do_id'] = data_object['resource_url'] # assume url is the identifier for object
                data_reference[dr]['do_type'] = data_object['type'] # should match type in ref_content
                data_reference[dr]['do_metadata'] = data_object['metadata'] 
        else:
            data_object = urlh.getObject(ref_link)
            if data_object != {}:
                data_reference[dr]['got_object'] = True 
                data_reference[dr]['do_id'] = data_object['resource_url'] # assume url is the identifier for object
                data_reference[dr]['do_type'] = data_object['type'] # should match type in ref_content
                if 'size' in data_object.keys():
                    data_reference[dr]['do_size'] = data_object['size'] # should match size in ref_size
                data_reference[dr]['do_file'] = data_object['file_name'] 
            else:
                # score is 0 if the data cannot be downloaded
                data_reference[dr]['a_score'] = 0
    if data_reference[dr]['a_score'] != 0:
        data_reference[dr]['a_score'] = 5
        # type of object is diferent from availability check
        if 'do_type' in data_reference[dr].keys() and data_reference[dr]['do_type'] != data_reference[dr]['ref_content']:
            data_reference[dr]['a_score'] -= 1
        # size of object is diferent from availability check
        if 'do_size' in data_reference[dr].keys() and data_reference[dr]['do_size'] != data_reference[dr]['ref_size']:
            data_reference[dr]['a_score'] -= 1
        # the file should exist and contain data of the specified type
        if 'file_name' in data_reference[dr].keys() and not Path(data_object['file_name']).is_file():
           data_reference[dr]['a_score'] = 0


In [None]:
if len(data_reference) > 0:
    csvh.write_csv_data(data_reference, 'pub_data_fairness.csv')
    

## Interoperable
Access to a resource does not guarantee interoperability, it is interoperable if the data is stored in a format which makes it easy to interpret by humans and machines. So an object in an open format is more interoperable that an object in a proprietary format.

### Interoperability score
The interoperability score is defined along the lines of [5 Star Open Data](https://5stardata.info/en/), using the three first levels.  The definition is relaxed ommiting the request to publish with an open license for intereoperability (it is used below for reusability). The scoring maximum is 3, and it is assigned as follows:

- 1 if the data object is available on the Web (whatever format). 
- 2 if the data object is available as structured data (e.g., Excel instead of image scan of a table)
- 3 make it available in a non-proprietary open format (e.g., CSV instead of Excel)

In [None]:
# get names and links for data object references
data_reference, _ = csvh.get_csv_data('pub_data_fairness.csv', 'num')
# check if files actually exist, and then assess if they are of type 1, 2, or 3
level_1_types = ['.doc','.pdf','.tif','.jpg', '.docx', '.ppt', '.pptx', '.mp4', '.mpg', '.htm', '.html', '.zip','.avi']
level_2_types = ['xls','xlsx']
level_3_types = ['csv','cif']
for dr in tqdm_notebook(data_reference):
    do_file_name = data_reference[dr]['do_file']
    if do_file_name!= "" and  not Path(do_file_name).is_file():
        data_reference[dr]['file_missing'] = 'TRUE'
    elif do_file_name!= "":
        data_reference[dr]['file_missing'] = 'FALSE'
        data_reference[dr]['file_size'] = Path(do_file_name).stat().st_size
    for lv1_type in level_1_types:
        if lv1_type in do_file_name:
            data_reference[dr]['i_score'] = 1
    for lv2_type in level_2_types:
        if lv2_type in do_file_name:
            data_reference[dr]['i_score'] = 2
    for lv3_type in level_3_types:
        if lv3_type in do_file_name:
            data_reference[dr]['i_score'] = 3

            

In [8]:
if len(data_reference) > 0:
    csvh.write_csv_data(data_reference, 'pub_data_fairness.csv')

## Reusable
Finding, retrieving and interpreting an object is not all ther is. For the resource to be reusable it needs to be a) licensed for use and b) in an appropriate format to guarantee long term support (closely related to the criteria for interoperability).

### Reusability score
The interoperability score is also based on the [5 Star Open Data](https://5stardata.info/en/) levels, using the requirement for open licenses and the requirements for using identifiers and links to other data. In this case the scoring adds up to 3 ponts, to obtain the score a point is added for each of the following cases:
- 1 if the data object is available on an open license.
- 1 use identifiers(URI, DOI) to denote things, so that people can point at it.
- 1 the data object is linked  to other data to provide context.


In [14]:
data_reference, _ = csvh.get_csv_data('pub_data_fairness.csv', 'num')
# for doi marked data objects the information about license, identification
# and linking can be obtained by looking at the DOI metadata
# for supplementary data, we try to get the DOI metadata for the parent
# publication and assing the same license to the DO.
open_licenses = ['http://pubs.acs.org/page/policy/authorchoice_ccby_termsofuse.html',
                 'http://creativecommons.org/licenses/by/3.0/',
                 'http://creativecommons.org/licenses/by/4.0',
                 'https://creativecommons.org/licenses/by/4.0',
                 'http://creativecommons.org/licenses/by/4.0/',
                 'http://creativecommons.org/licenses/by-nc/3.0/',
                 'http://doi.wiley.com/10.1002/tdm_license_1',
                 'https://creativecommons.org/licenses/by/4.0/']

not_open_licenses = ['http://www.springer.com/tdm',
                     'http://onlinelibrary.wiley.com/termsAndConditions#vor',
                     'http://doi.wiley.com/10.1002/tdm_license_1.1',
                     'http://rsc.li/journals-terms-of-use',
                     'http://pubs.acs.org/page/policy/authorchoice_termsofuse.html',
                     'http://onlinelibrary.wiley.com/termsAndConditions',
                     'https://www.elsevier.com/tdm/userlicense/1.0/',
                     'http://www.sciencemag.org/about/science-licenses-journal-article-reuse']

def license_is_open(a_license):
    is_open = False
    if not a_license in open_licenses and not a_license in not_open_licenses:
        assigned_lt = False
        while not assigned_lt:
            print(a_license)
            print('Assing license type:')
            print('\ta) Open')
            print('\tb) Not Open')
            print('\tSelect a or b:')
            lts = input()
            if lts == "a":
                open_licenses.append(a_license)
                assigned_lt = True
            elif lts == "b":
                not_open_licenses.append(a_license)
                assigned_lt = True
    if a_license in open_licenses:
        is_open = True
    else:
        is_open = False
    return is_open
    
for dr in tqdm_notebook(data_reference):
    
    do_file_name = data_reference[dr]['do_file']
    if data_reference[dr]['r_score'] == "":
        data_reference[dr]['r_score'] = 0
        if data_reference[dr]['license'] != "":
            #print(data_reference[dr])
            data_reference[dr]['r_score'] += 1
        else:
            print("assing same license as publication")
            # Use publication DOI metadata copyright field
            doi_link = "https://doi.org/" + data_reference[dr]['doi']
            data_object = urlh.getObjectMetadata(doi_link)
            #print(data_object)
            if data_object != {}:
                #print(data_object['resource_url'], data_object['type'], data_object['metadata'])
                if 'license' in data_object['metadata']:
                    #print(str(type(data_object['metadata']['license'])))
                    if isinstance(data_object['metadata']['license'], list):
                        for license_item in data_object['metadata']['license']:
                            this_license = license_item['URL']
                            if license_is_open(this_license):
                                data_reference[dr]['r_score'] += 1
                            if data_reference[dr]['license'] == "":
                                data_reference[dr]['license'] = this_license
                            else:
                                data_reference[dr]['license'] += ", " + this_license
                    else: 
                        this_license = data_object['metadata']['license']['URL']
                        if license_is_open(this_license):
                            data_reference[dr]['r_score'] += 1
                        data_reference[dr]['license']=this_license
        # the resource is linked
        # this is a very relaxed view equating identifier to any link!
        if data_reference[dr]['user_mined']=='FALSE':
           data_reference[dr]['r_score'] += 1
        else:
           print("this one is not linked")
        # the link works 
        if data_reference[dr]['user_mined']=='FALSE' and \
           data_reference[dr]['ret_code'] in ['200', '301','302','303']:
            data_reference[dr]['r_score'] += 1
        else:
            print("the link does not work")
        #print(data_reference[dr])

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=344.0), HTML(value='')))




In [16]:
if len(data_reference) > 0:
    csvh.write_csv_data(data_reference, 'pub_data_fairness.csv')