# Are Data Objects referenced from publications FAIR

A list of data objects referenced from a set of publications is checked to determine their FAIRness.

Instead of only referencing to data, these tests are applied to **data objects**, which are any data which is published to complement the publication, this includes raw data, supplementary data, processing data, tables, images, movies, and compilations containing one or more of such resources.

The tests to be performed are aimed at finding out if the data objects are: 
 - **F**indable:  can the object be found easily?
 - **A**ccessible: can the object be retrieved?
 - **I**nteroperable: can the object be accesed programatically to extract data and metadata?
 - **R**eusable: can the object be readily used?

In [3]:
# library containign read and write functions to csv file
import lib.handle_csv as csvh

# managing files and file paths
from pathlib import Path

# library for handling url searchs
import lib.handle_urls as urlh

# add a progress bar
from tqdm import tqdm_notebook
    
# library for accessing system functions
import os

# import custom functions (common to various notebooks)
import processing_functions as pr_fns

## Findable
Most of the data objects are assumed to be findable as we were able to find links to them. However, some are references to other pages, references to contact the authors or point to repositories without identifying a specific record.

### Findability Score
A findability score was calculated for each data object as follows: assing 5 points if the object is referenced from the publication web page, it is referenced directly, and further details from it can be recovered (name, type and size) just by accessing that reference.

After this, points are deucted from the top score, if to find to the referenced object:
- Special access to the publication is needed (download the pdf, get a password or token for mining the publication, other blocks) \[-1 point\]
- Human access to the publication online is required (there is no metadata or clear pattern to identify a reference on the publication landing page or the pdf version redirects to the article). \[-1 point\]
- Recovering the reference object details (name, type and size) requires more than a single query. \[-1 point\]
- The reference is wrong (broken link). \[-2 points\]
- The reference points to contact the authors or lookup a data repository without an ID. \[-4 points\]

In [4]:
# get names and links for references in data mentions
data_reference, _ = csvh.get_csv_data('pub_data_fairness.csv', 'num')

for dr in tqdm_notebook(data_reference):
    if data_reference[dr]['ret_code'] == "" and data_reference[dr]['f_score'] == "":
        # try to get data object details from reference
        print("Article Link: https://doi.org/" + data_reference[dr]['doi'])
        ref_name = data_reference[dr]['name']
        ref_link = data_reference[dr]['data_url']
        print("Search for: Data Name:", ref_name, "data link:", ref_link)
        head = urlh.getPageHeader(ref_link)
        if head != None:
            data_reference[dr]['ret_code'] = head.status_code 
            data_reference[dr]['resoruce_name'] = os.path.basename(head.url)
            if head.status_code == 200:
                #print (head.headers, head.url)
                if 'content-type' in head.headers.keys():
                    data_reference[dr]['ref_content'] = head.headers['content-type']
                if 'content-length' in head.headers.keys():
                    data_reference[dr]['ref_size'] = head.headers['content-length']
                data_reference[dr]['ref_redirect'] = head.url
            elif head.status_code == 302 or head.status_code == 301:
                #print(head, head.headers)
                data_reference[dr]['ref_redirect'] = head.headers['location']
                data_reference[dr]['resoruce_name'] = os.path.basename(head.headers['location'])
            else:
                print(head, head.headers)
        else:
            data_reference[dr]['f_score'] = 1
    elif data_reference[dr]['f_score'] == "":
        data_reference[dr]['f_score'] = 5
        #print ("start ", dr, data_reference[dr]['f_score'])
        if data_reference[dr]['html_mined'] == 'FALSE' and data_reference[dr]['pdf_mined'] == 'TRUE':
            data_reference[dr]['f_score'] -= 1 # the publication page is not accessible directly to get the DO 
            #print ("deduct pdf mined", dr, data_reference[dr]['f_score'])
        if data_reference[dr]['html_mined'] == 'FALSE' and data_reference[dr]['user_mined'] == 'TRUE':
            data_reference[dr]['f_score'] -= 1 # a human user needed to access the resource
            #print ("deduct manually mined", dr, data_reference[dr]['f_score'])
        if data_reference[dr]['ret_code'] in ['0','404']:
            data_reference[dr]['f_score'] -= 2 # there is a problem with the link
            #print ("deduct page not found", dr, data_reference[dr]['f_score'])
        elif data_reference[dr]['ret_code'] != '200' and not 'doi.org' in data_reference[dr]['data_url'].lower():
            # dois always redirect
            data_reference[dr]['f_score'] -= 1 # there is some form of redirect to get to the  object
            #print ("deduct page redirect", dr, data_reference[dr]['f_score'], data_reference[dr]['ret_code'])

        

OrderedDict([('num', '1'), ('id', '1'), ('doi', '10.1038/s41929-019-0334-3'), ('type', 'data-availability'), ('name', 'https://doi.org/10.17035/d.2019.0079744472'), ('data_url', 'https://doi.org/10.17035/d.2019.0079744472'), ('ignore', ''), ('publisher', 'springer'), ('dup', 'FALSE'), ('Final Type', 'supporting'), ('data type', ''), ('Findable', 'TRUE'), ('Accessible', ''), ('Interoperable', ''), ('Reusable', ''), ('html_mined', 'TRUE'), ('pdf_mined', 'TRUE'), ('user_mined', 'FALSE'), ('notes', ''), ('ret_code', '302'), ('resoruce_name', '79744472?auxfun=&lang=en_GB'), ('ref_redirect', 'https://research.cardiff.ac.uk/converis/portal/detail/Dataset/79744472?auxfun=&lang=en_GB'), ('ref_content', ''), ('ref_size', ''), ('Findable_issue', ''), ('correction', ''), ('actual_name', ''), ('f_score', '5'), ('a_score', '')])
OrderedDict([('num', '2'), ('id', '1'), ('doi', '10.1038/s41929-019-0334-3'), ('type', 'supplementary'), ('name', 'Supplementary Figs. 1 and references'), ('data_url', 'http

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  after removing the cwd from sys.path.


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=374.0), HTML(value='')))




In [5]:
if len(data_reference) > 0:
    csvh.write_csv_data(data_reference, 'pub_data_fairness.csv')

## Accessible
Having an identifier and a link does not guarantee access. Resources may be behind walls (login, redirections, email owner to get it, or similar). This is tranlated as: Can we get the resource? Again this is not a yes or no question. Getting the resource means that once the resource is at one's disposal.

### Accessibility score
An accessibility  score was calculated for each data object as follows: 5 if the object referenced allows direct download of the object just by accessing that reference. After this, for each additional step points are deucted from the top score, if to obtain to the referenced object:

- Special access to the publication is need (get a password or token for mining the publication, or similar acces blocks) [-1 point]
- Human access to the publication online is required (there is no metadata or clear pattern to identify a reference on the publication landing page or the pdf version redirects to the article). [-1 point]
- Recovering the reference object details (name, type and size) requires more than a single query. [-1 point]
- The reference is wrong (broken link). [-2 points]
- The reference points to contact the authors or lookup a data repository without an ID. [-4 points]


In [7]:
# get names and links for data object references
data_reference, _ = csvh.get_csv_data('pub_data_fairness.csv', 'num')
for dr in tqdm_notebook(data_reference):
    # if data objects has not been recovered before
    if data_reference[dr]['a_score'] == "":
        # try to get data object from reference
        print("Article Link: https://doi.org/" + data_reference[dr]['doi'])
        ref_name = data_reference[dr]['name']
        ref_link = data_reference[dr]['data_url']
        if data_reference[dr]['correct_url'] != "":
            ref_link = data_reference[dr]['correct_url']
        print("Search for: Data Name:", ref_name, "data link:", ref_link)
        
        if 'doi.org' in ref_link.lower():
            data_object = urlh.getObjectMetadata(ref_link)
            print(data_object)
            if data_object != {}:
                data_reference[dr]['got_object'] = True 
                data_reference[dr]['do_id'] = data_object['resource_url'] # assume url is the identifier for object
                data_reference[dr]['do_type'] = data_object['type'] # should match type in ref_content
                data_reference[dr]['do_metadata'] = data_object['metadata'] 
        else:
            data_object = urlh.getObject(ref_link)
            if data_object != {}:
                data_reference[dr]['got_object'] = True 
                data_reference[dr]['do_id'] = data_object['resource_url'] # assume url is the identifier for object
                data_reference[dr]['do_type'] = data_object['type'] # should match type in ref_content
                if 'size' in data_object.keys():
                    data_reference[dr]['do_size'] = data_object['size'] # should match size in ref_size
                data_reference[dr]['do_file'] = data_object['file_name'] 
            else:
                # score is 0 if the data cannot be downloaded
                data_reference[dr]['a_score'] = 0
    if data_reference[dr]['a_score'] != 0:
        data_reference[dr]['a_score'] = 5
        # type of object is diferent from availability check
        if 'do_type' in data_reference[dr].keys() and data_reference[dr]['do_type'] != data_reference[dr]['ref_content']:
            data_reference[dr]['a_score'] -= 1
        # size of object is diferent from availability check
        if 'do_size' in data_reference[dr].keys() and data_reference[dr]['do_size'] != data_reference[dr]['ref_size']:
            data_reference[dr]['a_score'] -= 1
        # the file should exist and contain data of the specified type
        if 'file_name' in data_reference[dr].keys() and not Path(data_object['file_name']).is_file():
           data_reference[dr]['a_score'] = 0


OrderedDict([('num', '1'), ('id', '1'), ('doi', '10.1038/s41929-019-0334-3'), ('type', 'data-availability'), ('name', 'https://doi.org/10.17035/d.2019.0079744472'), ('data_url', 'https://doi.org/10.17035/d.2019.0079744472'), ('ignore', ''), ('publisher', 'springer'), ('dup', 'FALSE'), ('Final Type', 'supporting'), ('data type', ''), ('Findable', 'TRUE'), ('Accessible', ''), ('Interoperable', ''), ('Reusable', ''), ('html_mined', 'TRUE'), ('pdf_mined', 'TRUE'), ('user_mined', 'FALSE'), ('notes', ''), ('ret_code', '302'), ('resoruce_name', '79744472?auxfun=&lang=en_GB'), ('ref_redirect', 'https://research.cardiff.ac.uk/converis/portal/detail/Dataset/79744472?auxfun=&lang=en_GB'), ('ref_content', ''), ('ref_size', ''), ('Findable_issue', ''), ('correct_url', ''), ('actual_name', ''), ('f_score', '5'), ('a_score', '')])
OrderedDict([('num', '2'), ('id', '1'), ('doi', '10.1038/s41929-019-0334-3'), ('type', 'supplementary'), ('name', 'Supplementary Figs. 1 and references'), ('data_url', 'htt

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  This is separate from the ipykernel package so we can avoid doing imports until


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=374.0), HTML(value='')))

Article Link: https://doi.org/10.1038/s41929-019-0334-3
Search for: Data Name: https://doi.org/10.17035/d.2019.0079744472 data link: https://doi.org/10.17035/d.2019.0079744472
trying to recover object from https://doi.org/10.17035/d.2019.0079744472
got something back
resource url https://data.crosscite.org/10.17035%2Fd.2019.0079744472
{'resource_url': 'https://data.crosscite.org/10.17035%2Fd.2019.0079744472', 'type': 'application/vnd.citationstyles.csl+json; charset=utf-8', 'metadata': {'type': 'article', 'id': 'https://doi.org/10.17035/d.2019.0079744472', 'categories': ['X-ray Photoelectron Spectroscopy (XPS)', 'Near Infrared Spectroscopy', 'Scanning Electron Microscopy', 'EXAFS', 'Gas Chromatography'], 'language': 'en', 'author': [{'family': 'MacIno', 'given': 'Margherita'}, {'family': 'Barnes', 'given': 'Alexandra J'}, {'family': 'Althahban', 'given': 'Sultan M'}, {'family': 'Qu', 'given': 'Ruiyang'}, {'family': 'Gibson', 'given': 'Emma K'}, {'family': 'Freakley', 'given': 'Simon J'

[Errno 22] Invalid argument: 'do_files/downloadSupplement?doi=10.1002%2Fejoc.201800799&file=ejoc201800799-sup-0001-SupMat.pdf'
Article Link: https://doi.org/10.1039/c8ob00066b
Search for: Data Name: Supplementary information PDF (1211K) data link: http://www.rsc.org/suppdata/c8/ob/c8ob00066b/c8ob00066b1.pdf
trying to recover object from http://www.rsc.org/suppdata/c8/ob/c8ob00066b/c8ob00066b1.pdf
Article Link: https://doi.org/10.1021/acs.biochem.8b00169
Search for: Data Name: bi8b00169_si_001.pdf (607.12 kb) data link: https://pubs.acs.org/doi/suppl/10.1021/acs.biochem.8b00169/suppl_file/bi8b00169_si_001.pdf
trying to recover object from https://pubs.acs.org/doi/suppl/10.1021/acs.biochem.8b00169/suppl_file/bi8b00169_si_001.pdf
Article Link: https://doi.org/10.1021/acs.biochem.8b00169
Search for: Data Name: bi8b00169_si_002.pdf (3.63 MB) data link: https://pubs.acs.org/doi/suppl/10.1021/acs.biochem.8b00169/suppl_file/bi8b00169_si_002.pdf
trying to recover object from https://pubs.acs.or

Article Link: https://doi.org/10.1038/s41467-018-03138-7
Search for: Data Name: Peer Review File data link: https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-018-03138-7/MediaObjects/41467_2018_3138_MOESM2_ESM.pdf
trying to recover object from https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-018-03138-7/MediaObjects/41467_2018_3138_MOESM2_ESM.pdf
Article Link: https://doi.org/10.1039/c8cp01022f
Search for: Data Name: 10.5286/ISIS.E.63530347 data link: https://doi.org/10.5286/ISIS.E.63530347
trying to recover object from https://doi.org/10.5286/ISIS.E.63530347
got something back
resource url https://data.crosscite.org/10.5286%2FISIS.E.63530347
{'resource_url': 'https://data.crosscite.org/10.5286%2FISIS.E.63530347', 'type': 'application/vnd.citationstyles.csl+json; charset=utf-8', 'metadata': {'type': 'article', 'id': 'https://doi.org/10.5286/isis.e.63530347', 'author': [{'literal': 'Professor Richard Catlow'}, {'literal': 'Dr Stewart Parker'}, {'literal': 'Dr 

[Errno 22] Invalid argument: 'do_files/downloadSupplement?doi=10.1002%2Fanie.201713115&file=anie201713115-sup-0001-misc_information.pdf'
Article Link: https://doi.org/10.1038/s41929-018-0197-z
Search for: Data Name: https://doi.org/10.5523/bris.1kp2f62x3klb02mfz2qymcmxmx data link: https://doi.org/10.5523/bris.1kp2f62x3klb02mfz2qymcmxmx
trying to recover object from https://doi.org/10.5523/bris.1kp2f62x3klb02mfz2qymcmxmx
got something back
resource url https://data.crosscite.org/10.5523%2Fbris.1kp2f62x3klb02mfz2qymcmxmx
{'resource_url': 'https://data.crosscite.org/10.5523%2Fbris.1kp2f62x3klb02mfz2qymcmxmx', 'type': 'application/vnd.citationstyles.csl+json; charset=utf-8', 'metadata': {'type': 'dataset', 'id': 'https://doi.org/10.5523/bris.1kp2f62x3klb02mfz2qymcmxmx', 'language': 'en', 'author': [{'family': 'Bedford', 'given': 'Robin'}, {'family': 'Messinis', 'given': 'Antonios'}], 'issued': {'date-parts': [[2018]]}, 'abstract': 'Data supporting Nature Catalysis paper', 'DOI': '10.5523/

[Errno 22] Invalid argument: 'do_files/downloadSupplement?doi=10.1002%2Fcctc.201900100&file=cctc201900100-sup-0001-misc_information.pdf'
Article Link: https://doi.org/10.1021/acs.iecr.8b00230
Search for: Data Name: ie8b00230_si_001.pdf (903.9 kb) data link: https://pubs.acs.org/doi/suppl/10.1021/acs.iecr.8b00230/suppl_file/ie8b00230_si_001.pdf
trying to recover object from https://pubs.acs.org/doi/suppl/10.1021/acs.iecr.8b00230/suppl_file/ie8b00230_si_001.pdf
Article Link: https://doi.org/10.1002/chem.201704151
Search for: Data Name: chem201704151-sup-0001-misc_information.pdf data link: https://chemistry-europe.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fchem.201704151&file=chem201704151-sup-0001-misc_information.pdf
trying to recover object from https://chemistry-europe.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fchem.201704151&file=chem201704151-sup-0001-misc_information.pdf
[Errno 22] Invalid argument: 'do_files/downloadSupplement?doi=10.1002

Article Link: https://doi.org/10.1039/c5cy01175b
Search for: Data Name: Supplementary information PDF (1188K) data link: http://www.rsc.org/suppdata/c5/cy/c5cy01175b/c5cy01175b1.pdf
trying to recover object from http://www.rsc.org/suppdata/c5/cy/c5cy01175b/c5cy01175b1.pdf
Article Link: https://doi.org/10.1039/c5cc06118k
Search for: Data Name: Supplementary information PDF (794K) data link: http://www.rsc.org/suppdata/c5/cc/c5cc06118k/c5cc06118k1.pdf
trying to recover object from http://www.rsc.org/suppdata/c5/cc/c5cc06118k/c5cc06118k1.pdf
Article Link: https://doi.org/10.1039/c5cc08956e
Search for: Data Name: Supplementary information PDF (667K) data link: http://www.rsc.org/suppdata/c5/cc/c5cc08956e/c5cc08956e1.pdf
trying to recover object from http://www.rsc.org/suppdata/c5/cc/c5cc08956e/c5cc08956e1.pdf
Article Link: https://doi.org/10.1039/c5cc08956e
Search for: Data Name: <img alt="image file: c5cc08956e-u1.tif" src="/image/article/2016/CC/c5cc08956e/c5cc08956e-u1.gif"/> data link:

Article Link: https://doi.org/10.1039/c4cp04693e
Search for: Data Name: Supplementary information PDF (1017K) data link: http://www.rsc.org/suppdata/cp/c4/c4cp04693e/c4cp04693e1.pdf
trying to recover object from http://www.rsc.org/suppdata/cp/c4/c4cp04693e/c4cp04693e1.pdf
Article Link: https://doi.org/10.1039/c4dt01309c
Search for: Data Name: Supplementary information PDF (1060K) data link: http://www.rsc.org/suppdata/dt/c4/c4dt01309c/c4dt01309c1.pdf
trying to recover object from http://www.rsc.org/suppdata/dt/c4/c4dt01309c/c4dt01309c1.pdf
Article Link: https://doi.org/10.1021/jp5081753
Search for: Data Name: jp5081753_si_001.pdf (2.31 MB) data link: https://pubs.acs.org/doi/suppl/10.1021/jp5081753/suppl_file/jp5081753_si_001.pdf
trying to recover object from https://pubs.acs.org/doi/suppl/10.1021/jp5081753/suppl_file/jp5081753_si_001.pdf
Article Link: https://doi.org/10.1007/s11244-018-0893-6
Search for: Data Name: Supplementary material 1 (TIF 118 KB) data link: https://static-conten

Article Link: https://doi.org/10.1039/c5cc09780k
Search for: Data Name: Supplementary information PDF (481K) data link: http://www.rsc.org/suppdata/c5/cc/c5cc09780k/c5cc09780k1.pdf
trying to recover object from http://www.rsc.org/suppdata/c5/cc/c5cc09780k/c5cc09780k1.pdf
Article Link: https://doi.org/10.1039/c5cy02072g
Search for: Data Name: Supplementary information PDF (236K) data link: http://www.rsc.org/suppdata/c5/cy/c5cy02072g/c5cy02072g1.pdf
trying to recover object from http://www.rsc.org/suppdata/c5/cy/c5cy02072g/c5cy02072g1.pdf
Article Link: https://doi.org/10.1039/c6cc01599a
Search for: Data Name: Supplementary information PDF (679K) data link: http://www.rsc.org/suppdata/c6/cc/c6cc01599a/c6cc01599a1.pdf
trying to recover object from http://www.rsc.org/suppdata/c6/cc/c6cc01599a/c6cc01599a1.pdf
Article Link: https://doi.org/10.1039/c5cc04188k
Search for: Data Name: Supplementary information PDF (727K) data link: http://www.rsc.org/suppdata/c5/cc/c5cc04188k/c5cc04188k1.pdf
try

Article Link: https://doi.org/10.1039/c5cc08714g
Search for: Data Name: Supplementary information PDF (2418K) data link: http://www.rsc.org/suppdata/c5/cc/c5cc08714g/c5cc08714g1.pdf
trying to recover object from http://www.rsc.org/suppdata/c5/cc/c5cc08714g/c5cc08714g1.pdf
Article Link: https://doi.org/10.1039/c5cc08681g
Search for: Data Name: Supplementary information PDF (1585K) data link: http://www.rsc.org/suppdata/c5/cc/c5cc08681g/c5cc08681g1.pdf
trying to recover object from http://www.rsc.org/suppdata/c5/cc/c5cc08681g/c5cc08681g1.pdf
Article Link: https://doi.org/10.1002/cssc.201501225
Search for: Data Name: cssc201501225-sup-0001-misc_information.pdf data link: https://chemistry-europe.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fcssc.201501225&file=cssc201501225-sup-0001-misc_information.pdf
trying to recover object from https://chemistry-europe.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fcssc.201501225&file=cssc201501225-sup-0001-misc_inf

Article Link: https://doi.org/10.1021/ja5062467
Search for: Data Name: ja5062467_si_006.cif (55.99 kb) data link: https://pubs.acs.org/doi/suppl/10.1021/ja5062467/suppl_file/ja5062467_si_006.cif
trying to recover object from https://pubs.acs.org/doi/suppl/10.1021/ja5062467/suppl_file/ja5062467_si_006.cif
Article Link: https://doi.org/10.1021/ja5062467
Search for: Data Name: ja5062467_si_007.cif (22.35 kb) data link: https://pubs.acs.org/doi/suppl/10.1021/ja5062467/suppl_file/ja5062467_si_007.cif
trying to recover object from https://pubs.acs.org/doi/suppl/10.1021/ja5062467/suppl_file/ja5062467_si_007.cif
Article Link: https://doi.org/10.1021/ja5062467
Search for: Data Name: ja5062467_si_008.cif (45.64 kb) data link: https://pubs.acs.org/doi/suppl/10.1021/ja5062467/suppl_file/ja5062467_si_008.cif
trying to recover object from https://pubs.acs.org/doi/suppl/10.1021/ja5062467/suppl_file/ja5062467_si_008.cif
Article Link: https://doi.org/10.1021/ja5062467
Search for: Data Name: ja5062467_s

Article Link: https://doi.org/10.1038/nature16935
Search for: Data Name: Extended Data Figure 5 Spectroscopic analysis of the addition of zinc to georgeite. data link: https://www.nature.com/articles/nature16935/figures/9
trying to recover object from https://www.nature.com/articles/nature16935/figures/9
Article Link: https://doi.org/10.1038/nature16935
Search for: Data Name: Extended Data Figure 6 Representative DF-STEM and BF-STEM micrographs of zincian georgeite and zincian malachite, calcined at 300Â°C. data link: https://www.nature.com/articles/nature16935/figures/10
trying to recover object from https://www.nature.com/articles/nature16935/figures/10
Article Link: https://doi.org/10.1038/nature16935
Search for: Data Name: Extended Data Figure 7 X-ray diffraction analysis of calcined zincian georgeite and zincian malachite. data link: https://www.nature.com/articles/nature16935/figures/11
trying to recover object from https://www.nature.com/articles/nature16935/figures/11
Article L

Article Link: https://doi.org/10.1002/cctc.201900658
Search for: Data Name: cctc201900658-sup-0001-misc_information.pdf data link: https://chemistry-europe.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fcctc.201900658&file=cctc201900658-sup-0001-misc_information.pdf
trying to recover object from https://chemistry-europe.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fcctc.201900658&file=cctc201900658-sup-0001-misc_information.pdf
[Errno 22] Invalid argument: 'do_files/downloadSupplement?doi=10.1002%2Fcctc.201900658&file=cctc201900658-sup-0001-misc_information.pdf'
Article Link: https://doi.org/10.1039/c9cy01679a
Search for: Data Name: Supplementary information PDF (1118K) data link: http://www.rsc.org/suppdata/c9/cy/c9cy01679a/c9cy01679a1.pdf
trying to recover object from http://www.rsc.org/suppdata/c9/cy/c9cy01679a/c9cy01679a1.pdf
Article Link: https://doi.org/10.1002/cbic.201800606
Search for: Data Name: cbic201800606-sup-0001-misc_information.pdf dat

Article Link: https://doi.org/10.1021/acscatal.9b00160
Search for: Data Name: cs9b00160_si_001.pdf (3.07 MB) data link: https://pubs.acs.org/doi/suppl/10.1021/acscatal.9b00160/suppl_file/cs9b00160_si_001.pdf
trying to recover object from https://pubs.acs.org/doi/suppl/10.1021/acscatal.9b00160/suppl_file/cs9b00160_si_001.pdf
Article Link: https://doi.org/10.1021/acscatal.9b00160
Search for: Data Name: 10.1021/acscatal.9b00160 data link: https://pubs.acs.org/doi/abs/10.1021/acscatal.9b00160
trying to recover object from https://pubs.acs.org/doi/abs/10.1021/acscatal.9b00160
Article Link: https://doi.org/10.1021/acscatal.9b00160
Search for: Data Name: PDF data link: https://pubs.acs.org/doi/suppl/10.1021/acscatal.9b00160/suppl_file/cs9b00160_si_001.pdf
trying to recover object from https://pubs.acs.org/doi/suppl/10.1021/acscatal.9b00160/suppl_file/cs9b00160_si_001.pdf
Article Link: https://doi.org/10.1002/cctc.201801067
Search for: Data Name: cctc201801067-sup-0001-misc_information.pdf dat

Article Link: https://doi.org/10.1039/c8cc01880d
Search for: Data Name: 1822415 data link: http://xlink.rsc.org/?ccdc=1822415&msid=c8cc01880d
trying to recover object from http://xlink.rsc.org/?ccdc=1822415&msid=c8cc01880d
Article Link: https://doi.org/10.1039/c8cc01880d
Search for: Data Name: 1822416 data link: http://xlink.rsc.org/?ccdc=1822416&msid=c8cc01880d
trying to recover object from http://xlink.rsc.org/?ccdc=1822416&msid=c8cc01880d
Article Link: https://doi.org/10.1039/c8cc01880d
Search for: Data Name: 10.1039/c8cc01880d data link: http://xlink.rsc.org/?DOI=c8cc01880d
trying to recover object from http://xlink.rsc.org/?DOI=c8cc01880d
Article Link: https://doi.org/10.1039/c8cc01880d
Search for: Data Name: 10.17861/14c23fe6-bc65-4806-ba5e-63642a6ad3e9 data link: https://dx.doi.org/10.17861/14c23fe6-bc65-4806-ba5e-63642a6ad3e9
trying to recover object from https://dx.doi.org/10.17861/14c23fe6-bc65-4806-ba5e-63642a6ad3e9
got something back
resource url https://data.crosscite.org/

[Errno 22] Invalid argument: 'do_files/downloadSupplement?doi=10.1002%2Fcssc.201501264&file=cssc201501264-sup-0001-misc_information.pdf'
Article Link: https://doi.org/10.1021/acscatal.7b03805
Search for: Data Name: cs7b03805_si_001.pdf (1.09 MB) data link: https://pubs.acs.org/doi/suppl/10.1021/acscatal.7b03805/suppl_file/cs7b03805_si_001.pdf
trying to recover object from https://pubs.acs.org/doi/suppl/10.1021/acscatal.7b03805/suppl_file/cs7b03805_si_001.pdf
Article Link: https://doi.org/10.1021/acscatal.7b03805
Search for: Data Name: 10.1021/acscatal.7b03805 data link: https://pubs.acs.org/doi/abs/10.1021/acscatal.7b03805
trying to recover object from https://pubs.acs.org/doi/abs/10.1021/acscatal.7b03805
Article Link: https://doi.org/10.1021/acscatal.7b03805
Search for: Data Name: PDF data link: https://pubs.acs.org/doi/suppl/10.1021/acscatal.7b03805/suppl_file/cs7b03805_si_001.pdf
trying to recover object from https://pubs.acs.org/doi/suppl/10.1021/acscatal.7b03805/suppl_file/cs7b038

[Errno 22] Invalid argument: 'do_files/downloadSupplement?doi=10.1002%2Fcctc.201901955&file=cctc201901955-sup-0001-misc_information.pdf'
Article Link: https://doi.org/10.1039/c9sc04905c
Search for: Data Name: Supplementary information PDF (1648K) data link: http://www.rsc.org/suppdata/c9/sc/c9sc04905c/c9sc04905c1.pdf
trying to recover object from http://www.rsc.org/suppdata/c9/sc/c9sc04905c/c9sc04905c1.pdf
Article Link: https://doi.org/10.1039/d0sc01924k
Search for: Data Name: Supplementary information PDF (140K) data link: http://www.rsc.org/suppdata/d0/sc/d0sc01924k/d0sc01924k1.pdf
trying to recover object from http://www.rsc.org/suppdata/d0/sc/d0sc01924k/d0sc01924k1.pdf
Article Link: https://doi.org/10.1038/s41467-020-15445-z
Search for: Data Name: The data that support the plots in this paper and the other findings of this study are available from the corresponding authors on reasonable request. data link: 
trying to recover object from 
Invalid URL '': No schema supplied. Perhaps 

Article Link: https://doi.org/10.1038/s41563-020-0800-y
Search for: Data Name: Source Data Fig. 4 data link: https://static-content.springer.com/esm/art%3A10.1038%2Fs41563-020-0800-y/MediaObjects/41563_2020_800_MOESM4_ESM.xlsx
trying to recover object from https://static-content.springer.com/esm/art%3A10.1038%2Fs41563-020-0800-y/MediaObjects/41563_2020_800_MOESM4_ESM.xlsx
Article Link: https://doi.org/10.1038/s41563-020-0800-y
Search for: Data Name: Source Data Fig. 5 data link: https://static-content.springer.com/esm/art%3A10.1038%2Fs41563-020-0800-y/MediaObjects/41563_2020_800_MOESM5_ESM.xlsx
trying to recover object from https://static-content.springer.com/esm/art%3A10.1038%2Fs41563-020-0800-y/MediaObjects/41563_2020_800_MOESM5_ESM.xlsx
Article Link: https://doi.org/10.1038/s41563-020-0800-y
Search for: Data Name: Source Data Fig. S3 data link: https://static-content.springer.com/esm/art%3A10.1038%2Fs41563-020-0800-y/MediaObjects/41563_2020_800_MOESM6_ESM.xlsx
trying to recover obje

Article Link: https://doi.org/10.1016/j.apcatb.2016.12.066
Search for: Data Name: 1-s2.0-S0926337316310025-mmc1.doc data link: https://ars.els-cdn.com/content/image/1-s2.0-S0926337316310025-mmc1.doc
trying to recover object from https://ars.els-cdn.com/content/image/1-s2.0-S0926337316310025-mmc1.doc
Article Link: https://doi.org/10.1016/j.apcatb.2018.07.008
Search for: Data Name: 1-s2.0-S0926337318306167-mmc1.docx data link: https://ars.els-cdn.com/content/image/1-s2.0-S0926337318306167-mmc1.docx
trying to recover object from https://ars.els-cdn.com/content/image/1-s2.0-S0926337318306167-mmc1.docx
Article Link: https://doi.org/10.1016/j.apcatb.2018.07.072
Search for: Data Name: 1-s2.0-S0926337318307136-mmc1.docx data link: https://ars.els-cdn.com/content/image/1-s2.0-S0926337318307136-mmc1.docx
trying to recover object from https://ars.els-cdn.com/content/image/1-s2.0-S0926337318307136-mmc1.docx
Article Link: https://doi.org/10.1016/j.apcatb.2019.04.078
Search for: Data Name: 1-s2.0-S0

Article Link: https://doi.org/10.1016/j.susc.2016.10.005
Search for: Data Name: 10.17035/d.2016.0009299603 data link: http://dx.doi.org/10.17035/d.2016.0009299603
trying to recover object from http://dx.doi.org/10.17035/d.2016.0009299603
got something back
resource url https://data.crosscite.org/10.17035%2Fd.2016.0009299603
{'resource_url': 'https://data.crosscite.org/10.17035%2Fd.2016.0009299603', 'type': 'application/vnd.citationstyles.csl+json; charset=utf-8', 'metadata': {'type': 'dataset', 'id': 'https://doi.org/10.17035/d.2016.0009299603', 'categories': ['Surfaces and Interfaces', 'Catalysis at surfaces', 'Palladium Catalysts', 'Gold catalysis'], 'language': 'en', 'author': [{'literal': 'Sharpe R'}, {'literal': 'Counsell J'}, {'literal': 'Bowker M'}], 'issued': {'date-parts': [[2016]]}, 'abstract': 'The interaction of Au and Pd in bimetallic systems is important in a number of areas of technology, especially catalysis. In order to investigate the segregation behaviour in such sys

Article Link: https://doi.org/10.1128/aac.00564-19
Search for: Data Name: AAC.00564-19-s0001.pdf data link: https://aac.asm.org/highwire/filestream/207378/field_highwire_adjunct_files/0/AAC.00564-19-s0001.pdf
trying to recover object from https://aac.asm.org/highwire/filestream/207378/field_highwire_adjunct_files/0/AAC.00564-19-s0001.pdf
Article Link: https://doi.org/10.3390/surfaces2010001
Search for: Data Name: https://www.mdpi.com/2571-9637/2/1/1/s1 data link: https://www.mdpi.com/2571-9637/2/1/1/s1
trying to recover object from https://www.mdpi.com/2571-9637/2/1/1/s1
[Errno 22] Invalid argument: 'do_files/surfaces-02-00001-s001.pdf?version=1546517275'
Article Link: https://doi.org/10.3762/bjnano.10.191
Search for: Data Name: 2190-4286-10-191-S1.pdf data link: https://www.beilstein-journals.org/bjnano/content/supplementary/2190-4286-10-191-S1.pdf
trying to recover object from https://www.beilstein-journals.org/bjnano/content/supplementary/2190-4286-10-191-S1.pdf



In [10]:
if len(data_reference) > 0:
    csvh.write_csv_data(data_reference, 'pub_data_fairness.csv')
    

## Interoperable
Access to a resource does not guarantee interoperability, it is interoperable if the data is stored in a format which makes it easy to interpret by humans and machines. So an object in an open format is more interoperable that an object in a proprietary format.

### Interoperability score
The interoperability score is defined along the lines of [5 Star Open Data](https://5stardata.info/en/), using the three first levels.  The definition is relaxed ommiting the request to publish with an open license for intereoperability (it is used below for reusability). The scoring is as follows:

- 1 if the data object is available on the Web (whatever format). 
- 2 if the data object is available as structured data (e.g., Excel instead of image scan of a table)
- 3 make it available in a non-proprietary open format (e.g., CSV instead of Excel)

In [31]:
# get names and links for data object references
data_reference, _ = csvh.get_csv_data('pub_data_fairness.csv', 'num')
# check if files actually exist, and then assess if they are of type 1, 2, or 3
level_1_types = ['.doc','.pdf','.tif','.jpg', '.docx', '.ppt', '.pptx', '.mp4', '.mpg', '.htm', '.html', '.zip','.avi']
level_2_types = ['xls','xlsx']
level_3_types = ['csv','cif']
for dr in tqdm_notebook(data_reference):
    do_file_name = data_reference[dr]['do_file']
    if do_file_name!= "" and  not Path(do_file_name).is_file():
        data_reference[dr]['file_missing'] = 'TRUE'
    elif do_file_name!= "":
        data_reference[dr]['file_missing'] = 'FALSE'
        data_reference[dr]['file_size'] = Path(do_file_name).stat().st_size
    for lv1_type in level_1_types:
        if lv1_type in do_file_name:
            data_reference[dr]['i_score'] = 1
    for lv2_type in level_2_types:
        if lv2_type in do_file_name:
            data_reference[dr]['i_score'] = 2
    for lv3_type in level_3_types:
        if lv3_type in do_file_name:
            data_reference[dr]['i_score'] = 3

            

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  import sys


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=344.0), HTML(value='')))




In [32]:
if len(data_reference) > 0:
    csvh.write_csv_data(data_reference, 'pub_data_fairness.csv')

## Reusable
Finding, retrieving and interpreting an object is not all ther is. For the resource to be reusable it needs to be a) licensed for use and b) in an appropriate format to guarantee long term support (closely related to the criteria for interoperability).

### Reusability score
The interoperability score is also based on the [5 Star Open Data](https://5stardata.info/en/) levels, using the requirement for open licenses and the requirements for using identifiers and links to other data. In this case the scoring adds up to 3 ponts, in which a point is added for each of the following cases:
- 1 if the data object is available on an open license.
- 1 use URIs to denote things, so that people can point at your stuff.
- 1 the data object is linked  to other data to provide context.



In [33]:
data_reference, _ = csvh.get_csv_data('pub_data_fairness.csv', 'num')


In [None]:
import requests
import json

req_head = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
req_head['Accept'] = 'application/x-bibtex'
url_text = 'https://doi.org/10.17035/d.2019.0079744472'
response = requests.get(url_text, headers = req_head)
print('*************BibTex*******************')
print(response.content.decode())
#req_head = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
print('*************Research Info Systems (RIS)*******************')
req_head['Accept'] = 'application/x-research-info-systems'
response = requests.get(url_text, headers = req_head)
print(response.content.decode())
print('*************VND Citation Styles CSL*******************')
req_head['Accept'] = 'application/vnd.citationstyles.csl+json'
response = requests.get(url_text, headers = req_head)
contents = response.content
contents_json = json.loads(contents.decode())
#contents_str = contents.decode('utf-8') 
print(json.dumps(contents_json, indent=4, sort_keys=True))

In [None]:
contents.decode()

In [None]:
str(b'a string').encode().decode()
str("'a string'")