# Are Data Objects referenced from publications FAIR

A list of data objects referenced from a set of publications is checked to determine their FAIRness.

Instead of only referencing to data, these tests are applied to **data objects**, which are any data which is published to complement the publication, this includes raw data, supplementary data, processing data, tables, images, movies, and compilations containing one or more of such resources.

The tests to be performed are aimed at finding out if the data objects are: 
 - **F**indable:  can the object be found easily?
 - **A**ccessible: can the object be retrieved?
 - **I**nteroperable: can the object be accesed programatically to extract data and metadata?
 - **R**eusable: can the object be readily used?

In [1]:
# library containign read and write functions to csv file
import lib.handle_csv as csvh

# managing files and file paths
from pathlib import Path

# library for handling url searchs
import lib.handle_urls as urlh

# add a progress bar
from tqdm import tqdm_notebook
    
# library for accessing system functions
import os

# import custom functions (common to various notebooks)
import processing_functions as pr_fns

## Findable
Most of the data objects are assumed to be findable as we were able to find links to them. However, some are references to other pages, references to contact the authors or point to repositories without identifying a specific record.

### Findability Score
A findability score was calculated for each data object as follows: assing 5 points if the object is referenced from the publication web page, it is referenced directly, and further details from it can be recovered (name, type and size) just by accessing that reference.

After this, points are deucted from the top score, if to find to the referenced object:
- Special access to the publication is needed (download the pdf, get a password or token for mining the publication, other blocks) \[-1 point\]
- Human access to the publication online is required (there is no metadata or clear pattern to identify a reference on the publication landing page or the pdf version redirects to the article). \[-1 point\]
- Recovering the reference object details (name, type and size) requires more than a single query. \[-1 point\]
- The reference is wrong (broken link). \[-2 points\]
- The reference points to contact the authors or lookup a data repository without an ID. \[-4 points\]

In [3]:
# get names and links for references in data mentions
do_refs = 'pdf_mentions202110_fairness.csv'

data_reference, _ = csvh.get_csv_data(do_refs, 'num')

for dr in tqdm_notebook(data_reference):
    if data_reference[dr]['ret_code'] == "" and data_reference[dr]['f_score'] == "":
        # try to get data object details from reference
        print("Article Link: https://doi.org/" + data_reference[dr]['doi'])
        ref_name = data_reference[dr]['name']
        ref_link = data_reference[dr]['data_url']
        print("Search for: Data Name:", ref_name, "data link:", ref_link)
        head = urlh.getPageHeader(ref_link)
        if head != None:
            data_reference[dr]['ret_code'] = head.status_code 
            data_reference[dr]['resoruce_name'] = os.path.basename(head.url)
            if head.status_code == 200:
                #print (head.headers, head.url)
                if 'content-type' in head.headers.keys():
                    data_reference[dr]['ref_content'] = head.headers['content-type']
                if 'content-length' in head.headers.keys():
                    data_reference[dr]['ref_size'] = head.headers['content-length']
                data_reference[dr]['ref_redirect'] = head.url
            elif head.status_code == 302 or head.status_code == 301:
                #print(head, head.headers)
                data_reference[dr]['ref_redirect'] = head.headers['location']
                data_reference[dr]['resoruce_name'] = os.path.basename(head.headers['location'])
            else:
                print(head, head.headers)
        else:
            data_reference[dr]['f_score'] = 1

#save results search results
if len(data_reference) > 0:
    csvh.write_csv_data(data_reference, do_refs)
    
#score the results
for dr in tqdm_notebook(data_reference):
    #evaluate and get scores
    if data_reference[dr]['f_score'] == "":
        data_reference[dr]['f_score'] = 5    
        #print ("start ", dr, data_reference[dr]['f_score'])
        if data_reference[dr]['html_mined'] == 'FALSE' and data_reference[dr]['pdf_mined'] == 'TRUE':
            data_reference[dr]['f_score'] -= 1 # the publication page is not accessible directly to get the DO 
            #print ("deduct pdf mined", dr, data_reference[dr]['f_score'])
        if data_reference[dr]['html_mined'] == 'FALSE' and data_reference[dr]['user_mined'] == 'TRUE':
            data_reference[dr]['f_score'] -= 1 # a human user needed to access the resource
            #print ("deduct manually mined", dr, data_reference[dr]['f_score'])
        if data_reference[dr]['ret_code'] in ['0','404']:
            data_reference[dr]['f_score'] -= 2 # there is a problem with the link
            #print ("deduct page not found", dr, data_reference[dr]['f_score'])
        elif data_reference[dr]['ret_code'] != '200' and not 'doi.org' in data_reference[dr]['data_url'].lower():
            # dois always redirect
            data_reference[dr]['f_score'] -= 1 # there is some form of redirect to get to the  object
            #print ("deduct page redirect", dr, data_reference[dr]['f_score'], data_reference[dr]['ret_code'])

# save the scores
if len(data_reference) > 0:
    csvh.write_csv_data(data_reference, do_refs)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for dr in tqdm_notebook(data_reference):


  0%|          | 0/151 [00:00<?, ?it/s]

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for dr in tqdm_notebook(data_reference):


  0%|          | 0/151 [00:00<?, ?it/s]

## Accessible
Having an identifier and a link does not guarantee access. Resources may be behind walls (login, redirections, email owner to get it, or similar). This is tranlated as: Can we get the resource? Again this is not a yes or no question. Getting the resource means that once the resource is at one's disposal.

### Accessibility score
An accessibility  score was calculated for each data object as follows: 5 if the object referenced allows direct download of the object just by accessing that reference. After this, for each additional step points are deucted from the top score, if to obtain to the referenced object:

- Special access to the publication is need (get a password or token for mining the publication, or similar acces blocks) [-1 point]
- Human access to the publication online is required (there is no metadata or clear pattern to identify a reference on the publication landing page or the pdf version redirects to the article). [-1 point]
- Recovering the reference object details (name, type and size) requires more than a single query. [-1 point]
- The reference is wrong (broken link). [-2 points]
- The reference points to contact the authors or lookup a data repository without an ID. [-4 points]


In [4]:
# get names and links for data object references
data_reference, _ = csvh.get_csv_data(do_refs, 'num')
for dr in tqdm_notebook(data_reference):
    # if data objects has not been recovered before
    if data_reference[dr]['a_score'] == "":
        # try to get data object from reference
        print("Article Link: https://doi.org/" + data_reference[dr]['doi'])
        ref_name = data_reference[dr]['name']
        ref_link = data_reference[dr]['data_url']
        if data_reference[dr]['correct_url'] != "":
            ref_link = data_reference[dr]['correct_url']
        print("Search for: Data Name:", ref_name, "data link:", ref_link)
        
        if 'doi.org' in ref_link.lower():
            data_object = urlh.getObjectMetadata(ref_link)
            print(data_object)
            if data_object != {}:
                data_reference[dr]['got_object'] = True 
                data_reference[dr]['do_id'] = data_object['resource_url'] # assume url is the identifier for object
                data_reference[dr]['do_type'] = data_object['type'] # should match type in ref_content
                data_reference[dr]['do_metadata'] = data_object['metadata'] 
        else:
            data_object = urlh.getObject(ref_link)
            if data_object != {}:
                data_reference[dr]['got_object'] = True 
                data_reference[dr]['do_id'] = data_object['resource_url'] # assume url is the identifier for object
                data_reference[dr]['do_type'] = data_object['type'] # should match type in ref_content
                if 'size' in data_object.keys():
                    data_reference[dr]['do_size'] = data_object['size'] # should match size in ref_size
                data_reference[dr]['do_file'] = data_object['file_name'] 
            else:
                # score is 0 if the data cannot be downloaded
                data_reference[dr]['a_score'] = 0
    if data_reference[dr]['a_score'] != 0:
        data_reference[dr]['a_score'] = 5
        # type of object is diferent from availability check
        if 'do_type' in data_reference[dr].keys() and data_reference[dr]['do_type'] != data_reference[dr]['ref_content']:
            data_reference[dr]['a_score'] -= 1
        # size of object is diferent from availability check
        if 'do_size' in data_reference[dr].keys() and data_reference[dr]['do_size'] != data_reference[dr]['ref_size']:
            data_reference[dr]['a_score'] -= 1
        # the file should exist and contain data of the specified type
        if 'file_name' in data_reference[dr].keys() and not Path(data_object['file_name']).is_file():
           data_reference[dr]['a_score'] = 0


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for dr in tqdm_notebook(data_reference):


  0%|          | 0/151 [00:00<?, ?it/s]

Article Link: https://doi.org/10.1039/d0cp00793e
Search for: Data Name: Dataset for 'The electronic structure, surface properties, and in situ N2O decomposition of mechanochemically synthesised LaMnO3' data link: https://doi.org/10.5258/soton/d1342
trying to recover object metadata from https://doi.org/10.5258/soton/d1342
got something back
resource url https://data.crosscite.org/10.5258%2Fsoton%2Fd1342
{'resource_url': 'https://data.crosscite.org/10.5258%2Fsoton%2Fd1342', 'type': 'application/vnd.citationstyles.csl+json; charset=utf-8', 'metadata': {'type': 'dataset', 'id': 'https://doi.org/10.5258/soton/d1342', 'author': [{'family': 'Wells', 'given': 'Peter'}, {'family': 'Tierney', 'given': 'George'}, {'family': 'Rivas', 'given': 'Maria Elena'}, {'family': 'Mohammed', 'given': 'Khaled'}, {'family': 'Decarolis', 'given': 'Donato'}, {'family': 'Hayama', 'given': 'Shu'}, {'family': 'Venturini', 'given': 'Federica'}, {'family': 'Held', 'given': 'Georg'}, {'family': 'Arrigo', 'given': 'Ro

got something back
resource url https://data.crosscite.org/10.5517%2Fccdc.csd.cc25rk9t
{'resource_url': 'https://data.crosscite.org/10.5517%2Fccdc.csd.cc25rk9t', 'type': 'application/vnd.citationstyles.csl+json; charset=utf-8', 'metadata': {'type': 'dataset', 'id': 'https://doi.org/10.5517/ccdc.csd.cc25rk9t', 'categories': ['Crystal Structure', 'Experimental 3D Coordinates', 'Crystal System', 'Space Group', 'Cell Parameters', 'Crystallography', "{2,2'-[(2,2-dimethylpropane-1,3-diyl)bis(P,P-diphenylphosphorimidoyl)]bis(4,6-di-t-butylphenolato)}-(2-methylpropan-2-olato)-indium tetrahydrofuran solvate"], 'language': 'en', 'author': [{'literal': 'Yuntawattana, Nattawut'}, {'literal': 'McGuire, Thomas M.'}, {'literal': 'Durr, Christopher B.'}, {'literal': 'Buchard, Antoine'}, {'literal': 'Williams, Charlotte K.'}], 'issued': {'date-parts': [[2020]]}, 'abstract': 'Related Article: Nattawut Yuntawattana, Thomas M. McGuire, Christopher B. Durr, Antoine Buchard, Charlotte K. Williams|2020|Cat.S

got something back
resource url https://data.crosscite.org/10.5517%2Fccdc.csd.cc27l8ty
{'resource_url': 'https://data.crosscite.org/10.5517%2Fccdc.csd.cc27l8ty', 'type': 'application/vnd.citationstyles.csl+json; charset=utf-8', 'metadata': {'type': 'dataset', 'id': 'https://doi.org/10.5517/ccdc.csd.cc27l8ty', 'categories': ['Crystal Structure', 'Experimental 3D Coordinates', 'Crystal System', 'Space Group', 'Cell Parameters', 'Crystallography', 'catena-[(mu-5,5-dimethyl-14,17,20,23-tetraoxa-3,7-diazatricyclo[22.3.1.19,13]nonacosa-1(28),9(29),10,12,24,26-hexaene-28,29-diolato)-(mu-acetato)-sodium-zinc(ii) acetone solvate]'], 'language': 'en', 'author': [{'literal': 'Lindeboom, Wouter'}, {'literal': 'Fraser, Duncan A. X.'}, {'literal': 'Durr, Christopher B.'}, {'literal': 'Williams, Charlotte K.'}], 'issued': {'date-parts': [[2021]]}, 'abstract': 'Related Article: Wouter Lindeboom, Duncan A. X. Fraser, Christopher B. Durr, Charlotte K. Williams|2021|Chem.-Eur.J.|27|12224|doi:10.1002/chem

In [5]:
if len(data_reference) > 0:
    csvh.write_csv_data(data_reference, do_refs)
    

## Interoperable
Access to a resource does not guarantee interoperability, it is interoperable if the data is stored in a format which makes it easy to interpret by humans and machines. So an object in an open format is more interoperable that an object in a proprietary format.

### Interoperability score
The interoperability score is defined along the lines of [5 Star Open Data](https://5stardata.info/en/), using the three first levels.  The definition is relaxed ommiting the request to publish with an open license for intereoperability (it is used below for reusability). The scoring maximum is 3, and it is assigned as follows:

- 1 if the data object is available on the Web (whatever format). 
- 2 if the data object is available as structured data (e.g., Excel instead of image scan of a table)
- 3 make it available in a non-proprietary open format (e.g., CSV instead of Excel)

In [6]:
# get names and links for data object references
data_reference, _ = csvh.get_csv_data(do_refs, 'num')
# check if files actually exist, and then assess if they are of type 1, 2, or 3
level_1_types = ['.doc','.pdf','.tif','.jpg', '.docx', '.ppt', '.pptx', '.mp4', '.mpg', '.htm', '.html', '.zip','.avi']
level_2_types = ['xls','xlsx']
level_3_types = ['csv','cif']
for dr in tqdm_notebook(data_reference):
    do_file_name = data_reference[dr]['do_file']
    if do_file_name!= "" and  not Path(do_file_name).is_file():
        data_reference[dr]['file_missing'] = 'TRUE'
    elif do_file_name!= "":
        data_reference[dr]['file_missing'] = 'FALSE'
        data_reference[dr]['file_size'] = Path(do_file_name).stat().st_size
    for lv1_type in level_1_types:
        if lv1_type in do_file_name:
            data_reference[dr]['i_score'] = 1
    for lv2_type in level_2_types:
        if lv2_type in do_file_name:
            data_reference[dr]['i_score'] = 2
    for lv3_type in level_3_types:
        if lv3_type in do_file_name:
            data_reference[dr]['i_score'] = 3

            

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for dr in tqdm_notebook(data_reference):


  0%|          | 0/151 [00:00<?, ?it/s]

In [7]:
if len(data_reference) > 0:
    csvh.write_csv_data(data_reference, do_refs)

## Reusable
Finding, retrieving and interpreting an object is not all ther is. For the resource to be reusable it needs to be a) licensed for use and b) in an appropriate format to guarantee long term support (closely related to the criteria for interoperability).

### Reusability score
The interoperability score is also based on the [5 Star Open Data](https://5stardata.info/en/) levels, using the requirement for open licenses and the requirements for using identifiers and links to other data. In this case the scoring adds up to 3 ponts, to obtain the score a point is added for each of the following cases:
- 1 if the data object is available on an open license.
- 1 use identifiers(URI, DOI) to denote things, so that people can point at it.
- 1 the data object is linked  to other data to provide context.


In [8]:
data_reference, _ = csvh.get_csv_data(do_refs, 'num')
# for doi marked data objects the information about license, identification
# and linking can be obtained by looking at the DOI metadata
# for supplementary data, we try to get the DOI metadata for the parent
# publication and assing the same license to the DO.
open_licenses = ['http://pubs.acs.org/page/policy/authorchoice_ccby_termsofuse.html',
                 'http://creativecommons.org/licenses/by/3.0/',
                 'http://creativecommons.org/licenses/by/4.0',
                 'https://creativecommons.org/licenses/by/4.0',
                 'http://creativecommons.org/licenses/by/4.0/',
                 'http://creativecommons.org/licenses/by-nc/3.0/',
                 'http://doi.wiley.com/10.1002/tdm_license_1',
                 'https://creativecommons.org/licenses/by/4.0/']

not_open_licenses = ['http://www.springer.com/tdm',
                     'http://onlinelibrary.wiley.com/termsAndConditions#vor',
                     'http://doi.wiley.com/10.1002/tdm_license_1.1',
                     'http://rsc.li/journals-terms-of-use',
                     'http://pubs.acs.org/page/policy/authorchoice_termsofuse.html',
                     'http://onlinelibrary.wiley.com/termsAndConditions',
                     'https://www.elsevier.com/tdm/userlicense/1.0/',
                     'http://www.sciencemag.org/about/science-licenses-journal-article-reuse']

def license_is_open(a_license):
    is_open = False
    if not a_license in open_licenses and not a_license in not_open_licenses:
        assigned_lt = False
        while not assigned_lt:
            print(a_license)
            print('Assing license type:')
            print('\ta) Open')
            print('\tb) Not Open')
            print('\tSelect a or b:')
            lts = input()
            if lts == "a":
                open_licenses.append(a_license)
                assigned_lt = True
            elif lts == "b":
                not_open_licenses.append(a_license)
                assigned_lt = True
    if a_license in open_licenses:
        is_open = True
    else:
        is_open = False
    return is_open
    
for dr in tqdm_notebook(data_reference):
    do_file_name = data_reference[dr]['do_file']
    if data_reference[dr]['r_score'] == "":
        data_reference[dr]['r_score'] = 0
        if data_reference[dr]['license'] != "":
            #print(data_reference[dr])
            data_reference[dr]['r_score'] += 1
        else:
            print("assing same license as publication")
            # Use publication DOI metadata copyright field
            doi_link = "https://doi.org/" + data_reference[dr]['doi']
            data_object = urlh.getObjectMetadata(doi_link)
            #print(data_object)
            if data_object != {}:
                #print(data_object['resource_url'], data_object['type'], data_object['metadata'])
                if 'license' in data_object['metadata']:
                    #print(str(type(data_object['metadata']['license'])))
                    if isinstance(data_object['metadata']['license'], list):
                        for license_item in data_object['metadata']['license']:
                            this_license = license_item['URL']
                            if license_is_open(this_license):
                                data_reference[dr]['r_score'] += 1
                            if data_reference[dr]['license'] == "":
                                data_reference[dr]['license'] = this_license
                            else:
                                data_reference[dr]['license'] += ", " + this_license
                    else: 
                        this_license = data_object['metadata']['license']['URL']
                        if license_is_open(this_license):
                            data_reference[dr]['r_score'] += 1
                        data_reference[dr]['license']=this_license
        # the resource is linked
        # this is a very relaxed view equating identifier to any link!
        if data_reference[dr]['user_mined']=='FALSE':
           data_reference[dr]['r_score'] += 1
        else:
           print("this one is not linked")
        # the link works 
        if data_reference[dr]['user_mined']=='FALSE' and \
           data_reference[dr]['ret_code'] in ['200', '301','302','303']:
            data_reference[dr]['r_score'] += 1
        else:
            print("the link does not work")
        #print(data_reference[dr])

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for dr in tqdm_notebook(data_reference):


  0%|          | 0/151 [00:00<?, ?it/s]

assing same license as publication
trying to recover object metadata from https://doi.org/10.1039/d0cp00793e
got something back
resource url https://api.crossref.org/v1/works/10.1039%2Fd0cp00793e/transform
this one is not linked
the link does not work
assing same license as publication
trying to recover object metadata from https://doi.org/10.1021/jacs.0c07980
got something back
resource url https://api.crossref.org/v1/works/10.1021%2Fjacs.0c07980/transform
this one is not linked
the link does not work
assing same license as publication
trying to recover object metadata from https://doi.org/10.1021/jacs.0c07980
got something back
resource url https://api.crossref.org/v1/works/10.1021%2Fjacs.0c07980/transform
this one is not linked
the link does not work
assing same license as publication
trying to recover object metadata from https://doi.org/10.1021/jacs.0c07980
got something back
resource url https://api.crossref.org/v1/works/10.1021%2Fjacs.0c07980/transform
this one is not linked
the

In [None]:
if len(data_reference) > 0:
    csvh.write_csv_data(data_reference, do_refs)

In [None]:
# get article pdf size
data_reference, _ = csvh.get_csv_data(do_refs, 'num')
for dr in tqdm_notebook(data_reference):
    file_name = data_reference[dr]['file']
    do_name = data_reference[dr]['name']
    do_url = data_reference[dr]['data_url']
    if file_name == do_name and 'ccdc' in do_url:
        do_name = "CCDC "+ do_name[:-4] + ": Crystal Structure"
        data_reference[dr]['name'] = do_name
    elif file_name == do_name and do_name[-3:].lower() == 'pdf':
        print(file_name)
    elif 'ccdc' in do_url:
        print(data_reference[dr]['ref_redirect'])
if len(data_reference) > 0:
    csvh.write_csv_data(data_reference, do_refs)  

dois_list = []
for dr in tqdm_notebook(data_reference):
    doi_id = data_reference[dr]['doi']
    file_name = data_reference[dr]['file']
    
    if not doi_id in dois_list:
        dois_list.append(doi_id)


In [None]:
len(dois_list)


In [None]:
print(dois_list)

In [None]:
files = ['aic.15095.pdf','aic.15415.pdf','aic.16687.pdf','ange.201703550.pdf','anie.201602930.pdf','anie.201609557.pdf','anie.201612370.pdf','anie.201705753.pdf','anie.201710091.pdf','ange.201713115.pdf','anie.201801400.pdf','cbic.201800606.pdf','cctc.201500545.pdf','cctc.201501260.pdf','cctc.201600729.pdf','cctc.201600775.pdf','cctc.201600925.pdf','cctc.201601603.pdf','cctc.201700516.pdf','cctc.201701840.pdf','cctc.201701946.pdf','cctc.201801067.pdf','cctc.201801299.pdf','cctc.201900100.pdf','cctc.201900658.pdf','cctc.201900795.pdf','cctc.201901166.pdf','cctc.201901268.pdf','cctc.201901955.pdf','celc.201800052.pdf','celc.201800478.pdf','celc.201800729.pdf','celc.201800770.pdf','chem.201605690.pdf','chem.201700496.pdf','chem.201701013.pdf','chem.201703567.pdf','chem.201704151.pdf','chem.201805250.pdf','chem.201901188.pdf','cphc.201600149.pdf','cplu.201500195.pdf','Adams_Metal_Oxide_Catalysts_for_Solar_Driven_Water_Splitting.pdf','cssc.201501225.pdf','cssc.201501264.pdf','ejoc.201601388.pdf','Smart anime donors.pdf','Bowker2015_Article_ThePhotocatalyticWindowPhoto-R.pdf','Hellier2018_Article_VOxFe2O3ShellCoreCatalystsForT.pdf','Celorrio2018_Article_AMnO3ASrLaCaYPerovskiteOxidesA.pdf','Decarolis2018_Article_EffectOfParticleSizeAndSupport.pdf','Greenaway2018_Article_OperandoSpectroscopicStudiesOf.pdf','s11244-018-0893-6.pdf','Locke2018_Article_CatalysisOfTheOxygenEvolutionR.pdf','Zachariou2020_Article_TheEffectOfCo-feedingMethylAce.pdf','1-s2.0-S0926860X17301485-main.pdf','1-s2.0-S0926860X18305817-main.pdf','1-s2.0-S0926337314006043-main.pdf','1-s2.0-S0926337315002192-main.pdf','1-s2.0-S0926337316310025-main.pdf','1-s2.0-S0926337318306167-main.pdf','1-s2.0-S0926337318307136-main.pdf','1-s2.0-S092633731930400X-main.pdf','1-s2.0-S0968089618313233-main.pdf','1-s2.0-S0008622316306182-main.pdf','1-s2.0-S0920586118316456-main.pdf','1-s2.0-S0009250915002857-main.pdf','1-s2.0-S0013468617312628-main.pdf','1-s2.0-S0013468618315172-main.pdf','1-s2.0-S2352152X18303815-main.pdf','1-s2.0-S0360319916304268-main.pdf','1-s2.0-S0360319916324387-main.pdf','1-s2.0-S0360319916303135-main.pdf','1-s2.0-S0021951714000876-main.pdf','1-s2.0-S0021951716001135-main.pdf','1-s2.0-S0021951718302124-main.pdf','1-s2.0-S0021951719301459-main.pdf','1-s2.0-S0959652616320212-main.pdf','1-s2.0-S1572665717306963-main.pdf','1-s2.0-S1572665718301826-main.pdf','1-s2.0-S1387181117307965-main.pdf','1-s2.0-S0039602816301509-main.pdf','1-s2.0-S0039602816301996-main.pdf','acs.biochem.8b00169.pdf','acs.chemmater.5b00866.pdf','acs.chemmater.7b02552.pdf','acs.iecr.8b00230.pdf','acs.iecr.9b04263.pdf','acs.inorgchem.5b02038.pdf','acs.inorgchem.5b02233.pdf','acs.jcim.8b00940.pdf','acs.jctc.6b01131.pdf','acs.jpcc.6b04781.pdf','acs.jpcc.6b11186.pdf','acs.jpcc.8b08420.pdf','acs.jpcc.9b05475.pdf','acs.macromol.5b00225.pdf','acs.macromol.5b01293.pdf','acs.nanolett.9b01733.pdf','acs.organomet.8b00063.pdf','acsaem.8b00873.pdf','acsami.6b02863.pdf','acscatal.0c00414.pdf','acscatal.0c00596.pdf','acscatal.5b00480.pdf','acscatal.5b00481.pdf','acscatal.5b00625.pdf','acscatal.5b00754.pdf','acscatal.5b01327.pdf','acscatal.5b01936.pdf','acscatal.6b00589.pdf','acscatal.6b00982.pdf','acscatal.6b02369.pdf','acscatal.6b03190.pdf','acscatal.6b03237.pdf','acscatal.6b03641.pdf','acscatal.7b03805.pdf','acscatal.8b00389.pdf','acscatal.8b01509.pdf','acscatal.8b02232.pdf','acscatal.8b03169.pdf','acscatal.8b04564.pdf','acscatal.9b00160.pdf','acscatal.9b00685.pdf','acscatal.9b01820(1).pdf','acscatal.9b05129.pdf','acsnano.8b09399.pdf','acsomega.9b03351.pdf','acsomega.9b03503.pdf','acssuschemeng.8b03568.pdf','cm503433q.pdf','cs400683e.pdf','cs502038y.pdf','ja5062467.pdf','ja512868a.pdf','jacs.5b09913.pdf','jacs.5b13070.pdf','jacs.6b00710.pdf','jacs.7b12621.pdf','jacs.8b01920.pdf','jacs.9b02731.pdf','jp5081753.pdf','nn500963m.pdf','om5008055.pdf','om501252m.pdf','nature16935.pdf','s41467-018-03138-7.pdf','s41467-020-15445-z.pdf','s41563-019-0562-6.pdf','s41563-020-0800-y.pdf','s41570-016-0002.pdf','s41589-018-0154-9.pdf','s41929-018-0197-z.pdf','s41929-018-0206-2.pdf','s41929-018-0213-3.pdf','s41929-019-0334-3.pdf','srep39392.pdf','C4CC04024D.pdf','C4CP00753K.pdf','C4CP04693E.pdf','C4DT01309C.pdf','C4RA16127K.pdf','C4SC00545G.pdf','C5CC04188K.pdf','C5CC06118K.pdf','C5CC08223D.pdf','C5CC08681G.pdf','C5CC08714G.pdf','C5CC08956E.pdf','C5CC09780K.pdf','C5CP02512E.pdf','C5CY00732A.pdf','C5CY01175B.pdf','C5CY01650A.pdf','C5CY01726B.pdf','C5CY02072G.pdf','C5RA19197A.pdf','C5SC03494A.pdf','C5TA08709K.pdf','C5TA10283A.pdf','C6CC01599A.pdf','C6CP01209D.pdf','C6CP01311B.pdf','C6CP01494A.pdf','C6CY01105E.pdf','C6CY01129B.pdf','C6DT03565E.pdf','C6FD00189K.pdf','C6GC01288D.pdf','C6ME00061D.pdf','C6RE00140H.pdf','C6SC04130B.pdf','C6TA00293E.pdf','C6TB01774F.pdf','C7CP04144F.pdf','C7CY00184C.pdf','C7CY00798A.pdf','C7CY00875A.pdf','C7CY01553D.pdf','C7DT01022B.pdf','C7DT02167D.pdf','c7dt04805j.pdf','C7FD00159B.pdf','C7FD00216E.pdf','C7FD00221A.pdf','C7TA10892C.pdf','C8CC01880D.pdf','C8CC07444E.pdf','C8CP01022F.pdf','c8cp06736h.pdf','C8CY01483C.pdf','C8DT04638G.pdf','C8DT05051A.pdf','C8FD00002F.pdf','C8FD00005K.pdf','C8NJ03632B.pdf','C8OB00066B.pdf','C8SC03312A.pdf','C8TA02908C.pdf','C8TA12263F.pdf','C9CC02088H.pdf','C9CC02459J.pdf','C9CP00826H.pdf','c9cp05476f.pdf','c9cy01679a.pdf','C9CY02371B.pdf','c9cy02473e.pdf','C9DT01634A.pdf','C9DT03590G.pdf','C9NA00159J.pdf','C9NR04553H.pdf','C9RA03568K.pdf','C9SC03374B.pdf','C9SC04905C.pdf','C9SE01103J.pdf','D0CP00032A.pdf','D0CP00704H.pdf','D0CP01196G.pdf','d0cy00036a.pdf','d0dt00007h.pdf','D0RA03871G.pdf','d0sc01317j.pdf','D0SC01924K.pdf','D0SC02152K.pdf','d0sc02253e.pdf','Silverwood_etal_2016_towards_microfluidic_reactors_for_in_situ_synchrotron_infrared_studies.pdf','rspa.2016.0054.pdf','rspa.2016.0126.pdf','1399.full.pdf','Antimicrobial Agents and Chemotherapy-2019-Tooke-e00564-19.full.pdf','Development and characterization of thermally stable supported V W TiO2 catalysts for mobile NH3 SCR applications.pdf','surfaces-02-00001.pdf','2190-4286-10-191.pdf']
file_sizes = {}
for file_name in files:
    pdf_file = "pdf_files/"+file_name
    if Path(pdf_file).is_file():
        print (file_name, Path(pdf_file).stat().st_size)

In [None]:
file_types = [['origin', 'excel', 'powerpoint'],['pdf'],['pdf'],['excel'],['pdf'],['pdf'],
              ['pdf'],['text', 'tar gz'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['video mpeg'],
              ['video mpeg'],['pdf'],['pdf'],['pdf'],['pdf'],['text', 'excel', 'zip','tif'],['pdf'],['pdf'],
              ['pdf'],['pdf'],[],['pdf'],['pdf'],['raw/nexus'],[],['pdf'],['pdf'],['doc'],['doc'],['pdf'],
              ['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],[],['doc'],['pdf'],['pdf'],['pdf'],['pdf'],
              ['zip'],['pdf'],['pdf'],['video avi'],['pdf'],['zip'],['pdf'],['pdf'],['video mp4'],['pdf'],
              ['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['excel'],['pdf'],['pdf'],
              ['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['cif'],['pdf'],['pdf'],['pdf'],['pdf'],
              ['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['image gif'],['raw/nexus','processed'],['pdf'],
              ['pdf'],['pdf'],['doc'],['doc'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],
              ['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['image gif'],['image gif'],['image gif'],
              ['image tif'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],
              ['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['powerpoint'],
              ['powerpoint'],['powerpoint'],['powerpoint'],['powerpoint'],['pdf'],['pdf'],['pdf'],['zip'],
              ['zip'],['pdf'],['pdf'],['cif'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['cif'],['pdf'],
              ['pdf'],['cif'],['html'],['pdf'],['pdf'],['cif'],['pdf'],['pdf'],['pdf'],['cif'],['pdf'],
              ['pdf'],['cif'],['cif'],['cif'],['cif'],['cif'],['cif'],['cif'],['cif'],['pdf'],['vamas', 'excel'],
              ['pdf'],['pdf'],['pdf'],['image jpg'],['html'],['html'],['html'],['html'],['html'],['html'],
              ['html'],['html'],['html'],['html'],['powerpoint'],['powerpoint'],['powerpoint'],['powerpoint'],
              ['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['cif'],['pdf'],['pdf'],['pdf'],
              ['video mp4'],['video mp4'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],
              ['text','origin', 'gatan dm4','tif','opus','compiled xas'],['raw/nexus'],['pdf'],['pdf'],['pdf'],
              ['pdf'],[],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['doc'],['pdf'],['excel','text'],
              ['pdf'],['pdf'],['athena','raw/tem'],['text','zip'],['pdf'],['zip'],['pdf'],['pdf'],['pdf'],['pdf'],
              ['doc'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['doc'],['pdf'],['pdf'],['pdf'],['pdf'],
              ['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],
              [ 'excel','origin', 'image jpg','athena project'],['pdf'],['pdf'],['pdf'],
              ['Origin', 'text', 'athena project','PNG'],['pdf'],['pdf'],['pdf'],[],['pdf'],['pdf'],['pdf'],['pdf'],
              ['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],['pdf'],[],['pdf'],['cif'],['cif'],['cif'],
              ['cif'],['xyz','excel','inp','shell','python'],['pdf'],['excel'],['excel'],['excel'],['excel'],
              ['excel'],['excel'],['excel'],['excel'],['excel'],['excel'],['excel'],['pdf'],['pdf'],['pdf'],
              ['zip'],['pdf'],['pdf'],['zip'],['pdf'],['pdf'],[],['pdf'],['pdf'],['doc'],['doc'],['doc'],['doc'],
              ['doc'],['doc'],['doc'],['doc'],['doc'],['doc'],['doc'],['pdf'],['doc'],['doc'],['doc'],['doc'],
              ['doc'],['doc'],['doc'],['doc'],['doc'],['doc'],['doc'],['doc'],['pdf'],['cif'],['doc'],['zip'],
              ['excel'],['excel'],[],['pdf'],['pdf'],['pdf'],['pdf']]

ft_summary = {}

for fts in file_types:
   for ft in fts:
    if ft in ft_summary:
        ft_summary[ft] += 1
    else:
        ft_summary[ft] = 1
              

In [None]:
ft_summary

In [None]:
from IPython.core.display import display, HTML
display(HTML('<h1>Hello, world!</h1>'))

In [None]:
from IPython.display import IFrame

IFrame(src='do_files/41467_2021_21062_MOESM2_ESM.pdf', width=700, height=600)

In [None]:
url= 'http://api.scholexplorer.openaire.eu/v2/Links?sourcePublisher=Cambridge%20Crystallographic%20Data%20Centre&page='
for i = [0..100]:
response = urlh.getPageFromURL('http://api.scholexplorer.openaire.eu/v2/Links?sourcePublisher=Cambridge%20Crystallographic%20Data%20Centre&page=0')





In [None]:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
fp = open('do_files/41467_2021_21062_MOESM2_ESM.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)

print(doc.info)  # The "Info" metadata

In [None]:
pdf_info = doc.info

In [None]:



ds_doi = data_results['result'][1]['source']['Identifier'][0]['ID']
ds_title = data_results['result'][1]['source']['Title']
ds_published = data_results['result'][1]['source']['PublicationDate']

pub_doi = data_results['result'][1]['target']['Identifier'][0]['ID']
pub_title = data_results['result'][1]['target']['Title']
pub_published = data_results['result'][1]['target']['PublicationDate']
data_results['result'][0]['target']

a_dl = {"ds_doi":ds_doi,'ds_title':ds_title, 'ds_published':ds_published, 'pub_doi':pub_doi, 
        'pub_title':pub_title, 'pub_published': pub_published}

In [None]:
print(ds_doi,ds_title, ds_published, pub_doi, pub_title, pub_published)
print (a_dl)

In [None]:
data_links = []
a_dl = {}
for a_result in data_results['result']:
    ds_doi = a_result['source']['Identifier'][0]['ID']
    ds_title = a_result['source']['Title']
    ds_published = a_result['source']['PublicationDate']

    pub_doi = a_result['target']['Identifier'][0]['ID']
    pub_title = a_result['target']['Title']
    pub_published = a_result['target']['PublicationDate']
    a_dl = {"ds_doi":ds_doi,'ds_title':ds_title, 'ds_published':ds_published, 'pub_doi':pub_doi, 
        'pub_title':pub_title, 'pub_published': pub_published}
    data_links.append(a_dl)
print (len(data_links))

In [None]:
import json
# library for handling url searchs
import lib.handle_urls as urlh
url_base = 'http://api.scholexplorer.openaire.eu/v2/Links?sourcePublisher=Cambridge%20Crystallographic%20Data%20Centre&page='
data_links = []
a_dl = {}
for i in range(0,100):
    print (url_base + str(i))
    response = urlh.getPageFromURL(url_base + str(i))
    data_results = json.loads(response)

    for a_result in data_results['result']:
        ds_doi = a_result['source']['Identifier'][0]['ID']
        ds_title = a_result['source']['Title']
        ds_published = a_result['source']['PublicationDate']

        pub_doi = a_result['target']['Identifier'][0]['ID']
        pub_title = a_result['target']['Title']
        pub_published = a_result['target']['PublicationDate']
        a_dl = {"ds_doi":ds_doi,'ds_title':ds_title, 'ds_published':ds_published, 'pub_doi':pub_doi, 
            'pub_title':pub_title, 'pub_published': pub_published}
        data_links.append(a_dl)
    print (len(data_links))

In [None]:
data_links_dic = {}
for i in range(1,10001):
    data_links_dic[i] = data_links[i-1]

for i in range(1,10000):
    if data_links_dic[i]['pub_title'] != None:
        data_links_dic[i]['pub_title'] = data_links_dic[i]['pub_title'].replace('\n', ' ')

    
if len(data_links_dic) > 0:
    csvh.write_csv_data(data_links_dic, "ccdc_scholix2.csv")

In [None]:
data_results

In [None]:
response = urlh.getPageFromURL("https://api.eventdata.crossref.org/v1/events?scholix=true&subj-id.prefix=10.5517")
response

In [None]:
data_results = json.loads(response)

In [None]:
data_results['message']['events'][0]

In [None]:
len(data_results['message']['events'])

In [None]:
import requests

response = requests.get('https://doi.org/10.5517/ccdc.csd.cc14dg66',  headers={"Accept": "text/bibliography; style=bibtex"})


In [None]:
response.content

In [None]:
# library containign read and write functions to csv file
import lib.handle_csv as csvh

# library for handling url searchs
import lib.handle_urls as urlh

import json
url_base = 'http://api.scholexplorer.openaire.eu/v2/Links?sourcePid='
doi_list = ['10.1002/chem.202000067', '10.1016/j.jcat.2018.01.033', '10.1021/acscatal.9b03889', 
            '10.1039/d0cp01227k', '10.1039/d0cy01061h', '10.1098/rsta.2020.0058', '10.1098/rsta.2020.0063', 
            '10.1039/D0CY01608J', '10.1021/acs.est.0c04279', '10.1039/D0CP01192D', '10.1039/d0cy01779e', 
            '10.1021/acsenergylett.0c02614', '10.1039/d1fd00004g', '10.3390/catal10121370', '10.1039/d1gc00901j', '10.1038/s41467-021-21062-1', '10.1021/acscatal.0c05413', '10.1021/acscatal.0c04858', '10.1088/1361-648x/abfe16', '10.1088/1361-6463/abe9e1', '10.1039/d0sc03113e', '10.1007/s11244-021-01447-8', '10.1021/acs.organomet.1c00055', '10.1021/acscatal.0c05019', '10.1021/acs.inorgchem.1c00327', '10.1002/smsc.202100032', '10.1039/d0gc02295k', '10.1002/anie.201901592', '10.1021/acs.organomet.9b00845', '10.1021/jacs.9b13106', '10.1002/anie.202006807', '10.1021/jacs.0c07980', '10.1039/d0cy01484b', '10.1039/d0cy02164d', '10.1002/anie.202101180', '10.1002/chem.202101140', '10.1021/acsmacrolett.1c00216', '10.1002/anie.201810245', '10.1039/c9sc00385a', '10.1021/acs.macromol.8b01224', '10.1039/c9dt02918d', '10.1038/s41467-019-10481-w', '10.1002/ange.201901592', '10.1039/c9dt00595a', '10.1039/d1cy00238d', '10.1021/acs.inorgchem.8b02923', '10.1002/ange.202006807', '10.1002/anie.201814320', '10.1007/s10562-019-02876-7', '10.1021/acs.jpcc.9b09050', '10.1016/j.apcatb.2017.01.042', '10.1039/d0cc04036c', '10.1002/anie.202015016', '10.1039/d1ta01464a', '10.1002/smtd.202100512', '10.1107/s1600576720013576', '10.1039/d0cp00793e', '10.1039/d0ta01398f', '10.1007/s11244-021-01450-z', '10.1039/d0ta08351h', '10.1021/acssuschemeng.1c01451', '10.1002/cphc.201800721', '10.1021/acssuschemeng.8b04073', '10.1002/cctc.202100286', '10.1007/s11244-020-01245-8', '10.1021/acscatal.0c03620', '10.1016/j.cattod.2018.06.033', '10.1016/j.apcatb.2020.118752', '10.1016/j.joule.2020.07.024', '10.1002/anie.201814381', '10.1002/ange.201902857']
data_links = {}
a_dl = {}
ignore_types = ['References','IsReferencedBy']
for a_doi in doi_list:
    print (url_base + str(i))
    response = urlh.getPageFromURL(url_base + a_doi.replace('/','%2f'))
    data_results = json.loads(response)
    id_dl = len(data_links)
    for a_result in data_results['result']:
        if not a_result['RelationshipType']['Name'] in ignore_types:
            id_dl += 1
            source_doi = a_result['source']['Identifier'][0]['ID']
            source_title = a_result['source']['Title']
            source_published = a_result['source']['PublicationDate']

            target_doi = a_result['target']['Identifier'][0]['ID']
            target_title = a_result['target']['Title']
            target_published = a_result['target']['PublicationDate']
            
            rel_type = a_result['RelationshipType']['Name']

            a_dl = {"source_doi":source_doi,'source_title':source_title, 'source_published':source_published,
                    'target_doi':target_doi, 'target_title':target_title, 
                    'target_published': target_published, 'rel_type': rel_type}
            data_links[id_dl]=a_dl
    print (len(data_links))

if len(data_links) > 0:
    csvh.write_csv_data(data_links, "ccdc_scholix2.csv")

In [None]:

if len(data_links) > 0:
    csvh.write_csv_data(data_links, "ccdc_scholix2.csv")

In [None]:
len(doi_list)
# library for handling url searchs
import lib.handle_urls as urlh
url_base = 'http://api.scholexplorer.openaire.eu/v2/Links?sourcePid='
response = urlh.getPageFromURL(url_base + doi_list[0].replace('/','%2f'))
data_results = json.loads(response)

In [None]:

for a_result in data_results['result']:
    if data_results['result'][0]['RelationshipType']['Name'] != 'References':
        ds_doi = a_result['source']['Identifier'][0]['ID']
        ds_title = a_result['source']['Title']
        ds_published = a_result['source']['PublicationDate']

        pub_doi = a_result['target']['Identifier'][0]['ID']
        pub_title = a_result['target']['Title']
        pub_published = a_result['target']['PublicationDate']
        a_dl = {"ds_doi":ds_doi,'ds_title':ds_title, 'ds_published':ds_published, 'pub_doi':pub_doi, 
            'pub_title':pub_title, 'pub_published': pub_published}
        print(a_dl)