# Are Data Objects referenced from publications FAIR

A list of data objects referenced from a set of publications is checked to determine their FAIRness.

Instead of only referencing to data, these tests are applied to **data objects**, which are any data which is published to complement the publication, this includes raw data, supplementary data, processing data, tables, images, movies, and compilations containing one or more of such resources.

The tests to be performed are aimed at finding out if the data objects are: 
 - **F**indable:  can the object be found easily?
 - **A**ccessible: can the object be retrieved?
 - **I**nteroperable: can the object be accesed programatically to extract data and metadata?
 - **R**eusable: can the object be readily used?

In [1]:
# Libraries
# library containign functions that read and write to csv files
import lib.handle_csv as csvh
# library for connecting to the db
import lib.handle_db as dbh
# library for handling text matchings
import lib.text_comp as txtc
# library for getting data from crossref
import lib.crossref_api as cr_api
# library for handling url searchs
import lib.handle_urls as urlh
# managing files and file paths
from pathlib import Path
# add aprogress bar
from tqdm import tqdm_notebook 
import tqdm
#library for handling json files
import json
# library for using regular expressions
import re
# library for handling http requests
import requests
# import custom functions (common to various notebooks)
import processing_functions as pr_fns


## Findable

Most of the data objects are assumed to be findable as we already have links to them. However, some are just references to contact the authors or point to repositories without identifying a specific record.


In [21]:
# get names and links for references in data mentions
data_reference, _ = csvh.get_csv_data('pub_data_mined.csv', 'num')
import os
for dr in tqdm_notebook(data_reference):
    print("Article Link: https://doi.org/" + data_reference[dr]['doi'])
    ref_name = data_reference[dr]['name']
    ref_link = data_reference[dr]['data_url']
    print("Search for: Data Name:", ref_name, "data link:", ref_link)
    head = urlh.getPageHeader(ref_link)
    
    if head != None:
        data_reference[dr]['ret_code'] = head.status_code 
        data_reference[dr]['resoruce_name'] = os.path.basename(head.url)
        if head.status_code == 200:
            #print (head.headers, head.url)
            if 'content-type' in head.headers.keys():
                data_reference[dr]['ref_content'] = head.headers['content-type']
            if 'content-length' in head.headers.keys():
                data_reference[dr]['ref_size'] = head.headers['content-length']
            data_reference[dr]['ref_redirect'] = head.url
        elif head.status_code == 302 or head.status_code == 301:
            #print(head, head.headers)
            data_reference[dr]['ref_redirect'] = head.headers['location']
            data_reference[dr]['resoruce_name'] = os.path.basename(head.headers['location'])
        else:
            print(head, head.headers)
    
    
    #while ref_name == "":
    #    print('Please enter the name of data object:')
    #    ref_name = input()
    #ref_link = data_mentions[dm]['ref_link']
    #while ref_link == "":
    #    print('Please enter the data object link:')
    #    ref_link = input()
    #data_mentions[dm]['ref_name'] = ref_name
    #data_mentions[dm]['ref_link'] = ref_link
                      

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  after removing the cwd from sys.path.


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=374.0), HTML(value='')))

Article Link: https://doi.org/10.1038/s41929-019-0334-3
Search for: Data Name: https://doi.org/10.17035/d.2019.0079744472 data link: https://doi.org/10.17035/d.2019.0079744472
Article Link: https://doi.org/10.1038/s41929-019-0334-3
Search for: Data Name: Supplementary Figs. 1â€“11, Tables 1â€“4 and references data link: https://static-content.springer.com/esm/art%3A10.1038%2Fs41929-019-0334-3/MediaObjects/41929_2019_334_MOESM1_ESM.pdf
Article Link: https://doi.org/10.1021/acscatal.9b00685
Search for: Data Name: cs9b00685_si_001.pdf (1.18 MB) data link: https://pubs.acs.org/doi/suppl/10.1021/acscatal.9b00685/suppl_file/cs9b00685_si_001.pdf
Article Link: https://doi.org/10.1021/acscatal.9b00685
Search for: Data Name: 10.25375/uct.8332652 data link: https://www.doi.org/10.25375/uct.8332652
Article Link: https://doi.org/10.1002/cctc.201901268
Search for: Data Name: cctc201901268-sup-0001-misc_information.pdf data link: https://chemistry-europe.onlinelibrary.wiley.com/action/downloadSupplem

Article Link: https://doi.org/10.1007/s11244-018-0888-3
Search for: Data Name: Supplementary material 1 (DOCX 775 KB) data link: https://static-content.springer.com/esm/art%3A10.1007%2Fs11244-018-0888-3/MediaObjects/11244_2018_888_MOESM1_ESM.docx
Article Link: https://doi.org/10.1007/s11244-018-0887-4
Search for: Data Name: Supplementary material 1 (DOCX 1439 KB) data link: https://static-content.springer.com/esm/art%3A10.1007%2Fs11244-018-0887-4/MediaObjects/11244_2018_887_MOESM1_ESM.docx
Article Link: https://doi.org/10.1021/acsaem.8b00873
Search for: Data Name: ae8b00873_si_001.pdf (4.97 MB) data link: https://pubs.acs.org/doi/suppl/10.1021/acsaem.8b00873/suppl_file/ae8b00873_si_001.pdf
Article Link: https://doi.org/10.1002/celc.201800729
Search for: Data Name: celc201800729-sup-0001-misc_information.pdf data link: https://chemistry-europe.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fcelc.201800729&file=celc201800729-sup-0001-misc_information.pdf
Article Link: htt

Article Link: https://doi.org/10.1021/acs.organomet.8b00063
Search for: Data Name: om8b00063_si_002.xlsx (36.37 kb) data link: https://pubs.acs.org/doi/suppl/10.1021/acs.organomet.8b00063/suppl_file/om8b00063_si_002.xlsx
Article Link: https://doi.org/10.1002/cctc.201701946
Search for: Data Name: cctc201701946-sup-0001-misc_information.pdf data link: https://chemistry-europe.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fcctc.201701946&file=cctc201701946-sup-0001-misc_information.pdf
Article Link: https://doi.org/10.1039/c7cy00798a
Search for: Data Name: Supplementary information PDF (517K) data link: http://www.rsc.org/suppdata/c7/cy/c7cy00798a/c7cy00798a1.pdf
Article Link: https://doi.org/10.1002/anie.201705753
Search for: Data Name: anie201705753-sup-0001-misc_information.pdf data link: https://onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fanie.201705753&file=anie201705753-sup-0001-misc_information.pdf
Article Link: https://doi.org/10.1002/ejoc.2016

Article Link: https://doi.org/10.1021/acscatal.5b00754
Search for: Data Name: PDF data link: https://pubs.acs.org/doi/suppl/10.1021/acscatal.5b00754/suppl_file/cs5b00754_si_001.pdf
Article Link: https://doi.org/10.1039/c4sc00545g
Search for: Data Name: Supplementary information PDF (591K) data link: http://www.rsc.org/suppdata/sc/c4/c4sc00545g/c4sc00545g1.pdf
Article Link: https://doi.org/10.1039/c4cc04024d
Search for: Data Name: Supplementary information PDF (1520K) data link: http://www.rsc.org/suppdata/cc/c4/c4cc04024d/c4cc04024d1.pdf
Article Link: https://doi.org/10.1039/c4cp04693e
Search for: Data Name: Supplementary information PDF (1017K) data link: http://www.rsc.org/suppdata/cp/c4/c4cp04693e/c4cp04693e1.pdf
Article Link: https://doi.org/10.1039/c4dt01309c
Search for: Data Name: Supplementary information PDF (1060K) data link: http://www.rsc.org/suppdata/dt/c4/c4dt01309c/c4dt01309c1.pdf
Article Link: https://doi.org/10.1021/jp5081753
Search for: Data Name: jp5081753_si_001.pdf 

Article Link: https://doi.org/10.1021/jacs.5b13070
Search for: Data Name: ja5b13070_si_004.zip (8.64 MB) data link: https://pubs.acs.org/doi/suppl/10.1021/jacs.5b13070/suppl_file/ja5b13070_si_004.zip
Article Link: https://doi.org/10.1021/jacs.5b13070
Search for: Data Name: ja5b13070_si_005.zip (20.95 MB) data link: https://pubs.acs.org/doi/suppl/10.1021/jacs.5b13070/suppl_file/ja5b13070_si_005.zip
Article Link: https://doi.org/10.1021/jacs.5b13070
Search for: Data Name: ja5b13070_si_001.pdf (2.28 MB) data link: https://pubs.acs.org/doi/suppl/10.1021/jacs.5b13070/suppl_file/ja5b13070_si_001.pdf
Article Link: https://doi.org/10.1021/jacs.5b13070
Search for: Data Name: ja5b13070_si_004.zip (8.64 MB) data link: https://pubs.acs.org/doi/suppl/10.1021/jacs.5b13070/suppl_file/ja5b13070_si_004.zip
Article Link: https://doi.org/10.1021/jacs.5b13070
Search for: Data Name: ja5b13070_si_005.zip (20.95 MB) data link: https://pubs.acs.org/doi/suppl/10.1021/jacs.5b13070/suppl_file/ja5b13070_si_005.zi

Article Link: https://doi.org/10.1002/cctc.201601603
Search for: Data Name: cctc201601603-sup-0001-misc_information.pdf data link: https://chemistry-europe.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fcctc.201601603&file=cctc201601603-sup-0001-misc_information.pdf
Article Link: https://doi.org/10.1039/c6sc04130b
Search for: Data Name: Supplementary information PDF (684K) data link: http://www.rsc.org/suppdata/c6/sc/c6sc04130b/c6sc04130b1.pdf
Article Link: https://doi.org/10.1039/c5cy01650a
Search for: Data Name: 10.17035/d.2015.100119 data link: http://dx.doi.org/10.17035/d.2015.100119
<Response [301]> {'Date': 'Fri, 13 Nov 2020 16:50:40 GMT', 'Connection': 'keep-alive', 'Cache-Control': 'max-age=3600', 'Expires': 'Fri, 13 Nov 2020 17:50:40 GMT', 'Location': 'https://dx.doi.org/10.17035/d.2015.100119', 'cf-request-id': '06641d39740000f41382a2e000000001', 'Report-To': '{"endpoints":[{"url":"https:\\/\\/a.nel.cloudflare.com\\/report?s=92Xb0%2Bv4Wi3ck%2FD620t5K7OB3o1ZdT

<Response [303]> {'Connection': 'keep-alive', 'Content-Length': '150', 'Content-Type': 'text/html', 'Location': 'https://idp.nature.com/authorize?response_type=cookie&client_id=grover&redirect_uri=https%3A%2F%2Fwww.nature.com%2Farticles%2Fnature16935%2Ffigures%2F13', 'Server': 'Oscar Platform 0.437.0', 'Strict-Transport-Security': 'max-age=31536000;preload', 'X-Vcap-Request-Id': 'f91a620f-db69-438f-4ecf-238333010835', 'Via': '1.1 google, 1.1 varnish', 'X-Cdn-Origin': 'SNPaaS', 'Accept-Ranges': 'bytes', 'Date': 'Fri, 13 Nov 2020 16:50:41 GMT', 'X-Served-By': 'cache-lhr7352-LHR', 'X-Cache': 'MISS', 'X-Cache-Hits': '0', 'X-Timer': 'S1605286242.837017,VS0,VE15', 'Vary': 'x-forwarded-host, upgrade-insecure-requests'}
Article Link: https://doi.org/10.1038/nature16935
Search for: Data Name: <span>Full size table</span> data link: https://www.nature.com/articles/nature16935/tables/1
<Response [303]> {'Connection': 'keep-alive', 'Content-Length': '150', 'Content-Type': 'text/html', 'Location': 

Article Link: https://doi.org/10.1039/c8dt04638g
Search for: Data Name: Supplementary information PDF (1455K) data link: http://www.rsc.org/suppdata/c8/dt/c8dt04638g/c8dt04638g1.pdf
Article Link: https://doi.org/10.1039/c9dt01634a
Search for: Data Name: Supplementary information PDF (1046K) data link: http://www.rsc.org/suppdata/c9/dt/c9dt01634a/c9dt01634a1.pdf
Article Link: https://doi.org/10.1021/acsnano.8b09399
Search for: Data Name: nn8b09399_si_001.pdf (2.76 MB) data link: https://pubs.acs.org/doi/suppl/10.1021/acsnano.8b09399/suppl_file/nn8b09399_si_001.pdf
Article Link: https://doi.org/10.1021/acscatal.9b00160
Search for: Data Name: cs9b00160_si_001.pdf (3.07 MB) data link: https://pubs.acs.org/doi/suppl/10.1021/acscatal.9b00160/suppl_file/cs9b00160_si_001.pdf
Article Link: https://doi.org/10.1021/acscatal.9b00160
Search for: Data Name: 10.1021/acscatal.9b00160 data link: https://pubs.acs.org/doi/abs/10.1021/acscatal.9b00160
Article Link: https://doi.org/10.1021/acscatal.9b00160

Article Link: https://doi.org/10.1002/aic.16687
Search for: Data Name: aic16687-sup-0001-FigureS1.doc data link: https://aiche.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Faic.16687&file=aic16687-sup-0001-FigureS1.doc
Article Link: https://doi.org/10.1021/acs.jpcc.6b11186
Search for: Data Name: jp6b11186_si_001.pdf (1.5 MB) data link: https://pubs.acs.org/doi/suppl/10.1021/acs.jpcc.6b11186/suppl_file/jp6b11186_si_001.pdf
Article Link: https://doi.org/10.1021/acscatal.6b00982
Search for: Data Name: cs6b00982_si_001.pdf (2.28 MB) data link: https://pubs.acs.org/doi/suppl/10.1021/acscatal.6b00982/suppl_file/cs6b00982_si_001.pdf
Article Link: https://doi.org/10.1039/c9sc03374b
Search for: Data Name: Supplementary information PDF (3485K) data link: http://www.rsc.org/suppdata/c9/sc/c9sc03374b/c9sc03374b1.pdf
Article Link: https://doi.org/10.1021/ja512868a
Search for: Data Name: ja512868a_si_001.pdf (2.22 MB) data link: https://pubs.acs.org/doi/suppl/10.1021/ja512868a/supp

Article Link: https://doi.org/10.1002/cctc.201901166
Search for: Data Name: cctc201901166-sup-0001-misc_information.pdf data link: https://chemistry-europe.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fcctc.201901166&file=cctc201901166-sup-0001-misc_information.pdf
Article Link: https://doi.org/10.1039/c9cy02371b
Search for: Data Name: Supplementary information PDF (1877K) data link: http://www.rsc.org/suppdata/c9/cy/c9cy02371b/c9cy02371b1.pdf
Article Link: https://doi.org/10.1039/d0sc01317j
Search for: Data Name: Supplementary information PDF (3861K) data link: http://www.rsc.org/suppdata/d0/sc/d0sc01317j/d0sc01317j1.pdf
Article Link: https://doi.org/10.1038/s41563-019-0562-6
Search for: Data Name: All the relevant data are available from the authors, and/or are included with the manuscript. data link: 
Invalid URL '': No schema supplied. Perhaps you meant http://?
Article Link: https://doi.org/10.1038/s41563-019-0562-6
Search for: Data Name: Supplementary Informatio

Article Link: https://doi.org/10.1016/j.apcatb.2016.12.066
Search for: Data Name: 1-s2.0-S0926337316310025-mmc1.doc data link: https://ars.els-cdn.com/content/image/1-s2.0-S0926337316310025-mmc1.doc
Article Link: https://doi.org/10.1016/j.apcatb.2018.07.008
Search for: Data Name: 1-s2.0-S0926337318306167-mmc1.docx data link: https://ars.els-cdn.com/content/image/1-s2.0-S0926337318306167-mmc1.docx
Article Link: https://doi.org/10.1016/j.apcatb.2018.07.072
Search for: Data Name: 1-s2.0-S0926337318307136-mmc1.docx data link: https://ars.els-cdn.com/content/image/1-s2.0-S0926337318307136-mmc1.docx
Article Link: https://doi.org/10.1016/j.apcatb.2019.04.078
Search for: Data Name: 1-s2.0-S092633731930400X-mmc1.docx data link: https://ars.els-cdn.com/content/image/1-s2.0-S092633731930400X-mmc1.docx
Article Link: https://doi.org/10.1016/j.bmc.2018.10.015
Search for: Data Name: 1-s2.0-S0968089618313233-mmc1.docx data link: https://ars.els-cdn.com/content/image/1-s2.0-S0968089618313233-mmc1.docx


In [22]:
if len(data_reference) > 0:
    csvh.write_csv_data(data_reference, 'pub_data_mined​_fair.csv')


## Reusable
Finding, retrieving and interpreting an object is not all ther is. For the resource to be reusable it needs to be a) licensed for use and b) in an appropriate format to guarantee long term support (related to 3)

## Interoperable
Access to a resource does not guarantee interoperability, it is interoperable if the data is stored in a format which makes it easy to interpret by humans and machines. So an object in an open format is more interoperable that an object in a proprietary format.

## Accessible
Having an identifier and a link does not guarantee access. Resources may be behind walls (login, email to get it, or similar).

In [None]:
# functions for ChemDataExtractor
# not used for mining data references (suplementary/raw) or to get pdf metadata
from chemdataextractor import Document

# A function for getting a list of files from the directory
# This will be modified to get the list from a csv file
def get_files_list (source_dir):
    i_counter = 0
    files_list = []
    for filepath in sorted(source_dir.glob('*.pdf')):
        i_counter += 1
        files_list.append(filepath)
    return files_list

def cde_read_pdfs(a_file):
    pdf_f = open(a_file, 'rb')
    doc = Document.from_file(pdf_f)
    return doc

def find_doi(element_text):
    cr_re_01 = '10.\d{4,9}/[-._;()/:A-Z0-9]+'
    compare = re.search(cr_re_01, element_text, re.IGNORECASE)
    if compare != None:
        return compare.group()
    return ""

def get_db_id(doi_value, db_name = "app_db.sqlite3"):
    db_conn = dbh.DataBaseAdapter(db_name)
    table = 'articles'   
    id_val = db_conn.get_value(table, "id", "doi", doi_value)
    db_conn.close()
    if id_val != None:
        return id_val[0]
    else:
        return 0

def get_db_title(doi_value, db_name = "app_db.sqlite3"):
    db_conn = dbh.DataBaseAdapter(db_name)
    table = 'articles'   
    id_val = db_conn.get_value(table, "title", "doi", doi_value)
    db_conn.close()
    if id_val != None:
        return id_val[0]
    else:
        return 0

def get_close_dois(str_name, db_name = "prev_search.sqlite3"):
    db_conn = dbh.DataBaseAdapter(db_name)
    search_in = 'articles'
    fields_required = "id, doi, title, pdf_file"
    filter_str = "doi like '%"+str_name+"%';"

    db_titles = db_conn.get_values(search_in, fields_required, filter_str)
    db_conn.close()
    return db_titles

In [None]:
import pdfminer
from pdfminer.high_level import extract_text

# functions for PDFminer

def get_pdf_text(pdf_file):
    return extract_text(pdf_file)

# get the paragraph fragments with references to data
def get_ref_sentences(pdf_text):
    sentences = pdf_text.split("\n")
    groups=[]
    for sentence in sentences:
        if pr_fns.is_data_stmt(sentence.lower()):
            idx = sentences.index(sentence)
            groups.append([idx-1,idx,idx+1])
    reduced_groups = []
    for group in groups:
        idx_group = groups.index(group)
        if groups.index(group) > 0:
            set_g = set(group)
            # make the array before current a set
            set_bg = set(groups[idx_group - 1])
            # make the array after current a set
            set_ag = set()
            if idx_group + 1 < len(groups):    
                set_ag = set(groups[idx_group + 1])
            if len(set_bg.intersection(set_g)) > 0:
                ordered_union = list(set_bg.union(set_g))
                ordered_union.sort()
                reduced_groups.append(ordered_union)
            if len(set_ag.intersection(set_g)) > 0:
                ordered_union = list(set_ag.union(set_g))
                ordered_union.sort()
                reduced_groups.append(ordered_union)
            if len(reduced_groups) > 0:
                is_in_rg = False
                for a_rg in reduced_groups:
                    if set_g.issubset(a_rg):
                        is_in_rg = True
                        break
                if not is_in_rg:
                    reduced_groups.append(list(set_g))
    return_group = []
    for sentence_group in reduced_groups:
        full_sentence = ""
        for single_sentence in sentence_group:
            full_sentence += sentences[single_sentence].strip()
        return_group.append(full_sentence)
    return return_group

# get the paragraph fragments with references to data
def get_all_data_sentences(pdf_text):
    sentences = pdf_text.split("\n")
    groups=[]
    for sentence in sentences:
        if 'data' in sentence.lower() or 'inform' in sentence.lower():
            idx = sentences.index(sentence)
            groups.append([idx-1, idx, idx+1])
    reduced_groups = []
    for group in groups:
        idx_group = groups.index(group)
        if groups.index(group) > 0:
            set_g = set(group)
            # make the array before current a set
            set_bg = set(groups[idx_group - 1])
            # make the array after current a set
            set_ag = set()
            if idx_group + 1 < len(groups):    
                set_ag = set(groups[idx_group + 1])
            if len(set_bg.intersection(set_g)) > 0:
                ordered_union = list(set_bg.union(set_g))
                ordered_union.sort()
                reduced_groups.append(ordered_union)
            if len(set_ag.intersection(set_g)) > 0:
                ordered_union = list(set_ag.union(set_g))
                ordered_union.sort()
                reduced_groups.append(ordered_union)
            if len(reduced_groups) > 0:
                is_in_rg = False
                for a_rg in reduced_groups:
                    if set_g.issubset(a_rg):
                        is_in_rg = True
                        break
                if not is_in_rg:
                    reduced_groups.append(list(set_g))
    return_group = []
    for sentence_group in reduced_groups:
        full_sentence = ""
        for single_sentence in sentence_group:
            full_sentence += sentences[single_sentence].strip()
        if not full_sentence in return_group:
            return_group.append(full_sentence)
    return return_group

# get the http strings from references to data
def get_http_ref(sentence):
    http_frag = ""
    if 'http' in sentence.lower():
        idx_http = sentence.lower().index('http')
        http_frag = sentence[idx_http:]
        space_in_ref = True
        while " " in http_frag:
            space_idx = http_frag.rfind(" ")
            http_frag = http_frag[:space_idx]
        if(http_frag[-1:]=="."):
            http_frag = http_frag[:-1]
    return http_frag

In [None]:
if len(data_mentions) > 0:
    csvh.write_csv_data(data_mentions, 'pdf_mentions_filtered_02.csv')

Get the name of the current app db file:

In [None]:
# app db file with path: db_files/app_db.sqlite3
ukchapp_db = "db_files/app_db.sqlite3"
while not Path(ukchapp_db).is_file():
    print('Please enter the name of app db file:')
    ukchapp_db = input()

## Use pdfminer to get metadata from pdf file

In [None]:
# get publication data from the ukch app
db_pubs = pr_fns.get_pub_app_data(ukchapp_db)

# get the list of dois already mined for data 
input_file = 'pub_data_all.csv'
id_field = 'num'
processed, headings = csvh.get_csv_data(input_file, id_field)
for id_num in processed:
    current_title = processed[id_num]['doi']
processed[1]['num']

processed_dois = []
for entry in processed:
    if not processed[entry]['doi'] in processed_dois:
        processed_dois.append( processed[entry]['doi'])

data_records = {}
data_mentions = {}
ref_count = mention_count = 0
for a_pub in tqdm_notebook(db_pubs):
    data_refs = []
    data_sents = []
    pub_id = a_pub[0]
    pub_title = a_pub[1]
    pub_doi = a_pub[2]
    pub_url = a_pub[3]
    pub_pdf = a_pub[4]
    pub_html = a_pub[5]
    if pub_pdf == 'None':
        print("*************************")
        print("Missing PDF for:", pub_doi)
        print("*************************")
    else:
        pdf_file = "pdf_files/" + pub_pdf
        if not Path(pdf_file).is_file():
            print("*************************")
            print("Missing file for:", pdf_file, "for", pub_doi)
            print("*************************")
        else: 
            print("PDF filename", pdf_file)
            pdf_text = get_pdf_text(pdf_file)
            ref_sentences = get_ref_sentences(pdf_text)
            data_sentences = get_all_data_sentences(pdf_text)
            for r_sentence in ref_sentences:
                dt_link = get_http_ref(r_sentence)
                if 'supplem' in r_sentence.lower():
                    data_refs.append({'type':'supplementary',"desc":r_sentence, 'data_url':dt_link})
                else:
                    data_refs.append({'type':'supporting',"desc":r_sentence, 'data_url':dt_link})
            for d_sentence in data_sentences:
                dt_link = get_http_ref(d_sentence)
                if 'supplem' in d_sentence.lower():
                    data_sents.append({'type':'supplementary',"desc":d_sentence, 'data_url':dt_link})
                else:
                    data_sents.append({'type':'supporting',"desc":d_sentence, 'data_url':dt_link})
    if data_refs != []:
        for data_ref in data_refs:
            data_record = {'id':pub_id, 'doi':pub_doi}    
            data_record.update(data_ref)
            data_records[ref_count] = data_record
            ref_count += 1
    if data_sents != []:
        for data_sent in data_sents:
            sentence_record = {'id':pub_id, 'doi':pub_doi}    
            sentence_record.update(data_sent)
            data_mentions[mention_count] = sentence_record
            mention_count += 1

In [None]:
if len(data_records) > 0:
    csvh.write_csv_data(data_records, 'pdf_data.csv')
    
if len(data_mentions) > 0:
    csvh.write_csv_data(data_mentions, 'pdf_mentions.csv')

In [None]:
# get names and links for references in data mentions
data_mentions, dm_fields = csvh.get_csv_data('pdf_mentions_filtered_02.csv', 'num')

for dm in data_mentions:
    print("https://doi.org/" + data_mentions[dm]['doi'])
    ref_name = data_mentions[dm]['ref_name']
    while ref_name == "":
        print('Please enter the name of data object:')
        ref_name = input()
    ref_link = data_mentions[dm]['ref_link']
    while ref_link == "":
        print('Please enter the data object link:')
        ref_link = input()
    data_mentions[dm]['ref_name'] = ref_name
    data_mentions[dm]['ref_link'] = ref_link
                      

In [None]:
len(data_records)