# Are Data Objects referenced from publications FAIR

A list of data objects referenced from a set of publications is checked to determine their FAIRness.

Instead of only referencing to data, these tests are applied to **data objects**, which are any data which is published to complement the publication, this includes raw data, supplementary data, processing data, tables, images, movies, and compilations containing one or more of such resources.

The tests to be performed are aimed at finding out if the data objects are: 
 - **F**indable:  can the object be found easily?
 - **A**ccessible: can the object be retrieved?
 - **I**nteroperable: can the object be accesed programatically to extract data and metadata?
 - **R**eusable: can the object be readily used?

In [2]:
# Libraries
# library containign functions that read and write to csv files
import lib.handle_csv as csvh
# library for connecting to the db
import lib.handle_db as dbh
# library for handling text matchings
import lib.text_comp as txtc
# library for getting data from crossref
import lib.crossref_api as cr_api
# library for handling url searchs
import lib.handle_urls as urlh
# managing files and file paths
from pathlib import Path
# add aprogress bar
from tqdm import tqdm_notebook 
import tqdm
#library for handling json files
import json
# library for using regular expressions
import re
# library for handling http requests
import requests
# library for accessing system functions
import os
# import custom functions (common to various notebooks)
import processing_functions as pr_fns



## Findable

Most of the data objects are assumed to be findable as we were able to find links to them. However, some are references to other pages, references to contact the authors or point to repositories without identifying a specific record.

### Findability Score
A findability score was calculated for each data object as follows:
5 if the object is referenced from the publication web page, it is referenced directly, and further details from it can be recovered (name, type and size) just by accessing that reference. 
After this, for each additional step points are deucted from the top score, if to get to the referenced object:
- Special access to the publication is need (download the pdf, get a password or token for mining the publication, other acces blocks) \[-1 point\]
- Human access to the publication online is required (there is no metadata or clear pattern to identify a reference on the publication landing page or the pdf version redirects to the article). \[-1 point\]
- Recovering the reference object details (name, type and size) requires more than a single query. \[-1 point\]
- The reference is wrong (broken link). \[-2 points\] 
- The reference points to contact the authors or lookup a data repository without an ID. \[-4 points\]

In [10]:
# get names and links for references in data mentions
data_reference, _ = csvh.get_csv_data('pub_data_fairness.csv', 'num')

for dr in tqdm_notebook(data_reference):
    if data_reference[dr]['ret_code'] == "":
        # try to get data object details from reference
        print("Article Link: https://doi.org/" + data_reference[dr]['doi'])
        ref_name = data_reference[dr]['name']
        ref_link = data_reference[dr]['data_url']
        print("Search for: Data Name:", ref_name, "data link:", ref_link)
        head = urlh.getPageHeader(ref_link)

        if head != None:
            data_reference[dr]['ret_code'] = head.status_code 
            data_reference[dr]['resoruce_name'] = os.path.basename(head.url)
            if head.status_code == 200:
                #print (head.headers, head.url)
                if 'content-type' in head.headers.keys():
                    data_reference[dr]['ref_content'] = head.headers['content-type']
                if 'content-length' in head.headers.keys():
                    data_reference[dr]['ref_size'] = head.headers['content-length']
                data_reference[dr]['ref_redirect'] = head.url
            elif head.status_code == 302 or head.status_code == 301:
                #print(head, head.headers)
                data_reference[dr]['ref_redirect'] = head.headers['location']
                data_reference[dr]['resoruce_name'] = os.path.basename(head.headers['location'])
            else:
                print(head, head.headers)
        else:
            data_reference[dr]['f_score'] = 1
    else:
        data_reference[dr]['f_score'] = 5
        if data_reference[dr]['html_mined'] == False and data_reference[dr]['pdf_mined'] == True:
            data_reference[dr]['f_score'] -= 1 # the publication page is not accessible directly to get the DO reference 
        if data_reference[dr]['html_mined'] == False and data_reference[dr]['user_mined'] == True:
            data_reference[dr]['f_score'] -= 1 # a human user needed to access the resource
        if data_reference[dr]['ret_code'] != 200:
            if data_reference[dr]['ret_code'] in [0,404]:
                data_reference[dr]['f_score'] -= 2 # there is some form of redirect to get to the  object
        

IndentationError: expected an indented block (<ipython-input-10-a2bde914cbb4>, line 39)

In [9]:
if len(data_reference) > 0:
    csvh.write_csv_data(data_reference, 'pub_data_fairness.csv')


## Accessible
Having an identifier and a link does not guarantee access. Resources may be behind walls (login, email to get it, or similar).

## Reusable
Finding, retrieving and interpreting an object is not all ther is. For the resource to be reusable it needs to be a) licensed for use and b) in an appropriate format to guarantee long term support (related to 3)

## Interoperable
Access to a resource does not guarantee interoperability, it is interoperable if the data is stored in a format which makes it easy to interpret by humans and machines. So an object in an open format is more interoperable that an object in a proprietary format.

In [None]:
# functions for ChemDataExtractor
# not used for mining data references (suplementary/raw) or to get pdf metadata
from chemdataextractor import Document

# A function for getting a list of files from the directory
# This will be modified to get the list from a csv file
def get_files_list (source_dir):
    i_counter = 0
    files_list = []
    for filepath in sorted(source_dir.glob('*.pdf')):
        i_counter += 1
        files_list.append(filepath)
    return files_list

def cde_read_pdfs(a_file):
    pdf_f = open(a_file, 'rb')
    doc = Document.from_file(pdf_f)
    return doc

def find_doi(element_text):
    cr_re_01 = '10.\d{4,9}/[-._;()/:A-Z0-9]+'
    compare = re.search(cr_re_01, element_text, re.IGNORECASE)
    if compare != None:
        return compare.group()
    return ""

def get_db_id(doi_value, db_name = "app_db.sqlite3"):
    db_conn = dbh.DataBaseAdapter(db_name)
    table = 'articles'   
    id_val = db_conn.get_value(table, "id", "doi", doi_value)
    db_conn.close()
    if id_val != None:
        return id_val[0]
    else:
        return 0

def get_db_title(doi_value, db_name = "app_db.sqlite3"):
    db_conn = dbh.DataBaseAdapter(db_name)
    table = 'articles'   
    id_val = db_conn.get_value(table, "title", "doi", doi_value)
    db_conn.close()
    if id_val != None:
        return id_val[0]
    else:
        return 0

def get_close_dois(str_name, db_name = "prev_search.sqlite3"):
    db_conn = dbh.DataBaseAdapter(db_name)
    search_in = 'articles'
    fields_required = "id, doi, title, pdf_file"
    filter_str = "doi like '%"+str_name+"%';"

    db_titles = db_conn.get_values(search_in, fields_required, filter_str)
    db_conn.close()
    return db_titles

In [None]:
import pdfminer
from pdfminer.high_level import extract_text

# functions for PDFminer

def get_pdf_text(pdf_file):
    return extract_text(pdf_file)

# get the paragraph fragments with references to data
def get_ref_sentences(pdf_text):
    sentences = pdf_text.split("\n")
    groups=[]
    for sentence in sentences:
        if pr_fns.is_data_stmt(sentence.lower()):
            idx = sentences.index(sentence)
            groups.append([idx-1,idx,idx+1])
    reduced_groups = []
    for group in groups:
        idx_group = groups.index(group)
        if groups.index(group) > 0:
            set_g = set(group)
            # make the array before current a set
            set_bg = set(groups[idx_group - 1])
            # make the array after current a set
            set_ag = set()
            if idx_group + 1 < len(groups):    
                set_ag = set(groups[idx_group + 1])
            if len(set_bg.intersection(set_g)) > 0:
                ordered_union = list(set_bg.union(set_g))
                ordered_union.sort()
                reduced_groups.append(ordered_union)
            if len(set_ag.intersection(set_g)) > 0:
                ordered_union = list(set_ag.union(set_g))
                ordered_union.sort()
                reduced_groups.append(ordered_union)
            if len(reduced_groups) > 0:
                is_in_rg = False
                for a_rg in reduced_groups:
                    if set_g.issubset(a_rg):
                        is_in_rg = True
                        break
                if not is_in_rg:
                    reduced_groups.append(list(set_g))
    return_group = []
    for sentence_group in reduced_groups:
        full_sentence = ""
        for single_sentence in sentence_group:
            full_sentence += sentences[single_sentence].strip()
        return_group.append(full_sentence)
    return return_group

# get the paragraph fragments with references to data
def get_all_data_sentences(pdf_text):
    sentences = pdf_text.split("\n")
    groups=[]
    for sentence in sentences:
        if 'data' in sentence.lower() or 'inform' in sentence.lower():
            idx = sentences.index(sentence)
            groups.append([idx-1, idx, idx+1])
    reduced_groups = []
    for group in groups:
        idx_group = groups.index(group)
        if groups.index(group) > 0:
            set_g = set(group)
            # make the array before current a set
            set_bg = set(groups[idx_group - 1])
            # make the array after current a set
            set_ag = set()
            if idx_group + 1 < len(groups):    
                set_ag = set(groups[idx_group + 1])
            if len(set_bg.intersection(set_g)) > 0:
                ordered_union = list(set_bg.union(set_g))
                ordered_union.sort()
                reduced_groups.append(ordered_union)
            if len(set_ag.intersection(set_g)) > 0:
                ordered_union = list(set_ag.union(set_g))
                ordered_union.sort()
                reduced_groups.append(ordered_union)
            if len(reduced_groups) > 0:
                is_in_rg = False
                for a_rg in reduced_groups:
                    if set_g.issubset(a_rg):
                        is_in_rg = True
                        break
                if not is_in_rg:
                    reduced_groups.append(list(set_g))
    return_group = []
    for sentence_group in reduced_groups:
        full_sentence = ""
        for single_sentence in sentence_group:
            full_sentence += sentences[single_sentence].strip()
        if not full_sentence in return_group:
            return_group.append(full_sentence)
    return return_group

# get the http strings from references to data
def get_http_ref(sentence):
    http_frag = ""
    if 'http' in sentence.lower():
        idx_http = sentence.lower().index('http')
        http_frag = sentence[idx_http:]
        space_in_ref = True
        while " " in http_frag:
            space_idx = http_frag.rfind(" ")
            http_frag = http_frag[:space_idx]
        if(http_frag[-1:]=="."):
            http_frag = http_frag[:-1]
    return http_frag

In [None]:
if len(data_mentions) > 0:
    csvh.write_csv_data(data_mentions, 'pdf_mentions_filtered_02.csv')

Get the name of the current app db file:

In [None]:
# app db file with path: db_files/app_db.sqlite3
ukchapp_db = "db_files/app_db.sqlite3"
while not Path(ukchapp_db).is_file():
    print('Please enter the name of app db file:')
    ukchapp_db = input()

## Use pdfminer to get metadata from pdf file

In [None]:
# get publication data from the ukch app
db_pubs = pr_fns.get_pub_app_data(ukchapp_db)

# get the list of dois already mined for data 
input_file = 'pub_data_all.csv'
id_field = 'num'
processed, headings = csvh.get_csv_data(input_file, id_field)
for id_num in processed:
    current_title = processed[id_num]['doi']
processed[1]['num']

processed_dois = []
for entry in processed:
    if not processed[entry]['doi'] in processed_dois:
        processed_dois.append( processed[entry]['doi'])

data_records = {}
data_mentions = {}
ref_count = mention_count = 0
for a_pub in tqdm_notebook(db_pubs):
    data_refs = []
    data_sents = []
    pub_id = a_pub[0]
    pub_title = a_pub[1]
    pub_doi = a_pub[2]
    pub_url = a_pub[3]
    pub_pdf = a_pub[4]
    pub_html = a_pub[5]
    if pub_pdf == 'None':
        print("*************************")
        print("Missing PDF for:", pub_doi)
        print("*************************")
    else:
        pdf_file = "pdf_files/" + pub_pdf
        if not Path(pdf_file).is_file():
            print("*************************")
            print("Missing file for:", pdf_file, "for", pub_doi)
            print("*************************")
        else: 
            print("PDF filename", pdf_file)
            pdf_text = get_pdf_text(pdf_file)
            ref_sentences = get_ref_sentences(pdf_text)
            data_sentences = get_all_data_sentences(pdf_text)
            for r_sentence in ref_sentences:
                dt_link = get_http_ref(r_sentence)
                if 'supplem' in r_sentence.lower():
                    data_refs.append({'type':'supplementary',"desc":r_sentence, 'data_url':dt_link})
                else:
                    data_refs.append({'type':'supporting',"desc":r_sentence, 'data_url':dt_link})
            for d_sentence in data_sentences:
                dt_link = get_http_ref(d_sentence)
                if 'supplem' in d_sentence.lower():
                    data_sents.append({'type':'supplementary',"desc":d_sentence, 'data_url':dt_link})
                else:
                    data_sents.append({'type':'supporting',"desc":d_sentence, 'data_url':dt_link})
    if data_refs != []:
        for data_ref in data_refs:
            data_record = {'id':pub_id, 'doi':pub_doi}    
            data_record.update(data_ref)
            data_records[ref_count] = data_record
            ref_count += 1
    if data_sents != []:
        for data_sent in data_sents:
            sentence_record = {'id':pub_id, 'doi':pub_doi}    
            sentence_record.update(data_sent)
            data_mentions[mention_count] = sentence_record
            mention_count += 1

In [None]:
if len(data_records) > 0:
    csvh.write_csv_data(data_records, 'pdf_data.csv')
    
if len(data_mentions) > 0:
    csvh.write_csv_data(data_mentions, 'pdf_mentions.csv')

In [None]:
# get names and links for references in data mentions
data_mentions, dm_fields = csvh.get_csv_data('pdf_mentions_filtered_02.csv', 'num')

for dm in data_mentions:
    print("https://doi.org/" + data_mentions[dm]['doi'])
    ref_name = data_mentions[dm]['ref_name']
    while ref_name == "":
        print('Please enter the name of data object:')
        ref_name = input()
    ref_link = data_mentions[dm]['ref_link']
    while ref_link == "":
        print('Please enter the data object link:')
        ref_link = input()
    data_mentions[dm]['ref_name'] = ref_name
    data_mentions[dm]['ref_link'] = ref_link
                      

In [None]:
len(data_records)