In [1]:
import os
import pandas as pd
from datetime import date
path = os.path.dirname(os.path.realpath("__file__"))

In [2]:
# Import parseTREC function from InputDataPreprocess.py
os.chdir('../../../code')
from InputDataPreprocess import parseRelish
# --------------------------------------------------------------
# Change to the directory where this tutorial is stored for relative paths to work
os.chdir(path)

The RELISH file is of type json and contains a lot of information. We wrote a function to retrieve the most important information into a TSV-file:

In [3]:
file = './data/RELISH/RELISH_v1.json'
pmidSet = parseRelish(file) # parseRelish transforms the json file to a TSV file that can be found in the file folder
df = pd.read_csv('./data/RELISH/RELISH.tsv', sep='\t', header=None, names=['reference_pmid', 'assessment_pmid', 'relevance'])
display(df.head())

Unnamed: 0,reference_pmid,assessment_pmid,relevance
0,22569528,17928366,2
1,22569528,18562239,2
2,22569528,19052640,2
3,22569528,19060905,2
4,22569528,19242111,2


We see that the RELISH file includes the reference PMIDs, the assessment PMIDs and assessment values. We want to retrieve all the PMIDs from the BioC API. Because we already made some cleaning of the parsed data, we recommend using the pmidSet retrieved from calling the parseRelish function for further analysis:

In [4]:
pmidList = sorted(list(pmidSet))[:1000] # use list to get first 1000 pmids
print(pmidList[:10]) # print first 10 pmids

['16638632', '16735107', '16838294', '16891103', '17006532', '17055509', '17070090', '17089077', '17098331', '17115153']


We wrote a function to automatically retrieve abstracts and titles of PMIDs from the [BioC-API](https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PubMed/). Further we analyze which of the defined PMIDs do not have an API entry or do not have an abstract or title. The function gathers all the wanted abstracts and titles in a tab-separated file (TSV).

In the following we retrieve abstracts from 1000 articles defined in pmidList:

In [5]:
# Import main function
os.chdir('../../../code/BioC-approach/')
from BioCAPIReader import main
os.chdir(path)

# define the path where the data will be stored
parentPath = './data/RELISH'

# execute function main to retrieve data
main(pmidList, parentPath, log=True, delete_tmp=True, chunk_size=400, processes=30) # Change of log argument needs restart of kernel; if you want to keep the temporary files from the BioC-API, set delete_tmp=False

2022-09-05 19:09:22,800 - INFO - Started script.
2022-09-05 19:09:22,801 - INFO - Number of PMIDs to process: 1000
2022-09-05 19:09:28,216 - INFO - Processing 3 chunk files.


  0%|          | 0/3 [00:00<?, ?it/s]

2022-09-05 19:09:29,591 - INFO - Titles, abstracts and pmids saved to tsv file (Path: ./data/RELISH/documents_20220905.tsv).
2022-09-05 19:09:29,612 - INFO - All titles and abstracts successfully retrieved.
2022-09-05 19:09:29,815 - INFO - Finished script.


Let's have a look at the outputfile:

In [6]:
today = date.today().strftime("%Y%m%d") # define today's date, because the output is always named with today's date
documents = pd.read_csv(f'{parentPath}/documents_{today}.tsv', sep='\t')
print('Number of documents: ', len(documents))
display(documents.head())

Number of documents:  1000


Unnamed: 0,PMID,title,abstract
0,16638632,Analysis of degradation of bacterial cell divi...,"The identity of protease(s), which would degra..."
1,16735107,Screening of free-living rhizospheric bacteria...,Plant growth promoting rhizobacteria (PGPR) ar...
2,16838294,A conserved role for the nodal signaling pathw...,Nodal factors play crucial roles during embryo...
3,16891103,A type-1 metacaspase from Acanthamoeba castell...,The complete sequence of a type-1 metacaspase ...
4,17006532,Quantitative and evolutionary biology of alter...,Alternative splicing (AS) of pre-messenger RNA...


And a look at the missing PMIDs:

In [7]:
try:
    missing = pd.read_csv(f'{parentPath}/missing_{today}.tsv', sep='\t')
    print('Number of missing documents: ', len(missing))
    display(missing.head())
except:
    print('No abstracts or titles were missing.')

No abstracts or titles were missing.
