<a href="https://colab.research.google.com/github/rvdinter/slr-study-selection/blob/main/SLR5_Data_Retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#SLR5 - Data Retrieval
This notebook enables researchers to retrieve citations through accessing multiple API's at once.


**Requirements:**


*   API keys for Pubmed and Elsevier API
*   For Elsevier's API, you must have a connection via an Academic IP address (Can be through VPN)



In [1]:
#@title Run this cell to install and import all requirements { display-mode: "form" }
!pip install elsapy
!pip install biopython
from Bio import Entrez
from elsapy.elsclient import ElsClient
from elsapy.elsprofile import ElsAuthor, ElsAffil
from elsapy.elsdoc import FullDoc, AbsDoc
from elsapy.elssearch import ElsSearch
import pandas as pd
import ipywidgets as widgets
import csv 
from Bio import Medline
import requests

Collecting elsapy
  Downloading https://files.pythonhosted.org/packages/d8/7b/934ef0e29ebc283d60ef9ae78c1f583ef5ca652144dc8215bba4de9d9fde/elsapy-0.5.0-py3-none-any.whl
Installing collected packages: elsapy
Successfully installed elsapy-0.5.0
Collecting biopython
[?25l  Downloading https://files.pythonhosted.org/packages/76/02/8b606c4aa92ff61b5eda71d23b499ab1de57d5e818be33f77b01a6f435a8/biopython-1.78-cp36-cp36m-manylinux1_x86_64.whl (2.3MB)
[K     |████████████████████████████████| 2.3MB 5.8MB/s 
Installing collected packages: biopython
Successfully installed biopython-1.78


In [2]:
#@title # SLR5 - Data Retrieval { display-mode: "form" }
#@markdown ---

#@markdown ### Enter a the following required lines:
elsevier_api_key = 'bb28136bcb09d86431fcbc948c0fda22' #@param {type:'string'}
springer_api_key = 'b16ffe0831f09468ac1c6b6d1aedb21c' #@param {type:'string'}
your_email = "Your.Name.Here@example.org" #@param {type:'string'}
Entrez.email = your_email
#@markdown ---

#@markdown ### Choose your databases
pubmed = True #@param {type:"boolean"}
sciencedirect = False #@param {type:"boolean"}
springer = True #@param {type:"boolean"}

#@markdown ---

#@markdown ### Put in your search query
search_query = "Automation OR Automate OR Automates OR Automating) AND (\"Systematic Literature Review\" OR \"Systematic Review\")" #@param {type:'string'}

#@markdown ### Choose your fields
#@markdown If you didn't choose the database to search, it does not matter what field you choose for that particular database

# EXAMPLE: Here, we request all possible fields for the PubMed database.
# print possible_fields_df if you want to use another field that Title/Abstract
# Through Entrez.einfo, you can also find all other databases and their fields.
# record = Entrez.read(Entrez.einfo(db='pubmed'))
# possible_fields_df = pd.DataFrame(record['DbInfo']['FieldList'])

pubmed_field = 'TIAB' #@param ['ALL', 'UID', 'FILT', 'TITL', 'WORD', 'MESH', 'MAJR', 'AUTH', 'JOUR', 'AFFL', 'ECNO', 'SUBS', 'PDAT', 'EDAT', 'VOL', 'PAGE', 'PTYP', 'LANG', 'ISS', 'SUBH', 'SI', 'MHDA', 'TIAB', 'OTRM', 'INVR', 'COLN', 'CNTY', 'PAPX', 'GRNT', 'MDAT', 'CDAT', 'PID', 'FAUT', 'FULL', 'FINV', 'TT', 'LAUT', 'PPDT', 'EPDT', 'LID', 'CRDT', 'BOOK', 'ED', 'ISBN', 'PUBN', 'AUCL', 'EID', 'DSO', 'AUID', 'PS', 'COIS'] {type:"raw"}

#@markdown ### Choose in which timeframe you want to search
#@markdown Springer only searches on year, not exact date
start_date = '2000/01/01' #@param {type:"date"}
end_date = '2021/01/01' #@param {type:"date"}

In [3]:
#@title Run this cell to execute your query { display-mode: "form" }
if pubmed:
    print('--- Pubmed ---')
    # Find all article IDs containing search query, sorted by relevance
    handle = Entrez.esearch(db="pubmed", retmax=200, term=search_query, sorted='relevance', idtype="acc",field=pubmed_field, mindate=start_date, maxdate=end_date)
    record = Entrez.read(handle)
    handle.close()

    # Retrieve all article data by ID
    idlist = record["IdList"]
    handle = Entrez.efetch(db="pubmed", id=idlist, rettype="medline", retmode="text")
    docs = []

    # Parse data in medline format and save to file
    articles = Medline.parse(handle)
    for article in articles:
        docs.append(article)
    pubmed_df = pd.DataFrame(docs)
    pubmed_df = pubmed_df.rename(columns={'TI': 'title', 'AB':'abstract', 'AID': 'identifier', 'DP':'publicationDate'})
    pubmed_df['database'] = 'pubmed'
    print("The query interpreted by PubMed: {} \nThis query resulted into {} records found.".format(record['QueryTranslation'], record["Count"]))

if springer:
    print('--- Springer ---')
    # Springer allows a maximum of 100 returns
    retmax = 100
    x = requests.get(f'http://api.springernature.com/meta/v2/json?q={search_query}&api_key={springer_api_key}&p={retmax}&date-facet-mode=between&facet-start-year={start_date[:4]}&showAll=true&facet-end-year={end_date[:4]}')
    springer_df = pd.DataFrame(x.json()['records'])
    springer_df['database'] = 'springer'
    print("This query resulted into {} records found.".format(len(springer_df)))


if sciencedirect:
    print('--- ScienceDirect ---')
    ## Initialize client
    client = ElsClient(elsevier_api_key)

    # If get_all = True, then all results will be retrieved, with a maximum of 5000, 
    # otherwise, 20 results will be retrieved (1 API request)
    get_all = False

    ## Initialize doc search object using ScienceDirect and execute search, 
    #   retrieving all results
    doc_srch = ElsSearch(search_query, 'sciencedirect')
    doc_srch.execute(client, get_all=get_all)
    print("doc_srch has", doc_srch.len_res(), "results.")

    sciencedirect_df = doc_srch.results_df

if not pubmed and not springer and not sciencedirect:
    print('No database selected')
else:
    query_results = pd.concat([pubmed_df[['title', 'abstract', 'identifier', 'database', 'publicationDate']], springer_df[['title', 'abstract', 'identifier', 'database', 'publicationDate']]], ignore_index=True)
    query_results.to_excel('query_results.xlsx')
    print('Articles have been saved to query_results.xlsx')

--- Pubmed ---
The query interpreted by PubMed: Automation[Title/Abstract] OR Automate[Title/Abstract] OR Automates[Title/Abstract] OR Automating[Title/Abstract] AND ("Systematic Literature Review"[Title/Abstract] OR "Systematic Review"[Title/Abstract]) AND 2000/01/01[EDAT] : 2021/01/01[EDAT] 
This query resulted into 127 records found.
--- Springer ---
This query resulted into 100 records found.
Articles have been saved to query_results.xlsx


In [4]:
#@title #SLR6 - Study Selection
from IPython.display import HTML, display, clear_output
import ipywidgets as widgets
from sklearn.model_selection import train_test_split
import numpy as np

In [5]:
#@markdown ### Run this cell if you want to load the full dataset and split into train and test data
query_results = pd.read_excel('query_results.xlsx')
query_results['label'] = np.nan

Xtrain, Xtest, ytrain, ytest = train_test_split(query_results[['title', 'publicationDate', 'identifier', 'abstract']], query_results['label'], test_size=0.5)

train = pd.concat([Xtrain, ytrain], axis=1)
test = pd.concat([Xtest, ytest], axis=1)
train.to_excel('train.xlsx')
test.to_excel('test.xlsx')

In [6]:
#@markdown ### Run this cell to start manual classification of the train set
train_set = pd.read_excel('train.xlsx')
train_set = train_set[['title', 'publicationDate', 'identifier', 'abstract', 'label']]


already_classified = train_set[~train_set['label'].isna()]
train = train_set[train_set['label'].isna()].reset_index()


include = widgets.Button(description="Include")
exclude = widgets.Button(description="Exclude")
save_and_exit = widgets.Button(description="Save and Exit")
hbox = widgets.HBox([include, exclude, save_and_exit])
display(hbox)

i = 0

def on_button_clicked_include(b):
    global i
    train['label'][i] = True
    clear_output()
    if i < len(train):
        i += 1
        display(hbox)
        print('\nReview {} out of {}.\n\nTitle: {}\nYear: {}\nIdentifier: {}\nAbstract: {}'.format(i+1, len(train)+1, train['title'][i], train['publicationDate'][i], train['identifier'][i], train['abstract'][i]))
    else:
        print('All citations have been reviewed.')
        save_set = pd.concat([already_classified, train], ignore_index=True)
        save_set.to_excel('train.xlsx')
        print('Training set updated and saved.')

def on_button_clicked_exclude(b):
    global i
    train['label'][i] = False
    clear_output()
    if i < len(train):
        i += 1
        display(hbox)
        print('\nReview {} out of {}.\n\nTitle: {}\nYear: {}\nIdentifier: {}\nAbstract: {}'.format(i+1, len(train)+1, train['title'][i], train['publicationDate'][i], train['identifier'][i], train['abstract'][i]))
    else:
        print('All citations have been reviewed.')
        save_set = pd.concat([already_classified, train], ignore_index=True)
        save_set.to_excel('train.xlsx')
        print('Training set updated and saved.')

def on_button_clicked_save_and_exit(b):
    save_set = pd.concat([already_classified, train], ignore_index=True)
    save_set.to_excel('train.xlsx')
    print('Training set updated and saved.')


include.on_click(on_button_clicked_include)
exclude.on_click(on_button_clicked_exclude)
save_and_exit.on_click(on_button_clicked_save_and_exit)
print('\nReview {} out of {}.\n\nTitle: {}\nYear: {}\nIdentifier: {}\nAbstract: {}'.format(i+1, len(train)+1, train['title'][i], train['publicationDate'][i], train['identifier'][i], train['abstract'][i]))

HBox(children=(Button(description='Include', style=ButtonStyle()), Button(description='Exclude', style=ButtonS…


Review 5 out of 114.

Title: Effects of a multifaceted intervention to promote the use of intravenous iron sucrose complex instead of ferric carboxymaltose in patients admitted for more than 24 h
Year: 2021-02-01
Identifier: doi:10.1007/s00228-020-02993-y
Abstract: Purpose Although more practical for use, the impact of ferric carboxymaltose (FCM) on the hospital budget is considerable, and intravenous iron sucrose complex (ISC) represents a cost-saving alternative for the management of iron deficiency anemia in patients during hospitalization. The Drug Committee decided to reserve FCM for day hospitalizations and contraindications to ISC, especially allergy. ISC was available for prescription for all other situations. Methods The impact of a multifaceted intervention promoting a switch from FCM to ISC was evaluated using an interrupted time series model with segmented regression analysis. The standardized rate of the dispensing of FCM, ISC, and oral iron by the hospital pharmacy, as w