# Crossref API basics 
##                                     using Crossref Commons and Habanero

Basic literature search in Crossref https://search.crossref.org using the Crossref-API. This is done using Crossref-Commons:

https://gitlab.com/crossref/crossref_commons_py

and the Habanero package:

https://habanero.readthedocs.io/en/latest/index.html


Required packages and installation (in addition to standard packages):

     pip install crossref-commons
     
     pip install habanero
 

#### Check python version and import general packages

In [1]:
# Check python version for reproducibility and import general packages
import sys, os
print("Python version = ", sys.version)

Python version =  3.9.5 (default, Jun  4 2021, 12:28:51) 
[GCC 7.5.0]


# 1. Crossref Commons

We start with the Crossref Commons packages and look at some basic functionality

### 1. 1 Crossref Common imports and setup

Import some Crossref Commons modules and set up your crossref user agent 

In [2]:
# import required packages
import crossref_commons.retrieval
from crossref_commons.relations import get_related
from crossref_commons.iteration import iterate_publications_as_json

Setting up the crossref user agent below is optional, but may help to get more reliable server responses

In [3]:
# setup crossref user agent using environmental variables
os.environ['CR_API_AGENT'] = "polite user agent; including mailto:foo@bar.com"
os.environ.get('CR_API_AGENT')

'polite user agent; including mailto:foo@bar.com'

In [4]:
# setup crossref user agent using environmental variables
os.environ['CR_API_MAILTO'] = "foo@bar.com"
os.environ.get('CR_API_MAILTO')

'foo@bar.com'

### 1.2 Get reference details from DOI via CrossRef Commons

Use crossref to retrieve the details of a literature reference from its DOI. You can choose a DOI as input in the field below:

In [5]:
# define DOI
doi_var = '10.1063/1.1699114'

#### Retrieve bibliographic information in JSON format

We can retrieve the bibliographic information linked to this DOI in different formats. Here we choose the JSON format.

In [6]:
# retrieve bibliographic information related to DOI
ref = crossref_commons.retrieval.get_publication_as_json( doi_var)

In [7]:
# print result
print(ref)

{'indexed': {'date-parts': [[2022, 2, 8]], 'date-time': '2022-02-08T13:26:13Z', 'timestamp': 1644326773327}, 'reference-count': 1, 'publisher': 'AIP Publishing', 'issue': '6', 'content-domain': {'domain': [], 'crossmark-restriction': False}, 'short-container-title': ['The Journal of Chemical Physics'], 'published-print': {'date-parts': [[1953, 6]]}, 'DOI': '10.1063/1.1699114', 'type': 'journal-article', 'created': {'date-parts': [[2005, 1, 5]], 'date-time': '2005-01-05T19:34:37Z', 'timestamp': 1104953677000}, 'page': '1087-1092', 'source': 'Crossref', 'is-referenced-by-count': 24174, 'title': ['Equation of State Calculations by Fast Computing Machines'], 'prefix': '10.1063', 'volume': '21', 'author': [{'given': 'Nicholas', 'family': 'Metropolis', 'sequence': 'first', 'affiliation': []}, {'given': 'Arianna W.', 'family': 'Rosenbluth', 'sequence': 'additional', 'affiliation': []}, {'given': 'Marshall N.', 'family': 'Rosenbluth', 'sequence': 'additional', 'affiliation': []}, {'given': 'Au

This is a lot of information! In order to access specific fields, we have a look at how the data is structured. The information is stored in a structure called 'dictionnary' which has 'keys' for each information field. We can easily find out which keys are used here:

In [8]:
# get keys
ref.keys()

dict_keys(['indexed', 'reference-count', 'publisher', 'issue', 'content-domain', 'short-container-title', 'published-print', 'DOI', 'type', 'created', 'page', 'source', 'is-referenced-by-count', 'title', 'prefix', 'volume', 'author', 'member', 'reference', 'container-title', 'original-title', 'language', 'link', 'deposited', 'score', 'subtitle', 'short-title', 'issued', 'references-count', 'journal-issue', 'alternative-id', 'URL', 'relation', 'ISSN', 'issn-type', 'subject', 'published'])

Now we can get the information stored for each key.

In [9]:
# get authors
ref['author']

[{'given': 'Nicholas',
  'family': 'Metropolis',
  'sequence': 'first',
  'affiliation': []},
 {'given': 'Arianna W.',
  'family': 'Rosenbluth',
  'sequence': 'additional',
  'affiliation': []},
 {'given': 'Marshall N.',
  'family': 'Rosenbluth',
  'sequence': 'additional',
  'affiliation': []},
 {'given': 'Augusta H.',
  'family': 'Teller',
  'sequence': 'additional',
  'affiliation': []},
 {'given': 'Edward',
  'family': 'Teller',
  'sequence': 'additional',
  'affiliation': []}]

In [10]:
# get only first author
ref['author'][0]

{'given': 'Nicholas',
 'family': 'Metropolis',
 'sequence': 'first',
 'affiliation': []}

In [11]:
# get publication date
ref['published']

{'date-parts': [[1953, 6]]}

This can now be repeated for other fields, depending on what information we need.

### 1.3 Iterate queries with filter options

Crossref allows to query specific information using different filter options. Crossref filters are described here:
https://docs.ropensci.org/rcrossref/articles/crossref_filters.html

#### Example 1): Filter for a specific funder defined by its funder ID and reference type 'journal-article'.

In [12]:
# Define Filter
filter = {'funder': '10.13039/501100000038', 'type': 'journal-article'}

Now we define a query for using this filter. We search for authors named 'muller' affiliated with a university.

In [13]:
# Define query
queries = {'query.author': 'muller', 'query.affiliation': 'university'}

In [14]:
# Now iterate over search results and print the DOI for each result
for p in iterate_publications_as_json(max_results=189, filter=filter, queries=queries):
  print(p['DOI'])

10.1080/03081087.2021.1965947
10.1021/jacs.7b07047
10.1096/fasebj.2021.35.s1.04751
10.1002/ange.202111977
10.1002/anie.202111977
10.1002/jbm.a.35392


#### Example 2): Filter for specific author defined by ORCID and journal articles

ORCID is a mechanism to unambiguosly identify an author. Authors can choose to create an ORCID which will then help to find their publications. In case you don't know the ORCID ID of the author your searching for (if it exists), you can directly search for the author name on the ORCID webpage: https://orcid.org.

In [15]:
# Define another filter
filter = {'orcid': '0000-0003-4169-9324', 'type': 'journal-article'}

In [16]:
# Define query; in this case we look for affiliations containing "Texas"
queries = {'query.affiliation': 'Texas', }

In [17]:
# Do query and get results as DOI
for p in iterate_publications_as_json(max_results=189, filter=filter, queries=queries):
  print(p['DOI'])

10.1063/5.0059915
10.1063/5.0041022
10.1063/5.0007276
10.1063/5.0026133
10.1063/5.0064668
10.1063/1.5099194
10.1063/5.0032346
10.1063/5.0032836
10.1063/5.0060314
10.1063/1.5083627
10.1063/1.4976518
10.1063/1.5083040


Theses were just some of the many query and filter options provided by Crossref Commons. For more options check the documentation page: https://gitlab.com/crossref/crossref_commons_py

# 2. Habanero

In this second part we use the Habanero package to access the CrossRef API. Again we will just do some simple examples to show basic Habanero functionality

### 2.1 Habanero imports and setup

Import Habanero modules and set server communication variables

#### Import Modules

In [18]:
from habanero import Crossref

In [19]:
from habanero import counts

#### Server communication variables: set up user agent

This is again optional (see above).

In [20]:
cr = Crossref()
# set a different base url
Crossref(base_url = "http://some.other.url")
# set an api key
Crossref(api_key = "123456")
# set a mailto address to get into the "polite pool"
Crossref(mailto = "foo@bar.com")
# set an additional user-agent string
Crossref(ua_string = "foo bar")

< Crossref 
URL: https://api.crossref.org
KEY: None
MAILTO: None
ADDITIONAL UA STRING: foo bar
>

### 2.2  Citation counts

Get the number of citations for a given paper defined by its DOI from the CrossRef database.

In [21]:
# define DOI of paper
doi_var = '10.1371/journal.pone.0042793'

In [22]:
# get citation counts
print("number of citations = ", counts.citation_count(doi = doi_var))

number of citations =  47


### 2.3 Queries

Different query options can be found here: https://habanero.readthedocs.io/en/latest/modules/crossref.html Below two example queries are used to test some of the options.

#### Example A: Query for a specific topic defined by keywords

In [23]:
# define search topic
search_topic = "Gauge field theory"

Now we do the querry for the defined seach topic, we limit the number of results we get to the 10 first results using the "limit" variable, and we filter for journal articles by defining 'type' as 'journal-article'.

In [24]:
# do query 
test_query = cr.works(query = search_topic, limit=10, filter = {'type': 'journal-article'})

In [25]:
# store main query result
query_result = test_query['message']

In [26]:
# the query result is provided as a dictionary structure with different keys
query_result.keys()

dict_keys(['facets', 'total-results', 'items', 'items-per-page', 'query'])

We had our query limited to the 10 first result, which is what we get as a result. However, we also got the informations how many results there are in total for this query in the database.

In [27]:
# find out the total number of results in the database 
print("total results = ", query_result['total-results'])

total results =  1541840


Now we look at our 10 results, again by using the keys of the dictionnary data structure.

In [28]:
# the information we need is stored in the key "items", again in a dictionnary structure with keys:
query_result['items'][0].keys()

dict_keys(['indexed', 'reference-count', 'publisher', 'issue', 'content-domain', 'short-container-title', 'published-print', 'DOI', 'type', 'created', 'page', 'source', 'is-referenced-by-count', 'title', 'prefix', 'volume', 'author', 'member', 'published-online', 'container-title', 'link', 'deposited', 'score', 'issued', 'references-count', 'journal-issue', 'URL', 'ISSN', 'issn-type', 'subject', 'published'])

We can look at each of the keys and list items separately. Here two examples:

In [29]:
# look at the DOI of the first item
query_result['items'][0]['DOI']

'10.1088/0253-6102/37/4/427'

In [30]:
# look at the journal title of the third item
query_result['items'][2]['container-title']

['Physical Review D']

We can also make a list of all the 10 results, selecting some information fields to be displayed. Here we make a list of the 10 result with publisher, DOI, author list and title.

In [31]:
# Display list with selected fields
for n, item in enumerate(query_result['items']):
    print(n, '\t Publisher =',  item['publisher'], ',\tDOI =', item['DOI'])
    print(item['author'])
    print(item['title'])
    print('---------------------------------------------------------------------------------------------------')

0 	 Publisher = IOP Publishing ,	DOI = 10.1088/0253-6102/37/4/427
[{'given': 'Wu', 'family': 'Ning', 'sequence': 'first', 'affiliation': []}]
['Supersymmetric U(1) Gauge Field Theory with Massive Gauge Field']
---------------------------------------------------------------------------------------------------
1 	 Publisher = Elsevier BV ,	DOI = 10.1006/aphy.1996.0139
[{'given': 'Haruichi', 'family': 'Yabuki', 'sequence': 'first', 'affiliation': []}]
['Partially Gauge Invariant Field Configurations and the Gribov Horizon inSU(2) Gauge Field Theory']
---------------------------------------------------------------------------------------------------
2 	 Publisher = American Physical Society (APS) ,	DOI = 10.1103/physrevd.21.1067
[{'given': 'B.', 'family': 'Sakita', 'sequence': 'first', 'affiliation': []}]
['Field theory of strings as a collective field theory ofU(N)gauge fields']
---------------------------------------------------------------------------------------------------
3 	 Publish

#### Example B: Query for publications of a given author defined by name and ORCID

In the second query example, we look for the publication of a specific researcher which we define by his name and ORCID. In case you don't know the ORCID ID of the author your searching for (if it exists), you can directly search for the author name on the ORCID webpage: https://orcid.org.

In [32]:
#Define author by Name and ORCID
author_name = "Walter Thiel"
author_orcid = "0000-0001-6780-0350"

Now we do the query for the defined author. Instead of sorting by relevance (default), we have the results sorted by the number of times they are references. The query is limited to the first 5 items.

In [33]:
# do query
test_query2 = cr.works(query_author = author_name, sort="is-referenced-by-count", limit=5, filter = {'orcid': author_orcid})

In [34]:
# store result in query2_result variable and get total number of results
query2_result = test_query2['message']
print("total results = ", query2_result['total-results'])

total results =  40


We can again display the results as a list. Here we choose to display the results with  publication date, author, title, journal name, volume, and issue information. In contrast to the list above, we only list authors given and family names, not the entire author field. This makes the list more readable.

In [35]:
# display results as a list
for n, item in enumerate(query2_result['items']):
    print(n, '\t', 'publication date =', item['published']['date-parts'])
    authors = [[item['author'][x]['given']+" "+item['author'][x]['family']] for x in range(len(item['author']))]
    print(authors)
    print(item['title'])
    print(item['container-title'], " volume =", item['volume'], ", issue =", item['issue'])
    print('---------------------------------------------------------------------------------------------------')

0 	 publication date = [[2017, 1, 3]]
[['Xing Gao'], ['Shuming Bai'], ['Daniele Fazzi'], ['Thomas Niehaus'], ['Mario Barbatti'], ['Walter Thiel']]
['Evaluation of Spin-Orbit Couplings with Linear-Response Time-Dependent Density Functional Methods']
['Journal of Chemical Theory and Computation']  volume = 13 , issue = 2
---------------------------------------------------------------------------------------------------
1 	 publication date = [[2017, 2, 7]]
[['Dragoş-Adrian Roşca'], ['Karin Radkowski'], ['Larry M. Wolf'], ['Minal Wagh'], ['Richard Goddard'], ['Walter Thiel'], ['Alois Fürstner']]
['Ruthenium-Catalyzed Alkyne trans-Hydrometalation: Mechanistic Insights and Preparative Implications']
['Journal of the American Chemical Society']  volume = 139 , issue = 6
---------------------------------------------------------------------------------------------------
2 	 publication date = [[2018, 2, 16]]
[['Alexandre Guthertz'], ['Markus Leutzsch'], ['Lawrence M. Wolf'], ['Puneet Gupta'], 