# Using Python to Search and Retrieve Data from the Elsevier Scopus Database

Vincent Scalfani and Lance Simpson, Univ. of Alabama Libraries

These examples use the Elsevier Scopus API and the Python Scopus API-wrapper package, [pybliometrics](https://pybliometrics.readthedocs.io/en/stable/). Code was tested and sample data downloaded from the Scopus API on September 29, 2022 via http://api.elsevier.com and http://www.scopus.com. This tutorial content is intended to help facillitate academic research. Before continuing or reusing any of this code, please be aware of Elsevier's [API policies and appropiate use-cases](https://dev.elsevier.com/use_cases.html). You will also need to register with the [Elsevier Developer Portal](https://dev.elsevier.com/) to request an API key in order to use the Scopus API.

## 1. Conda Environment Setup

We are going to use the Ana(conda) package mangager to setup our Python/Scopus development environment. See the documentation for [managing environments](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html).

Here is the recipe we will use within the Anaconda Prompt Terminal:

```console
conda create --name my-scopus-env
conda activate my-scopus-env
conda install -c conda-forge jupyterlab matplotlib pandas pip
pip install pybliometrics

```

To launch a jupyter notebook, type ``jupyter lab``

## 2. A five minute introduction to Pandas [1,2]

We'll use Pandas dataframes for working with the pybliometrics returned Scopus data. [pandas](https://pandas.pydata.org/) is a popular Python library for data analysis and manipulation. The library extends the functionality of working with structured arrays in [NumPy](https://numpy.org/). See our previous workshops for more about Pandas: https://github.com/ualibweb/UALIB_Workshops

A pandas [Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) is a one-dimensional array, while a [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame) is a two-dimensional array (e.g., multiple columns). Both the Series and DataFrame structures contain an index [1].

### Basics

In [None]:
# import with common alias for numpy (np) and pandas (pd):
import numpy as np
import pandas as pd

In [None]:
# create some pandas series
# data from https://en.wikipedia.org/wiki/Melting_points_of_the_elements_(data_page)

atomic_number = pd.Series([4, 12, 20, 38, 56, 88])
symbol = pd.Series(['Be', 'Mg', 'Ca', 'Sr', 'Ba', 'Ra'])
melting_point = pd.Series([1287, 923, 842, 777, 727, 700]) # celsius

In [None]:
# create a data frame with headings from our data series
df = pd.DataFrame({'atomic_number': atomic_number, 'symbol': symbol, 'melting_point': melting_point})
df

In [None]:
# We can also create our own named index if we want:
myindex = pd.Series(['Elem0', 'Elem1', 'Elem2', 'Elem3', 'Elem4', 'Elem5'])
df['myindex'] = myindex
df.set_index('myindex', inplace=True)
df

In [None]:
# get info, also try help(df)
df.info()

In [None]:
# view column names
df.columns

**References:**

[1] https://jakevdp.github.io/PythonDataScienceHandbook/

[2] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame

### Dataframe Indexing

There are two main ways to select subsets of a DataFrame via indexing [3,6].

1. [pd.DataFrame.iloc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html): The iloc property uses integer based indexing (e.g., `[i,j]`, where `i` is the row, and `j` is the column).

2. [pd.DataFrame.loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc): The loc property is label based (e.g., `[row_name, col_name]`)

In [None]:
df

In [None]:
# select a value at position row 0, col 0
df.iloc[0,0]

In [None]:
# now with loc
df.loc['Elem0','atomic_number']

In [None]:
# select a value at position row 0, col 1
df.iloc[0,1]

In [None]:
# now with loc
df.loc['Elem0','symbol']

In [None]:
# select an entire row or column
# iloc select all rows, column index position 2
df.iloc[:,2]

In [None]:
# loc equivalent
df.loc[:,'melting_point']

In [None]:
# can also use dot indexing
df.melting_point

In [None]:
df.melting_point[0]

**References:**

[3] http://swcarpentry.github.io/python-novice-gapminder/08-data-frames/index.html

[4] https://stackoverflow.com/questions/17071871/how-to-select-rows-from-a-dataframe-based-on-column-values

[5] https://stackoverflow.com/questions/27975069/how-to-filter-rows-containing-a-string-pattern-from-a-pandas-dataframe

[6] https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

### Dataframe Operations

See doc: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

In [None]:
# get the max of a column
df.melting_point.max()

In [None]:
# return index position
df.melting_point.idxmax()

In [None]:
# get sum
df.melting_point.sum()

In [None]:
# convert a column to a python list
a = df.atomic_number.tolist()
a

In [None]:
type(a)

In [None]:
# It's also possible to plot data directly from a DataFrame:
import matplotlib.pyplot as plt
df.plot.scatter(x = "atomic_number", y = "melting_point")
plt.xticks(size=16)
plt.yticks(size=16)
plt.ylabel('Melting Point (celsius)', size=16)
plt.xlabel('Atomic Number', size=16)
plt.show()

## 3. Python For Loops [7]

`for` loops allow repeated execution of code on a known collection of values such as a range of numbers or a list. A general syntax example is as follows:

```python
for item in items:
  do something
```

In [None]:
# put variables in a list
subjects = ["Nursing", "Engineering", "Math", "Science"]

for subject in subjects:
    print(subject)

In [None]:
# Here is another way using an index value
for i in range(len(subjects)):
    print(subjects[i])

In [None]:
# And even one more way! :)
for i, subject in enumerate(subjects):
    print(i, subject)

With the returned Scopus data, we will make use of Python's ability to have lists within lists, and then will reformat this data for further analysis. Here is an example:

In [None]:
# the use of \ here is for clarity, so we can put one list set per line
r = \
['vin',['batman','superman'],[100,90]],\
['tim',['Thor'],[80]], \
['amy',['guardians','venom','x-men','spiderman'],[95,90,70,80]]

r

In [None]:
# Outline of how to index into r
# load our scanned drawing
from IPython.display import Image
Image(filename='indexing_help.png')

Let's look at some of the indexing within this list and then "flatten" the list.

In [None]:
# First "row" or list within the list
r[0]

In [None]:
# element 0
r[0][0]

In [None]:
# element 1 within list 0
r[0][1]

In [None]:
# one more level
r[0][1][0]

In [None]:
# element 2
r[0][2]

In [None]:
# Get all names
names = []
for idx in range(len(r)):
    names.append(r[idx][0])  # list index, 0 for the first entry
print(names)

In [None]:
# Get all movie names
movies = []
for idx in range(len(r)):
    movies.append(r[idx][1])  # same list index, but now 1 for second position
print(movies)

In [None]:
# still a list of lists, so we can "flatten" this more with two loops
movies_flat = []
for idx in range(len(r)): # length of lists of lists
    for movie_idx in range(len(r[idx][1])): # length of individual list of movies             
        movies_flat.append(r[idx][1][movie_idx]) # gets actual movie within list of movies
movies_flat

In [None]:
# Now add other data to make the entire original list flat
r_flat = []
for idx in range(len(r)): # length of lists of lists
    for movie_idx in range(len(r[idx][1])): # length of indivudual list of movies             
        r_flat.append([r[idx][0], r[idx][1][movie_idx], r[idx][2][movie_idx]])
r_flat

References:
    
[7] https://nbviewer.jupyter.org/github/jakevdp/WhirlwindTourOfPython/blob/master/07-Control-Flow-Statements.ipynb

## 4. Initial Pybliometrics Setup

The first time you run `import pybliometrics`, it will prompt you for your Elsevier Scopus API Key (apply for one here: https://dev.elsevier.com/),
which is then saved to a local config file. See the documentation:
https://pybliometrics.readthedocs.io/en/stable/configuration.html

**N.B. Keep your API key a secret**

In [None]:
import pybliometrics

In [None]:
import numpy as np
import pandas as pd
# import time (we'll use this later for delays)
import time

## 5. Scopus APIs

Scopus has a variety of different APIs, some of which are implemented in pybliometrics:

https://dev.elsevier.com/sc_api_spec.html

https://pybliometrics.readthedocs.io/en/stable/classes.html

Let's take a look at a few of the APIs:


### Abstract Retrieval API

In [None]:
# https://pybliometrics.readthedocs.io/en/stable/classes/AbstractRetrieval.html

from pybliometrics.scopus import AbstractRetrieval
a = AbstractRetrieval("2-s2.0-85109133923", view='FULL') # eid Elsevier identifier (we'll see how to get these below)

In [None]:
print(a)

In [None]:
type(a)

In [None]:
# help(a)

In [None]:
# There are > 50 different properties that you can extract.
# here are a few as examples:

a.abstract

In [None]:
a.doi

In [None]:
a.openaccess

### PlumX API

In [None]:
# https://pybliometrics.readthedocs.io/en/stable/classes/PlumXMetrics.html

from pybliometrics.scopus import PlumXMetrics
plum1 = PlumXMetrics("10.1186/1758-2946-3-33", id_type='doi')
print(plum1)

In [None]:
plum1.citation

In [None]:
# save to a dataFrame:
df_capture1 = pd.DataFrame(plum1.capture)
df_citation1 = pd.DataFrame(plum1.citation)
df_mention1 = pd.DataFrame(plum1.mention)
df_social1 = pd.DataFrame(plum1.social_media)
df_use1 = pd.DataFrame(plum1.usage)

frames1 = [df_capture1, df_citation1, df_mention1, df_social1, df_use1]
df_totals1 = pd.concat(frames1)
df_totals1

### Scopus Search API

In [None]:
# API Doc: https://pybliometrics.readthedocs.io/en/stable/classes/ScopusSearch.html
# We can use standard field limiters like Abstract, Title, etc:
# Search Tips: https://dev.elsevier.com/sc_search_tips.html

from pybliometrics.scopus import ScopusSearch

# search for "chemical fingerprint" in the record abstract and "cheminformatics" in doc source title
q0 = ScopusSearch('ABS("chemical fingerprint") AND SRCTITLE (cheminformatics)', download=False)
q0.get_results_size()

#### Number of Records for Author

In [None]:
# Scopus Author ID field (AU-ID): 7103233705, Frank S. Bates (Univ. of Minnesota)
q1 = ScopusSearch('AU-ID(7103233705)', download=False)
q1.get_results_size()

#### Download Record Data

In [None]:
q1 = ScopusSearch('AU-ID(7103233705)', download=True)

# save to dataframe
df1 = pd.DataFrame(q1.results)

In [None]:
# view column names
df1.columns

In [None]:
# number of rows
len(df1)

In [None]:
# view first 5 rows
df1.head(5)

In [None]:
# We can index data from our new dataframe, df1.
# For example, create a list of just the DOIs
dois = df1.doi.tolist()
dois[0:20] # print first 20

In [None]:
# Get a list of article titles
titles = df1.title.tolist()
titles[0:20]

In [None]:
# now a list of the cited by count
cited_by = df1.citedby_count.tolist()
cited_by[0:20]

In [None]:
# get sum of cited_by counts
sum(cited_by)

In [None]:
# get max cited_by
df1.citedby_count.max()

In [None]:
df1.citedby_count.idxmax()

In [None]:
# return the data for the idxmax value
df1.iloc[500]

In [None]:
# Get a summary of statistics
df1.citedby_count.describe()

In [None]:
# plot a quick histogram
import matplotlib.pyplot as plt
df1.loc[:,'citedby_count'].hist(bins=75)
plt.ylabel('Frequency', size=16)
plt.xlabel('citedby', size=16)
ax = plt.gca()
ax.set_xlim(0,1000)
plt.show()

#### Automate several searches with a loop

In [None]:
author_list = [['Emy Decker', '36660678600'], ['Lindsey Lowry', '57210944451'], 
               ['Karen Chapman', '35783926100'], ['Kevin Walker', '56133961300'], 
               ['Sara Whitver', '57194760730']]


# Alternatively if you want to load data from file:

# import csv
# with open('authors.txt') as infile:
#           rows = csv.reader(infile, delimiter='\t')
#           author_list = list(rows)

In [None]:
# get number of Scopus records for each author
num_records = []
for author,authorID in author_list:
    
    # query search
    q = ScopusSearch('AU-ID' +'(' + authorID + ')', download=False)
    num = q.get_results_size()
    
    # compile saved scopus data into a list of lists               
    num_records.append([author, authorID, num])
    
    # delay one second between api calls to be nice to Elsevier servers
    time.sleep(1)

In [None]:
num_records

#### Download Record Data

In [None]:
# Let's say we want the DOIs and cited by counts in a list
cites = []
for author,authorID in author_list:
    
    # query search
    q = ScopusSearch('AU-ID' +'(' + authorID + ')')
    
    # create a dataframe
    q_df = pd.DataFrame(q.results)
       
    # save DOIs to a list
    doi = q_df.doi.tolist()
    
    # save citedby_count to a list
    citedby_count = q_df.citedby_count.tolist()
       
    # compile saved scopus data into a list of lists               
    cites.append([author, doi, citedby_count])
    
    # delay one second between api calls to be nice to Elsevier servers
    time.sleep(1)   

In [None]:
# The cites variable is a list of list with the data
# view data for first three authors
cites[0:3]

In [None]:
# We can transform this into a flat list as follows
cites_flat = []
for authors in range(len(cites)):
    for doi in range(len(cites[authors][1])):
        cites_flat.append([cites[authors][0], cites[authors][1][doi], cites[authors][2][doi]])
cites_flat[0:25] # show first 25

In [None]:
# add to dataframe
cites_df = pd.DataFrame(cites_flat)
cites_df.head(25)

#### Save Record Data to a file

Here is one method if you want to loop over author queries and save all Scopus document data to a file

In [None]:
print(author_list)

In [None]:
##################
##################

# ****this writes one file for each author dataset in your current directory*****

##################
##################

for authorName,authorID in author_list:
    
    # create new empty dataFrame on each loop
    df = pd.DataFrame()
    
    # query search by Author ID
    q = ScopusSearch('AU-ID' +'(' + authorID + ')')
    
    # convert to dataframe
    df = pd.DataFrame(q.results)
    
    # Save to file
    df.to_csv(str(authorName).replace(' ','_') + "_" + str(authorID) + "_ScopusData" + ".tsv", sep = '\t', index=False)
    
    # delay two seconds between api calls to be nice to Elsevier servers
    time.sleep(2)

In [None]:
# load one of the files into pandas
df_author3 = pd.read_csv('Karen_Chapman_35783926100_ScopusData.tsv', delimiter='\t')
df_author3.head(5) # view first 5

In [None]:
# get info about citedby_count
df_author3.citedby_count.describe()

In [None]:
# get info about publication titles
df_author3.publicationName.describe()

#### Try a Title Search

In [None]:
# Search Scopus for all references containing 'ChemSpider' in the record title
q2 = ScopusSearch('TITLE(chemspider)',download=False)
q2.get_results_size()

In [None]:
# repeat this in a loop for several different searches
titleWord_list = ['chemspider', 'pubchem', 'chembl', 'reaxys', 'scifinder']

# get number of Scopus records for each title search
num_records_title = []
for titleWord in titleWord_list:
    
    # query search
    qt = ScopusSearch('TITLE' +'(' + titleWord + ')',download=False)
    numt = qt.get_results_size()
    
    # compile saved scopus data into a list of lists               
    num_records_title.append([titleWord,numt])
    
    # delay one second between api calls to be nice to Elsevier servers
    time.sleep(1)

In [None]:
num_records_title

#### Download Title Match Record Data

In [None]:
# download records and create a list of selected metadata
titleWord_list = ['chemspider', 'pubchem', 'chembl', 'reaxys', 'scifinder']
scopus_title_data = []

for titleWord in titleWord_list:
    
    # query search
    qt = ScopusSearch('TITLE' +'(' + titleWord + ')') 
    
    # create the dataframe
    qt_df = pd.DataFrame(qt.results)
    
    # save DOIs to a list
    doi = qt_df.doi.tolist()
    
    # save title to a list
    title = qt_df.title.tolist()

    # save coverDate to a list
    coverDate = qt_df.coverDate.tolist()
    
    # compile saved scopus_title_data into a list of lists               
    scopus_title_data.append([titleWord, doi, title, coverDate])
    
    # delay one second between api calls to be nice to Elsevier servers
    time.sleep(1)

In [None]:
scopus_title_data[2]

In [None]:
# create a flat list of scopus_title_data
scopus_title_data_flat = []
for titleWord in range(len(scopus_title_data)):
    for doi in range(len(scopus_title_data[titleWord][1])):
        scopus_title_data_flat.append([scopus_title_data[titleWord][0], # titleWord
                                       scopus_title_data[titleWord][1][doi], # doi
                                       scopus_title_data[titleWord][2][doi], # title
                                       scopus_title_data[titleWord][3][doi]]) # coverdate

scopus_title_data_flat[0:5]

In [None]:
# add to dataFrame
title_df = pd.DataFrame(scopus_title_data_flat)


title_df.rename(columns={0:"titleWord",1: "doi",2: "title", 3: "coverDate"},
                            inplace=True)


pd.options.display.max_rows = 30
title_df

In [None]:
# add a new column with just the year of coverDate and convert to numeric
title_df['coverDate_year'] = title_df.coverDate.str[:4]
title_df['coverDate_year'] = pd.to_numeric(title_df['coverDate_year'])
title_df

In [None]:
# filter rows for ChEMBL results
chembl_df = title_df.loc[title_df['titleWord'].str.contains("chembl")]
chembl_df

In [None]:
# get counts by year and sort
chembl_df.loc[:,'coverDate_year'].value_counts().sort_index()

In [None]:
# plot a bar graph of chembl matches in Scopus by year
chembl_df.loc[:,'coverDate_year'].value_counts().sort_index().plot.bar(color='darkseagreen')
plt.ylabel("Number of ChEMBL title occurances", size=12)
plt.xlabel('Year', size=12)
plt.show()

## Before leaving, restart your computer to clear any conda and config data