# Querying Scopus API 

## 1. Initial Set Up

##### Importing required libraries.

In [1]:
import ScopusFunctions
import pandas as pd

Note ScopusFunctions are a set of functions I built to query the API and get desired information, it also requires config.py with the api key and the python requests library.  I use the help function here to list out the functions and their docstrings. In ScopusFunctions I import my ScopusAPI key, and the requests library. 

In [2]:
help(ScopusFunctions)

Help on module ScopusFunctions:

NAME
    ScopusFunctions - Created on Wed Apr  3 14:39:13 2019

DESCRIPTION
    @author: shannonhagerty

FUNCTIONS
    GetPaperIDsFromAuthID(authID)
        This function uses Author Id to query Scopus API and returns a list of
        ScopusID values assigned to each paper the author has in Scopus, if author
        ID is 0 then the list returned will just have one 0 value
    
    getArticleInfo(ScopusID)
        This function has one agrument, ScopusID which is associated with each
        paper in Scopus Database, it then queries Scopus API and returns 5 
        variables 1)a list of authors, 2) a string the article title, 3) year 
        published, 4) Name of Journal holding publication, 5) A mostly fully 
        constructed citation, it returns zeroes for everything if ScopusID is 0
    
    getAuthIDFromName(FirstName, LastName, Affiliation='brandeis university')
        This function takes three arguments 1) First name of person, 
        2) 

##### Now here is where file paths and names are set 

A csv file with three columns named: FirstName, LastName, Dept

In [3]:
FileWithNames = 'BrandeisComputerScience/FacultyList_ComputerScience.csv' 

This is going to be the path and file name for csv that will be generated containing columns: FacultyLastName, FacultyFirstName, authlist, title, Year, Journal, citation, ScopusID where each row is an article published by a member of the faculty from FileWithNames, will have repeats for faculty that coauthor a paper together. 

In [38]:
FileWithOutputPapers = 'BrandeisComputerScience/ComputerSciencePapers.csv'

Name of file where Faculty who do not have perfect AuthID match will go, these will need to be manually queried in Scopus to check if they have no id or if they have an ID but it is associated with a former affiliation

In [39]:
FileWithNamesToCheck = 'BrandeisComputerScience/FacultyCheck.csv'

After you get the file with names to check, you check the names and you should have a file with columns: FacultyLastName, FacutlyFirstName, AuthID

You'll reload that one back in to get the papers for these faculty you had to manually check. 

In [40]:
CheckedFileName = 'BrandeisComputerScience/FacultyChecked.csv'

## 2. Generating dataframe of faculty publications

First creating the dataframes that will hold output, first is a Papers dataframe that has the paper information

In [41]:
Papers= pd.DataFrame(columns=['FacultyLastName','FacultyFirstName','authList', 'title', 'Year', 'Journal', 'citation', 'ScopusID'])

Next is the dataframe that will have the first and last name of Faculty that need to be manually searched to get their scopus ID, either because no matches were found (which happens if a new faculty member has a non-brandeis affiliation associated with their account or if the person really doesnt have an author page with scopus )

In [42]:
FacultyCheck =pd.DataFrame(columns=['FacultyLastName','FacultyFirstName'])

Okay next we have a for loop that goes over the FacultyList uses the getAuthIDFromName function to get author ID and the total number of documents that is associated with that author id.  Then if author id is returned as 0 (meaning we didn't get a perfect match) we add the name to the FacultyCheck dataframe, otherwise we use the information to populate the Papers dataframe. To do that we first take the authID and put it into the GetPaperIDsFromAuthID function which returns a list of ScopusIDs (i.e. unique identifiers for each article). Then we loop over that list to populate the Papers dataframe using the GetArticleInfo function and use its return to populate each row with FacultyLastName and FirstName of faculty that prompted query, then author list, article title, year published, journal, and a mostly complete citation (hopefully, if elements of the citation aren't retrieved they just aren't included in the citation).  

In [43]:
FacultyList=pd.read_csv(FileWithNames)

In [44]:
for i in range(len(FacultyList)):
    authID,docCount = ScopusFunctions.getAuthIDFromName(FacultyList['FirstName'][i], FacultyList['LastName'][i])
    print('Retrieving ', docCount, ' documents for ',FacultyList['FirstName'][i],' ',FacultyList['LastName'][i]) #This line just shows which number on faculty list you're at the code can take time to run. 
    if authID == '0':
        FacultyCheck.loc[(len(FacultyCheck.index)+1)]=[FacultyList['FirstName'][i], FacultyList['LastName'][i]]
    else:
        PaperIDList= ScopusFunctions.GetPaperIDsFromAuthID(authID)
        print('...retrieved ', len(PaperIDList), ' papers...')
        for j in range(len(PaperIDList)):
            paperID = PaperIDList[j]
            print('...paper index', j+1, ' of ',docCount )
            authList, title, Year, Journal, citation = ScopusFunctions.getArticleInfo(paperID)
            Papers.loc[(len(Papers.index)+1)] = [FacultyList['LastName'][i], FacultyList['FirstName'][i], authList, title, Year, Journal, citation, paperID]

Perfect Match Not Found Rick Alterman
Retrieving  0  documents for  Rick   Alterman
Retrieving  25  documents for  Mitch   Cherniack
...retrieved  25  papers...
...paper index 1  of  25
...paper index 2  of  25
...paper index 3  of  25
...paper index 4  of  25
...paper index 5  of  25
...paper index 6  of  25
...paper index 7  of  25
...paper index 8  of  25
...paper index 9  of  25
...paper index 10  of  25
...paper index 11  of  25
...paper index 12  of  25
...paper index 13  of  25
...paper index 14  of  25
...paper index 15  of  25
...paper index 16  of  25
...paper index 17  of  25
...paper index 18  of  25
...paper index 19  of  25
...paper index 20  of  25
...paper index 21  of  25
...paper index 22  of  25
...paper index 23  of  25
...paper index 24  of  25
...paper index 25  of  25
Retrieving  5  documents for  Antonella   DiLillo
...retrieved  5  papers...
...paper index 1  of  5
...paper index 2  of  5
...paper index 3  of  5
...paper index 4  of  5
...paper index 5  of  5
P

...paper index 2  of  49
...paper index 3  of  49
...paper index 4  of  49
...paper index 5  of  49
...paper index 6  of  49
...paper index 7  of  49
...paper index 8  of  49
...paper index 9  of  49
...paper index 10  of  49
...paper index 11  of  49
...paper index 12  of  49
...paper index 13  of  49
...paper index 14  of  49
...paper index 15  of  49
...paper index 16  of  49
...paper index 17  of  49
...paper index 18  of  49
...paper index 19  of  49
...paper index 20  of  49
...paper index 21  of  49
...paper index 22  of  49
...paper index 23  of  49
...paper index 24  of  49
...paper index 25  of  49
...paper index 26  of  49
...paper index 27  of  49
...paper index 28  of  49
...paper index 29  of  49
...paper index 30  of  49
...paper index 31  of  49
...paper index 32  of  49
...paper index 33  of  49
...paper index 34  of  49
...paper index 35  of  49
...paper index 36  of  49
...paper index 37  of  49
...paper index 38  of  49
...paper index 39  of  49
...paper index 40  o

NOTE: If you query is for a very large number of papers/people sometimes this goes awry and you have to do it in pieces.This is why I have it printing statuses of what person and paper its querying so that you know how to break it into chunks. Code will need to be adapted to account for this. 

Okay lets save the file with our names we need to check

In [45]:
FacultyCheck.to_csv(FileWithNamesToCheck)

Now you want to go to that file and check out the names in scopus. Delete any that do not have a Scopus ID, and add a third column 'AuthID' with the author IDs for the faculty you are able to confirm ids. Then we'll load the files back in and query for their papers.  

In [49]:
CheckedNames = pd.read_csv(CheckedFileName)

Okay then we take that list and go over the same process to get paperIDs and ArticleInfo that we used for the fully FacultyList initially

In [50]:
for i in range(len(CheckedNames)):
    PaperIDList= ScopusFunctions.GetPaperIDsFromAuthID(str(CheckedNames.loc[i, 'AuthID']))
    for j in range(len(PaperIDList)):
        paperID = PaperIDList[j]
        authList, title, Year, Journal, citation = ScopusFunctions.getArticleInfo(paperID)
        Papers.loc[(len(Papers.index)+1)] = [CheckedNames['FacultyLastName'][i], CheckedNames['FacultyFirstName'][i], authList, title, Year, Journal, citation, paperID]


When we have as complete a list of papers as we're going to get for now. So we can save this dataframe

In [51]:
Papers.to_csv(FileWithOutputPapers)