# Scopus and RDSA References

**Abstract**: Rice University's Fondren Library started the Rice Digital Scholarship Archive (RDSA) in 2007 as a way to collect and preserve digital objects related to Rice University, as well as making them available to researchers both local and external to the University. Making these objects available to the research world at large affords Fondren the opportuity to peek in on how scholars are using our materials, and to check to see if RDSA materials are being properly cited as other research sources. This Notebook details one investigation that ultimately led to findings about implied researcher behavior in citing sources.

## Intro

Originally, I wanted to check Elsevier's Scopus, the largest abstract and citation database of peer-reviewed literature, to see how many articles could be traced to Rice University as a source, as well as the Rice Institute. (The Rice Institute is Rice University's original name, and the Rice Institute Pamphlets collection in the RDSA).

Scopus provides both a GUI search interface as well as the Scopus Search API to conduct inquiries. Each allows users to search the entire text of parsed reference citations:

```
GUI General Reference Search Results:
"Rice Institute" : 320 document results
"Rice University" : 2,510 document results
```

Of course, it became clear these results could be misleading: if the cited article itself is about Rice... well, that doesn't always mean that it is sourced *from* Rice. I switched to distinct GUI/API queries that searched the title of each citation (REFTITLE), as well as the source of the citation (REFSRCTITLE).

```
Search on REFTITLE:
"Rice Institute" : 30 document results
"Rice University" : 326 document results
```

```
Search on REFSRCTITLE:
"Rice Institute" : 293 document results
"Rice University" : 2,164 document results
```

If we look at these REFSRCTITLE searches plotted by the publication year of the citing articles, we can see an amazing jump: by 2013, the citing of Rice material was three and a half times more than it was in 2007, when the RDSA was founded.

![graph1.PNG](graph1.PNG)

Was this jump emboldened by the introduction of the RDSA? It seemed worth it to look at how many Scopus articles specifically cite RDSA digital objects!

## Data Collection
Scopus's level of searchable metadata detail is top-notch. Full-text articles with cited references have the entirety of that information parsed into structured data fields, which makes it extremely easy to search and retrieve citation information, including cited web URLs:

![scopus-1.PNG](scopus-1.PNG)

Incidentally, the cornerstone of an RDSA object's citation is a URL -- technically, a URI -- known as the object's *permanent RDSA handle ID*, a persistent identifier (*permalink*) supplied by the [Corporation for National Research Initiatives](https://www.handle.net/), which begins with **hdl.handle.net/1911/**.

In addition, there is a non-permanent URL for each object that uses our repository's general web domain: **scholarship.rice.edu/**

So, for example, the RDSA article *"Natural Associativity and Commutativity"* has two (ultimately identical) identifying URLs:

https://scholarship.rice.edu/handle/1911/62865
and
http://hdl.handle.net/1911/62865

Since both of these types of URLs could be findable in Scopus articles, I decided to look for both forms in General Reference Scopus searches. Once an article containing one of the URLs in a cited reference is found, I could grab the RDSA URL and look at which RDSA materials are being used.

At the same time, I could also grab that citation's internal Scopus ID, creating a list of cited RDSA objects that could be cross-referenced with Scopus's Abstract Retrieval API to get further data.

![scopus-2.PNG](scopus-2.PNG)

The results from these two URL searches were combined and condensed, resulting in a list of 167 Scopus IDs. I converted this list into a JSON doc of Scopus IDS:
```{  
   "scopus":[  
      85038912567,
      85020498833,
      85038619276,
      85031498308,
      85029651603,
      85020001078,
      85028570268,
      85026642128...
```

Further data gathering was accomplished with Python. Using the above JSON list, my Python script computed calls to the Abstract Retrieval API. The returned XML from each API call was then searched for the reference(s) containing RDSA URLs and extracted.

**(Note: recreation of this method requires: (1) an API key from Elsevier, and (2) access to an institutional token from Elsevier. Elsevier provides Rice's on-campus IP range with an institutional token that allows a far greater amount of returned data from the APIs. My own API key has been obfuscated in the below Python scripts.)**

The below script uses a simple RegEx to look for RDSA URLs:

In [None]:
import requests
from bs4 import BeautifulSoup
import re
import json

json_input = open("scopus.json")

data = json.load(json_input)

scopus_ids = data["scopus"]

for scopus_id in scopus_ids:
    r = requests.get('https://api.elsevier.com/content/abstract/scopus_id/' + str(scopus_id) + '?apiKey=XXXXXXXXXXXXXXXXXXXX')
    xml_received = r.text
    soup = BeautifulSoup(xml_received,"xml")
    reference = soup.find("ref-fulltext", string=re.compile("hdl\.handle\.net\/1911\/"))
    alt_reference = soup.find("ref-fulltext", string=re.compile("scholarship\.rice\.edu"))
    ref_id_a = reference.parent.find("itemid")
    ref_id_b = alt_reference.parent.find("itemid")
    if reference:
        print("%s||%s||%s" % (scopus_id, ref_id_a, reference))
    if alt_reference:
        print("%s||%s||%s" % (scopus_id, ref_id_b, alt_reference))

Unfortunately, the data in Scopus is not infallible. Some reference URLs were malformed, due to either author error or Scopus parsing error. I modified the script to look for variations of RDSA URLs with errant spaces and/or incorrect punctuation:

In [None]:
for scopus_id in scopus_ids:
    r = requests.get('https://api.elsevier.com/content/abstract/scopus_id/' + str(scopus_id) + '?apiKey=XXXXXXXXXXXXXXXXXXXXXXXXX')
    xml_received = r.text
    soup = BeautifulSoup(xml_received, "xml")
    reference = soup.find("ref-fulltext", string=re.compile("hdl\s*\.handle\.net\/1911|hdl\.handle\s*\.net\/1911|hdl\.handle\.net\s*\/1911|hdl\.\s*handle\.net\/1911|hdl\.handle\.\s*net\/1911|hdl\.handle\.net\/\s*1911|hdl\.\s*handle\.\s*net\/1911|hdl\.\s*handle\.\s*net\/\s*1911|hdl\s*\.handle\s*\.net\/1911|hdl\s*\.handle\s*\.net\s*\/1911|hdl\s*\.\s*handle\s*\.\s*net\s*\/1911|hdl\.handle\s*\.\s*net\/1911"))
    alt_reference = soup.find("ref-fulltext", string=re.compile("scholarship\.rice\.edu|scholarship\s*\.rice\.edu|scholarship\.\s*rice\.edu|scholarship\.rice\s*\.edu|scholarship\.rice\.\s*edu|scholarship\.\s*rice\.\s*edu|scholarship\s\.rice\s*\.edu|scholarship\s*\.\s*rice\s*\.\s*edu"))
    if reference:
        try:
            ref_id = reference.parent.find("itemid", idtype="SGR")
            ref_link = reference.parent.find("ref-website")
            print(str(scopus_id) + '||' + str(ref_id).strip("\<itemid idtype=\"SGR\"\>").strip("\<\/itemid\>") + '||' + str(reference).strip("\<ref\-fulltext\>").strip("\<\/ref\-fulltext\>") + '||' + str(ref_link))
        except:
            ref_id = reference.parent.find("itemid")
            print(str(scopus_id) + '||' + str(ref_id).strip("\<itemid idtype=\"SGR\"\>").strip("\<\/itemid\>") + '||' + str(reference).strip("\<ref\-fulltext\>").strip("\<\/ref\-fulltext\>") + '||None')
    if alt_reference:
        try:
            altref_id = alt_reference.parent.find("itemid", idtype="SGR")
            altref_link = alt_reference.parent.find("ref-website")
            print(str(scopus_id) + '||' + str(altref_id).strip("\<itemid idtype=\"SGR\"\>").strip("\<\/itemid\>") + '||' + str(alt_reference).strip("\<ref\-fulltext\>").strip("\<\/ref\-fulltext\>") + '||' + str(altref_link))
        except:
            altref_id = alt_reference.parent.find("itemid")
            print(str(scopus_id) + '||' + str(altref_id).strip("\<itemid idtype=\"SGR\"\>").strip("\<\/itemid\>") + '||' + str(alt_reference).strip("\<ref\-fulltext\>").strip("\<\/ref\-fulltext\>") + '||None')

Once the RDSA links are grabbed, they can be cleaned up and used to grab other kinds of data from the RDSA pages directly. Sample data:
```{
   "data":[
      {
         "Article_Scopus_ID":85038912567,
         "Ref_website_cleaned":"https://scholarship.rice.edu/handle/1911/70290"
      },
      {
         "Article_Scopus_ID":85020498833,
         "Ref_website_cleaned":"https://scholarship.rice.edu/handle/1911/19169"
      },
      {
         "Article_Scopus_ID":85038619276,
         "Ref_website_cleaned":"https://scholarship.rice.edu/handle/1911/70350"
      }]}
```

In [None]:
import requests
from bs4 import BeautifulSoup
import json

json_input = open("scopus_handles.json")

buffed = json.load(json_input)

for x in buffed["data"]:
    r = requests.get(x["Ref_website_cleaned"])
    html_received = r.text
    soup = BeautifulSoup(html_received, 'html.parser')
    refer_ence = soup.find("dim")
    if refer_ence:
        try:
            print(str(x["Article_Scopus_ID"]) + '||' + str(refer_ence).strip("\<dim\>").strip("\<\/dim\>"))
        except:
            print(str(x["Article_Scopus_ID"]) + '||Nothing')

 All outputted information is later collected and collocated in a spreadsheet.
 
 ## Analysis

![graph2.PNG](graph2.PNG)

**What do the API search results tell us?** For starters, our 167 results reference a total of 154 unique materials in the RDSA, encompassing theses and dissertations, Rice Institute pamphlets, faculty articles, and others. (The most frequently occurring source in our data is a [1974 Rice Institute pamphlet](https://scholarship.rice.edu/handle/1911/63159), which showed up in six articles.)

**How about citation counts?** The Python script that looks for RDSA URLs with errant spaces and/or incorrect punctuation is, at the same time, grabbing a second set of Scopus IDs. This is because each item in an article's reference section is ultimately connected back to its main Scopus database entry. Knowing the IDs for our RDSA materials means we can also grab their citation counts with slightly modified Python:

In [None]:
for scopus_id in scopus_ids:
    r = requests.get('https://api.elsevier.com/content/abstract/scopus_id/' + str(scopus_id) + '?apiKey=XXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
    xml_received = r.text
    soup = BeautifulSoup(xml_received,"xml")
    cited_count = soup.find("citedby-count")
    if cited_count:
        date_data = soup.find("prism:coverDate")
        print(str(scopus_id) + '||' + str(cited_count).strip('\<dn\:citedby\-count\>').strip('\<\/dn\:citedby\-count\>') + '||' + str(date_data).strip('\<prism\:coverDate\>').strip('\<\/prism\:coverDate\>'))

Roughly 64 percent of the RDSA materials in the results have only been cited once, and a full three-quarters have been cited three times or less. However, seven RDSA docs have gotten *a ton of use* in Scopus -- including three Rice Institute pamphlets, two faculty articles, and a thesis.
<br />

| Scopus Reference ID | Handle | Cited Count |
|--|--|--|
|8645875|https://scholarship.rice.edu/handle/1911/62733|51|
|33745146554|https://scholarship.rice.edu/handle/1911/19969|56|
|34548277499|https://scholarship.rice.edu/handle/1911/21679|96|
|3509551|https://scholarship.rice.edu/handle/1911/9176/|155|
|79951656251|http://scholarship.rice.edu/handle/1911/62229|163|
|8744485|http://scholarship.rice.edu/handle/1911/63159|194|
|3146836|http://scholarship.rice.edu/handle/1911/62865|230|

Additionally, Scopus metadata has already indexed the subject matter of the articles referencing the RDSA objects:

In [None]:
for scopus_id in scopus_ids:
    r = requests.get('https://api.elsevier.com/content/abstract/scopus_id/' + str(scopus_id) + '?apiKey=XXXXXXXXXXXXX')
    xml_received = r.text
    soup = BeautifulSoup(xml_received,"xml")
    subjects = soup.find_all("subject-area")
    for subject in subjects:
        print(str(scopus_id) + '||' + str(subject))

These results were easily cleaned and condensed to reportable subject areas. While the natural/formal sciences of course took up quite a chunk of the pie in aggregate, there was a strong showing from the arts and social sciences as well:

![graph3.PNG](graph3.PNG)

## Extended Analysis

While all of these results were impressive, something nagged at me. I couldn't put my finger on it until I overlaid the number of 2007-2016 RDSA search results on top of the original REFSRC search results and saw the giant chasm between the two:

![graph4.PNG](graph4.PNG)

Full discolsure: I think it naive to assume everything in the RDSA should be found in Scopus. While we do have almost 3,000 scholarly publications in the RDSA and thousands upon thousands more dissertations, archival images, university-related documents, and cultural heritage collections, it would be utterly foolish to think that every journal article written by Rice faculty and staff will be deposited there, or that everything in the RDSA should be somehow referenced in Scopus.

That being said, I suspect are that we are missing out on a limited number of citations that should explicitly link to the RDSA.

Take, for example, *Natural Associativity and Commutativity*, the top-cited RDSA document from our search results; written in 1963, it has 230 citations in Scopus between 1970 and 2017, with 75 in the past ten years.

Only *one* of these 75 citations, it turns out, links directly to the RDSA.

Is it possible that people have been citing a physical copy of the pamphlet for the last decade? I suppose; its author, Saunders MacLean, included it in a monograph in the late 1970s. But none of those articles reference the book -- they reference the pamphlet itself.

Part of the issue may be the references themselves; I noticed that some can be of varying accuracy, and Scopus (understandably) simply parses them at face value.

Take, for example, our Rice Institute Pamphlet that showed up six times in the search results. Its RDSA page provides a helpful example citation, and due to the fact that there are several reference citation formats in academia, we can't expect the example citation to be the format every article uses, right?

And yet, all six of the citations are subtly different from each other, in their use of author name form, document source title, and even which handle URL:

|Article ID|Citation Used in Article|
|--|--|
|85034101958|Turner, Victor. (1974). "Liminal to Liminoid, in Play, Flow and Ritual: An Essay in Comparative Symbology". Rice Institute Pamphlet - Rice University Studies 60/3: 53-92. https://scholarship.rice.edu/bitstream/handle/1911/63159/article_RIP603_part4.pdf (Accessed 3 June 2016).|
|84947289407|V.Turner, (1974). Liminal to liminoid, in play, flow, and ritual:An essay in comparative symbology. Rice University Studies, 60(3), 53–92. Retrieved from https://scholarship.rice.edu/handle/1911/63159|
|85006004128|Turner, V. (1974) ‘Liminal to liminoid, in play, flow, and ritual: an essay in comparative symbology’, Rice Institute pamphlet, Rice University Studies, 60 (3), available at: http://hdl.handle.net/1911/63159.|
|85021383745|Turner, V. (1974). Liminal to Liminoid, in Play, Flow, and Ritual: An Essay in Comparative Symbology. Rice Institute Pamphlet-Rice University Studies, 60, (3). Recuperado de http://hdl.handle.net/1911/63159.|
|85020977329|Turner, V. (1982). Liminal to liminoid, in play, flow, and ritual: An essay in comparative symbology. Rice University Studies, 60(3), 53-92. Retrieved from http://hdl.handle.net/1911/63159|
|84964008023|V.Turner, (1974). Liminal to liminoid, in play, flow, and ritual: An essay in comparative symbology. Rice University Studies, 60(3), 53–92. Retrieved from http://hdl.handle.net/1911/63159|

(All that being said, Scopus was still skilled enough to connect all six to the same master reference ID.)

The greater take-away from this investigation may be the necessity to remind scholars to properly cite our material in the RDSA. This is, of course, no easy feat with no simple solution -- after all, we already provide citation examples for virtually all of our data.

Interestingly, the problem may be the product of how disciplines prefer to cite research. For example, guidelines within both the Chicago Manual of Style and the Modern Language Association (MLA) generally prescribe the inclusion of URLs or permalinks for ebooks, journals, news or magazine articles consulted online. Meanwhile, the American Psychological Association (APA) citation style has, historically, shown preference toward using the "print citation information" for referencing a print article obtained from an online database: "By providing this information, you allow people to retrieve the print version if they do not have access to the database from which you retrieved the article." ([Source](https://owl.english.purdue.edu/owl/resource/560/10/)) Of course, times change, and APA does now prescribe using URLs; however, their wariness toward entropy of web links is fully apparent [here](http://www.apastyle.org/manual/related/electronic-sources.pdf).

Because of this, a possible way to start rectifying the situation might be to emphasize the importance of using our permalink RDSA handles, either directly on our repository or through outreach to faculty members. One of the resulting findings from the investigation was the realization that only half of the citations cite our handle URLs; the others cite the non-permanent **scholarship.rice.edu** domain, or link directly to the cited digital object's file:

![graph5.PNG](graph5.PNG)

In any case, it would be a start, if not a magic bullet solution.