A Short Tutorial on JSTOR Data for Research

This is a short tutorial on JSTOR Data for Research (DfR), which provides data sets of information on articles and books on JSTOR. DfR is often used for text mining purpose on academic articles. I was first introduced to DfR in 2016 in a Digital Humanities seminar, and found it quite useful for historiographical research. DfR is a rather new service by JSTOR, so not that many people are familiar with it. I think it would be nice if I write this short tutorial on DfR.

DfR has changed a lot since 2016. It has a better user interface, and faster server, but one thing that I don’t like is the data format also changed. It used to provide citation information for every article and book in a csv file. Now, it only provides metadata of each article and book in xml format. Fortunately, however, it is not hard to extract information from xml files. I have written some simple python script for processing the data. Besides metadata of every article, DfR also provides word count, bigram count and trigram count in txt files. I have included at the end of this tutorial some of my past data visualizations based on DfR.

This is also my community contribution to the EDVA course.

Getting the Data

Here is the link to JSTOR DfR

You can sign up an account on the homepage. It is free, but I don’t think you can sign in through library.

Click “Create a Dataset”

Search by keyword. You can refine the result on the left panels.

Once you are ready, click “Request Dataset”

Select the data you wanted. You will receive an email for the data. (Usually within an hour, since the DfR server is quite fast now)

Cleaning the Data

As I mentioned, the data from DfR is in xml and txt format. You can find the data I requested in data folder.

It is not hard to pre-process the data, you can use my scripts to convert to csv and json format. I made a word count of articles by the publish year. BeautifulSoup is helpful for parsing xml files. Json files maybe better for the data I wanted (you can imagine that csv files will be large, since there is a lot entries of zero). I’ve pasted my script for pre-processing here:

from bs4 import BeautifulSoup
import glob
import re
import json

total_freq = {}

# The json schema here is {pub-year: {word: count}}

for xml in glob.iglob('data/metadata/*.xml'):
    with open(xml) as f:
        bs = BeautifulSoup(f, "lxml-xml")
    
    # Extract pub lish year
    pub_year = bs.year
    year = int(str(pub_year)[6:10])
    
    total_freq[year] = total_freq.get(year, {})

    
    txt = xml.replace("metadata", "ngram1").replace(".xml", "-ngram1.txt")
    
    
    
    with open(txt) as t:
        for line in t:
            sub = re.split("\s+", line)
            word = sub[0]
            count = int(sub[1])
            
        if total_freq[year].get(word, 0) == 0:
            total_freq[year][word] = 0

        total_freq[year][word] += count

    
file = open("count.json", "w")
    
with file:
    json.dump(total_freq, file)

Of course, there are a lot more information you can extract from the xml files. Explore it on your own. The xml was rather clean; I’ve been not coding defensively, but for this small set of 789 records, my script works smoothly.

Visualization

Here are some visualization for my historiographical research I did in 2016 with ggplot2. I plotted the rolling average of word frequencies in percentages of several key words over five years in articles about Natsume Soseki. This is not difficult if you have cleaned data, like the one generated above.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
fig		fig
screen_shot		screen_shot
README.org		README.org
count.csv		count.csv
count.json		count.json
csv_word_count.py		csv_word_count.py
json_word_count.py		json_word_count.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Short Tutorial on JSTOR Data for Research

Getting the Data

Cleaning the Data

Visualization

About

Releases

Packages

Languages

sethsch/python-jstor-dfr

Folders and files

Latest commit

History

Repository files navigation

A Short Tutorial on JSTOR Data for Research

Getting the Data

Cleaning the Data

Visualization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages