vignettes/known-quirks.Rmd

---
title: "Known Quirks of JSTOR/DfR Data"
author: "Thomas Klebel"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{Known Quirks of JSTOR Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Collecting all the quirks
Data from JSTOR/DfR is unlike most other data you encounter when doing text 
analysis. First and foremost, the data about articles and books come from a
wide variety of journals and publishers. The level of detail and certain formats
vary because of this. `jstor` tries to deal with this situation with two
strategies:

- try to recognise the format and read data accordingly
- if this is not possible, read data as "raw" as possible, i.e. without any 
conversions

An example for the first case are references. Four different ways how
references can be specified are known at this time, and all are imported in 
specific ways to deal this variation. There might however be other formats,
which should lead to an informative error when trying to import them via 
`jst_get_references()`.

An example for the latter case are page numbers. Most of the time, the entries
for page numbers are simply `42`, or `61`. This is as expected, and could be
parsed as integers. Sometimes, there are characters present, like `M77`. This
would pose no problem either, we could simply extract all digits via regex and
parse as character. Unfortunately, sometimes the page is specified like this:
`v75i2p84`. Extracting all digits would result in `75284`, which is wrong by
a long shot. Since there might be other ways of specifying pages, `jstor` does
not attempt to parse the pages to integers when importing. However, it
offers a set of convenience functions which deal with a few common cases
(see `jst_augment()` and below).

There are many other problems or peculiarities like this. This vignette tries to
list as many as possible, and offer solutions for dealing with them.
Unfortunately I have neither the time nor the interest to wade through all the
data which you could get from DfR in order to find all possible quirks. The
following list is thus inevitably incomplete. If you encounter new 
quirks/peculiarities, it would be greatly appreciated if you sent me an email,
or [opened an issue at GitHub](https://github.com/ropensci/jstor/issues).
I will then include your findings in future version of this vignette, so this 
vignette can
be a starting point for everybody who conducts text analysis with data from
JSTOR/DfR.


# Data augmentation
After importing data via `jst_get_article()`, there are at least two tasks you
might typically want to undertake:

- Merge different identifiers for journals into one, so you can filter journals.
- Convert pages from character into integers and calculate the total number
of pages per article.

There are four functions which help you to streamline this process:

- `jst_clean_page()` attempts to turn a character vector with pages into an integer
vector.
- `jst_add_total_pages()` adds a column with the total number of pages per
article.
- `jst_unify_journal_id()` merges different identifiers for journals into one.
- `jst_augment()` wraps the above functions for convenience.

# Known quirks

In the following sections, known issues with data from DfR are described in
greater detail.

## Page numbers
Page numbers are a mess. Besides the issues mentioned above, page numbers might
sometimes be specified as "pp. 1234-83" as in
[this article from the American Journal of 
Sociology](https://www.jstor.org/stable/10.1086/659100).
Of course, this results in `first_page = 1234` and `last_page = 83`, and the
computed number of total pages from `jst_get_total_pages()` will be
negative. There is currently no general solution for this issue.


### Calculating total pages
As outlined above, page numbers come in very different forms. Besides this
problem, there is actually another issue. Imagine you would like to quantify 
the lengths of articles. Obviously you will need information on the first
and the last page of the articles. Furthermore, the pages need to be
parsed properly: you will run into troubles if you calculate page numbers like
`75284 - 42 + 1`, in case the number was parsed badly. `jst_clean_page()` tries
to do this properly, based on a few known possibilities:

- "2" -> 2
- "A2" -> 2
- "v75i2p84" -> 84


Parsing correctly is unfortunately not enough. Things like "Errata" might come
to haunt you. For example there might be an article with `first_page = 42` and 
`last_page = 362`, which
would leave you puzzled as to if this can be true^[Although it sounds absurd,
this can actually be true. There are some articles which are 200 pages long.
Obviously, they are not standard research articles. You will need to decide
if they fall into your sample or not.]. There could be a simple explanation:
the article might start on page 42, and end on page 65, and there is furthermore
an erratum on page 362. Technically, `last_page = 362` is true then, but it
will cause problems for calculating the total number of pages. Quite often,
there is information in another column which could resolve this: `page_range`, 
which in this case would look like `42 - 65, 362`. 

A small helper to deal with those situations is `jst_get_total_pages()`. It
works for page ranges, but also for first and last pages:

```{r, message=FALSE}
library(jstor)
library(dplyr)

input <- tibble::tribble(
  ~first_page,   ~last_page,    ~page_range,
  NA_real_,      NA_real_,      NA_character_, 
  1,             10,            "1 - 10",
  1,             10,            NA_character_,
  1,             NA_real_,      NA_character_, 
  1,             NA_real_,      "1-10",
  NA_real_,       NA_real_,      "1, 5-10",
  NA_real_,       NA_real_,      "1-4, 5-10",
  NA_real_,       NA_real_,      "1-4, C5-C10"
)

input %>% 
  mutate(n_pages = jst_get_total_pages(first_page, last_page, page_range))
```
This is actually identical to using `jst_add_total_pages()`:

```{r}
input %>% 
  jst_add_total_pages()
```


## Journal identifiers
Identifiers for the journal usually appear in three columns:

- `journal_doi`
- `journal_jcode`
- `journal_pub_id`

```{r, results='asis'}
sample_article <- jst_get_article(jst_example("article_with_references.xml")) 

knitr::kable(sample_article)
```

From my samples, it seems that the information in `journal_pub_id` is 
often missing, as is journal_doi. The most important identifier is thus
`journal_jcode`. In cases where both `journal_jcode` and `journal_pub_id` are
present, at least in my samples, the format of `journal_jcode` was different.
For consistency, `jst_unify_journal_id()` thus takes content of
`journal_pub_id` if it is present, and that of `journal_jcode` otherwise.

With this algorithm, it should be possible to reliably match them to 
general information about the respective journals, which are available from
`jst_get_journal_overview()`:

```{r}
sample_article %>% 
  jst_unify_journal_id() %>% 
  left_join(jst_get_journal_overview()) %>% 
  tidyr::gather(variable, value) %>% 
  knitr::kable()
```


## Duplicated ngrams
|Source                            |time span          |Part              |
|:---------------------------------|-----------------:|-----------------:|
|American Journal of Sociology     |Unknown           |Book Reviews      |

For the AJS, ngrams for book reviews are calculated per issue. There are 
numerous reviews per issue, and each of them has an identical file of ngrams,
containing ngrams for all book reviews of this issue. 

A possible strategy for dealing with this is either not to use those ngrams,
since they are calculated on all reviews in the issue, irrespective of 
whether actually all reviews of the given issue are in the sample or not.
Alternatively, one could group by issues, and only take one set of ngrams
per issue.

## Language codes
Information on langues is not consistent. For the sample article, `language` is
`eng`. 

```{r}
sample_article %>% 
  pull(language)
```

In other cases it might be `en`. It is thus advisable to take a quick look at
different variants via `distinct(meta_data, language)` or 
`count(meta_data, language)`. 


## Incorrect/odd references
When analysing data about references and footnotes, you will encounter many
inconsistencies and errors. Most of them are not due to errors from DfR, but 
stem simply from the fact, that humans make mistakes when creating manuscripts,
and not all errors with references are caught before printing.

### Problems with non-english characters
A common problem are names with non-english characters like german umlauts
(Ferdinand Tönnies) or nordic names (Gøsta Esping-Andersen). These will appear
in many different variations: Tonnies, Tönnies, Gosta, Gösta, etc.

### OCR-Issues
For older articles, you might encounter issues that stem from digitising text 
with OCR-software. A common problem is distinguishing `I` from `l`, like in
the phrase "In love". Depending on which names appear in your data, this might
lead to inconsistencies.

### Errors by article authors
There are many examples where authors make mistakes and your summary statistics
end up being skewed. 
[This article](https://www.jstor.org/stable/25074331?seq=11&refreqid=excelsior%3A6f9520d8aecff945ab2033fa66d3438e#page_scan_tab_contents) 
about "Ethics Education in the Workplace" cites the same items multiple times,
which is possibly an artifact. The advantage of using JSTOR/DfR data is, that 
you can inspect all sources and
check, if a specific pattern you see is an artifact or genuine.