## Planning

https://docs.openalex.org/about-the-data/work

So here's a process to build the data.

1. Grab $S_r$ retracted articles: `is_retracted=true`, `type=research-article` ...minimum # of citations? ...pubmed ID?
1. Grab $S_u$ unretracted articles as... random? (but how in API?) roughly matched* groups? 1:200? 1:1000?
1. Repeat until sufficient data

\* Matching with...
- `is_retracted=false`
- `type=research-article`
- Same year -- there are WAY more old unretracted articles in these data, and age matters a lot for citations
- Matching on N concepts could help the data maintain a certain amount of comparability
- Maybe limit to one area for that reason -- could require PubMed identifiers, which also might improve retraction measure
- Same institution might helps avoid institutional confounds in data generation

## Additional notes

### Retraction Watch Database

This sounds like the best source available, well curated and detailed, but it's not publicly available & requires a DUA with an organization for research. Probably best for a longer-term project, but not good for an 8 day deadline :)

### open-retractions

We could also hit up open-retractions API - https://github.com/open-retractions/open-retractions - using a DOI. The OpenAlex API documentation implies that they use Crossref but it's not clear whether they also use PubMed info for `is_retracted`; open-retractions does supply both CrossRef and PubMed.

## Setup and imports

In [20]:
%pip install pyarrow

distutils: /opt/conda/include/python3.8/UNKNOWN
sysconfig: /opt/conda/include/python3.8[0m
user = False
home = None
root = None
prefix = None[0m
distutils: /opt/conda/include/python3.8/UNKNOWN
sysconfig: /opt/conda/include/python3.8[0m
user = False
home = None
root = None
prefix = None[0m
Note: you may need to restart the kernel to use updated packages.


In [21]:
import math
import json
from pathlib import Path
import pickle
import time

from IPython.display import display, clear_output
import numpy as np
import pandas as pd
import requests


%matplotlib inline

# for i in range(10):
#     clear_output(wait=True)
#     print(i)
#     time.sleep(0.1)

DATA_DIR = Path('./data')
UNRETRACTED_DIR = DATA_DIR / 'json_collecting'
tester = DATA_DIR / 'test.txt'
tester.touch()
assert tester.exists()

PER_PAGE = 200 # API max
SHARED_FILTERS = "type:journal-article,publication_year:>2010,publication_year:<2018"

## Data building

### We have available a good chunk of retracted articles

Retracted journal articles with at least 1 citation
15,117 https://api.openalex.org/works?mailto=matvan@umich.edu&filter=type:journal-article,cited_by_count:>0&group_by=is_retracted

Another 
5,415 with 0 citations https://api.openalex.org/works?mailto=matvan@umich.edu&filter=type:journal-article,cited_by_count:0&group_by=is_retracted

Unfortunately, we don't have the date of retraction in these data.

 

In [None]:
base_url = "https://api.openalex.org/works?mailto=matvan@umich.edu"
filtering = f"filter={SHARED_FILTERS},is_retracted:true"
pagination = f"per_page={PER_PAGE}&page=1"
url = "&".join((base_url, filtering, pagination))
j = requests.get(url).json()
print(url)
print(j['meta'])

n_records = j['meta']['count']
print('Records:', n_records)
n_pages = math.ceil(n_records / PER_PAGE)
print('Pages of records:', n_pages)

In [None]:
base_url = "https://api.openalex.org/works?mailto=matvan@umich.edu"
filtering = f"filter={SHARED_FILTERS},is_retracted:true"

records = []
for i in range(1, 1000):
    pagination = f"per_page={PER_PAGE}&page={i}"
    url = "&".join((base_url, filtering, pagination))
    j = requests.get(url).json()
    print(url)
    print(j['meta'])
    
    records += j['results']
    
    count = j['meta']['count']
    page = j['meta']['page']
    per_page = j['meta']['per_page']
    if per_page * page >= count:
        break
    time.sleep(1)
print('done')
print(len(records))

In [None]:
# json is only about 20% inflated over pkl
with Path(DATA_DIR / 'raw_retracted.json').open('w', encoding='UTF-8') as outfile:
    json.dump(records, outfile)


### Now to pull the  unretracted subsample

In [None]:
with Path(DATA_DIR / 'raw_retracted.json').open('r', encoding='UTF-8') as infile:
    raw_retracted_records = json.load(infile)

In [None]:
journal_years = list(set((rec['host_venue']['id'], rec['publication_year']) for rec in raw_retracted_records))
print(journal_years[:8])
print("Retracted sample:", len(raw_retracted_records))
print("Unique year/journal combos:", len(journal_years))
print("Maximum possible unretracted sample:", len(journal_years)*200)

In [None]:
base_url = "https://api.openalex.org/works?mailto=matvan@umich.edu"
pagination = f"per_page={PER_PAGE}&page=1"

if not UNRETRACTED_DIR.exists():
    UNRETRACTED_DIR.mkdir()

for i, journal_year in enumerate(journal_years):
    clear_output(wait=True)
    print(i+1, 'of', len(journal_years), journal_year)
    
    journal_url, year = journal_year
    # Can't match the works with no venue id
    if not journal_url:
        continue
    journal_id = journal_url.replace("https://openalex.org/", "")

    filename = UNRETRACTED_DIR / f"{journal_id}-{year}.json"
    if filename.exists():
        print("Already collected", filename)
        print('')
        continue
        
    filtering = f"filter=type:journal-article,is_retracted:false,host_venue.id:{journal_id},publication_year:{year}"
    url = "&".join((base_url, filtering, pagination))
    print(url)
    j = requests.get(url).json()
    print(j['meta'])
    
    print('--->', filename)
    with filename.open('w', encoding='UTF-8') as outfile:
        json.dump(j, outfile)
    
    print('')
    time.sleep(2)

    
print('done')    

In [None]:
# next:
# validate each json as load()able
# -- this will take a minute but will be much easier to refetch any errors rather than deal with the entire json

for i, path in enumerate(UNRETRACTED_DIR.glob('*.json')):
    clear_output(wait=True)
    print(i+1, path)
    with path.open('r', encoding='UTF-8') as infile:
        json.load(infile)
print('done')
# if it doesn't error out, we're good
# I'm seeing 5259 instead of 5260 because of the one missing host_venue.id