# Sample by Field

This notebook creates a sample of the preprint metadata where the disciplinary fields as expressed in the OpenAlex metadata are equally sampled, and the publication date is after 2019. The goal isn't to have a representative sample of all preprints, since we are only seeing preprints that happen to be in OpenAlex. The purpose of this dataset is to help build a hand curated benchmark dataset for evaluating different machine learning models and techniques. You will need a copy of the OpenAlex metadata that was collected in the [download-preprints](download-preprints.ipynb) notebook.

https://storage.cloud.google.com/cloud-ai-platform-e215f7f7-a526-4a66-902d-eb69384ef0c4/preprints/metadata.jsonl

In [1]:
import pandas

df = pandas.read_json('metadata.jsonl', lines=True)
len(df)

48221

In [2]:
df = df[df.publication_year >= 2019]
len(df)

19669

Extract the OpenAlex primary topic into a separate column:

In [3]:
df = df[df['primary_topic'].notna()]
df['field'] = df['primary_topic'].apply(lambda s: pandas.json_normalize(s)['subfield.display_name'])
df['field'].value_counts()

field
Molecular Biology                   2161
Artificial Intelligence             1043
Astronomy and Astrophysics           984
Genetics                             727
Cognitive Neuroscience               701
                                    ... 
Museology                              1
Ceramics and Composites                1
Process Chemistry and Technology       1
General Dentistry                      1
Linguistics and Language               1
Name: count, Length: 218, dtype: int64

Create a URL field:

In [4]:
df['url'] = df.pdf_path.apply(lambda p: "https://storage.cloud.google.com/cloud-ai-platform-e215f7f7-a526-4a66-902d-eb69384ef0c4/preprints/" + p)

In [6]:
fields = df['field'].dropna().unique()
fields

array(['Molecular Biology', 'Infectious Diseases',
       'Computer Vision and Pattern Recognition',
       'Renewable Energy, Sustainability and the Environment',
       'Astronomy and Astrophysics',
       'Atomic and Molecular Physics, and Optics',
       'Global and Planetary Change', 'Epidemiology',
       'Nature and Landscape Conservation', 'Automotive Engineering',
       'Endocrinology, Diabetes and Metabolism',
       'Radiology, Nuclear Medicine and Imaging',
       'Modeling and Simulation',
       'Cardiology and Cardiovascular Medicine',
       'Artificial Intelligence', 'Health, Toxicology and Mutagenesis',
       'Statistics and Probability',
       'General Economics, Econometrics and Finance', 'Oncology',
       'Otorhinolaryngology', 'Nuclear and High Energy Physics',
       'Cognitive Neuroscience', 'Immunology', 'Geology', 'Genetics',
       'Control and Systems Engineering', 'Clinical Psychology',
       'Physiology', 'Pulmonary and Respiratory Medicine',
       '

Select 3 preprints for each field as long a it has at least 10 matching preprints:

In [9]:
sample = pandas.DataFrame(columns=["id", "title", "url", "field"])

for field in fields:
    field_df = df[df.field == field]
    # only include if there are enough papers
    if len(field_df) > 3:
        s = field_df.sample(3)
        sample = pandas.concat([sample, s[["id", "title", "url", "field"]]])

sample

Unnamed: 0,id,title,url,field
31373,https://openalex.org/W4394843835,Single-nuclei histone modification profiling o...,https://storage.cloud.google.com/cloud-ai-plat...,Molecular Biology
26147,https://openalex.org/W3013783484,Dynamical model of the CLC-2 ion channel revea...,https://storage.cloud.google.com/cloud-ai-plat...,Molecular Biology
38874,https://openalex.org/W2945212438,easyCLIP Quantifies RNA-Protein Interactions a...,https://storage.cloud.google.com/cloud-ai-plat...,Molecular Biology
18516,https://openalex.org/W2901173781,Moderate-to-High Levels of Pretreatment HIV Dr...,https://storage.cloud.google.com/cloud-ai-plat...,Infectious Diseases
19080,https://openalex.org/W4390722339,Risk of Emergent Dolutegravir Resistance Mutat...,https://storage.cloud.google.com/cloud-ai-plat...,Infectious Diseases
...,...,...,...,...
32295,https://openalex.org/W4245851024,Interventions to reduce meat consumption by ap...,https://storage.cloud.google.com/cloud-ai-plat...,Small Animals
35154,https://openalex.org/W4226053155,The Effects of Exposure to Information About A...,https://storage.cloud.google.com/cloud-ai-plat...,Small Animals
45345,https://openalex.org/W3142474082,The Politics of Judicial Reform,https://storage.cloud.google.com/cloud-ai-plat...,Law
45513,https://openalex.org/W4214588293,"The New Judicial Governance: Courts, Data, and...",https://storage.cloud.google.com/cloud-ai-plat...,Law


In [10]:
sample.to_csv('data/benchmark.csv', index=False)