# Datasets

We published four datasets for training and evaluating extraction of performance results from machine learning papers. In this notebook we describe the format and show how to use our python API to conveniently work with the datasets. Due to the licensing the datasets consists of metadata and annotations, but do not include papers and data extracted from them. However, we made special effort in our extraction pipeline to get reproducible results.

Simple functions to load the datasets

In [1]:
from axcell.helpers.datasets import read_arxiv_papers
from pathlib import Path

V1_URL = 'https://github.com/sotagents/axcell/releases/download/v1.0/'
ARXIV_PAPERS_URL = V1_URL + 'arxiv-papers.csv.xz'
SEGMENTED_TABLES_URL = V1_URL + 'segmented-tables.json.xz'
PWC_LEADERBOARDS_URL = V1_URL + 'pwc-leaderboards.json.xz'

## ArxivPapers

**ArxivPapers** dataset is a corpus of over 100,000 scientific papers related to machine learning. In our work we use the corpus for self-supervised training of ULMFiT langauge model (see the lm_training notebook) and for extraction of common abbreviations. The dataset is a CSV file with one row per paper and the following fields:
* arxiv_id - arXiv identifier with version
* archive_size - the file size in bytes of the e-print archive
* sha256 - SHA-256 hash of the e-print archive
* title - paper's title
* status - the text and tables extraction status for this paper, one of:
  + success,
  + no-tex - LaTeX source is unavailable,
  + processing-error - extraction issues,
  + withdrawn - the paper is withdrawn from arXiv
* sections - number of extracted sections and subsections
* tables - number of extracted tables

In [2]:
arxiv_papers = read_arxiv_papers(ARXIV_PAPERS_URL)

print(f'Number of papers:           {len(arxiv_papers):8}')
print(f'└── with LaTeX source:      {(~arxiv_papers.status.isin(["no-tex", "withdrawn"])).sum():8}')
print(f'Number of extracted tables: {arxiv_papers.tables.sum():8}')

Number of papers:             104710
└── with LaTeX source:         93811
Number of extracted tables:   277946


The arXiv id can be used to generate links to e-prints. Please read https://arxiv.org/help/bulk_data and play nice.

In [3]:
def get_eprint_link(paper):
    return f'http://export.arxiv.org/e-print/{paper.arxiv_id}'

links = arxiv_papers.apply(get_eprint_link, axis=1)
links.tail()

104705    http://export.arxiv.org/e-print/2002.08204v1
104706    http://export.arxiv.org/e-print/2002.08253v1
104707    http://export.arxiv.org/e-print/2002.08264v1
104708    http://export.arxiv.org/e-print/2002.08301v1
104709    http://export.arxiv.org/e-print/2002.08325v1
dtype: object

## SegmentedTables & LinkedResults

The **SegmentedTables** dataset contains annotations of almost 2,000 tables. The dataset is a JSON array with one item per paper and the following fields:
* arxiv_id - arXiv identifier with version. The version can be different than in **ArxivTables**,
* sha256 - SHA-256 hash of the e-print archive
* fold - one of 11 folds, f.e., img_class or speech_rec. Each paper has exactly one fold, even if it's related to more than one task,
* tables - array of tables annotations
  + index - 0-based index of tables extracted from paper,
  + leaderboard - a boolean denoting if this table is a leaderboard table,
  + ablation - a boolean denoting if this table is an ablation table (a table can be both a leaderboard and an ablation table),
  + dataset_text - datasets mentioned in table's caption, not normalized
  + segmentation - for leaderboard tables, a 2D array (list of lists) with one label per cell

Additionally we annotated part of the tables with performance results, called simply the **LinkedResults** dataset. Each table contains a 'records' array with items containing:
* task, dataset, metric - task, dataset and metric names normalized across all papers from the **LinkedResults** dataset,
* value - normalized metric value,
* model - model name,
* row, column - 0-based cell location with this result.

In [4]:
from axcell.helpers.datasets import read_tables_annotations

segmented_tables_annotations = read_tables_annotations(SEGMENTED_TABLES_URL)

leaderboards = (segmented_tables_annotations.tables.apply(
    lambda tables: len([t for t in tables if t['leaderboard']])
).sum())
ablations = (segmented_tables_annotations.tables.apply(
    lambda tables: len([t for t in tables if t['ablation']])
).sum())
records = (segmented_tables_annotations.tables.apply(
    lambda tables: sum([len(t['records']) for t in tables])
).sum())

print(f'Number of papers: {len(segmented_tables_annotations):8}')
print(f'Number of tables: {segmented_tables_annotations.tables.apply(len).sum():8}')
print(f'├── leaderboards: {leaderboards:8}')
print(f'└── ablations:    {ablations:8}')
print(f'Linked results:   {records:8}')

Number of papers:      352
Number of tables:     1994
├── leaderboards:      796
└── ablations:         468
Linked results:       1591


## PWCLeaderboards

The **PWCLeaderboards** dataset is similar in structure to the **LinkedResults** dataset. It's a JSON array with one item per paper, containing:
* arxiv_id - arXiv identifier with version. The version corresponds to the version in **ArxivTables**,
* tables
  + index - 0-based table index
  + records - as in **LinkedResults**

In [5]:
pwc_leaderboards = read_tables_annotations(PWC_LEADERBOARDS_URL)

records = (pwc_leaderboards.tables.apply(
    lambda tables: sum([len(t['records']) for t in tables])
).sum())

print(f'Number of papers: {len(pwc_leaderboards):8}')
print(f'Number of tables: {pwc_leaderboards.tables.apply(len).sum():8}')
print(f'Linked results:   {records:8}')

Number of papers:      731
Number of tables:     1278
Linked results:       5393
