# ResearchArcade Complete Tutorial

This tutorial demonstrates how to work with the ResearchArcade database, covering all node types and edge relationships.

## Table of Contents
1. [Setup](#setup)
2. [OpenReview Data](#openreview)
3. [ArXiv Papers](#arxiv-papers)
4. [ArXiv Authors](#arxiv-authors)
5. [ArXiv Categories](#arxiv-categories)
6. [ArXiv Figures](#arxiv-figures)
7. [ArXiv Tables](#arxiv-tables)
8. [ArXiv Sections](#arxiv-sections)
9. [ArXiv Paragraphs](#arxiv-paragraphs)
10. [Relationships/Edges](#relationships)
11. [Advanced Queries](#advanced-queries)

## 1. Setup <a name="setup"></a>

In [1]:
import sys
from pathlib import Path
from tqdm import tqdm
import os
sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '..')))
from research_arcade.research_arcade import ResearchArcade
import pandas as pd
from datetime import datetime

### Choose Database Backend

#### CSV Based

In [None]:
# db_type = "csv"
# config = {
#     "csv_dir": "../data/my_research_arcade_data/"
# }

# research_arcade = ResearchArcade(db_type=db_type, config=config)

#### SQL Based

In [2]:
db_type = "sql"
config = {
    "host": "localhost",
    "dbname": "postgres",
    "user": "cl195",
    "password": "",
    "port": "5433"
}

research_arcade = ResearchArcade(db_type=db_type, config=config)

## 3. ArXiv Papers <a name="arxiv-papers"></a>

### Table Schema
- `id` (SERIAL PK)
- `arxiv_id` (VARCHAR, unique) - e.g., 1802.08773v3
- `base_arxiv_id` (VARCHAR) - e.g., 1802.08773
- `version` (INT) - e.g., 3
- `title` (TEXT)
- `abstract` (TEXT)
- `submit_date` (DATE)
- `metadata` (JSONB)

### Construct Table from API

In [3]:
config = {"arxiv_ids": ["2505.23559", "1903.03894v4"], "dest_dir": "./download"}
research_arcade.construct_table_from_api("arxiv_papers", config)

Paper 2505.23559 does not have metadata downloaded


#### Construct Table from CSV

In [4]:
config = {"csv_file": "./examples/csv_data/csv_arxiv_papers_example.csv"}
research_arcade.construct_table_from_csv("arxiv_papers", config)

Error: CSV file ./examples/csv_data/csv_arxiv_papers_example.csv does not exist.


#### Construct Table from JSON

In [5]:
config = {"json_file": "./examples/json_data/json_arxiv_papers_example.json"}
research_arcade.construct_table_from_json("arxiv_papers", config)

Error: JSON file ./examples/json_data/json_arxiv_papers_example.json does not exist.


### Insert a Paper

In [6]:
# Example 1: Insert the famous "Attention is All You Need" paper
new_paper = {
    'arxiv_id': '1706.03762v7',
    'base_arxiv_id': '1706.03762',
    'version': 7,
    'title': 'Attention Is All You Need',
    'abstract': 'The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.',
    'submit_date': '2017-06-12',
    'metadata': {'venue': 'NeurIPS 2017', 'pdf_url': 'https://arxiv.org/pdf/1706.03762.pdf'}
}

research_arcade.insert_node("arxiv_papers", node_features=new_paper)
print("Paper inserted successfully!")

Paper inserted successfully!


In [7]:
# Example 2: Insert BERT paper
bert_paper = {
    'arxiv_id': '1810.04805v2',
    'base_arxiv_id': '1810.04805',
    'version': 2,
    'title': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding',
    'abstract': 'We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.',
    'submit_date': '2018-10-11',
    'metadata': {'venue': 'NAACL 2019', 'citations': 50000}
}

research_arcade.insert_node("arxiv_papers", node_features=bert_paper)
print("BERT paper inserted successfully!")

BERT paper inserted successfully!


### Get All Papers

In [8]:
arxiv_papers_df = research_arcade.get_all_node_features("arxiv_papers")
print(f"Total papers in database: {len(arxiv_papers_df)}")
print("\nFirst 5 papers:")
print(arxiv_papers_df)

Total papers in database: 3

First 5 papers:
[(2, '1903.03894v4', '1903.03894', '4', 'GNNExplainer: Generating Explanations for Graph Neural Networks', "Graph Neural Networks (GNNs) are a powerful tool for machine learning on\ngraphs.GNNs combine node feature information with the graph structure by\nrecursively passing neural messages along edges of the input graph. However,\nincorporating both graph structure and feature information leads to complex\nmodels, and explaining predictions made by GNNs remains unsolved. Here we\npropose GNNExplainer, the first general, model-agnostic approach for providing\ninterpretable explanations for predictions of any GNN-based model on any\ngraph-based machine learning task. Given an instance, GNNExplainer identifies a\ncompact subgraph structure and a small subset of node features that have a\ncrucial role in GNN's prediction. Further, GNNExplainer can generate consistent\nand concise explanations for an entire class of instances. We formulate\nGNNE

### Get Specific Paper by ID

In [9]:
paper_id = {"arxiv_id": "1810.04805v2"}
paper_features = research_arcade.get_node_features_by_id("arxiv_papers", paper_id)
print("Paper details:")
print(paper_features.to_dict(orient="records")[0])

AttributeError: 'SQLArxivPapers' object has no attribute 'get_paper_by_id'

### Update a Paper

In [10]:
# Update metadata for a paper
updated_paper = {
    'arxiv_id': '1706.03762v7',
    'metadata': {
        'venue': 'NeurIPS 2017',
        'pdf_url': 'https://arxiv.org/pdf/1706.03762.pdf',
        'citations': 75000,
        'influential': True
    }
}

research_arcade.update_node("arxiv_papers", node_features=updated_paper)
print("Paper updated successfully!")

Paper updated successfully!


### Delete a Paper

In [11]:
# Delete a paper by ID
paper_id = {"arxiv_id": "1706.03762v7"}
deleted_paper = research_arcade.delete_node_by_id("arxiv_papers", paper_id)
print("Deleted paper:")
print(deleted_paper)

AttributeError: 'SQLArxivPapers' object has no attribute 'delete_paper_by_id'

## 4. ArXiv Authors <a name="arxiv-authors"></a>

### Table Schema
- `id` (SERIAL PK)
- `semantic_scholar_id` (VARCHAR, unique)
- `name` (VARCHAR)
- `homepage` (VARCHAR)

### Construct Table from API

In [None]:
# config = {"arxiv_ids": ["1903.03894v4", "1806.08804v4"], "dest_dir": "./download"}
# research_arcade.construct_table_from_api("arxiv_authors", config)

#### Construct Table from CSV

In [12]:
config = {"csv_file": "./examples/csv_data/csv_arxiv_authors_example.csv"}
research_arcade.construct_table_from_csv("arxiv_authors", config)

Error: CSV file ./examples/csv_data/csv_arxiv_authors_example.csv does not exist.


#### Construct Table from JSON

In [13]:
config = {"json_file": "./examples/json_data/json_arxiv_authors_example.json"}
research_arcade.construct_table_from_json("arxiv_authors", config)

Error: JSON file ./examples/json_data/json_arxiv_authors_example.json does not exist.


### Insert Authors

In [14]:
# Insert authors from the Transformer paper
authors = [
    {
        'semantic_scholar_id': 'ss_ashish_vaswani',
        'name': 'Ashish Vaswani',
        'homepage': 'https://scholar.google.com/citations?user=oR9sCGYAAAAJ'
    },
    {
        'semantic_scholar_id': 'ss_noam_shazeer',
        'name': 'Noam Shazeer',
        'homepage': 'https://scholar.google.com/citations?user=oR9sCGYAAAAJ'
    },
    {
        'semantic_scholar_id': 'ss_niki_parmar',
        'name': 'Niki Parmar',
        'homepage': 'https://scholar.google.com/citations?user=oR9sCGYAAAAJ'
    },
    {
        'semantic_scholar_id': 'ss_jakob_uszkoreit',
        'name': 'Jakob Uszkoreit',
        'homepage': 'https://scholar.google.com/citations?user=oR9sCGYAAAAJ'
    },
    {
        'semantic_scholar_id': 'ss_llion_jones',
        'name': 'Llion Jones',
        'homepage': 'https://scholar.google.com/citations?user=oR9sCGYAAAAJ'
    }
]

for author in authors:
    research_arcade.insert_node("arxiv_authors", node_features=author)
    print(f"Inserted author: {author['name']}")

Inserted author: Ashish Vaswani
Inserted author: Noam Shazeer
Inserted author: Niki Parmar
Inserted author: Jakob Uszkoreit
Inserted author: Llion Jones


### Get All Authors

In [15]:
authors_df = research_arcade.get_all_node_features("arxiv_authors")
print(f"Total authors in database: {len(authors_df)}")
print("\nAll authors:")
print(authors_df)

Total authors in database: 14391

All authors:
[(1, '2288033664', 'Zhe Xu', 'https://www.semanticscholar.org/author/2288033664'), (2, '49025612', 'Daoyuan Chen', 'https://www.semanticscholar.org/author/49025612'), (3, '2283309816', 'Zhenqing Ling', 'https://www.semanticscholar.org/author/2283309816'), (4, '2237607166', 'Yaliang Li', 'https://www.semanticscholar.org/author/2237607166'), (5, '2288065597', 'Ying Shen', 'https://www.semanticscholar.org/author/2288065597'), (6, '2249533760', 'Ziyu Wan', 'https://www.semanticscholar.org/author/2249533760'), (7, '2296444976', 'Yunxiang Li', 'https://www.semanticscholar.org/author/2296444976'), (8, '2326075771', 'Yan Song', 'https://www.semanticscholar.org/author/2326075771'), (9, '2345961411', 'Hanjing Wang', 'https://www.semanticscholar.org/author/2345961411'), (10, '2284931437', 'Linyi Yang', 'https://www.semanticscholar.org/author/2284931437'), (11, '2351406620', 'Mark Schmidt', 'https://www.semanticscholar.org/author/2351406620'), (12, '2

### Get Specific Author by ID

In [None]:
author_id = {"semantic_scholar_id": 2288033664}
author_features = research_arcade.get_node_features_by_id("arxiv_authors", author_id)
print("Author details:")
print(author_features)

TypeError: SQLArxivAuthors.get_author_by_id() got an unexpected keyword argument 'semantic_scholar_id'

### Update an Author

In [19]:
updated_author = {
    'semantic_scholar_id': 'ss_ashish_vaswani',
    'homepage': 'https://ashishvaswani.com'
}

research_arcade.update_node("arxiv_authors", node_features=updated_author)
print("Author updated successfully!")

TypeError: SQLArxivAuthors.update_author() missing 1 required positional argument: 'id'

## 5. ArXiv Categories <a name="arxiv-categories"></a>

### Table Schema
- `id` (SERIAL PK)
- `name` (VARCHAR, unique)
- `description` (TEXT)

### Insert From API

In [21]:
config = {"arxiv_ids": ["1903.03894v4", "1806.08804v4"], "dest_dir": "./download"}
research_arcade.construct_table_from_api("arxiv_categories", config)

./download/1806.08804v4/1806.08804v4.tar.gz
paper with id 1806.08804v4 downloaded
{'id': '1903.03894v4', 'title': 'GNNExplainer: Generating Explanations for Graph Neural Networks', 'abstract': "Graph Neural Networks (GNNs) are a powerful tool for machine learning on\ngraphs.GNNs combine node feature information with the graph structure by\nrecursively passing neural messages along edges of the input graph. However,\nincorporating both graph structure and feature information leads to complex\nmodels, and explaining predictions made by GNNs remains unsolved. Here we\npropose GNNExplainer, the first general, model-agnostic approach for providing\ninterpretable explanations for predictions of any GNN-based model on any\ngraph-based machine learning task. Given an instance, GNNExplainer identifies a\ncompact subgraph structure and a small subset of node features that have a\ncrucial role in GNN's prediction. Further, GNNExplainer can generate consistent\nand concise explanations for an enti

#### Construct Table from CSV

In [22]:
config = {"csv_file": "./examples/csv_data/csv_arxiv_categories_example.csv"}
research_arcade.construct_table_from_csv("arxiv_categories", config)

Error: CSV file ./examples/csv_data/csv_arxiv_categories_example.csv does not exist.


#### Construct Table from JSON

In [23]:
config = {"json_file": "./examples/json_data/json_arxiv_categories_example.json"}
research_arcade.construct_table_from_json("arxiv_categories", config)

Error: JSON file ./examples/json_data/json_arxiv_categories_example.json does not exist.


### Insert Categories

In [24]:
categories = [
    {
        'name': 'cs.CL',
        'description': 'Computation and Language (Natural Language Processing)'
    },
    {
        'name': 'cs.LG',
        'description': 'Machine Learning'
    },
    {
        'name': 'cs.AI',
        'description': 'Artificial Intelligence'
    },
    {
        'name': 'cs.CV',
        'description': 'Computer Vision and Pattern Recognition'
    },
    {
        'name': 'stat.ML',
        'description': 'Machine Learning (Statistics)'
    }
]

for category in categories:
    research_arcade.insert_node("arxiv_categories", node_features=category)
    print(f"Inserted category: {category['name']}")

Inserted category: cs.CL
Inserted category: cs.LG
Inserted category: cs.AI
Inserted category: cs.CV
Inserted category: stat.ML


### Get All Categories

In [25]:
categories_df = research_arcade.get_all_node_features("arxiv_categories")
print(f"Total categories: {len(categories_df)}")
print("\nAll categories:")
print(categories_df)

Total categories: 7

All categories:
[(1, 'cs.LG', None), (2, 'stat.ML', None), (4, 'cs.NE', None), (5, 'cs.SI', None), (7, 'cs.CL', 'Computation and Language (Natural Language Processing)'), (9, 'cs.AI', 'Artificial Intelligence'), (10, 'cs.CV', 'Computer Vision and Pattern Recognition')]


## 6. ArXiv Figures <a name="arxiv-figures"></a>

### Table Schema
- `id` (SERIAL PK)
- `paper_arxiv_id` (VARCHAR FK → papers.arxiv_id)
- `path` (VARCHAR)
- `caption` (TEXT)
- `label` (TEXT)
- `name` (TEXT)

### Insert Figures

In [26]:
# Insert figures for the Transformer paper
figures = [
    {
        'paper_arxiv_id': '1706.03762v7',
        'path': '/figures/transformer_architecture.png',
        'caption': 'The Transformer model architecture. The left side shows the encoder stack and the right side shows the decoder stack.',
        'label': 'fig:architecture',
        'name': 'Figure 1'
    },
    {
        'paper_arxiv_id': '1706.03762v7',
        'path': '/figures/scaled_dot_product_attention.png',
        'caption': 'Scaled Dot-Product Attention and Multi-Head Attention mechanisms.',
        'label': 'fig:attention',
        'name': 'Figure 2'
    },
    {
        'paper_arxiv_id': '1706.03762v7',
        'path': '/figures/positional_encoding.png',
        'caption': 'Positional encoding visualization showing sine and cosine functions of different frequencies.',
        'label': 'fig:positional',
        'name': 'Figure 3'
    }
]

for figure in figures:
    research_arcade.insert_node("arxiv_figures", node_features=figure)
    print(f"Inserted {figure['name']}")

UndefinedObject: constraint "ux_arxiv_figures_name_notnull" for table "arxiv_figures" does not exist


### Get All Figures

In [27]:
figures_df = research_arcade.get_all_node_features("arxiv_figures")
print(f"Total figures: {len(figures_df)}")
print("\nAll figures:")
print(figures_df[['name', 'caption', 'label']])

TypeError: object of type 'NoneType' has no len()

## 7. ArXiv Tables <a name="arxiv-tables"></a>

### Table Schema
- `id` (SERIAL PK)
- `paper_arxiv_id` (VARCHAR FK → papers.arxiv_id)
- `path` (VARCHAR)
- `caption` (TEXT)
- `label` (TEXT)
- `table_text` (TEXT)

### Insert From API

In [28]:
config = {"arxiv_ids": ["1903.03894v4", "1806.08804v4"], "dest_dir": "./download"}
research_arcade.construct_table_from_api("arxiv_tables", config)

seed: ['1903.03894v4']
BFS_que.qsize(): 1
current paper: 1903.03894v4
Thread 136281040344640 Processing 1903.03894v4



gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now


Thread 136281040344640 Finished processing 1903.03894v4 (1/999999999) Time elapsed: 1.19s
'NoneType' object is not subscriptable
Thread 136281040344640 Failed to process 1903.03894v4
Thread 136284945086272 Finished processing 1 papers
Error: The file './download/output/1903.03894v4.json' was not found.
seed: ['1806.08804v4']
BFS_que.qsize(): 1
current paper: 1806.08804v4
Thread 136281040344640 Processing 1806.08804v4
Thread 136281040344640 Processing file paper-diffpool.tex
numbmer of citations in node info method: 0
Cannot find the bib file refs.bib
Thread 136281040344640 Finished processing 1806.08804v4 (1/999999999) Time elapsed: 1.68s
Thread 136284945086272 Finished processing 1 papers


#### Construct Table from CSV

In [29]:
config = {"csv_file": "./examples/csv_data/csv_arxiv_tables_example.csv"}
research_arcade.construct_table_from_csv("arxiv_tables", config)

Error: CSV file ./examples/csv_data/csv_arxiv_tables_example.csv does not exist.


### Insert Categories

In [30]:
categories = [
    {
        'name': 'cs.CL',
        'description': 'Computation and Language (Natural Language Processing)'
    },
    {
        'name': 'cs.LG',
        'description': 'Machine Learning'
    },
    {
        'name': 'cs.AI',
        'description': 'Artificial Intelligence'
    },
    {
        'name': 'cs.CV',
        'description': 'Computer Vision and Pattern Recognition'
    },
    {
        'name': 'stat.ML',
        'description': 'Machine Learning (Statistics)'
    }
]

for category in categories:
    research_arcade.insert_node("arxiv_categories", node_features=category)
    print(f"Inserted category: {category['name']}")

Inserted category: cs.CL
Inserted category: cs.LG
Inserted category: cs.AI
Inserted category: cs.CV
Inserted category: stat.ML


### Get All Categories

In [31]:
categories_df = research_arcade.get_all_node_features("arxiv_categories")
print(f"Total categories: {len(categories_df)}")
print("\nAll categories:")
print(categories_df)

Total categories: 7

All categories:
[(1, 'cs.LG', None), (2, 'stat.ML', None), (4, 'cs.NE', None), (5, 'cs.SI', None), (7, 'cs.CL', 'Computation and Language (Natural Language Processing)'), (9, 'cs.AI', 'Artificial Intelligence'), (10, 'cs.CV', 'Computer Vision and Pattern Recognition')]


## 6. ArXiv Figures <a name="arxiv-figures"></a>

### Table Schema
- `id` (SERIAL PK)
- `paper_arxiv_id` (VARCHAR FK → papers.arxiv_id)
- `path` (VARCHAR)
- `caption` (TEXT)
- `label` (TEXT)
- `name` (TEXT)

### Insert From API

In [32]:
config = {"arxiv_ids": ["1903.03894v4", "1806.08804v4"], "dest_dir": "./download"}
research_arcade.construct_table_from_api("arxiv_figures", config)

seed: ['1903.03894v4']
BFS_que.qsize(): 1
current paper: 1903.03894v4
Thread 136281040344640 Processing 1903.03894v4



gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now


Thread 136281040344640 Finished processing 1903.03894v4 (1/999999999) Time elapsed: 1.04s
'NoneType' object is not subscriptable
Thread 136281040344640 Failed to process 1903.03894v4
Thread 136284945086272 Finished processing 1 papers
Error: The file './download/output/1903.03894v4.json' was not found.
An unexpected error occurred: constraint "ux_arxiv_figures_name_notnull" for table "arxiv_figures" does not exist



#### Construct Table from CSV

In [None]:
config = {"csv_file": "./examples/csv_data/csv_arxiv_figures_example.csv"}
research_arcade.construct_table_from_csv("arxiv_figures", config)

No new figures to import


#### Construct Table from JSON

In [None]:
config = {"json_file": "./examples/json_data/json_arxiv_figures_example.json"}
research_arcade.construct_table_from_json("arxiv_figures", config)

No new figures to import


### Insert Tables

In [33]:
# Insert tables for the Transformer paper
tables = [
    {
        'paper_arxiv_id': '1706.03762v7',
        'path': '/tables/model_variations.tex',
        'caption': 'Variations on the Transformer architecture with different hyperparameters.',
        'label': 'tab:variations',
        'table_text': 'Model | N | d_model | d_ff | h | d_k | d_v | P_drop | train time\nbase | 6 | 512 | 2048 | 8 | 64 | 64 | 0.1 | 12 hrs'
    },
    {
        'paper_arxiv_id': '1706.03762v7',
        'path': '/tables/wmt_results.tex',
        'caption': 'Performance of the Transformer on WMT 2014 English-German and English-French translation tasks.',
        'label': 'tab:wmt',
        'table_text': 'Model | EN-DE BLEU | EN-FR BLEU\nTransformer (base) | 27.3 | 38.1\nTransformer (big) | 28.4 | 41.8'
    },
    {
        'paper_arxiv_id': '1706.03762v7',
        'path': '/tables/parsing_results.tex',
        'caption': 'English constituency parsing results on WSJ test set.',
        'label': 'tab:parsing',
        'table_text': 'Model | WSJ 23 F1\nTransformer | 91.3'
    }
]

for table in tables:
    research_arcade.insert_node("arxiv_tables", node_features=table)
    print(f"Inserted table: {table['label']}")

Inserted table: tab:variations
Inserted table: tab:wmt
Inserted table: tab:parsing


### Get All Tables

In [35]:
tables_df = research_arcade.get_all_node_features("arxiv_tables")
print(f"Total tables: {len(tables_df)}")
print("\nAll tables:")
print(tables_df)

Total tables: 5

All tables:
[(1, '1806.08804v4', None, '\\caption{Classification accuracies in percent. The far-right column gives the relative increase in accuracy compared to the baseline \\textsc{GraphSage} approach.}', '\\label{tab:results}', '\\begin{tabular}{@{}clcccccc@{}}\\cmidrule[\\heavyrulewidth]{2-8}\n& \\multirow{3}{*}{\\vspace*{8pt}\\textbf{Method}}&\\multicolumn{5}{c}{\\textbf{Data Set}}\\\\\\cmidrule{3-8}\n& & {\\textsc{Enzymes}} & {\\textsc{D\\&D}} & {\\textsc{Reddit-Multi-12k}} & {\\textsc{Collab}} & {\\textsc{Proteins}} & {\\text{Gain}}\n\\\\ \\cmidrule{2-8}\n\\multirow{4}{*}{\\rotatebox{90}{\\hspace*{-6pt}Kernel}} \n& \\textsc{Graphlet}  & 41.03 & 74.85 &  21.73 & 64.66 & 72.91 &  \\\\ \n& \\textsc{Shortest-path} & 42.32 & 78.86 & 36.93 & 59.10  & 76.43 &   \\\\     \n& \\text{1-WL} &  53.43 & 74.02 &  39.03 &  78.61 & 73.76 &  \\\\     \n& \\text{WL-OA} & 60.13  & 79.04\t & 44.38  & 80.74  & 75.26  &   \\\\       \\cmidrule{2-8}\n% GNN\n& \\textsc{PatchySan} & -- 

## 8. ArXiv Sections <a name="arxiv-sections"></a>

### Table Schema
- `id` (SERIAL PK)
- `content` (TEXT)
- `title` (TEXT)
- `appendix` (BOOLEAN)
- `paper_arxiv_id` (VARCHAR FK → papers.arxiv_id)
- `section_in_paper_id` (INT)

### Insert From API

In [36]:
config = {"arxiv_ids": ["1903.03894v4", "1806.08804v4"], "dest_dir": "./download"}
research_arcade.construct_table_from_api("arxiv_sections", config)

seed: ['1903.03894v4']
BFS_que.qsize(): 1
current paper: 1903.03894v4
Thread 136281040344640 Processing 1903.03894v4



gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now


Thread 136281040344640 Finished processing 1903.03894v4 (1/999999999) Time elapsed: 1.04s
'NoneType' object is not subscriptable
Thread 136281040344640 Failed to process 1903.03894v4
Thread 136284945086272 Finished processing 1 papers
Error: The file './download/output/1903.03894v4.json' was not found.


#### Construct Table from CSV

In [37]:
config = {"csv_file": "./examples/csv_data/csv_arxiv_sections_example.csv"}
research_arcade.construct_table_from_csv("arxiv_sections", config)

Error: CSV file ./examples/csv_data/csv_arxiv_sections_example.csv does not exist.


#### Construct Table from JSON

In [38]:
config = {"json_file": "./examples/json_data/json_arxiv_sections_example.json"}
research_arcade.construct_table_from_json("arxiv_sections", config)

Error: JSON file ./examples/json_data/json_arxiv_sections_example.json does not exist.


### Insert Sections

In [39]:
# Insert sections for the Transformer paper
sections = [
    {
        'content': 'The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder...',
        'title': 'Introduction',
        'appendix': False,
        'paper_arxiv_id': '1706.03762v7',
        'section_in_paper_id': 1
    },
    {
        'content': 'Most competitive neural sequence transduction models have an encoder-decoder structure. Here, the encoder maps an input sequence of symbol representations...',
        'title': 'Background',
        'appendix': False,
        'paper_arxiv_id': '1706.03762v7',
        'section_in_paper_id': 2
    },
    {
        'content': 'Most neural sequence transduction models have an encoder-decoder structure. The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers...',
        'title': 'Model Architecture',
        'appendix': False,
        'paper_arxiv_id': '1706.03762v7',
        'section_in_paper_id': 3
    },
    {
        'content': 'In this section we describe the training regime for our models...',
        'title': 'Training',
        'appendix': False,
        'paper_arxiv_id': '1706.03762v7',
        'section_in_paper_id': 4
    },
    {
        'content': 'On the WMT 2014 English-to-German translation task, the big transformer model outperforms the best previously reported models...',
        'title': 'Results',
        'appendix': False,
        'paper_arxiv_id': '1706.03762v7',
        'section_in_paper_id': 5
    },
    {
        'content': 'In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers...',
        'title': 'Conclusion',
        'appendix': False,
        'paper_arxiv_id': '1706.03762v7',
        'section_in_paper_id': 6
    }
]

for section in sections:
    research_arcade.insert_node("arxiv_sections", node_features=section)
    print(f"Inserted section: {section['title']}")

TypeError: SQLArxivSections.insert_section() got an unexpected keyword argument 'appendix'

### Get All Sections

In [41]:
sections_df = research_arcade.get_all_node_features("arxiv_sections")
print(f"Total sections: {sections_df}")
print("\nAll sections:")
print(sections_df)

Total sections: [(1, "\n\\label{sec:intro}\nIn recent years there has been a surge of interest in developing graph neural networks (GNNs)---general deep learning architectures that can operate over graph structured data, such as social network data \\cite{hamilton2017inductive,kipf2017semi,Vel+2018} or graph-based representations of molecules \\cite{dai2016discriminative,Duv+2015,Gil+2017}.\nThe general approach with GNNs is to view the underlying graph as a computation graph and learn neural network primitives that generate individual node embeddings by passing, transforming, and aggregating node feature information across the graph~\\cite{Gil+2017,hamilton2017inductive}.\nThe generated node embeddings can then be used as input to any differentiable prediction layer, e.g., for node classification \\cite{hamilton2017inductive} or link prediction \\cite{Sch+2017}, and the whole model can be trained in an end-to-end fashion. \n\n\n\n\nHowever, a major limitation of current GNN architectu

## 9. ArXiv Paragraphs <a name="arxiv-paragraphs"></a>

### Table Schema
- `id` (SERIAL PK)
- `paragraph_id` (INT)
- `content` (TEXT)
- `paper_arxiv_id` (VARCHAR FK → papers.arxiv_id)
- `paper_section` (TEXT)
- `section_id` (INT)
- `paragraph_in_paper_id` (INT)

### Insert From API

In [42]:
config = {"arxiv_ids": ["1903.03894v4", "1806.08804v4"], "dest_dir": "./download"}
research_arcade.construct_table_from_api("arxiv_paragraphs", config)

seed: ['1903.03894v4']
BFS_que.qsize(): 1
current paper: 1903.03894v4
Thread 136281040344640 Processing 1903.03894v4



gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now


Thread 136281040344640 Finished processing 1903.03894v4 (1/999999999) Time elapsed: 1.00s
'NoneType' object is not subscriptable
Thread 136281040344640 Failed to process 1903.03894v4
Thread 136284945086272 Finished processing 1 papers


100%|██████████| 2/2 [00:00<00:00, 973.95it/s]


Error loading ./download/output/1903.03894v4.json: [Errno 2] No such file or directory: './download/output/1903.03894v4.json'


100%|██████████| 2/2 [00:00<00:00, 193.35it/s]

Error loading ./download/output/1903.03894v4.json: [Errno 2] No such file or directory: './download/output/1903.03894v4.json'
1806.08804v4
Key to References: {'fig:assignment_vis': 'figures_3', 'tab:results': 'table_4', 'tab:results2': 'table_5'}
tab:results
tab:results2
Paper count:  1
Total nodes:  113
Total edges:  210
Paper nodes:  1
Figure nodes:  0
Table nodes:  2
Text nodes:  110
0





UndefinedObject: constraint "ux_arxiv_paragraphs_unique" for table "arxiv_paragraphs" does not exist


#### Construct Table from CSV

In [43]:
config = {"csv_file": "./examples/csv_data/csv_arxiv_paragraphs_example.csv"}
research_arcade.construct_table_from_csv("arxiv_paragraphs", config)

Error: CSV file ./examples/csv_data/csv_arxiv_paragraphs_example.csv does not exist.


#### Construct Table from JSON

In [None]:
config = {"json_file": "./examples/json_data/json_arxiv_paragraphs_example.json"}
research_arcade.construct_table_from_json("arxiv_paragraphs", config)

No new paragraphs to import (all paragraphs already exist)


### Insert Paragraphs

In [44]:
# Insert paragraphs from the Introduction section
paragraphs = [
    {
        'paragraph_id': 1,
        'content': 'Recurrent neural networks, long short-term memory and gated recurrent neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation.',
        'paper_arxiv_id': '1706.03762v7',
        'paper_section': 'Introduction',
        'section_id': 1,
        'paragraph_in_paper_id': 1
    },
    {
        'paragraph_id': 2,
        'content': 'Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures. Recurrent models typically factor computation along the symbol positions of the input and output sequences.',
        'paper_arxiv_id': '1706.03762v7',
        'paper_section': 'Introduction',
        'section_id': 1,
        'paragraph_in_paper_id': 2
    },
    {
        'paragraph_id': 3,
        'content': 'Aligning the positions to steps in computation time, they generate a sequence of hidden states h_t, as a function of the previous hidden state h_{t-1} and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.',
        'paper_arxiv_id': '1706.03762v7',
        'paper_section': 'Introduction',
        'section_id': 1,
        'paragraph_in_paper_id': 3
    },
    {
        'paragraph_id': 4,
        'content': 'Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences.',
        'paper_arxiv_id': '1706.03762v7',
        'paper_section': 'Introduction',
        'section_id': 1,
        'paragraph_in_paper_id': 4
    },
    {
        'paragraph_id': 5,
        'content': 'In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.',
        'paper_arxiv_id': '1706.03762v7',
        'paper_section': 'Introduction',
        'section_id': 1,
        'paragraph_in_paper_id': 5
    }
]

for paragraph in paragraphs:
    research_arcade.insert_node("arxiv_paragraphs", node_features=paragraph)
    print(f"Inserted paragraph {paragraph['paragraph_id']} from {paragraph['paper_section']}")

UndefinedObject: constraint "ux_arxiv_paragraphs_unique" for table "arxiv_paragraphs" does not exist


### Get All Paragraphs

In [45]:
paragraphs_df = research_arcade.get_all_node_features("arxiv_paragraphs")
print(f"Total paragraphs: {len(paragraphs_df)}")
print("\nFirst 3 paragraphs:")
print(paragraphs_df[['paragraph_id', 'paper_section', 'content']].head(3))

TypeError: object of type 'NoneType' has no len()

## 10. Relationships/Edges <a name="relationships"></a>

This section demonstrates how to create and manage relationships between different entities.

### 10.2 ArXiv Citations (arxiv_citation)

#### Insert Citation

In [48]:
citation = {
    'citing_arxiv_id': '1810.04805v2',
    'cited_arxiv_id': '1706.03762v7',
    'bib_title': 'attention is all you need',
    'bib_key': 'something',
    'citing_sections': ['something'],
}
research_arcade.insert_edge("arxiv_citation", edge_features=citation)
print("Citation created!")

Citation created!


#### Construct Table from CSV

In [49]:
config = {"csv_file": "./examples/csv_data/csv_arxiv_paper_citation_example.csv"}
research_arcade.construct_table_from_csv("arxiv_paper_citation", config)

Error: CSV file ./examples/csv_data/csv_arxiv_paper_citation_example.csv does not exist.


#### Construct Table from JSON

In [50]:
config = {"json_file": "./examples/json_data/json_arxiv_paper_citation_example.json"}
research_arcade.construct_table_from_json("arxiv_paper_citation", config)

Error: JSON file ./examples/json_data/json_arxiv_paper_citation_example.json does not exist.


#### Get All Citations

In [52]:
all_citations = research_arcade.get_all_edge_features("arxiv_citation")
print(f"Total citations: {len(all_citations)}")
print(all_citations)

Total citations: 1
[(1, '1810.04805v2', '1706.03762v7', 'attention is all you need', 'something', None, ['something'], [])]


#### Get Cited Papers

In [53]:
citing_paper = {'citing_paper_id': '1810.04805v2'}
cited_papers = research_arcade.get_neighborhood("arxiv_citation", primary_key=citing_paper)
print("Papers cited:")
print(cited_papers)

Papers cited:
[(1, '1810.04805v2', '1706.03762v7', 'attention is all you need', 'something', None, ['something'], [])]


#### Get Citing Papers

In [54]:
cited_paper = {'cited_paper_id': '1706.03762v7'}
citing_papers = research_arcade.get_neighborhood("arxiv_citation", primary_key=cited_paper)
print("Papers that cite:")
print(citing_papers)

Papers that cite:
[(1, '1810.04805v2', '1706.03762v7', 'attention is all you need', 'something', None, ['something'], [])]


#### Delete Citation

In [55]:
citation_id = {
    'citing_paper_id': '1810.04805v2',
    'cited_paper_id': '1706.03762v7'
}
research_arcade.delete_edge_by_id("arxiv_citation", primary_key=citation_id)
print("Citation deleted!")

Deleted citation: 1810.04805v2 -> 1706.03762v7
Citation deleted!


### 10.3 ArXiv Paper-Author (arxiv_paper_author)

#### Insert Paper-Author Relationships

In [56]:
paper_authors = [
    {'paper_arxiv_id': '1706.03762v7', 'author_id': 'ss_ashish_vaswani', 'author_sequence': 1},
    {'paper_arxiv_id': '1706.03762v7', 'author_id': 'ss_noam_shazeer', 'author_sequence': 2},
    {'paper_arxiv_id': '1706.03762v7', 'author_id': 'ss_niki_parmar', 'author_sequence': 3}
]
for relation in paper_authors:
    research_arcade.insert_edge("arxiv_paper_author", edge_features=relation)
    print(f"Linked author {relation['author_id']} (position {relation['author_sequence']})")

UndefinedObject: constraint "ux_arxiv_paper_authors_unique" for table "arxiv_paper_authors" does not exist


#### Construct Table from CSV

In [57]:
config = {"csv_file": "./examples/csv_data/csv_arxiv_paper_author_example.csv"}
research_arcade.construct_table_from_csv("arxiv_paper_author", config)

Error: CSV file ./examples/csv_data/csv_arxiv_paper_author_example.csv does not exist.


#### Construct Table from JSON

In [None]:
config = {"json_file": "./examples/json_data/json_arxiv_paper_author_example.json"}
research_arcade.construct_table_from_json("arxiv_paper_author", config)

No new paper-author relationships to import


#### Get All Paper-Author Relationships

In [59]:
all_relations = research_arcade.get_all_edge_features("arxiv_paper_author")
print(f"Total relationships: {len(all_relations)}")
print(all_relations)

TypeError: object of type 'NoneType' has no len()

#### Get Authors for a Paper

In [61]:
paper_id = {'paper_arxiv_id': '1706.03762v7'}
authors = research_arcade.get_neighborhood("arxiv_paper_author", primary_key=paper_id)
print("Authors:")
print(authors)

Authors:
None


#### Get Papers by Author

In [62]:
author_id = {'author_id': 'ss_ashish_vaswani'}
papers = research_arcade.get_neighborhood("arxiv_paper_author", primary_key=author_id)
print("Papers by author:")
print(papers)

Papers by author:
None


#### Delete Paper-Author Link

In [63]:
relation_id = {'paper_arxiv_id': '1706.03762v7', 'author_id': 'ss_ashish_vaswani'}
research_arcade.delete_edge_by_id("arxiv_paper_author", primary_key=relation_id)
print("Relationship deleted!")

Relationship deleted!


### 10.4 ArXiv Paper-Category (arxiv_paper_category)

#### Insert Paper-Category Relationships

In [64]:
paper_categories = [
    {'paper_arxiv_id': '1706.03762v7', 'category_id': '1'},
    {'paper_arxiv_id': '1706.03762v7', 'category_id': '1'},
    {'paper_arxiv_id': '1706.03762v7', 'category_id': '2'}
]
for relation in paper_categories:
    research_arcade.insert_edge("arxiv_paper_category", edge_features=relation)
    print(f"Linked {relation['category_id']}")

UndefinedObject: constraint "ux_arxiv_paper_category_unique" for table "arxiv_paper_category" does not exist


#### Construct Table from CSV

In [65]:
config = {"csv_file": "./examples/csv_data/csv_arxiv_paper_category_example.csv"}
research_arcade.construct_table_from_csv("arxiv_paper_category", config)

Error: CSV file ./examples/csv_data/csv_arxiv_paper_category_example.csv does not exist.


#### Construct Table from JSON

In [66]:
config = {"json_file": "./examples/json_data/json_arxiv_paper_category_example.json"}
research_arcade.construct_table_from_json("arxiv_paper_category", config)

Error: JSON file ./examples/json_data/json_arxiv_paper_category_example.json does not exist.


#### Get All Paper-Category Relationships

In [67]:
all_relations = research_arcade.get_all_edge_features("arxiv_paper_category")
print(f"Total relationships: {len(all_relations)}")
print(all_relations.head())

TypeError: object of type 'NoneType' has no len()

#### Get Categories for Paper

In [68]:
paper_id = {'paper_arxiv_id': '1706.03762v7'}
categories = research_arcade.get_neighborhood("arxiv_paper_category", primary_key=paper_id)
print("Categories:")
print(categories)

Categories:
None


#### Get Papers in Category

In [69]:
category_id = {'category_id': 'cs.LG'}
papers = research_arcade.get_neighborhood("arxiv_paper_category", primary_key=category_id)
print("Papers in category:")
print(papers)

Papers in category:
None


#### Delete Paper-Category Link

In [70]:
relation_id = {'paper_arxiv_id': '1706.03762v7', 'category_id': 'cs.AI'}
research_arcade.delete_edge_by_id("arxiv_paper_category", primary_key=relation_id)
print("Relationship deleted!")

Relationship deleted!


### 10.5 ArXiv Paper-Figure (arxiv_paper_figure)

#### Insert Paper-Figure Relationships

In [71]:
paper_figures = [
    {'paper_arxiv_id': '1706.03762v7', 'figure_id': 1},
    {'paper_arxiv_id': '1706.03762v7', 'figure_id': 2}
]
for relation in paper_figures:
    research_arcade.insert_edge("arxiv_paper_figure", edge_features=relation)
    print(f"Linked figure {relation['figure_id']})")

UndefinedObject: constraint "ux_arxiv_paper_figures_unique" for table "arxiv_paper_figures" does not exist


#### Construct Table from CSV

In [None]:
config = {"csv_file": "./examples/csv_data/csv_arxiv_paper_figure_example.csv"}
research_arcade.construct_table_from_csv("arxiv_paper_figure", config)

No new paper-figure relationships to import


#### Construct Table from JSON

In [None]:
config = {"json_file": "./examples/json_data/json_arxiv_paper_figure_example.json"}
research_arcade.construct_table_from_json("arxiv_paper_figure", config)

No new paper-figure relationships to import


#### Get Figures for Paper

In [None]:
paper_id = {'paper_arxiv_id': '1706.03762v7'}
figures = research_arcade.get_neighborhood("arxiv_paper_figure", primary_key=paper_id)
print("Figures:")
print(figures)

Figures:
  paper_arxiv_id  figure_id
0   1706.03762v7          1
1   1706.03762v7          2
2   1706.03762v7          3


### 10.6 ArXiv Paper-Table (arxiv_paper_table)

#### Insert Paper-Table Relationships

In [None]:
paper_tables = [
    {'paper_arxiv_id': '1706.03762v7', 'table_id': 1},
    {'paper_arxiv_id': '1706.03762v7', 'table_id': 2}
]
for relation in paper_tables:
    research_arcade.insert_edge("arxiv_paper_table", edge_features=relation)
    print(f"Linked table {relation['table_id']}")

Linked table 1
Linked table 2


#### Construct Table from CSV

In [None]:
config = {"csv_file": "./examples/csv_data/csv_arxiv_paper_table_example.csv"}
research_arcade.construct_table_from_csv("arxiv_paper_table", config)

No new paper-table relationships to import


#### Construct Table from JSON

In [None]:
config = {"json_file": "./examples/json_data/json_arxiv_paper_table_example.json"}
research_arcade.construct_table_from_json("arxiv_paper_table", config)

No new paper-table relationships to import


#### Get Tables for Paper

In [None]:
paper_id = {'paper_arxiv_id': '1706.03762v7'}
tables = research_arcade.get_neighborhood("arxiv_paper_table", primary_key=paper_id)
print("Tables:")


Tables:


### 10.7 ArXiv Paragraph-Reference (arxiv_paragraph_reference)

#### Insert Paragraph-Reference Relationships

In [None]:
paragraph_references = [
    {'paragraph_id': 1, 'paper_section': 'established approaches', 'paper_arxiv_id': '1706.03762v7', 'reference_label': "{something}", 'reference_type': 'figure'}
]

for relation in paragraph_references:
    research_arcade.insert_edge("arxiv_paragraph_reference", edge_features=relation)

#### Construct Table from CSV

In [None]:
config = {"csv_file": "./examples/csv_data/csv_arxiv_paragraph_reference_example.csv"}
research_arcade.construct_table_from_csv("arxiv_paragraph_reference", config)

Successfully imported 10 paragraph-reference relationships from ./examples/csv_data/csv_arxiv_paragraph_reference_example.csv


#### Construct Table from JSON

In [None]:
config = {"json_file": "./examples/json_data/json_arxiv_paragraph_reference_example.json"}
research_arcade.construct_table_from_json("arxiv_paragraph_reference", config)

Error: JSON file ./examples/json_data/json_arxiv_paragraph_reference_example.json does not exist.


#### Get References in Paragraph

In [None]:
paragraph_id = {'paragraph_id': 1}
references = research_arcade.get_neighborhood("arxiv_paragraph_reference", primary_key=paragraph_id)
print("References:")
print(references)

References:
   id  paragraph_id           paper_section paper_arxiv_id  \
0   1             1  established approaches   1706.03762v7   
1   3             1            introduction   1706.03762v7   
2   6             1              background   1706.03762v7   
3   8             1            introduction   1810.04805v2   
4  11             1                   model   1810.04805v2   
5  12             1  established approaches   1706.03762v7   
6  14             1            introduction   1706.03762v7   
7  17             1              background   1706.03762v7   
8  19             1            introduction   1810.04805v2   
9  22             1                   model   1810.04805v2   

      reference_label reference_type  
0         {something}         figure  
1  bahdanau2014neural       citation  
2       fig:attention         figure  
3       fig:bert_arch         figure  
4      peters2018deep       citation  
5         {something}         figure  
6  bahdanau2014neural       cita

## Conclusion

This tutorial has covered:

1. Setting up the ResearchArcade database connection
2. Working with OpenReview data
3. CRUD operations for all ArXiv entity types:
   - Papers
   - Authors
   - Categories
   - Figures
   - Tables
   - Sections
   - Paragraphs
4. Creating relationships between entities:
   - Authorship
   - Citations
   - Paper-Category links
   - Paper-Figure/Table links
   - Paragraph-level references

For more information, refer to the ResearchArcade documentation.