# ResearchArcade Complete Tutorial

This tutorial demonstrates how to work with the ResearchArcade database, covering all node types and edge relationships.

## Table of Contents
1. [Setup](#setup)
2. [OpenReview Data](#openreview)
3. [ArXiv Papers](#arxiv-papers)
4. [ArXiv Authors](#arxiv-authors)
5. [ArXiv Categories](#arxiv-categories)
6. [ArXiv Figures](#arxiv-figures)
7. [ArXiv Tables](#arxiv-tables)
8. [ArXiv Sections](#arxiv-sections)
9. [ArXiv Paragraphs](#arxiv-paragraphs)
10. [Relationships/Edges](#relationships)
11. [Advanced Queries](#advanced-queries)

## 1. Setup <a name="setup"></a>

In [None]:
import sys
from pathlib import Path
from tqdm import tqdm
from research_arcade.research_arcade import ResearchArcade
import pandas as pd
from datetime import datetime

project_root = Path(__file__).resolve().parent.parent
sys.path.insert(0, str(project_root))

### Choose Database Backend

#### CSV Based

In [None]:
db_type = "csv"
config = {
    "csv_dir": "/data/jingjunx/my_research_arcade_data/"
}

research_arcade = ResearchArcade(db_type=db_type, config=config)

#### SQL Based (PostgreSQL)

In [None]:
db_type = "sql"
config = {
    "host": "localhost",
    "dbname": "iclr_openreview_database",
    "user": "jingjunx",
    "password": "",
    "port": "5432"
}

research_arcade = ResearchArcade(db_type=db_type, config=config)

## 2. OpenReview Data <a name="openreview"></a>

### openreview_authors

#### construct table from api

In [None]:
config = {"venue": "ICLR.cc/2025/Conference"}
research_arcade.construct_table_from_api("openreview_authors", config)

#### construct table from csv

In [None]:
config = {"csv_file": "/home/jingjunx/openreview_benchmark/Code/paper-crawler/examples/csv_data/csv_openreview_author_example.csv"}
research_arcade.construct_table_from_csv("openreview_authors", config)

#### construct table from json

In [None]:
config = {"json_file": "/home/jingjunx/openreview_benchmark/Code/paper-crawler/examples/json_data/json_openreview_author_example.json"}
research_arcade.construct_table_from_json("openreview_authors", config)

#### insert node

In [None]:
new_author = {'venue': 'ICLR.cc/2025/Conference', 
              'author_openreview_id': '~ishmam_zabir1', 
              'author_full_name': 'ishmam zabir', 
              'email': '****@microsoft.com', 
              'affiliation': 'Microsoft', 
              'homepage': 'https://scholar.google.com/citations?user=X7bjzrUAAAAJ&hl=en&oi=ao', 
              'dblp': ''}
research_arcade.insert_node("openreview_authors", node_features=new_author)

## 3. ArXiv Papers <a name="arxiv-papers"></a>

### Table Schema
- `id` (SERIAL PK)
- `arxiv_id` (VARCHAR, unique) - e.g., 1802.08773v3
- `base_arxiv_id` (VARCHAR) - e.g., 1802.08773
- `version` (INT) - e.g., 3
- `title` (TEXT)
- `abstract` (TEXT)
- `submit_date` (DATE)
- `metadata` (JSONB)

### Insert a Paper

In [None]:
# Example 1: Insert the famous "Attention is All You Need" paper
new_paper = {
    'arxiv_id': '1706.03762v7',
    'base_arxiv_id': '1706.03762',
    'version': 7,
    'title': 'Attention Is All You Need',
    'abstract': 'The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.',
    'submit_date': '2017-06-12',
    'metadata': {'venue': 'NeurIPS 2017', 'pdf_url': 'https://arxiv.org/pdf/1706.03762.pdf'}
}

research_arcade.insert_node("arxiv_papers", node_features=new_paper)
print("Paper inserted successfully!")

In [None]:
# Example 2: Insert BERT paper
bert_paper = {
    'arxiv_id': '1810.04805v2',
    'base_arxiv_id': '1810.04805',
    'version': 2,
    'title': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding',
    'abstract': 'We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.',
    'submit_date': '2018-10-11',
    'metadata': {'venue': 'NAACL 2019', 'citations': 50000}
}

research_arcade.insert_node("arxiv_papers", node_features=bert_paper)
print("BERT paper inserted successfully!")

### Get All Papers

In [None]:
arxiv_papers_df = research_arcade.get_all_node_features("arxiv_papers")
print(f"Total papers in database: {len(arxiv_papers_df)}")
print("\nFirst 5 papers:")
print(arxiv_papers_df.head())

### Get Specific Paper by ID

In [None]:
paper_id = {"arxiv_id": "1706.03762v7"}
paper_features = research_arcade.get_node_features_by_id("arxiv_papers", paper_id)
print("Paper details:")
print(paper_features.to_dict(orient="records")[0])

### Update a Paper

In [None]:
# Update metadata for a paper
updated_paper = {
    'arxiv_id': '1706.03762v7',
    'metadata': {
        'venue': 'NeurIPS 2017',
        'pdf_url': 'https://arxiv.org/pdf/1706.03762.pdf',
        'citations': 75000,
        'influential': True
    }
}

research_arcade.update_node("arxiv_papers", node_features=updated_paper)
print("Paper updated successfully!")

### Delete a Paper

In [None]:
# Delete a paper by ID
paper_id = {"arxiv_id": "1706.03762v7"}
deleted_paper = research_arcade.delete_node_by_id("arxiv_papers", paper_id)
print("Deleted paper:")
print(deleted_paper.to_dict(orient="records")[0])

## 4. ArXiv Authors <a name="arxiv-authors"></a>

### Table Schema
- `id` (SERIAL PK)
- `semantic_scholar_id` (VARCHAR, unique)
- `name` (VARCHAR)
- `homepage` (VARCHAR)

### Insert Authors

In [None]:
# Insert authors from the Transformer paper
authors = [
    {
        'semantic_scholar_id': 'ss_ashish_vaswani',
        'name': 'Ashish Vaswani',
        'homepage': 'https://scholar.google.com/citations?user=oR9sCGYAAAAJ'
    },
    {
        'semantic_scholar_id': 'ss_noam_shazeer',
        'name': 'Noam Shazeer',
        'homepage': 'https://scholar.google.com/citations?user=oR9sCGYAAAAJ'
    },
    {
        'semantic_scholar_id': 'ss_niki_parmar',
        'name': 'Niki Parmar',
        'homepage': 'https://scholar.google.com/citations?user=oR9sCGYAAAAJ'
    },
    {
        'semantic_scholar_id': 'ss_jakob_uszkoreit',
        'name': 'Jakob Uszkoreit',
        'homepage': 'https://scholar.google.com/citations?user=oR9sCGYAAAAJ'
    },
    {
        'semantic_scholar_id': 'ss_llion_jones',
        'name': 'Llion Jones',
        'homepage': 'https://scholar.google.com/citations?user=oR9sCGYAAAAJ'
    }
]

for author in authors:
    research_arcade.insert_node("arxiv_authors", node_features=author)
    print(f"Inserted author: {author['name']}")

### Get All Authors

In [None]:
authors_df = research_arcade.get_all_node_features("arxiv_authors")
print(f"Total authors in database: {len(authors_df)}")
print("\nAll authors:")
print(authors_df)

### Get Specific Author by ID

In [None]:
author_id = {"semantic_scholar_id": "ss_ashish_vaswani"}
author_features = research_arcade.get_node_features_by_id("arxiv_authors", author_id)
print("Author details:")
print(author_features.to_dict(orient="records")[0])

### Update an Author

In [None]:
updated_author = {
    'semantic_scholar_id': 'ss_ashish_vaswani',
    'homepage': 'https://ashishvaswani.com'
}

research_arcade.update_node("arxiv_authors", node_features=updated_author)
print("Author updated successfully!")

## 5. ArXiv Categories <a name="arxiv-categories"></a>

### Table Schema
- `id` (SERIAL PK)
- `name` (VARCHAR, unique)
- `description` (TEXT)

### Insert Categories

In [None]:
categories = [
    {
        'name': 'cs.CL',
        'description': 'Computation and Language (Natural Language Processing)'
    },
    {
        'name': 'cs.LG',
        'description': 'Machine Learning'
    },
    {
        'name': 'cs.AI',
        'description': 'Artificial Intelligence'
    },
    {
        'name': 'cs.CV',
        'description': 'Computer Vision and Pattern Recognition'
    },
    {
        'name': 'stat.ML',
        'description': 'Machine Learning (Statistics)'
    }
]

for category in categories:
    research_arcade.insert_node("arxiv_categories", node_features=category)
    print(f"Inserted category: {category['name']}")

### Get All Categories

In [None]:
categories_df = research_arcade.get_all_node_features("arxiv_categories")
print(f"Total categories: {len(categories_df)}")
print("\nAll categories:")
print(categories_df)

## 6. ArXiv Figures <a name="arxiv-figures"></a>

### Table Schema
- `id` (SERIAL PK)
- `paper_arxiv_id` (VARCHAR FK → papers.arxiv_id)
- `path` (VARCHAR)
- `caption` (TEXT)
- `label` (TEXT)
- `name` (TEXT)

### Insert Figures

In [None]:
# Insert figures for the Transformer paper
figures = [
    {
        'paper_arxiv_id': '1706.03762v7',
        'path': '/figures/transformer_architecture.png',
        'caption': 'The Transformer model architecture. The left side shows the encoder stack and the right side shows the decoder stack.',
        'label': 'fig:architecture',
        'name': 'Figure 1'
    },
    {
        'paper_arxiv_id': '1706.03762v7',
        'path': '/figures/scaled_dot_product_attention.png',
        'caption': 'Scaled Dot-Product Attention and Multi-Head Attention mechanisms.',
        'label': 'fig:attention',
        'name': 'Figure 2'
    },
    {
        'paper_arxiv_id': '1706.03762v7',
        'path': '/figures/positional_encoding.png',
        'caption': 'Positional encoding visualization showing sine and cosine functions of different frequencies.',
        'label': 'fig:positional',
        'name': 'Figure 3'
    }
]

for figure in figures:
    research_arcade.insert_node("arxiv_figures", node_features=figure)
    print(f"Inserted {figure['name']}")

### Get All Figures

In [None]:
figures_df = research_arcade.get_all_node_features("arxiv_figures")
print(f"Total figures: {len(figures_df)}")
print("\nAll figures:")
print(figures_df[['name', 'caption', 'label']])

## 7. ArXiv Tables <a name="arxiv-tables"></a>

### Table Schema
- `id` (SERIAL PK)
- `paper_arxiv_id` (VARCHAR FK → papers.arxiv_id)
- `path` (VARCHAR)
- `caption` (TEXT)
- `label` (TEXT)
- `table_text` (TEXT)

### Insert Tables

In [None]:
# Insert tables for the Transformer paper
tables = [
    {
        'paper_arxiv_id': '1706.03762v7',
        'path': '/tables/model_variations.tex',
        'caption': 'Variations on the Transformer architecture with different hyperparameters.',
        'label': 'tab:variations',
        'table_text': 'Model | N | d_model | d_ff | h | d_k | d_v | P_drop | train time\nbase | 6 | 512 | 2048 | 8 | 64 | 64 | 0.1 | 12 hrs'
    },
    {
        'paper_arxiv_id': '1706.03762v7',
        'path': '/tables/wmt_results.tex',
        'caption': 'Performance of the Transformer on WMT 2014 English-German and English-French translation tasks.',
        'label': 'tab:wmt',
        'table_text': 'Model | EN-DE BLEU | EN-FR BLEU\nTransformer (base) | 27.3 | 38.1\nTransformer (big) | 28.4 | 41.8'
    },
    {
        'paper_arxiv_id': '1706.03762v7',
        'path': '/tables/parsing_results.tex',
        'caption': 'English constituency parsing results on WSJ test set.',
        'label': 'tab:parsing',
        'table_text': 'Model | WSJ 23 F1\nTransformer | 91.3'
    }
]

for table in tables:
    research_arcade.insert_node("arxiv_tables", node_features=table)
    print(f"Inserted table: {table['label']}")

### Get All Tables

In [None]:
tables_df = research_arcade.get_all_node_features("arxiv_tables")
print(f"Total tables: {len(tables_df)}")
print("\nAll tables:")
print(tables_df[['label', 'caption']])

## 8. ArXiv Sections <a name="arxiv-sections"></a>

### Table Schema
- `id` (SERIAL PK)
- `content` (TEXT)
- `title` (TEXT)
- `appendix` (BOOLEAN)
- `paper_arxiv_id` (VARCHAR FK → papers.arxiv_id)
- `section_in_paper_id` (INT)

### Insert Sections

In [None]:
# Insert sections for the Transformer paper
sections = [
    {
        'content': 'The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder...',
        'title': 'Introduction',
        'appendix': False,
        'paper_arxiv_id': '1706.03762v7',
        'section_in_paper_id': 1
    },
    {
        'content': 'Most competitive neural sequence transduction models have an encoder-decoder structure. Here, the encoder maps an input sequence of symbol representations...',
        'title': 'Background',
        'appendix': False,
        'paper_arxiv_id': '1706.03762v7',
        'section_in_paper_id': 2
    },
    {
        'content': 'Most neural sequence transduction models have an encoder-decoder structure. The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers...',
        'title': 'Model Architecture',
        'appendix': False,
        'paper_arxiv_id': '1706.03762v7',
        'section_in_paper_id': 3
    },
    {
        'content': 'In this section we describe the training regime for our models...',
        'title': 'Training',
        'appendix': False,
        'paper_arxiv_id': '1706.03762v7',
        'section_in_paper_id': 4
    },
    {
        'content': 'On the WMT 2014 English-to-German translation task, the big transformer model outperforms the best previously reported models...',
        'title': 'Results',
        'appendix': False,
        'paper_arxiv_id': '1706.03762v7',
        'section_in_paper_id': 5
    },
    {
        'content': 'In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers...',
        'title': 'Conclusion',
        'appendix': False,
        'paper_arxiv_id': '1706.03762v7',
        'section_in_paper_id': 6
    }
]

for section in sections:
    research_arcade.insert_node("arxiv_sections", node_features=section)
    print(f"Inserted section: {section['title']}")

### Get All Sections

In [None]:
sections_df = research_arcade.get_all_node_features("arxiv_sections")
print(f"Total sections: {len(sections_df)}")
print("\nAll sections:")
print(sections_df[['title', 'section_in_paper_id', 'appendix']])

## 9. ArXiv Paragraphs <a name="arxiv-paragraphs"></a>

### Table Schema
- `id` (SERIAL PK)
- `paragraph_id` (INT)
- `content` (TEXT)
- `paper_arxiv_id` (VARCHAR FK → papers.arxiv_id)
- `paper_section` (TEXT)
- `section_id` (INT)
- `paragraph_in_paper_id` (INT)

### Insert Paragraphs

In [None]:
# Insert paragraphs from the Introduction section
paragraphs = [
    {
        'paragraph_id': 1,
        'content': 'Recurrent neural networks, long short-term memory and gated recurrent neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation.',
        'paper_arxiv_id': '1706.03762v7',
        'paper_section': 'Introduction',
        'section_id': 1,
        'paragraph_in_paper_id': 1
    },
    {
        'paragraph_id': 2,
        'content': 'Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures. Recurrent models typically factor computation along the symbol positions of the input and output sequences.',
        'paper_arxiv_id': '1706.03762v7',
        'paper_section': 'Introduction',
        'section_id': 1,
        'paragraph_in_paper_id': 2
    },
    {
        'paragraph_id': 3,
        'content': 'Aligning the positions to steps in computation time, they generate a sequence of hidden states h_t, as a function of the previous hidden state h_{t-1} and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.',
        'paper_arxiv_id': '1706.03762v7',
        'paper_section': 'Introduction',
        'section_id': 1,
        'paragraph_in_paper_id': 3
    },
    {
        'paragraph_id': 4,
        'content': 'Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences.',
        'paper_arxiv_id': '1706.03762v7',
        'paper_section': 'Introduction',
        'section_id': 1,
        'paragraph_in_paper_id': 4
    },
    {
        'paragraph_id': 5,
        'content': 'In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.',
        'paper_arxiv_id': '1706.03762v7',
        'paper_section': 'Introduction',
        'section_id': 1,
        'paragraph_in_paper_id': 5
    }
]

for paragraph in paragraphs:
    research_arcade.insert_node("arxiv_paragraphs", node_features=paragraph)
    print(f"Inserted paragraph {paragraph['paragraph_id']} from {paragraph['paper_section']}")

### Get All Paragraphs

In [None]:
paragraphs_df = research_arcade.get_all_node_features("arxiv_paragraphs")
print(f"Total paragraphs: {len(paragraphs_df)}")
print("\nFirst 3 paragraphs:")
print(paragraphs_df[['paragraph_id', 'paper_section', 'content']].head(3))

## 10. Relationships/Edges <a name="relationships"></a>

This section demonstrates how to create relationships between different entities.

### Paper-Author Relationships (arxiv_paper_authors)

In [None]:
# Create authorship relationships for the Transformer paper
paper_authors = [
    {
        'paper_arxiv_id': '1706.03762v7',
        'author_id': 'ss_ashish_vaswani',
        'author_sequence': 1
    },
    {
        'paper_arxiv_id': '1706.03762v7',
        'author_id': 'ss_noam_shazeer',
        'author_sequence': 2
    },
    {
        'paper_arxiv_id': '1706.03762v7',
        'author_id': 'ss_niki_parmar',
        'author_sequence': 3
    },
    {
        'paper_arxiv_id': '1706.03762v7',
        'author_id': 'ss_jakob_uszkoreit',
        'author_sequence': 4
    },
    {
        'paper_arxiv_id': '1706.03762v7',
        'author_id': 'ss_llion_jones',
        'author_sequence': 5
    }
]

for relation in paper_authors:
    research_arcade.insert_edge("arxiv_paper_authors", edge_features=relation)
    print(f"Linked author {relation['author_id']} to paper (position {relation['author_sequence']})")

### Paper-Category Relationships (arxiv_paper_category)

In [None]:
# Link papers to categories
paper_categories = [
    {
        'paper_arxiv_id': '1706.03762v7',
        'category_id': 1  # cs.CL
    },
    {
        'paper_arxiv_id': '1706.03762v7',
        'category_id': 2  # cs.LG
    },
    {
        'paper_arxiv_id': '1810.04805v2',
        'category_id': 1  # cs.CL
    },
    {
        'paper_arxiv_id': '1810.04805v2',
        'category_id': 3  # cs.AI
    }
]

for relation in paper_categories:
    research_arcade.insert_edge("arxiv_paper_category", edge_features=relation)
    print(f"Linked paper {relation['paper_arxiv_id']} to category {relation['category_id']}")

### Citation Relationships (arxiv_citations)

In [None]:
# Create citation relationships (BERT cites Transformer)
citations = [
    {
        'citing_arxiv_id': '1810.04805v2',  # BERT
        'cited_arxiv_id': '1706.03762v7',   # Transformer
        'citing_sections': ['Introduction', 'Related Work', 'Model Architecture']
    }
]

for citation in citations:
    research_arcade.insert_edge("arxiv_citations", edge_features=citation)
    print(f"Created citation: {citation['citing_arxiv_id']} → {citation['cited_arxiv_id']}")

### Paper-Figure Relationships (arxiv_paper_figures)

In [None]:
# Link figures to papers
paper_figures = [
    {
        'paper_arxiv_id': '1706.03762v7',
        'figure_id': 1
    },
    {
        'paper_arxiv_id': '1706.03762v7',
        'figure_id': 2
    },
    {
        'paper_arxiv_id': '1706.03762v7',
        'figure_id': 3
    }
]

for relation in paper_figures:
    research_arcade.insert_edge("arxiv_paper_figures", edge_features=relation)
    print(f"Linked figure {relation['figure_id']} to paper {relation['paper_arxiv_id']}")

### Paper-Table Relationships (arxiv_paper_tables)

In [None]:
# Link tables to papers
paper_tables = [
    {
        'paper_arxiv_id': '1706.03762v7',
        'table_id': 1
    },
    {
        'paper_arxiv_id': '1706.03762v7',
        'table_id': 2
    },
    {
        'paper_arxiv_id': '1706.03762v7',
        'table_id': 3
    }
]

for relation in paper_tables:
    research_arcade.insert_edge("arxiv_paper_tables", edge_features=relation)
    print(f"Linked table {relation['table_id']} to paper {relation['paper_arxiv_id']}")

### Paragraph Citation Relationships (arxiv_paragraph_citations)

In [None]:
# Link specific paragraphs to cited papers
paragraph_citations = [
    {
        'paragraph_id': 1,
        'paper_section': 'Introduction',
        'citing_arxiv_id': '1810.04805v2',
        'cited_arxiv_id': '1706.03762v7',
        'bib_key': 'vaswani2017attention'
    },
    {
        'paragraph_id': 4,
        'paper_section': 'Related Work',
        'citing_arxiv_id': '1810.04805v2',
        'cited_arxiv_id': '1706.03762v7',
        'bib_key': 'vaswani2017attention'
    }
]

for relation in paragraph_citations:
    research_arcade.insert_edge("arxiv_paragraph_citations", edge_features=relation)
    print(f"Paragraph {relation['paragraph_id']} cites {relation['cited_arxiv_id']}")

### Paragraph References (arxiv_paragraph_references)

In [None]:
# Link paragraphs to internal references (figures, tables, equations)
paragraph_references = [
    {
        'paragraph_id': 10,
        'paper_section': 'Model Architecture',
        'paper_arxiv_id': '1706.03762v7',
        'reference_label': 'fig:architecture',
        'reference_type': 'figure'
    },
    {
        'paragraph_id': 12,
        'paper_section': 'Model Architecture',
        'paper_arxiv_id': '1706.03762v7',
        'reference_label': 'fig:attention',
        'reference_type': 'figure'
    },
    {
        'paragraph_id': 20,
        'paper_section': 'Results',
        'paper_arxiv_id': '1706.03762v7',
        'reference_label': 'tab:wmt',
        'reference_type': 'table'
    }
]

for relation in paragraph_references:
    research_arcade.insert_edge("arxiv_paragraph_references", edge_features=relation)
    print(f"Paragraph {relation['paragraph_id']} references {relation['reference_type']}: {relation['reference_label']}")

### Paragraph-Table Relationships (arxiv_paragraph_tables)

In [None]:
# Link paragraphs that discuss specific tables
paragraph_tables = [
    {
        'paragraph_id': 18,
        'table_id': 1
    },
    {
        'paragraph_id': 20,
        'table_id': 2
    },
    {
        'paragraph_id': 25,
        'table_id': 3
    }
]

for relation in paragraph_tables:
    research_arcade.insert_edge("paragraph_tables", edge_features=relation)
    print(f"Linked paragraph {relation['paragraph_id']} to table {relation['table_id']}")

### Paragraph-Figure Relationships (arxiv_paragraph_figures)

In [None]:
# Link paragraphs that discuss specific figures
paragraph_figures = [
    {
        'paragraph_id': 10,
        'figure_id': 1
    },
    {
        'paragraph_id': 12,
        'figure_id': 2
    },
    {
        'paragraph_id': 15,
        'figure_id': 3
    }
]

for relation in paragraph_figures:
    research_arcade.insert_edge("paragraph_figures", edge_features=relation)
    print(f"Linked paragraph {relation['paragraph_id']} to figure {relation['figure_id']}")

## 11. Advanced Queries <a name="advanced-queries"></a>

Examples of more complex operations.

### Batch Insert Multiple Papers

In [None]:
# Batch insert multiple papers at once
papers_batch = [
    {
        'arxiv_id': '1409.0473v7',
        'base_arxiv_id': '1409.0473',
        'version': 7,
        'title': 'Neural Machine Translation by Jointly Learning to Align and Translate',
        'abstract': 'Neural machine translation is a recently proposed approach to machine translation...',
        'submit_date': '2014-09-01',
        'metadata': {'venue': 'ICLR 2015'}
    },
    {
        'arxiv_id': '1512.03385v1',
        'base_arxiv_id': '1512.03385',
        'version': 1,
        'title': 'Deep Residual Learning for Image Recognition',
        'abstract': 'Deeper neural networks are more difficult to train...',
        'submit_date': '2015-12-10',
        'metadata': {'venue': 'CVPR 2016'}
    }
]

for paper in papers_batch:
    research_arcade.insert_node("arxiv_papers", node_features=paper)
    
print(f"Batch inserted {len(papers_batch)} papers")

### Query Papers by a Specific Author

In [None]:
# Get all papers by a specific author
author_id = "ss_ashish_vaswani"
author_papers = research_arcade.query_edges(
    "arxiv_paper_authors",
    filters={"author_id": author_id}
)

print(f"Papers by {author_id}:")
for _, paper in author_papers.iterrows():
    paper_details = research_arcade.get_node_features_by_id(
        "arxiv_papers",
        {"arxiv_id": paper['paper_arxiv_id']}
    )
    print(f"  - {paper_details['title'].iloc[0]}")

### Find All Papers Citing a Specific Paper

In [None]:
# Find all papers that cite the Transformer paper
cited_paper = "1706.03762v7"
citations = research_arcade.query_edges(
    "arxiv_citations",
    filters={"cited_arxiv_id": cited_paper}
)

print(f"Papers citing {cited_paper}:")
for _, citation in citations.iterrows():
    citing_paper = research_arcade.get_node_features_by_id(
        "arxiv_papers",
        {"arxiv_id": citation['citing_arxiv_id']}
    )
    print(f"  - {citing_paper['title'].iloc[0]}")
    print(f"    Cited in sections: {citation['citing_sections']}")

### Get Complete Paper Structure (Sections + Paragraphs)

In [None]:
# Get the complete structure of a paper
paper_id = "1706.03762v7"

# Get paper
paper = research_arcade.get_node_features_by_id("arxiv_papers", {"arxiv_id": paper_id})
print(f"Paper: {paper['title'].iloc[0]}\n")

# Get sections
sections = research_arcade.query_nodes(
    "arxiv_sections",
    filters={"paper_arxiv_id": paper_id}
)
sections = sections.sort_values('section_in_paper_id')

print("Paper Structure:")
for _, section in sections.iterrows():
    print(f"\n{section['section_in_paper_id']}. {section['title']}")
    
    # Get paragraphs for this section
    paragraphs = research_arcade.query_nodes(
        "arxiv_paragraphs",
        filters={
            "paper_arxiv_id": paper_id,
            "section_id": section['section_in_paper_id']
        }
    )
    paragraphs = paragraphs.sort_values('paragraph_id')
    
    for _, para in paragraphs.iterrows():
        print(f"  Para {para['paragraph_id']}: {para['content'][:100]}...")

### Get All Figures and Tables for a Paper

In [None]:
# Get all figures and tables for a specific paper
paper_id = "1706.03762v7"

# Get figures
figures = research_arcade.query_nodes(
    "arxiv_figures",
    filters={"paper_arxiv_id": paper_id}
)

print("Figures:")
for _, fig in figures.iterrows():
    print(f"  {fig['name']}: {fig['caption']}")

# Get tables
tables = research_arcade.query_nodes(
    "arxiv_tables",
    filters={"paper_arxiv_id": paper_id}
)

print("\nTables:")
for _, tab in tables.iterrows():
    print(f"  {tab['label']}: {tab['caption']}")

### Find All Papers in a Specific Category

In [None]:
# Find all papers in the cs.CL category
category_id = 1  # cs.CL

paper_categories = research_arcade.query_edges(
    "arxiv_paper_category",
    filters={"category_id": category_id}
)

print("Papers in cs.CL category:")
for _, pc in paper_categories.iterrows():
    paper = research_arcade.get_node_features_by_id(
        "arxiv_papers",
        {"arxiv_id": pc['paper_arxiv_id']}
    )
    print(f"  - {paper['title'].iloc[0]}")

### Find Author Collaborations

In [None]:
# Find all co-authors of a specific author
author_id = "ss_ashish_vaswani"

# Get papers by this author
author_papers = research_arcade.query_edges(
    "arxiv_paper_authors",
    filters={"author_id": author_id}
)

# Get all co-authors
coauthors = set()
for _, paper_relation in author_papers.iterrows():
    paper_id = paper_relation['paper_arxiv_id']
    all_authors = research_arcade.query_edges(
        "arxiv_paper_authors",
        filters={"paper_arxiv_id": paper_id}
    )
    for _, author_relation in all_authors.iterrows():
        if author_relation['author_id'] != author_id:
            coauthors.add(author_relation['author_id'])

print(f"Co-authors of {author_id}:")
for coauthor_id in coauthors:
    author = research_arcade.get_node_features_by_id(
        "arxiv_authors",
        {"semantic_scholar_id": coauthor_id}
    )
    print(f"  - {author['name'].iloc[0]}")

### Database Statistics

In [None]:
# Get statistics about the database
stats = {
    'papers': len(research_arcade.get_all_node_features("arxiv_papers")),
    'authors': len(research_arcade.get_all_node_features("arxiv_authors")),
    'categories': len(research_arcade.get_all_node_features("arxiv_categories")),
    'figures': len(research_arcade.get_all_node_features("arxiv_figures")),
    'tables': len(research_arcade.get_all_node_features("arxiv_tables")),
    'sections': len(research_arcade.get_all_node_features("arxiv_sections")),
    'paragraphs': len(research_arcade.get_all_node_features("arxiv_paragraphs")),
}

print("Database Statistics:")
print("=" * 40)
for entity, count in stats.items():
    print(f"{entity.capitalize():15s}: {count:5d}")
print("=" * 40)

## Cleanup and Best Practices

### Validation Before Insertion

In [None]:
# Always validate data before insertion
def validate_paper(paper_data):
    required_fields = ['arxiv_id', 'base_arxiv_id', 'version', 'title', 'abstract']
    for field in required_fields:
        if field not in paper_data or not paper_data[field]:
            raise ValueError(f"Missing required field: {field}")
    return True

# Example usage
try:
    new_paper = {
        'arxiv_id': '2023.12345v1',
        'base_arxiv_id': '2023.12345',
        'version': 1,
        'title': 'New Research Paper',
        'abstract': 'This is an abstract...'
    }
    
    if validate_paper(new_paper):
        research_arcade.insert_node("arxiv_papers", node_features=new_paper)
        print("Paper inserted successfully!")
except ValueError as e:
    print(f"Validation error: {e}")

## Conclusion

This tutorial has covered:

1. Setting up the ResearchArcade database connection
2. Working with OpenReview data
3. CRUD operations for all ArXiv entity types:
   - Papers
   - Authors
   - Categories
   - Figures
   - Tables
   - Sections
   - Paragraphs
4. Creating relationships between entities:
   - Authorship
   - Citations
   - Paper-Category links
   - Paper-Figure/Table links
   - Paragraph-level references
5. Advanced querying patterns
6. Best practices for data validation

For more information, refer to the ResearchArcade documentation.