# ResearchArcade Complete Tutorial

This tutorial demonstrates how to work with the ResearchArcade database, covering all node types and edge relationships.

## Table of Contents
1. [Setup](#Setup)
2. [Node Table Operations](#NodeTableOperations)

    2.1. [openreview_authors](#openreview_authors)

    2.2. [openreview_papers](#openreview_papers)

    2.3. [openreview_reviews](#openreview_reviews)

    2.4. [openreview_revisions](#openreview_revisions)

    2.5. [openreview_paragraphs](#openreview_paragraphs)

    2.6. [arxiv_papers](#arxiv_papers)

    2.7. [arxiv_authors](#arxiv_authors)

    2.8. [arxiv_categories](#arxiv_categories)

    2.9. [arxiv_figures](#arxiv_figures)

    2.10. [arxiv_tables](#arxiv_tables)

    2.11. [arxiv_sections](#arxiv_sections)

    2.12. [arxiv_paragraphs](#arxiv_paragraphs)

3. [Edge Table Operations](#EdgeTableOperations)

    3.1. [openreview_arxiv](#openreview_arxiv)

    3.2. [openreview_papers_authors](#openreview_papers_authors)

    3.3. [openreview_papers_reviews](#openreview_papers_reviews)

    3.4. [openreview_papers_revisions](#openreview_papers_revisions)

    3.5. [openreview_revisions_reviews](#openreview_revisions_reviews)

    3.6. [arxiv_citations](#arxiv_citations)

    3.7. [arxiv_papers_authors](#arxiv_papers_authors)

    3.8. [arxiv_papers_categories](#arxiv_papers_categories)

    3.9. [arxiv_papers_figures](#arxiv_papers_figures)

    3.10. [arxiv_papers_tables](#arxiv_papers_tables)

    3.11. [arxiv_paragraphs_references](#arxiv_paragraphs_references)

    3.12. [arxiv_paragraphs_citations](#arxiv_paragraphs_citations)

4. [Batch Processing](#BatchProcessing)

    4.1 [openreview conference](#batch_openreview_conference)

    4.2 [openreview conference](#batch_arxiv_papers)

5. [Continuous Crawling](#ContinuousCrawling)

    5.1 [arxiv continuous crawling](#arxiv_continuous_crawling)

## Setup

In [1]:
import sys
from tqdm import tqdm
import os
sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '..')))
from research_arcade.research_arcade import ResearchArcade



### Choose Database Backend

#### CSV Based

In [None]:
db_type = "csv"
config = {
    "csv_dir": "/YOUR/CSV/DIRECTORY/PATH",
}

research_arcade = ResearchArcade(db_type=db_type, config=config)

Created empty CSV file at ./csv2/openreview_arxiv.csv
Created empty CSV file at ./csv2/openreview_authors.csv
Created empty CSV file at ./csv2/openreview_papers_authors.csv
Created empty CSV file at ./csv2/openreview_papers_reviews.csv
Created empty CSV file at ./csv2/openreview_papers_revisions.csv
Created empty CSV file at ./csv2/openreview_papers.csv
Created empty CSV file at ./csv2/openreview_reviews.csv
Created empty CSV file at ./csv2/openreview_revisions_reviews.csv
Created empty CSV file at ./csv2/openreview_revisions.csv
Created empty CSV file at ./csv2/openreview_paragraphs.csv


#### SQL Based

In [None]:
db_type = "sql"
config = {
    "host": "localhost",
    "dbname": "DATABASE_NAME",
    "user": "USER_NAME",
    "password": "PASSWORD",
    "port": "5432"
}

research_arcade = ResearchArcade(db_type=db_type, config=config)

## NodeTableOperations

### openreview_authors

#### construct table from api

In [None]:
config = {"venue": "ICLR.cc/2025/Conference"}
research_arcade.construct_table_from_api("openreview_authors", config)

#### construct table from csv

In [None]:
config = {"csv_file": "/home/jingjunx/openreview_benchmark/Code/paper-crawler/examples/csv_data/csv_openreview_author_example.csv"}
research_arcade.construct_table_from_csv("openreview_authors", config)

#### construct table from json

In [None]:
config = {"json_file": "/home/jingjunx/openreview_benchmark/Code/paper-crawler/examples/json_data/json_openreview_author_example.json"}
research_arcade.construct_table_from_json("openreview_authors", config)

#### insert node

In [None]:
new_author = {'venue': 'ICLR.cc/2025/Conference', 
              'author_openreview_id': '~ishmam_zabir1', 
              'author_full_name': 'ishmam zabir', 
              'email': '****@microsoft.com', 
              'affiliation': 'Microsoft', 
              'homepage': 'https://scholar.google.com/citations?user=X7bjzrUAAAAJ&hl=en&oi=ao', 
              'dblp': ''}
research_arcade.insert_node("openreview_authors", node_features=new_author)

#### delete specific node by id

In [None]:
author_id = {"author_openreview_id": "~ishmam_zabir1"}
author_features = research_arcade.delete_node_by_id("openreview_authors", author_id)
print(author_features.to_dict(orient="records")[0])

#### get all nodes

In [None]:
openreview_authors_df = research_arcade.get_all_node_features("openreview_authors")
print(len(openreview_authors_df))


arxiv_papers_df = research_arcade.get_all_node_features("arxiv_papers")
print(len(arxiv_papers_df))

#### get specific node by id

In [None]:
author_id = {"author_openreview_id": "~ishmam_zabir1"}
author_features = research_arcade.get_node_features_by_id("openreview_authors", author_id)
print(author_features.to_dict(orient="records")[0])


paper_id = {"arxiv_papers": "1706.03762v7"}
paper_features = research_arcade.get_node_features_by_id("arxiv_papers", author_id)
print(paper_features.to_dict(orient="records")[0])

#### update specific node by id

In [None]:
new_author = {'venue': 'ICLR.cc/2025/Conference', 
              'author_openreview_id': '~ishmam_zabir1', 
              'author_full_name': 'ishmam zabir', 
              'email': '****@microsoft.com', 
              'affiliation': 'Microsoft', 
              'homepage': 'https://scholar.google.com/citations?user=X7bjzrUAAAAJ&hl=en&oi=ao', 
              'dblp': ''}

research_arcade.update_node("openreview_authors", node_features=new_author)
author_id = {"author_openreview_id": "~ishmam_zabir1"}

#### get all nodes

In [None]:
openreview_authors_df = research_arcade.get_all_node_features("openreview_authors")
print(len(openreview_authors_df))


arxiv_papers_df = research_arcade.get_all_node_features("arxiv_papers")
print(len(arxiv_papers_df))

#### get specific node by id

In [None]:
author_id = {"author_openreview_id": "~ishmam_zabir1"}
author_features = research_arcade.get_node_features_by_id("openreview_authors", author_id)
print(author_features.to_dict(orient="records")[0])


paper_id = {"arxiv_papers": "1706.03762v7"}
paper_features = research_arcade.get_node_features_by_id("arxiv_papers", author_id)
print(paper_features.to_dict(orient="records")[0])

### openreview_papers

#### construct table from api

In [None]:
config = {"venue": "ICLR.cc/2025/Conference"}
research_arcade.construct_table_from_api("openreview_papers", config)

#### construct table from csv

In [None]:
config = {"csv_file": "/home/jingjunx/openreview_benchmark/Code/paper-crawler/examples/csv_data/csv_openreview_paper_example.csv"}
research_arcade.construct_table_from_csv("openreview_papers", config)

#### construct table from json

In [None]:
config = {"json_file": "/home/jingjunx/openreview_benchmark/Code/paper-crawler/examples/json_data/json_openreview_paper_example.json"}
research_arcade.construct_table_from_json("openreview_papers", config)

#### insert node

In [None]:
paper_features = {'venue': 'ICLR.cc/2025/Conference', 
                  'paper_openreview_id': 'zGej22CBnS', 
                  'title': 'Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles', 
                  'abstract': "Tokenization is associated with many poorly understood shortcomings in language models (LMs), yet remains an important component for long sequence scaling purposes. This work studies  how tokenization impacts  model performance by analyzing and comparing the stochastic behavior of tokenized models with their byte-level, or token-free, counterparts. We discover that, even when the two models are statistically equivalent, their predictive distributions over the next byte can be substantially different, a phenomenon we term as ``tokenization bias''. To fully characterize this phenomenon, we  introduce the Byte-Token Representation Lemma, a framework that establishes a mapping between the learned token distribution and its equivalent byte-level distribution.  From this result, we develop a next-byte sampling algorithm  that eliminates tokenization bias without requiring further training or optimization. In other words, this enables zero-shot conversion of tokenized LMs into statistically equivalent token-free ones. We demonstrate its broad applicability with two use cases: fill-in-the-middle (FIM) tasks and model ensembles. In FIM tasks where input prompts may terminate mid-token, leading to out-of-distribution tokenization, our method mitigates performance degradation and achieves 18\\% improvement in FIM coding benchmarks, while consistently outperforming the standard token healing fix. For model ensembles where each model employs a distinct vocabulary, our approach enables seamless integration, resulting in improved performance up to 3.7\\% over individual models across various standard baselines in reasoning, knowledge, and coding. Code is available at:https: //github.com/facebookresearch/Exact-Byte-Level-Probabilities-from-Tokenized-LMs.", 
                  'paper_decision': 'ICLR 2025 Poster', 
                  'paper_pdf_link': '/pdf/cdd2212a20c4034029874cba11a05e081bfdb83e.pdf'}
research_arcade.insert_node("openreview_papers", node_features=paper_features)

#### delete specific node by id

In [None]:
paper_id = {"paper_openreview_id": "zGej22CBnS"}
paper_features = research_arcade.delete_node_by_id("openreview_papers", paper_id)
print(paper_features.to_dict(orient="records")[0])

#### get all nodes

In [None]:
openreview_papers_df = research_arcade.get_all_node_features("openreview_papers")
print(len(openreview_papers_df))

#### get specific node by id

In [None]:
paper_id = {"paper_openreview_id": "zGej22CBnS"}
paper_features = research_arcade.get_node_features_by_id("openreview_papers", paper_id)
print(paper_features.to_dict(orient="records")[0])

#### update specific node by id

In [None]:
new_paper_features = {'venue': 'ICLR.cc/2025/Conference', 
                  'paper_openreview_id': 'zGej22CBnS', 
                  'title': 'Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles', 
                  'abstract': "Tokenization is associated with many poorly understood shortcomings in language models (LMs), yet remains an important component for long sequence scaling purposes. This work studies  how tokenization impacts  model performance by analyzing and comparing the stochastic behavior of tokenized models with their byte-level, or token-free, counterparts. We discover that, even when the two models are statistically equivalent, their predictive distributions over the next byte can be substantially different, a phenomenon we term as ``tokenization bias''. To fully characterize this phenomenon, we  introduce the Byte-Token Representation Lemma, a framework that establishes a mapping between the learned token distribution and its equivalent byte-level distribution.  From this result, we develop a next-byte sampling algorithm  that eliminates tokenization bias without requiring further training or optimization. In other words, this enables zero-shot conversion of tokenized LMs into statistically equivalent token-free ones. We demonstrate its broad applicability with two use cases: fill-in-the-middle (FIM) tasks and model ensembles. In FIM tasks where input prompts may terminate mid-token, leading to out-of-distribution tokenization, our method mitigates performance degradation and achieves 18\\% improvement in FIM coding benchmarks, while consistently outperforming the standard token healing fix. For model ensembles where each model employs a distinct vocabulary, our approach enables seamless integration, resulting in improved performance up to 3.7\\% over individual models across various standard baselines in reasoning, knowledge, and coding. Code is available at:https: //github.com/facebookresearch/Exact-Byte-Level-Probabilities-from-Tokenized-LMs.", 
                  'paper_decision': 'ICLR 2025 Poster', 
                  'paper_pdf_link': '/pdf/cdd2212a20c4034029874cba11a05e081bfdb83e.pdf'}
research_arcade.update_node("openreview_papers", node_features=new_paper_features)

### openreview_reviews

#### construct table from api

In [None]:
config = {"venue": "ICLR.cc/2013/conference"}
research_arcade.construct_table_from_api("openreview_reviews", config)

#### construct table from csv

In [None]:
config = {"csv_file": "/home/jingjunx/openreview_benchmark/Code/paper-crawler/examples/csv_data/csv_openreview_review_example.csv"}
research_arcade.construct_table_from_csv("openreview_reviews", config)

#### construct table from json

In [None]:
config = {"json_file": "/home/jingjunx/openreview_benchmark/Code/paper-crawler/examples/json_data/json_openreview_review_example.json"}
research_arcade.construct_table_from_json("openreview_reviews", config)

#### insert node

In [None]:
review_features = {'venue': 'ICLR.cc/2025/Conference', 
                   'review_openreview_id': 'DHwZxFryth', 
                   'replyto_openreview_id': 'Yqbllggrmw', 
                   'writer': 'Authors', 
                   'title': 'Response by Authors', 
                   'content': {'Title': 'Response to Reviewer 7i95 (1/2)', 'Comment': '> The method does not improve much in the AlpacaEval 2.0 Score. The author should give a detailed explanation. And why not use metrics like length-controlled win rate?**Response:** Thank you for your careful observation and question. We would like to clarify that we are already using the length-controlled (LC) AlpacaEval 2.0 win-rate metric in our evaluations. We will make this clearer in the table header of Table 3.Regarding the fact that the AlpacaEval 2.0 scores on LLama-3 (8B) do not improve compared to the baselines, we believe this is because our base model, the instruction-finetuned LLama-3 (8B), is already trained to perform exceptionally well in terms of helpfulness, which is the focus of the AlpacaEval benchmark. Additionally, the preference dataset we used, UltraFeedback, may not provide significant further enhancement in the helpfulness aspect. This is supported by the slight decrease observed in the AlpacaEval score for the standard DPO baseline as well (see Table 3, results on LLama-3). Therefore, we think these AlpacaEval 2.0 results on LLama-3 (8B) may not indicate that SAIL is ineffective; it may be simply caused by an ill-suited combination of base model, finetuning dataset, and evaluation benchmark.We also further conducted experiments on the Zephyr (7B) model as the backbone, whose AlpacaEval 2.0 win-rate is lower. We still train on the UltraFeedback preference dataset and the other experiment setups are unchanged. In this experiment, we see a larger improvement of the SAIL method compared to the standard DPO baseline (Zephyr-7B-Beta).|             | AlpacaEval 2.0 (LC) Win-Rate ||--------------------|------------------------------|| Base (Zephyr-7B-SFT-Full) | 6.4 %                        || DPO (Zephyr-7B-Beta)   | 13.2 %                       || SAIL-PP  | 15.9 %                       |> Authors should compare more advanced preference optimization algorithms like ORPO and SimPO. And current results are not impressive for the alignment community.**Response:** Thank you for raising this insightful point. We see ORPO and SimPO are two recent work which propose a different objective than the standard RLHF, and achieve remarkable improvements in terms of alignment performance and efficiency.Our work focus more on bringing standard RLHF to a bilevel optimization framework and propose an effective and efficient approximate algorithm on top of it. We can see some new preference optimization methods including ORPO and SimPO have one fundamental difference from our approach: they do not explicitly incorporate the KL regularization term. The absence of the KL regularization term allows these methods to optimize more aggressively for the reward function by deviating significantly from the reference model. In contrast, our approach is specifically grounded in the standard RLHF, where the KL regularization term ensures that the model remains aligned with the reference distribution while optimizing for the reward function. This distinction makes direct comparisons with ORPO or SimPO less meaningful theoretically, as those methods omit the KL regularization and adopt a fundamentally different optimization objective design.However, we think our work, although developed adhering to the standard RLHF setup, can be compatible and combined with some recent advanced preference optimization algorithms, despite their differences in optimization setups and objectives. This is because we can reformulate their alignment problem as bilevel optimization, and go through the derivation as done in the paper. Taking SimPO as an example, we can treat their reward model definition (Equation (4) in '
                   'the SimPO paper) as the solution of the upper level optimization (replacing Equation (4) in our manuscript), and adopt their modified Bradley-Terry objective with reward margin (Equation (5) in the SimPO paper) to replace the standard one (Equation (10) in our manuscript). By applying these changes and rederiving the extra gradient terms, we can formulate an adaptation of our method to the SimPO objective. We will implement this combined algorithm, which adapt our methodology to the SimPO objective, and compare with the SimPO as a baseline.Recently many different alignment objectives and algorithms have emerged; it is an interesting question to discuss the compatibility and combination of our method with each objective. We will add more relevant discussions to the appendices, but due to the fact that the compatibility problem with each design is a non-trivial question, this process may incur considerably more work, and we hope the reviewer understands that this effort cannot be fully reflected by the rebuttal period. But we will continue to expand the discussion as the wide compatibility to other designs also strengthens our contribution to the community. We thank the reviewer for raising this insightful point.'}, 
                   'time': '2024-11-26 15:27:26'
}
research_arcade.insert_node("openreview_reviews", node_features=review_features)

#### delete specific node by id

In [None]:
review_id = {"review_openreview_id": "DHwZxFryth"}
review_features = research_arcade.delete_node_by_id("openreview_reviews", review_id)
print(review_features.to_dict(orient="records")[0])

#### get all nodes

In [None]:
openreview_reviews_df = research_arcade.get_all_node_features("openreview_reviews")
print(len(openreview_reviews_df))

#### get specific node by id

In [None]:
review_id = {"review_openreview_id": "DHwZxFryth"}
review_features = research_arcade.get_node_features_by_id("openreview_reviews", review_id)
print(review_features.to_dict(orient="records")[0])

#### update specific node by id

In [None]:
new_review_features = {'venue': 'ICLR.cc/2025/Conference', 
                   'review_openreview_id': 'DHwZxFryth', 
                   'replyto_openreview_id': 'Yqbllggrmw', 
                   'writer': 'Authors', 
                   'title': 'Response by Authors', 
                   'content': {'Title': 'Response to Reviewer 7i95 (1/2)', 'Comment': '> The method does not improve much in the AlpacaEval 2.0 Score. The author should give a detailed explanation. And why not use metrics like length-controlled win rate?**Response:** Thank you for your careful observation and question. We would like to clarify that we are already using the length-controlled (LC) AlpacaEval 2.0 win-rate metric in our evaluations. We will make this clearer in the table header of Table 3.Regarding the fact that the AlpacaEval 2.0 scores on LLama-3 (8B) do not improve compared to the baselines, we believe this is because our base model, the instruction-finetuned LLama-3 (8B), is already trained to perform exceptionally well in terms of helpfulness, which is the focus of the AlpacaEval benchmark. Additionally, the preference dataset we used, UltraFeedback, may not provide significant further enhancement in the helpfulness aspect. This is supported by the slight decrease observed in the AlpacaEval score for the standard DPO baseline as well (see Table 3, results on LLama-3). Therefore, we think these AlpacaEval 2.0 results on LLama-3 (8B) may not indicate that SAIL is ineffective; it may be simply caused by an ill-suited combination of base model, finetuning dataset, and evaluation benchmark.We also further conducted experiments on the Zephyr (7B) model as the backbone, whose AlpacaEval 2.0 win-rate is lower. We still train on the UltraFeedback preference dataset and the other experiment setups are unchanged. In this experiment, we see a larger improvement of the SAIL method compared to the standard DPO baseline (Zephyr-7B-Beta).|             | AlpacaEval 2.0 (LC) Win-Rate ||--------------------|------------------------------|| Base (Zephyr-7B-SFT-Full) | 6.4 %                        || DPO (Zephyr-7B-Beta)   | 13.2 %                       || SAIL-PP  | 15.9 %                       |> Authors should compare more advanced preference optimization algorithms like ORPO and SimPO. And current results are not impressive for the alignment community.**Response:** Thank you for raising this insightful point. We see ORPO and SimPO are two recent work which propose a different objective than the standard RLHF, and achieve remarkable improvements in terms of alignment performance and efficiency.Our work focus more on bringing standard RLHF to a bilevel optimization framework and propose an effective and efficient approximate algorithm on top of it. We can see some new preference optimization methods including ORPO and SimPO have one fundamental difference from our approach: they do not explicitly incorporate the KL regularization term. The absence of the KL regularization term allows these methods to optimize more aggressively for the reward function by deviating significantly from the reference model. In contrast, our approach is specifically grounded in the standard RLHF, where the KL regularization term ensures that the model remains aligned with the reference distribution while optimizing for the reward function. This distinction makes direct comparisons with ORPO or SimPO less meaningful theoretically, as those methods omit the KL regularization and adopt a fundamentally different optimization objective design.However, we think our work, although developed adhering to the standard RLHF setup, can be compatible and combined with some recent advanced preference optimization algorithms, despite their differences in optimization setups and objectives. This is because we can reformulate their alignment problem as bilevel optimization, and go through the derivation as done in the paper. Taking SimPO as an example, we can treat their reward model definition (Equation (4) in the SimPO paper) as the solution of the upper level optimization (replacing Equation (4) in our manuscript), and adopt their modified Bradley-Terry objective with reward margin (Equation (5) in the SimPO paper) to replace the standard one (Equation (10) in our manuscript). By applying these changes and rederiving the extra gradient terms, we can formulate an adaptation of our method to the SimPO objective. We will implement this combined algorithm, which adapt our methodology to the SimPO objective, and compare with the SimPO as a baseline.Recently many different alignment objectives and algorithms have emerged; it is an interesting question to discuss the compatibility and combination of our method with each objective. We will add more relevant discussions to the appendices, but due to the fact that the compatibility problem with each design is a non-trivial question, this process may incur considerably more work, and we hope the reviewer understands that this effort cannot be fully reflected by the rebuttal period. But we will continue to expand the discussion as the wide compatibility to other designs also strengthens our contribution to the community. We thank the reviewer for raising this insightful point.'}, 
                   'time': '2024-11-26 15:27:26'
}
research_arcade.update_node("openreview_reviews", node_features=new_review_features)

### openreview_revisions

#### construct table from api

##### get pdfs

In [None]:
from .get_pdfs import get_paper_pdf, get_revision_pdf
import os
import openreview
import time

client_v1 = openreview.Client(baseurl='https://api.openreview.net')
client_v2 = openreview.api.OpenReviewClient(baseurl='https://api2.openreview.net')

venue = 'ICLR.cc/2017/conference'
pdf_dir = "/data/jingjunx/openreview_pdfs_2017/"
log_file = "./download_failed_ids_revisions_2017.log"
start_idx = 0
end_idx = 5

In [None]:
if "2023" in venue or "2022" in venue or "2021" in venue or "2020" in venue or "2019" in venue or "2018" in venue or "2017" in venue or "2014" in venue or "2013" in venue:
    if "2023" in venue or "2022" in venue or "2021" in venue or "2020" in venue or "2019" in venue or "2018" in venue:
        submissions = client_v1.get_all_notes(invitation=f'{venue}/-/Blind_Submission', details='revisions')
    elif "2017" in venue or "2014" in venue or "2013" in venue:
        submissions = client_v1.get_all_notes(invitation=f'{venue}/-/submission', details='revisions')
        
    if submissions is None:
        print(f"No submissions found for venue: {venue}")
    else:
        for submission in tqdm(submissions[start_idx:end_idx]):
            # get paper openreview id
            paper_id = submission.id
            if "pdf" in submission.content:
                pdf_link = submission.content["pdf"]
                pdf_path = str(pdf_dir)+str(paper_id)+".pdf"
                if os.path.isfile(pdf_path):
                    continue
                else:
                    get_paper_pdf(pdf_link, pdf_path, log_file)
            
            revisions = client_v1.get_references(referent=paper_id, original=True)
            time.sleep(1)
            
            pdf_revisions_ids = []
            for revision in revisions:
                if "pdf" in revision.content:
                    pdf_revisions_ids.append(revision.id)
            
            if len(pdf_revisions_ids) <= 1:
                continue
            else:
                for pdf_revision_id in pdf_revisions_ids:
                    pdf_path = str(pdf_dir)+str(pdf_revision_id)+".pdf"
                    if os.path.isfile(pdf_path):
                        continue
                    else:
                        get_revision_pdf(venue, pdf_revision_id, pdf_path, log_file)
                        time.sleep(1)
else:
    submissions = client_v2.get_all_notes(invitation=f'{venue}/-/Submission', details='revisions')
    if submissions is None:
        print(f"No submissions found for venue: {venue}")
    else:
        for submission in tqdm(submissions[start_idx:end_idx]):
            decision = submission.content["venueid"]["value"].split('/')[-1]
            if decision == "Withdrawn_Submission":
                continue
            else:
                # get paper openreview id
                paper_id = submission.id
                if "pdf" in submission.content:
                    pdf_link = submission.content["pdf"]["value"]
                    pdf_path = str(pdf_dir)+str(paper_id)+".pdf"
                    if os.path.isfile(pdf_path):
                        continue
                    else:
                        get_paper_pdf(pdf_link, pdf_path, log_file)
                        
                revisions = client_v2.get_note_edits(note_id=paper_id)
                if len(revisions) <= 1:
                    continue
                else:
                    for revision in revisions:
                        pdf_revision_id = revision.id
                        pdf_path = str(pdf_dir)+str(pdf_revision_id)+".pdf"
                        if os.path.isfile(pdf_path):
                            continue
                        else:
                            time.sleep(1)
                            get_revision_pdf(venue, pdf_revision_id, pdf_path, log_file)
                            time.sleep(1)

#### construct the table

In [None]:
venue = "ICLR.cc/2017/conference"
filter_list = ["Under review as a conference paper at ICLR 2017", "Published as a conference paper at ICLR 2017"]
pdf_dir = "/data/jingjunx/openreview_pdfs_2017/"
log_file = "./log/failed_ids_revisions_2017.log"
config = {"venue": venue, "filter_list": filter_list, "pdf_dir": pdf_dir, "log_file": log_file}
research_arcade.construct_table_from_api("openreview_revisions", config)

#### construct table from csv

In [None]:
config = {"csv_file": "/home/jingjunx/openreview_benchmark/Code/paper-crawler/examples/csv_data/csv_openreview_revision_example.csv"}
research_arcade.construct_table_from_csv("openreview_revisions", config)

#### construct table from json

In [None]:
config = {"json_file": "/home/jingjunx/openreview_benchmark/Code/paper-crawler/examples/json_data/json_openreview_revision_example.json"}
research_arcade.construct_table_from_json("openreview_revisions", config)

#### insert node

In [None]:
revision_feature = {'venue': 'ICLR.cc/2025/Conference', 
                    'original_openreview_id': 'pbTVNlX8Ig', 
                    'revision_openreview_id': 'yfHQOp5zWc', 
                    'content': [{'section': '1 INTRODUCTION', 
                                 'after_section': None, 
                                 'context_after': '2 RELATED WORK ', 
                                 'paragraph_idx': 9, 
                                 'before_section': None, 
                                 'context_before': 'Published as a conference paper at ICLR 2025 tograd system in PyTorch, specifically tailored for our experimental setup, which is available at ', 
                                 'modified_lines': 'https://github.com/stephane-rivaud/PETRA. ', 
                                 'original_lines': 'https://github.com/streethagore/PETRA. ', 
                                 'after_paragraph_idx': None, 
                                 'before_paragraph_idx': None}], 
                    'time': '2025-03-14 15:35:37'}
research_arcade.insert_node("openreview_revisions", node_features=revision_feature)

#### delete specific node by id

In [None]:
revision_id = {"revision_openreview_id": "yfHQOp5zWc"}
revision_feature = research_arcade.delete_node_by_id("openreview_revisions", revision_id)
print(revision_feature.to_dict(orient="records")[0])

#### get all nodes

In [None]:
openreview_revisions_df = research_arcade.get_all_node_features("openreview_revisions")
print(len(openreview_revisions_df))

#### get specific node by id

In [None]:
revision_id = {"revision_openreview_id": "yfHQOp5zWc"}
revision_feature = research_arcade.get_node_features_by_id("openreview_revisions", revision_id)
print(revision_feature.to_dict(orient="records")[0])

#### update specific node by id

In [None]:
new_revision_features = {'venue': 'ICLR.cc/2025/Conference', 
                    'original_openreview_id': 'pbTVNlX8Ig', 
                    'revision_openreview_id': 'yfHQOp5zWc', 
                    'content': [{'section': '1 INTRODUCTION', 
                                 'after_section': None, 
                                 'context_after': '2 RELATED WORK ', 
                                 'paragraph_idx': 9, 
                                 'before_section': None, 
                                 'context_before': 'Published as a conference paper at ICLR 2025 tograd system in PyTorch, specifically tailored for our experimental setup, which is available at ', 
                                 'modified_lines': 'https://github.com/stephane-rivaud/PETRA. ', 
                                 'original_lines': 'https://github.com/streethagore/PETRA. ', 
                                 'after_paragraph_idx': None, 
                                 'before_paragraph_idx': None}], 
                    'time': '2025-03-14 15:35:37'}
research_arcade.update_node("openreview_revisions", node_features=new_revision_features)

### openreview_paragraphs

#### construct table from api

##### get pdfs

In [None]:
from .get_pdfs import get_paper_pdf, get_revision_pdf
import os
import openreview
from tqdm import tqdm
import time

client_v1 = openreview.Client(baseurl='https://api.openreview.net')
client_v2 = openreview.api.OpenReviewClient(baseurl='https://api2.openreview.net')

venue = 'ICLR.cc/2025/Conference'
pdf_dir = "/data/jingjunx/openreview_pdfs_2025/"
log_file = "./download_failed_ids_revisions_2025.log"
start_idx = 0
end_idx = 5

In [None]:
if "2023" in venue or "2022" in venue or "2021" in venue or "2020" in venue or "2019" in venue or "2018" in venue or "2017" in venue or "2014" in venue or "2013" in venue:
    if "2023" in venue or "2022" in venue or "2021" in venue or "2020" in venue or "2019" in venue or "2018" in venue:
        submissions = client_v1.get_all_notes(invitation=f'{venue}/-/Blind_Submission', details='revisions')
    elif "2017" in venue or "2014" in venue or "2013" in venue:
        submissions = client_v1.get_all_notes(invitation=f'{venue}/-/submission', details='revisions')
        
    if submissions is None:
        print(f"No submissions found for venue: {venue}")
    else:
        for submission in tqdm(submissions[start_idx:end_idx]):
            # get paper openreview id
            paper_id = submission.id
            if "pdf" in submission.content:
                pdf_link = submission.content["pdf"]
                pdf_path = str(pdf_dir)+str(paper_id)+".pdf"
                if os.path.isfile(pdf_path):
                    continue
                else:
                    get_paper_pdf(pdf_link, pdf_path, log_file)
            
            revisions = client_v1.get_references(referent=paper_id, original=True)
            time.sleep(1)
            
            pdf_revisions_ids = []
            for revision in revisions:
                if "pdf" in revision.content:
                    pdf_revisions_ids.append(revision.id)
            
            if len(pdf_revisions_ids) <= 1:
                continue
            else:
                for pdf_revision_id in pdf_revisions_ids:
                    pdf_path = str(pdf_dir)+str(pdf_revision_id)+".pdf"
                    if os.path.isfile(pdf_path):
                        continue
                    else:
                        get_revision_pdf(venue, pdf_revision_id, pdf_path, log_file)
                        time.sleep(1)
else:
    submissions = client_v2.get_all_notes(invitation=f'{venue}/-/Submission', details='revisions')
    if submissions is None:
        print(f"No submissions found for venue: {venue}")
    else:
        for submission in tqdm(submissions[start_idx:end_idx]):
            decision = submission.content["venueid"]["value"].split('/')[-1]
            if decision == "Withdrawn_Submission":
                continue
            else:
                # get paper openreview id
                paper_id = submission.id
                if "pdf" in submission.content:
                    pdf_link = submission.content["pdf"]["value"]
                    pdf_path = str(pdf_dir)+str(paper_id)+".pdf"
                    if os.path.isfile(pdf_path):
                        continue
                    else:
                        get_paper_pdf(pdf_link, pdf_path, log_file)
                        
                revisions = client_v2.get_note_edits(note_id=paper_id)
                if len(revisions) <= 1:
                    continue
                else:
                    for revision in revisions:
                        pdf_revision_id = revision.id
                        pdf_path = str(pdf_dir)+str(pdf_revision_id)+".pdf"
                        if os.path.isfile(pdf_path):
                            continue
                        else:
                            time.sleep(1)
                            get_revision_pdf(venue, pdf_revision_id, pdf_path, log_file)
                            time.sleep(1)

##### construct the table

In [None]:
venue = "ICLR.cc/2025/Conference"
filter_list = ["Under review as a conference paper at ICLR 2025", "Published as a conference paper at ICLR 2025"]
pdf_dir = "/data/jingjunx/openreview_pdfs_2025/"
log_file = "./log/failed_ids_revisions_2025.log"
config = {"venue": venue, "filter_list": filter_list, "pdf_dir": pdf_dir, "log_file": log_file, "is_paper": True, "is_revision": True, "is_pdf_delete": False}
research_arcade.construct_table_from_api("openreview_paragraphs", config)

#### construct table from csv

In [None]:
config = {"csv_file": "/home/jingjunx/openreview_benchmark/Code/paper-crawler/examples/csv_data/csv_openreview_paragraphs_example.csv"}
research_arcade.construct_table_from_csv("openreview_paragraphs", config)

#### construct table from json

In [None]:
config = {"json_file": "/home/jingjunx/openreview_benchmark/Code/paper-crawler/examples/json_data/json_openreview_paragraphs_example.json"}
research_arcade.construct_table_from_json("openreview_paragraphs", config)

#### insert node

In [None]:
paragraph_feature = {'venue': 'xujj_test', 
                    'paper_openreview_id': 'xujj_test', 
                    'paragraph_idx': 1, 
                    'section': "xujj_test", 
                    'content': "xujj_test"}
research_arcade.insert_node("openreview_paragraphs", node_features=paragraph_feature)

#### delete specific node by id

In [None]:
paper_id = {"paper_openreview_id": "xujj_test"}
paragraph_feature = research_arcade.delete_node_by_id("openreview_paragraphs", paper_id)
print(len(paragraph_feature))
print(paragraph_feature.to_dict(orient="records")[0])

#### get all nodes

In [None]:
openreview_paragraphs_df = research_arcade.get_all_node_features("openreview_paragraphs")
print(len(openreview_paragraphs_df))

#### get specific node by id

In [None]:
paper_id = {"paper_openreview_id": "ryxB0Rtxx"}
paragraph_feature = research_arcade.get_node_features_by_id("openreview_paragraphs", paper_id)
print(paragraph_feature.to_dict(orient="records")[0])

### arxiv_papers

#### Table Schema

- `id` (SERIAL PK)
- `arxiv_id` (VARCHAR, unique) - e.g., 1802.08773v3
- `base_arxiv_id` (VARCHAR) - e.g., 1802.08773
- `version` (INT) - e.g., 3
- `title` (TEXT)
- `abstract` (TEXT)
- `submit_date` (DATE)
- `metadata` (JSONB)

#### Construct Table from API

In [None]:
config = {"arxiv_ids": ["1806.08804v4", "1903.03894v4"], "dest_dir": "./download"}
research_arcade.construct_table_from_api("arxiv_papers", config)

#### Construct Table from CSV

In [None]:
config = {"csv_file": "./examples/csv_data/csv_arxiv_papers_example.csv"}
research_arcade.construct_table_from_csv("arxiv_papers", config)

#### Construct Table from JSON

In [None]:
config = {"json_file": "./examples/json_data/json_arxiv_papers_example.json"}
research_arcade.construct_table_from_json("arxiv_papers", config)

#### Insert a Paper

In [None]:
# Example 1: Insert the famous "Attention is All You Need" paper
new_paper = {
    'arxiv_id': '1706.03762v7',
    'base_arxiv_id': '1706.03762',
    'version': 7,
    'title': 'Attention Is All You Need',
    'abstract': 'The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.',
    'submit_date': '2017-06-12',
    'metadata': {'venue': 'NeurIPS 2017', 'pdf_url': 'https://arxiv.org/pdf/1706.03762.pdf'}
}

research_arcade.insert_node("arxiv_papers", node_features=new_paper)
print("Paper inserted successfully!")

In [None]:
# Example 2: Insert BERT paper
bert_paper = {
    'arxiv_id': '1810.04805v2',
    'base_arxiv_id': '1810.04805',
    'version': 2,
    'title': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding',
    'abstract': 'We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.',
    'submit_date': '2018-10-11',
    'metadata': {'venue': 'NAACL 2019', 'citations': 50000}
}

research_arcade.insert_node("arxiv_papers", node_features=bert_paper)
print("BERT paper inserted successfully!")

#### Get All Papers

In [None]:
arxiv_papers_df = research_arcade.get_all_node_features("arxiv_papers")
print(f"Total papers in database: {len(arxiv_papers_df)}")
print("\nFirst 5 papers:")
print(arxiv_papers_df.head())

#### Get Specific Paper by ID

In [None]:
paper_id = {"arxiv_id": "1810.04805v2"}
paper_features = research_arcade.get_node_features_by_id("arxiv_papers", paper_id)
print("Paper details:")
print(paper_features.to_dict(orient="records")[0])

#### Update a Paper

In [None]:
# Update metadata for a paper
updated_paper = {
    'arxiv_id': '1706.03762v7',
    'metadata': {
        'venue': 'NeurIPS 2017',
        'pdf_url': 'https://arxiv.org/pdf/1706.03762.pdf',
        'citations': 75000,
        'influential': True
    }
}

research_arcade.update_node("arxiv_papers", node_features=updated_paper)
print("Paper updated successfully!")

#### Delete a Paper

In [None]:
# Delete a paper by ID
paper_id = {"arxiv_id": "1706.03762v7"}
deleted_paper = research_arcade.delete_node_by_id("arxiv_papers", paper_id)
print("Deleted paper:")
print(deleted_paper)

### arxiv_authors

#### Table Schema

- `id` (SERIAL PK)
- `semantic_scholar_id` (VARCHAR, unique)
- `name` (VARCHAR)
- `homepage` 

#### Construct Table from API

In [None]:
config = {"arxiv_ids": ["1903.03894v4", "1806.08804v4"], "dest_dir": "./download"}
research_arcade.construct_table_from_api("arxiv_authors", config)

#### Construct Table from CSV

In [None]:
config = {"csv_file": "./examples/csv_data/csv_arxiv_authors_example.csv"}
research_arcade.construct_table_from_csv("arxiv_authors", config)

#### Construct Table from JSON

In [None]:
config = {"json_file": "./examples/json_data/json_arxiv_authors_example.json"}
research_arcade.construct_table_from_json("arxiv_authors", config)

#### Insert Authors

In [None]:
# Insert authors from the Transformer paper
authors = [
    {
        'semantic_scholar_id': 'ss_ashish_vaswani',
        'name': 'Ashish Vaswani',
        'homepage': 'https://scholar.google.com/citations?user=oR9sCGYAAAAJ'
    },
    {
        'semantic_scholar_id': 'ss_noam_shazeer',
        'name': 'Noam Shazeer',
        'homepage': 'https://scholar.google.com/citations?user=oR9sCGYAAAAJ'
    },
    {
        'semantic_scholar_id': 'ss_niki_parmar',
        'name': 'Niki Parmar',
        'homepage': 'https://scholar.google.com/citations?user=oR9sCGYAAAAJ'
    },
    {
        'semantic_scholar_id': 'ss_jakob_uszkoreit',
        'name': 'Jakob Uszkoreit',
        'homepage': 'https://scholar.google.com/citations?user=oR9sCGYAAAAJ'
    },
    {
        'semantic_scholar_id': 'ss_llion_jones',
        'name': 'Llion Jones',
        'homepage': 'https://scholar.google.com/citations?user=oR9sCGYAAAAJ'
    }
]

for author in authors:
    research_arcade.insert_node("arxiv_authors", node_features=author)
    print(f"Inserted author: {author['name']}")

#### Get All Authors

In [None]:
authors_df = research_arcade.get_all_node_features("arxiv_authors")
print(f"Total authors in database: {len(authors_df)}")
print("\nAll authors:")
print(authors_df)

#### Get Specific Author by ID

In [None]:
author_id = {"semantic_scholar_id": "ss_ashish_vaswani"}
author_features = research_arcade.get_node_features_by_id("arxiv_authors", author_id)
print("Author details:")
print(author_features)

#### Update an Author

In [None]:
updated_author = {
    'semantic_scholar_id': 'ss_ashish_vaswani',
    'homepage': 'https://ashishvaswani.com'
}

research_arcade.update_node("arxiv_authors", node_features=updated_author)
print("Author updated successfully!")

### arxiv_categories

#### Table Schema
- `id` (SERIAL PK)
- `name` (VARCHAR, unique)
- `description` (TEXT)

#### Insert From API

In [None]:
config = {"arxiv_ids": ["1903.03894v4", "1806.08804v4"], "dest_dir": "./download"}
research_arcade.construct_table_from_api("arxiv_categories", config)

#### Construct Table from CSV

In [None]:
config = {"csv_file": "./examples/csv_data/csv_arxiv_categories_example.csv"}
research_arcade.construct_table_from_csv("arxiv_categories", config)

#### Construct Table from JSON

In [None]:
config = {"json_file": "./examples/json_data/json_arxiv_categories_example.json"}
research_arcade.construct_table_from_json("arxiv_categories", config)

#### Insert Categories

In [None]:
categories = [
    {
        'name': 'cs.CL',
        'description': 'Computation and Language (Natural Language Processing)'
    },
    {
        'name': 'cs.LG',
        'description': 'Machine Learning'
    },
    {
        'name': 'cs.AI',
        'description': 'Artificial Intelligence'
    },
    {
        'name': 'cs.CV',
        'description': 'Computer Vision and Pattern Recognition'
    },
    {
        'name': 'stat.ML',
        'description': 'Machine Learning (Statistics)'
    }
]

for category in categories:
    research_arcade.insert_node("arxiv_categories", node_features=category)
    print(f"Inserted category: {category['name']}")

#### Get All Categories

In [None]:
categories_df = research_arcade.get_all_node_features("arxiv_categories")
print(f"Total categories: {len(categories_df)}")
print("\nAll categories:")
print(categories_df)

### arxiv_figures

#### Table Schema

- `id` (SERIAL PK)
- `paper_arxiv_id` (VARCHAR FK → papers.arxiv_id)
- `path` (VARCHAR)
- `caption` (TEXT)
- `label` (TEXT)
- `name` (TEXT)

#### Insert Figures

In [None]:
# Insert figures for the Transformer paper
figures = [
    {
        'paper_arxiv_id': '1706.03762v7',
        'path': '/figures/transformer_architecture.png',
        'caption': 'The Transformer model architecture. The left side shows the encoder stack and the right side shows the decoder stack.',
        'label': 'fig:architecture',
        'name': 'Figure 1'
    },
    {
        'paper_arxiv_id': '1706.03762v7',
        'path': '/figures/scaled_dot_product_attention.png',
        'caption': 'Scaled Dot-Product Attention and Multi-Head Attention mechanisms.',
        'label': 'fig:attention',
        'name': 'Figure 2'
    },
    {
        'paper_arxiv_id': '1706.03762v7',
        'path': '/figures/positional_encoding.png',
        'caption': 'Positional encoding visualization showing sine and cosine functions of different frequencies.',
        'label': 'fig:positional',
        'name': 'Figure 3'
    }
]

for figure in figures:
    research_arcade.insert_node("arxiv_figures", node_features=figure)
    print(f"Inserted {figure['name']}")

#### Get All Figures

In [None]:
figures_df = research_arcade.get_all_node_features("arxiv_figures")
print(f"Total figures: {len(figures_df)}")
print("\nAll figures:")
print(figures_df[['name', 'caption', 'label']])

### arxiv_tables

#### Table Schema

- `id` (SERIAL PK)
- `paper_arxiv_id` (VARCHAR FK → papers.arxiv_id)
- `path` (VARCHAR)
- `caption` (TEXT)
- `label` (TEXT)
- `table_text` (TEXT)

#### Insert From API

In [None]:
config = {"arxiv_ids": ["1903.03894v4", "1806.08804v4"], "dest_dir": "./download"}
research_arcade.construct_table_from_api("arxiv_tables", config)

#### Construct Table from CSV

In [None]:
config = {"csv_file": "./examples/csv_data/csv_arxiv_tables_example.csv"}
research_arcade.construct_table_from_csv("arxiv_tables", config)

#### Construct Table from JSON

In [None]:
config = {"json_file": "./examples/json_data/json_arxiv_tables_example.json"}
research_arcade.construct_table_from_json("arxiv_tables", config)

#### Insert Categories

In [None]:
categories = [
    {
        'name': 'cs.CL',
        'description': 'Computation and Language (Natural Language Processing)'
    },
    {
        'name': 'cs.LG',
        'description': 'Machine Learning'
    },
    {
        'name': 'cs.AI',
        'description': 'Artificial Intelligence'
    },
    {
        'name': 'cs.CV',
        'description': 'Computer Vision and Pattern Recognition'
    },
    {
        'name': 'stat.ML',
        'description': 'Machine Learning (Statistics)'
    }
]

for category in categories:
    research_arcade.insert_node("arxiv_categories", node_features=category)
    print(f"Inserted category: {category['name']}")

#### Get All Categories

In [None]:
categories_df = research_arcade.get_all_node_features("arxiv_categories")
print(f"Total categories: {len(categories_df)}")
print("\nAll categories:")
print(categories_df)

### arxiv_sections

#### Table Schema

- `id` (SERIAL PK)
- `content` (TEXT)
- `title` (TEXT)
- `appendix` (BOOLEAN)
- `paper_arxiv_id` (VARCHAR FK → papers.arxiv_id)
- `section_in_paper_id` (INT)

#### Insert From API

In [None]:
config = {"arxiv_ids": ["1903.03894v4", "1806.08804v4"], "dest_dir": "./download"}
research_arcade.construct_table_from_api("arxiv_sections", config)

#### Construct Table from CSV

In [None]:
config = {"csv_file": "./examples/csv_data/csv_arxiv_sections_example.csv"}
research_arcade.construct_table_from_csv("arxiv_sections", config)

#### Construct Table from JSON

In [None]:
config = {"json_file": "./examples/json_data/json_arxiv_sections_example.json"}
research_arcade.construct_table_from_json("arxiv_sections", config)

#### Insert Sections

In [None]:
# Insert sections for the Transformer paper
sections = [
    {
        'content': 'The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder...',
        'title': 'Introduction',
        'appendix': False,
        'paper_arxiv_id': '1706.03762v7',
        'section_in_paper_id': 1
    },
    {
        'content': 'Most competitive neural sequence transduction models have an encoder-decoder structure. Here, the encoder maps an input sequence of symbol representations...',
        'title': 'Background',
        'appendix': False,
        'paper_arxiv_id': '1706.03762v7',
        'section_in_paper_id': 2
    },
    {
        'content': 'Most neural sequence transduction models have an encoder-decoder structure. The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers...',
        'title': 'Model Architecture',
        'appendix': False,
        'paper_arxiv_id': '1706.03762v7',
        'section_in_paper_id': 3
    },
    {
        'content': 'In this section we describe the training regime for our models...',
        'title': 'Training',
        'appendix': False,
        'paper_arxiv_id': '1706.03762v7',
        'section_in_paper_id': 4
    },
    {
        'content': 'On the WMT 2014 English-to-German translation task, the big transformer model outperforms the best previously reported models...',
        'title': 'Results',
        'appendix': False,
        'paper_arxiv_id': '1706.03762v7',
        'section_in_paper_id': 5
    },
    {
        'content': 'In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers...',
        'title': 'Conclusion',
        'appendix': False,
        'paper_arxiv_id': '1706.03762v7',
        'section_in_paper_id': 6
    }
]

for section in sections:
    research_arcade.insert_node("arxiv_sections", node_features=section)
    print(f"Inserted section: {section['title']}")

#### Get All Sections

In [None]:
sections_df = research_arcade.get_all_node_features("arxiv_sections")
print(f"Total sections: {sections_df}")
print("\nAll sections:")
print(sections_df[['title', 'section_in_paper_id', 'appendix']])

### arxiv_paragraphs

#### Table Schema

- `id` (SERIAL PK)
- `paragraph_id` (INT)
- `content` (TEXT)
- `paper_arxiv_id` (VARCHAR FK → papers.arxiv_id)
- `paper_section` (TEXT)
- `section_id` (INT)
- `paragraph_in_paper_id` (INT)

#### Insert From API

In [None]:
config = {"arxiv_ids": ["1903.03894v4", "1806.08804v4"], "dest_dir": "./download"}
research_arcade.construct_table_from_api("arxiv_paragraphs", config)

#### Construct Table from CSV

In [None]:
config = {"csv_file": "./examples/csv_data/csv_arxiv_paragraphs_example.csv"}
research_arcade.construct_table_from_csv("arxiv_paragraphs", config)

#### Construct Table from JSON

In [None]:
config = {"json_file": "./examples/json_data/json_arxiv_paragraphs_example.json"}
research_arcade.construct_table_from_json("arxiv_paragraphs", config)

#### Insert Paragraphs

In [None]:
# Insert paragraphs from the Introduction section
paragraphs = [
    {
        'paragraph_id': 1,
        'content': 'Recurrent neural networks, long short-term memory and gated recurrent neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation.',
        'paper_arxiv_id': '1706.03762v7',
        'paper_section': 'Introduction',
        'section_id': 1,
        'paragraph_in_paper_id': 1
    },
    {
        'paragraph_id': 2,
        'content': 'Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures. Recurrent models typically factor computation along the symbol positions of the input and output sequences.',
        'paper_arxiv_id': '1706.03762v7',
        'paper_section': 'Introduction',
        'section_id': 1,
        'paragraph_in_paper_id': 2
    },
    {
        'paragraph_id': 3,
        'content': 'Aligning the positions to steps in computation time, they generate a sequence of hidden states h_t, as a function of the previous hidden state h_{t-1} and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.',
        'paper_arxiv_id': '1706.03762v7',
        'paper_section': 'Introduction',
        'section_id': 1,
        'paragraph_in_paper_id': 3
    },
    {
        'paragraph_id': 4,
        'content': 'Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences.',
        'paper_arxiv_id': '1706.03762v7',
        'paper_section': 'Introduction',
        'section_id': 1,
        'paragraph_in_paper_id': 4
    },
    {
        'paragraph_id': 5,
        'content': 'In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.',
        'paper_arxiv_id': '1706.03762v7',
        'paper_section': 'Introduction',
        'section_id': 1,
        'paragraph_in_paper_id': 5
    }
]

for paragraph in paragraphs:
    research_arcade.insert_node("arxiv_paragraphs", node_features=paragraph)
    print(f"Inserted paragraph {paragraph['paragraph_id']} from {paragraph['paper_section']}")

#### Get All Paragraphs

In [None]:
paragraphs_df = research_arcade.get_all_node_features("arxiv_paragraphs")
print(f"Total paragraphs: {len(paragraphs_df)}")
print("\nFirst 3 paragraphs:")
print(paragraphs_df[['paragraph_id', 'paper_section', 'content']].head(3))

## EdgeTableOperations

### openreview_arxiv

#### construct table from api

In [None]:
config = {"venue": "ICLR.cc/2017/conference"}
research_arcade.construct_table_from_api("openreview_arxiv", config)

#### construct table from csv

In [None]:
config = {"csv_file": "/home/jingjunx/openreview_benchmark/Code/paper-crawler/examples/csv_data/csv_openreview_arxiv_example.csv"}
research_arcade.construct_table_from_csv("openreview_arxiv", config)

#### construct table from json

In [None]:
config = {"json_file": "/home/jingjunx/openreview_benchmark/Code/paper-crawler/examples/json_data/json_openreview_arxiv_example.json"}
research_arcade.construct_table_from_json("openreview_arxiv", config)

#### insert edge

In [None]:
openreview_arxiv = {'venue': 'ICLR.cc/2025/Conference', 
                    'paper_openreview_id': 'zkNCWtw2fd', 
                    'arxiv_id': 'http://arxiv.org/abs/2408.10536v1', 
                    'title': 'Synergistic Approach for Simultaneous Optimization of Monolingual, Cross-lingual, and Multilingual Information Retrieval'
}
research_arcade.insert_edge("openreview_arxiv", openreview_arxiv)

#### delete specific edge by id

In [None]:
openreview_id = {"paper_openreview_id": "zkNCWtw2fd"}
openreview_arxiv_df = research_arcade.delete_edge_by_id("openreview_arxiv", openreview_id)
print(openreview_arxiv_df.to_dict(orient="records")[0])

arxiv_id = {"arxiv_id": "http://arxiv.org/abs/2408.10536v1"}
openreview_arxiv_df = research_arcade.delete_edge_by_id("openreview_arxiv", arxiv_id)
print(openreview_arxiv_df.to_dict(orient="records")[0])

openreview_arxiv_id = {"paper_openreview_id": "zkNCWtw2fd", "arxiv_id": "http://arxiv.org/abs/2408.10536v1"}
openreview_arxiv_df = research_arcade.delete_edge_by_id("openreview_arxiv", openreview_arxiv_id)
print(openreview_arxiv_df.to_dict(orient="records")[0])

#### get all edges

In [None]:
openreview_arxiv_df = research_arcade.get_all_edge_features("openreview_arxiv")
print(len(openreview_arxiv_df))

#### get neighborhood by id

In [None]:
openreview_id = {"paper_openreview_id": "zkNCWtw2fd"}
openreview_arxiv_df = research_arcade.get_neighborhood("openreview_arxiv", openreview_id)
print(openreview_arxiv_df.to_dict(orient="records")[0])

arxiv_id = {"arxiv_id": "http://arxiv.org/abs/2408.10536v1"}
openreview_arxiv_df = research_arcade.get_neighborhood("openreview_arxiv", arxiv_id)
print(openreview_arxiv_df.to_dict(orient="records")[0])

### openreview_papers_authors

#### construct table from api

In [None]:
config = {"venue": "ICLR.cc/2025/Conference"}
research_arcade.construct_table_from_api("openreview_papers_authors", config)

#### construct table from csv

In [None]:
config = {"csv_file": "/home/jingjunx/openreview_benchmark/Code/paper-crawler/examples/csv_data/csv_openreview_papers_authors_example.csv"}
research_arcade.construct_table_from_csv("openreview_papers_authors", config)

#### construct table from json

In [None]:
config = {"json_file": "/home/jingjunx/openreview_benchmark/Code/paper-crawler/examples/json_data/json_openreview_papers_authors_example.json"}
research_arcade.construct_table_from_json("openreview_papers_authors", config)

#### insert edge

In [None]:
paper_authors = [{'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'author_openreview_id': '~Elias_Stengel-Eskin1'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'author_openreview_id': '~Zaid_Khan1'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'author_openreview_id': '~Jaemin_Cho1'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'author_openreview_id': '~Mohit_Bansal2'}]
for item in paper_authors:
    research_arcade.insert_edge("openreview_papers_authors", item)

author_papers = [{'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': 'Xbl6t6zxZs', 'author_openreview_id': '~Elias_Stengel-Eskin1'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': 'fDcn3S8oAt', 'author_openreview_id': '~Elias_Stengel-Eskin1'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': 'j9wBgcxa7N', 'author_openreview_id': '~Elias_Stengel-Eskin1'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': 'zd0iX5xBhA', 'author_openreview_id': '~Elias_Stengel-Eskin1'}, 
                 {'venue': 'ICLR.cc/2024/Conference', 'paper_openreview_id': 'L4nOxziGf9', 'author_openreview_id': '~Elias_Stengel-Eskin1'}, 
                 {'venue': 'ICLR.cc/2024/Conference', 'paper_openreview_id': 'qL9gogRepu', 'author_openreview_id': '~Elias_Stengel-Eskin1'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'author_openreview_id': '~Elias_Stengel-Eskin1'}]
for item in author_papers:
    research_arcade.insert_edge("openreview_papers_authors", item)

paper_author = [{'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'author_openreview_id': '~Elias_Stengel-Eskin1'}]
for item in paper_author:
    research_arcade.insert_edge("openreview_papers_authors", item)

#### delete specific edge by id

In [None]:
paper_id = {"paper_openreview_id": "00SnKBGTsz"}
openreview_papers_authors = research_arcade.delete_edge_by_id("openreview_papers_authors", paper_id)
print(openreview_papers_authors.to_dict(orient="records"))

author_id = {'author_openreview_id': '~Elias_Stengel-Eskin1'}
openreview_papers_authors = research_arcade.delete_edge_by_id("openreview_papers_authors", author_id)
print(openreview_papers_authors.to_dict(orient="records"))

paper_author = {"paper_openreview_id": "00SnKBGTsz", 'author_openreview_id': '~Elias_Stengel-Eskin1'}
openreview_papers_authors = research_arcade.delete_edge_by_id("openreview_papers_authors", paper_author)
print(openreview_papers_authors.to_dict(orient="records"))

#### get all edges

In [None]:
openreview_papers_authors = research_arcade.get_all_edge_features("openreview_papers_authors")
print(len(openreview_papers_authors))

#### get neighborhood by id

In [None]:
paper_id = {"paper_openreview_id": "00SnKBGTsz"}
openreview_papers_authors = research_arcade.get_neighborhood("openreview_papers_authors", paper_id)
print(openreview_papers_authors.to_dict(orient="records"))

author_id = {'author_openreview_id': '~Elias_Stengel-Eskin1'}
openreview_papers_authors = research_arcade.get_neighborhood("openreview_papers_authors", author_id)
print(openreview_papers_authors.to_dict(orient="records"))

### openreview_papers_reviews

#### construct table from api

In [None]:
config = {"venue": "ICLR.cc/2017/conference"}
research_arcade.construct_table_from_api("openreview_papers_reviews", config)

#### construct table from csv

In [None]:
config = {"csv_file": "/home/jingjunx/openreview_benchmark/Code/paper-crawler/examples/csv_data/csv_openreview_papers_reviews_example.csv"}
research_arcade.construct_table_from_csv("openreview_papers_reviews", config)

#### construct table from json

In [None]:
config = {"json_file": "/home/jingjunx/openreview_benchmark/Code/paper-crawler/examples/json_data/json_openreview_papers_reviews_example.json"}
research_arcade.construct_table_from_json("openreview_papers_reviews", config)

#### insert edge

In [None]:
paper_review = {'venue': 'ICLR.cc/2025/Conference', 
                'paper_openreview_id': '00SnKBGTsz', 
                'review_openreview_id': '13mj0Rtn5W', 
                'title': 'Response by Authors', 
                'time': '2024-11-27 17:27:45'}
research_arcade.insert_edge("openreview_papers_reviews", paper_review)

paper_reviews = [{'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'review_openreview_id': '7XT4kLWV2f', 'title': 'Official Review by Reviewer_wuGW', 'time': '2024-11-01 14:52:22'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'review_openreview_id': 'i3QgWgrJff', 'title': 'Official Review by Reviewer_rVo8', 'time': '2024-11-04 02:37:10'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'review_openreview_id': 'GMsjHLXdOx', 'title': 'Official Review by Reviewer_c5nB', 'time': '2024-11-04 09:59:14'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'review_openreview_id': 'r8ZflFk3T7', 'title': 'Official Review by Reviewer_VQ9Y', 'time': '2024-11-06 00:15:47'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'review_openreview_id': '4CnQpVCYkF', 'title': 'Response by Authors', 'time': '2024-11-20 22:48:42'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'review_openreview_id': 'h1qvpjhRP3', 'title': 'Response by Authors', 'time': '2024-11-20 22:51:07'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'review_openreview_id': 'pOR42YNLtU', 'title': 'Response by Authors', 'time': '2024-11-20 22:55:04'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'review_openreview_id': 'Aq2tBtB0lt', 'title': 'Response by Authors', 'time': '2024-11-20 22:57:18'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'review_openreview_id': 'm1iUqPHpwk', 'title': 'Response by Authors', 'time': '2024-11-20 22:58:29'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'review_openreview_id': '66buacQmRe', 'title': 'Response by Authors', 'time': '2024-11-20 23:02:21'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'review_openreview_id': 'Bgr7Ol90m7', 'title': 'Response by Authors', 'time': '2024-11-22 23:11:06'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'review_openreview_id': 'H2h2K6a8x5', 'title': 'Response by Reviewer', 'time': '2024-11-23 10:04:58'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'review_openreview_id': 'la5jPwJU4g', 'title': 'Response by Authors', 'time': '2024-11-24 19:17:22'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'review_openreview_id': 'DjVKsUoFN2', 'title': 'Response by Reviewer', 'time': '2024-11-25 04:00:18'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'review_openreview_id': 'C3MhCuKhTf', 'title': 'Response by Authors', 'time': '2024-11-25 19:44:38'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'review_openreview_id': 'ZqwAYtcmhv', 'title': 'Response by Authors', 'time': '2024-11-25 19:45:43'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'review_openreview_id': '9OQJoesINr', 'title': 'Response by Reviewer', 'time': '2024-11-25 20:07:51'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'review_openreview_id': 'wqTNtVDwef', 'title': 'Response by Authors', 'time': '2024-11-26 03:32:30'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'review_openreview_id': 'NEsxOTkkIV', 'title': 'Response by Reviewer', 'time': '2024-11-26 20:00:00'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'review_openreview_id': '13mj0Rtn5W', 'title': 'Response by Authors', 'time': '2024-11-27 17:27:45'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'review_openreview_id': 'hWat8aFBRw', 'title': 'Response by Reviewer', 'time': '2024-11-27 11:34:03'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'review_openreview_id': 'wnsiUkDh00', 'title': 'Response by Authors', 'time': '2024-11-27 17:28:35'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'review_openreview_id': 'zpboemkkjR', 'title': 'Meta Review of Submission11063 by Area_Chair_eoLd', 'time': '2024-12-20 15:14:25'}, 
                 {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'review_openreview_id': 'kokKFEn2fw', 'title': 'Paper Decision', 'time': '2025-01-22 05:35:00'}
]
for item in tqdm(paper_reviews):
    research_arcade.insert_edge("openreview_papers_reviews", item)

#### delete specific edge by id

In [None]:
paper_review_id = {"paper_openreview_id": "00SnKBGTsz", "review_openreview_id": "13mj0Rtn5W"}
openreview_papers_reviews = research_arcade.delete_edge_by_id("openreview_papers_reviews", paper_review_id)
print(openreview_papers_reviews.to_dict(orient="records"))

review_id = {"review_openreview_id": "13mj0Rtn5W"}
openreview_papers_reviews = research_arcade.delete_edge_by_id("openreview_papers_reviews", review_id)
print(openreview_papers_reviews.to_dict(orient="records"))

paper_id = {"paper_openreview_id": "00SnKBGTsz"}
openreview_papers_reviews = research_arcade.delete_edge_by_id("openreview_papers_reviews", paper_id)
print(openreview_papers_reviews.to_dict(orient="records"))

#### get all edges

In [None]:
openreview_papers_reviews = research_arcade.get_all_edge_features("openreview_papers_reviews")
print(len(openreview_papers_reviews))

#### get neighborhood by id

In [None]:
paper_id = {"paper_openreview_id": "00SnKBGTsz"}
openreview_papers_reviews = research_arcade.get_neighborhood("openreview_papers_reviews", paper_id)
print(openreview_papers_reviews.to_dict(orient="records"))

review_id = {"review_openreview_id": "13mj0Rtn5W"}
openreview_papers_reviews = research_arcade.get_neighborhood("openreview_papers_reviews", review_id)
print(openreview_papers_reviews.to_dict(orient="records"))

### openreview_papers_revisions

#### construct table from api

In [None]:
config = {"venue": "ICLR.cc/2025/Conference"}
research_arcade.construct_table_from_api("openreview_papers_revisions", config)

#### construct table from csv

In [None]:
config = {"csv_file": "/home/jingjunx/openreview_benchmark/Code/paper-crawler/examples/csv_data/csv_openreview_papers_revisions_example.csv"}
research_arcade.construct_table_from_csv("openreview_papers_revisions", config)

#### construct table from json

In [None]:
config = {"json_file": "/home/jingjunx/openreview_benchmark/Code/paper-crawler/examples/json_data/json_openreview_papers_revisions_example.json"}
research_arcade.construct_table_from_json("openreview_papers_revisions", config)

#### insert edge

In [None]:
paper_revision = {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'revision_openreview_id': 'dzL3IRBnE4', 'title': 'Camera_Ready_Revision', 'time': '2025-03-01 03:36:55'}
research_arcade.insert_edge("openreview_papers_revisions", paper_revision)

paper_revisions = [{'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'revision_openreview_id': 'oT4N28siLO', 'title': 'Camera_Ready_Revision', 'time': '2025-03-02 01:35:16'}, 
                   {'venue': 'ICLR.cc/2025/Conference', 'paper_openreview_id': '00SnKBGTsz', 'revision_openreview_id': 'dzL3IRBnE4', 'title': 'Camera_Ready_Revision', 'time': '2025-03-01 03:36:55'}]
for item in tqdm(paper_revisions):
    research_arcade.insert_edge("openreview_papers_revisions", item)

#### delete specific node by id

In [None]:
paper_revision_id = {"paper_openreview_id": "00SnKBGTsz", "revision_openreview_id": "dzL3IRBnE4"}
paper_revision = research_arcade.delete_edge_by_id("openreview_papers_revisions", paper_revision_id)
print(paper_revision.to_dict(orient="records"))

revision_id = {"revision_openreview_id": "dzL3IRBnE4"}
paper_revision = research_arcade.delete_edge_by_id("openreview_papers_revisions", revision_id)
print(paper_revision.to_dict(orient="records"))

paper_id = {"paper_openreview_id": "00SnKBGTsz"}
paper_revision = research_arcade.delete_edge_by_id("openreview_papers_revisions", paper_id)
print(paper_revision.to_dict(orient="records"))

#### get all edges

In [None]:
openreview_papers_revisions = research_arcade.get_all_edge_features("openreview_papers_revisions")
print(len(openreview_papers_revisions))

#### get neighborhood by id

In [None]:
paper_id = {"paper_openreview_id": "00SnKBGTsz"}
paper_revision = research_arcade.get_neighborhood("openreview_papers_revisions", paper_id)
print(paper_revision.to_dict(orient="records"))

revision_id = {"revision_openreview_id": "dzL3IRBnE4"}
paper_revision = research_arcade.get_neighborhood("openreview_papers_revisions", revision_id)
print(paper_revision.to_dict(orient="records"))

### openreview_revisions_reviews

#### construct table based on existing tables

In [None]:
papers_reviews_df = research_arcade.get_all_edge_features("openreview_papers_reviews")
print(len(papers_reviews_df))
papers_revisions_df = research_arcade.get_all_edge_features("openreview_papers_revisions")
print(len(papers_revisions_df))
config = {"papers_reviews_df": papers_reviews_df, "papers_revisions_df": papers_revisions_df}
research_arcade.construct_table_from_api("openreview_revisions_reviews", config)

#### construct table from csv

In [None]:
config = {"csv_file": "/home/jingjunx/openreview_benchmark/Code/paper-crawler/examples/csv_data/csv_openreview_revisions_reviews_example.csv"}
research_arcade.construct_table_from_csv("openreview_revisions_reviews", config)

#### construct table from json

In [None]:
config = {"json_file": "/home/jingjunx/openreview_benchmark/Code/paper-crawler/examples/json_data/json_openreview_revisions_reviews_example.json"}
research_arcade.construct_table_from_json("openreview_revisions_reviews", config)

#### insert edge

In [None]:
revision_review = {'venue': 'ICLR.cc/2025/Conference', 'revision_openreview_id': 'cX02yuzwWI', 'review_openreview_id': 'wumckDPIQ3'}
research_arcade.insert_edge("openreview_revisions_reviews", revision_review)

revision_reviews = [{'venue': 'ICLR.cc/2025/Conference', 'revision_openreview_id': 'cX02yuzwWI', 'review_openreview_id': 'wumckDPIQ3'}, 
                    {'venue': 'ICLR.cc/2025/Conference', 'revision_openreview_id': 'cX02yuzwWI', 'review_openreview_id': '138cOdBpgA'}, 
                    {'venue': 'ICLR.cc/2025/Conference', 'revision_openreview_id': 'cX02yuzwWI', 'review_openreview_id': 'yKh1fQYnUZ'}, 
                    {'venue': 'ICLR.cc/2025/Conference', 'revision_openreview_id': 'cX02yuzwWI', 'review_openreview_id': 'Pvt0OjNSp2'}, 
                    {'venue': 'ICLR.cc/2025/Conference', 'revision_openreview_id': 'cX02yuzwWI', 'review_openreview_id': 'MUhlEYyBD9'}, 
                    {'venue': 'ICLR.cc/2025/Conference', 'revision_openreview_id': 'cX02yuzwWI', 'review_openreview_id': '2mqiS3J8wC'}, 
                    {'venue': 'ICLR.cc/2025/Conference', 'revision_openreview_id': 'cX02yuzwWI', 'review_openreview_id': 'Er8QTorcyr'}, 
                    {'venue': 'ICLR.cc/2025/Conference', 'revision_openreview_id': 'cX02yuzwWI', 'review_openreview_id': 'AvtD9uxRtX'}, 
                    {'venue': 'ICLR.cc/2025/Conference', 'revision_openreview_id': 'cX02yuzwWI', 'review_openreview_id': '2tgxTGynNm'}, 
                    {'venue': 'ICLR.cc/2025/Conference', 'revision_openreview_id': 'cX02yuzwWI', 'review_openreview_id': '5MKJE3sFsd'}, 
                    {'venue': 'ICLR.cc/2025/Conference', 'revision_openreview_id': 'cX02yuzwWI', 'review_openreview_id': 'wViZ0H4ErF'}, 
                    {'venue': 'ICLR.cc/2025/Conference', 'revision_openreview_id': 'cX02yuzwWI', 'review_openreview_id': '0c1It75dTb'}, 
                    {'venue': 'ICLR.cc/2025/Conference', 'revision_openreview_id': 'cX02yuzwWI', 'review_openreview_id': 'PFwia9lcjP'}, 
                    {'venue': 'ICLR.cc/2025/Conference', 'revision_openreview_id': 'cX02yuzwWI', 'review_openreview_id': 'ygCqaGNPee'}]
for item in tqdm(revision_reviews):
    research_arcade.insert_edge("openreview_revisions_reviews", item)

#### delete edge by id

In [None]:
revision_review_id = {'revision_openreview_id': 'cX02yuzwWI', 'review_openreview_id': 'wumckDPIQ3'}
revision_review = research_arcade.delete_edge_by_id("openreview_revisions_reviews", revision_review_id)
print(revision_review.to_dict(orient="records"))

review_id = {'review_openreview_id': 'wumckDPIQ3'}
revision_review = research_arcade.delete_edge_by_id("openreview_revisions_reviews", review_id)
print(revision_review.to_dict(orient="records"))

paper_id = {'revision_openreview_id': 'cX02yuzwWI'}
revision_review = research_arcade.delete_edge_by_id("openreview_revisions_reviews", paper_id)
print(revision_review.to_dict(orient="records"))

#### get all edges

In [None]:
openreview_revisions_reviews = research_arcade.get_all_edge_features("openreview_revisions_reviews")
print(len(openreview_revisions_reviews))

#### get neighborhood by id

In [None]:
revision_id = {'revision_openreview_id': 'cX02yuzwWI'}
revision_review = research_arcade.get_neighborhood("openreview_revisions_reviews", revision_id)
print(revision_review.to_dict(orient="records"))

review_id = {'review_openreview_id': 'wumckDPIQ3'}
revision_review = research_arcade.get_neighborhood("openreview_revisions_reviews", review_id)
print(revision_review.to_dict(orient="records"))

### arxiv_citations

#### Insert Citation

In [None]:
citation = {
    'citing_arxiv_id': '1810.04805v2',
    'cited_arxiv_id': '1706.03762v7',
    'bib_title': 'attention is all you need',
    'bib_key': 'something',
    'citing_sections': 'citing_sections',
}
research_arcade.insert_edge("arxiv_citation", edge_features=citation)
print("Citation created!")

#### Construct Table from CSV

In [None]:
config = {"csv_file": "./examples/csv_data/csv_arxiv_paper_citation_example.csv"}
research_arcade.construct_table_from_csv("arxiv_paper_citation", config)

#### Construct Table from JSON

In [None]:
config = {"json_file": "./examples/json_data/json_arxiv_paper_citation_example.json"}
research_arcade.construct_table_from_json("arxiv_paper_citation", config)

Error: JSON file ./examples/json_data/json_arxiv_paper_citation_example.json does not exist.


#### Get All Citations

In [None]:
all_citations = research_arcade.get_all_edge_features("arxiv_citation")
print(f"Total citations: {len(all_citations)}")
print(all_citations.head())

#### Get Cited Papers

In [None]:
citing_paper = {'citing_paper_id': '1810.04805v2'}
cited_papers = research_arcade.get_neighborhood("arxiv_citation", primary_key=citing_paper)
print("Papers cited:")
print(cited_papers)

#### Get Citing Papers

In [None]:
cited_paper = {'cited_paper_id': '1706.03762v7'}
citing_papers = research_arcade.get_neighborhood("arxiv_citation", primary_key=cited_paper)
print("Papers that cite:")
print(citing_papers)

#### Delete Citation

In [None]:
citation_id = {
    'citing_paper_id': '1810.04805v2',
    'cited_paper_id': '1706.03762v7'
}
research_arcade.delete_edge_by_id("arxiv_citation", primary_key=citation_id)
print("Citation deleted!")

### arxiv_papers_authors

#### Insert Paper-Author Relationships

In [None]:
paper_authors = [
    {'paper_arxiv_id': '1706.03762v7', 'author_id': 'ss_ashish_vaswani', 'author_sequence': 1},
    {'paper_arxiv_id': '1706.03762v7', 'author_id': 'ss_noam_shazeer', 'author_sequence': 2},
    {'paper_arxiv_id': '1706.03762v7', 'author_id': 'ss_niki_parmar', 'author_sequence': 3}
]
for relation in paper_authors:
    research_arcade.insert_edge("arxiv_paper_author", edge_features=relation)
    print(f"Linked author {relation['author_id']} (position {relation['author_sequence']})")

#### Construct Table from CSV

In [None]:
config = {"csv_file": "./examples/csv_data/csv_arxiv_paper_author_example.csv"}
research_arcade.construct_table_from_csv("arxiv_paper_author", config)

#### Construct Table from JSON

In [None]:
config = {"json_file": "./examples/json_data/json_arxiv_paper_author_example.json"}
research_arcade.construct_table_from_json("arxiv_paper_author", config)

#### Get All Paper-Author Relationships

In [None]:
all_relations = research_arcade.get_all_edge_features("arxiv_paper_author")
print(f"Total relationships: {len(all_relations)}")
print(all_relations.head(10))

#### Get Authors for a Paper

In [None]:
paper_id = {'paper_arxiv_id': '1706.03762v7'}
authors = research_arcade.get_neighborhood("arxiv_paper_author", primary_key=paper_id)
print("Authors:")
print(authors.sort_values('author_sequence'))

#### Get Papers by Author

In [None]:
author_id = {'author_id': 'ss_ashish_vaswani'}
papers = research_arcade.get_neighborhood("arxiv_paper_author", primary_key=author_id)
print("Papers by author:")
print(papers)

#### Delete Paper-Author Link

In [None]:
relation_id = {'paper_arxiv_id': '1706.03762v7', 'author_id': 'ss_ashish_vaswani'}
research_arcade.delete_edge_by_id("arxiv_paper_author", primary_key=relation_id)
print("Relationship deleted!")

### arxiv_papers_categories

#### Insert Paper-Category Relationships

In [None]:
paper_categories = [
    {'paper_arxiv_id': '1706.03762v7', 'category_id': '1'},
    {'paper_arxiv_id': '1706.03762v7', 'category_id': '1'},
    {'paper_arxiv_id': '1706.03762v7', 'category_id': '2'}
]
for relation in paper_categories:
    research_arcade.insert_edge("arxiv_paper_category", edge_features=relation)
    print(f"Linked {relation['category_id']}")

#### Construct Table from CSV

In [None]:
config = {"csv_file": "./examples/csv_data/csv_arxiv_paper_category_example.csv"}
research_arcade.construct_table_from_csv("arxiv_paper_category", config)

#### Construct Table from JSON

In [None]:
config = {"json_file": "./examples/json_data/json_arxiv_paper_category_example.json"}
research_arcade.construct_table_from_json("arxiv_paper_category", config)

#### Get All Paper-Category Relationships

In [None]:
all_relations = research_arcade.get_all_edge_features("arxiv_paper_category")
print(f"Total relationships: {len(all_relations)}")
print(all_relations.head())

#### Get Categories for Paper

In [None]:
paper_id = {'paper_arxiv_id': '1706.03762v7'}
categories = research_arcade.get_neighborhood("arxiv_paper_category", primary_key=paper_id)
print("Categories:")
print(categories)

#### Get Papers in Category

In [None]:
category_id = {'category_id': 'cs.LG'}
papers = research_arcade.get_neighborhood("arxiv_paper_category", primary_key=category_id)
print("Papers in category:")
print(papers)

#### Delete Paper-Category Link

In [None]:
relation_id = {'paper_arxiv_id': '1706.03762v7', 'category_id': 'cs.AI'}
research_arcade.delete_edge_by_id("arxiv_paper_category", primary_key=relation_id)
print("Relationship deleted!")

### arxiv_papers_figures

#### Insert Paper-Figure Relationships

In [None]:
paper_figures = [
    {'paper_arxiv_id': '1706.03762v7', 'figure_id': 1},
    {'paper_arxiv_id': '1706.03762v7', 'figure_id': 2}
]
for relation in paper_figures:
    research_arcade.insert_edge("arxiv_paper_figure", edge_features=relation)
    print(f"Linked figure {relation['figure_id']})")

#### Construct Table from CSV

In [None]:
config = {"csv_file": "./examples/csv_data/csv_arxiv_paper_figure_example.csv"}
research_arcade.construct_table_from_csv("arxiv_paper_figure", config)

#### Construct Table from JSON

In [None]:
config = {"json_file": "./examples/json_data/json_arxiv_paper_figure_example.json"}
research_arcade.construct_table_from_json("arxiv_paper_figure", config)

#### Get Figures for Paper

In [None]:
paper_id = {'paper_arxiv_id': '1706.03762v7'}
figures = research_arcade.get_neighborhood("arxiv_paper_figure", primary_key=paper_id)
print("Figures:")
print(figures)

### arxiv_papers_tables

#### Insert Paper-Table Relationships

In [None]:
paper_tables = [
    {'paper_arxiv_id': '1706.03762v7', 'table_id': 1},
    {'paper_arxiv_id': '1706.03762v7', 'table_id': 2}
]
for relation in paper_tables:
    research_arcade.insert_edge("arxiv_paper_table", edge_features=relation)
    print(f"Linked table {relation['table_id']}")

#### Construct Table from CSV

In [None]:
config = {"csv_file": "./examples/csv_data/csv_arxiv_paper_table_example.csv"}
research_arcade.construct_table_from_csv("arxiv_paper_table", config)

#### Construct Table from JSON

In [None]:
config = {"json_file": "./examples/json_data/json_arxiv_paper_table_example.json"}
research_arcade.construct_table_from_json("arxiv_paper_table", config)

#### Get Tables for Paper

In [None]:
paper_id = {'paper_arxiv_id': '1706.03762v7'}
tables = research_arcade.get_neighborhood("arxiv_paper_table", primary_key=paper_id)
print("Tables:")


### arxiv_paragraphs_references

#### Insert Paragraph-Reference Relationships

In [None]:
paragraph_references = [
    {'paragraph_id': 1, 'paper_section': 'established approaches', 'paper_arxiv_id': '1706.03762v7', 'reference_label': "{something}", 'reference_type': 'figure'}
]

for relation in paragraph_references:
    research_arcade.insert_edge("arxiv_paragraph_reference", edge_features=relation)

#### Construct Table from CSV

In [None]:
config = {"csv_file": "./examples/csv_data/csv_arxiv_paragraph_reference_example.csv"}
research_arcade.construct_table_from_csv("arxiv_paragraph_reference", config)

#### Construct Table from JSON

In [None]:
config = {"json_file": "./examples/json_data/json_arxiv_paragraph_reference_example.json"}
research_arcade.construct_table_from_json("arxiv_paragraph_reference", config)

#### Get References in Paragraph

In [None]:
paragraph_id = {'paragraph_id': 1}
references = research_arcade.get_neighborhood("arxiv_paragraph_reference", primary_key=paragraph_id)
print("References:")
print(references)

### arxiv_paragraphs_citations

#### Insert paragraph citation relationship

In [None]:
# Link specific paragraphs to cited papers
paragraph_citations = [
    {
        'paragraph_id': 1,
        'paper_section': 'Introduction',
        'citing_arxiv_id': '1810.04805v2',
        'cited_arxiv_id': '1706.03762v7',
        'bib_key': 'vaswani2017attention'
    },
    {
        'paragraph_id': 4,
        'paper_section': 'Related Work',
        'citing_arxiv_id': '1810.04805v2',
        'cited_arxiv_id': '1706.03762v7',
        'bib_key': 'vaswani2017attention'
    }
]

for relation in paragraph_citations:
    research_arcade.insert_edge("arxiv_paragraph_citations", edge_features=relation)
    print(f"Paragraph {relation['paragraph_id']} cites {relation['cited_arxiv_id']}")

## BatchProcessing

### batch_openreview_conference

In [None]:
config = {"venue": "ICLR.cc/2025/Conference"}
research_arcade.construct_tables_from_venue(config)

### batch_arxiv_papers

In [None]:
# Example papers
arxiv_ids = ['1802.08773', '1806.02473', '2412.17767', '2507.10539', '2511.22036']


config = {
    'arxiv_ids': arxiv_ids,
    'dest_dir': os.getenv('PAPER_FOLDER_PATH')
}

research_arcade.construct_tables_from_arxiv_ids(arxiv_ids)


## ContinuousCrawling

### arxiv_continuous_crawling

In [6]:
research_arcade.continuous_crawling(interval_days=2, delay_days=2, paper_category='All', dest_dir="./download", arxiv_id_dest="./data")

Starting continuous crawl mode
  Interval: 2 days
  Delay: 2 days
  Field: All

[2025-12-23 11:48:57.377345] Starting crawl: 2025-12-18 to 2025-12-21
./download/arxiv_metadata_2025-12-18_2025-12-21.jsonl


Fetching 2025-12-21: 100%|██████████| 4/4 [01:43<00:00, 25.87s/it]


./download/arxiv_metadata_2025-12-18_2025-12-21.jsonl
Papers from 2025-12-18 to 2025-12-21: 2470 found
Saved 2470 processed arXiv IDs to data/processed_ids_2025-12-18_to_2025-12-21_All.txt
[2025-12-23 11:50:40.930826] Crawl completed successfully
./download/2512.16308/2512.16308.tar.gz
paper with id 2512.16308 downloaded
./download/2512.16304/2512.16304.tar.gz
paper with id 2512.16304 downloaded
./download/2512.16298/2512.16298.tar.gz
paper with id 2512.16298 downloaded
./download/2512.16283/2512.16283.tar.gz
paper with id 2512.16283 downloaded
./download/2512.16282/2512.16282.tar.gz
paper with id 2512.16282 downloaded
./download/2512.16271/2512.16271.tar.gz
paper with id 2512.16271 downloaded
./download/2512.16266/2512.16266.tar.gz
paper with id 2512.16266 downloaded
./download/2512.16258/2512.16258.tar.gz
paper with id 2512.16258 downloaded
./download/2512.16245/2512.16245.tar.gz
paper with id 2512.16245 downloaded


KeyboardInterrupt: 