Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/ar5iv_flow #79

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Feature/ar5iv_flow #79

wants to merge 5 commits into from

Conversation

delta-river
Copy link
Collaborator

@delta-river delta-river commented Nov 28, 2023

I added two scripts to start annotation by one-command given arxiv id:

  1. tools/fetch_html.py:
    • Fetch LaTeXML-processed HTML from ar5iv.
    • It also cleans the html by removing unnecessary elements, e.g., script.
  2. tools/fetch_and_run.sh:
    • Sequentially execute tools/fetch_html.py, tools/preprocess.py, and server/__main__.py.
    • All the internal commands use the default options (including the directories, e.g., ./arxmliv, ./templates, and ./sources.
    • If an error occurs in one of the steps, the script immediately exists with the same exit code as the internal script.
    • Note that, if there already exists one of, ./arxmliv/${paper_id}.html, ./sources/${paper_id}.html, this script (preciesely, the internal fetch_html and preprocess script) exits immediately.

The script can be used as follows:

$ bash ./tools/fetch_and_run.sh ${arxiv id}

@delta-river
Copy link
Collaborator Author

It turned out that this code doesn't work.
Currently we only fetch papers that have "ar5iv-severity-ok" and reject papers with "ar5iv-severity-fatal". However, there are also "ar5iv-severity-warning" and "ar5iv-severity-error". We need to deal with them because most of the papers in our dataset have either "ar5iv-severity-warning" or "ar5iv-severity-error".

@delta-river
Copy link
Collaborator Author

It turned out that this code doesn't work.
Currently we only fetch papers that have "ar5iv-severity-ok" and reject papers with "ar5iv-severity-fatal". However, there are also "ar5iv-severity-warning" and "ar5iv-severity-error". We need to deal with them because most of the papers in our dataset have either "ar5iv-severity-warning" or "ar5iv-severity-error".

This problem has been fixed.
Currently, we deal with the severity as follows:

  • ar5iv-severity-ok -> simply fetch the html
  • ar5iv-severity-warning, ar5iv-severity-error -> fetch the html with warning
  • ar5iv-severity-fatal -> exit and don't fetch the html

@delta-river
Copy link
Collaborator Author

I ran tools/fetch_and_run.sh for the 40 papers in the dataset to check if the flow works or not.
As a result, the script worked for most of the papers (37/40):

  • 37 papers: no apparent problem (though patch_source is still required)
  • 2 papers (2107.10832, 2002.08046): mostly fine except that the font of the main text (not the formula) is somehow strange (i.e., different from the ones in the dataset)
  • 1 paper (1807.00939): couldn't fetch the ar5iv html (some how https://ar5iv.labs.arxiv.org/html/1807.00939 is redirect to https://arxiv.org/abs/1807.00939)

the one (2002.08046) in the dataset
image

the one fetched from ar5iv
image

@delta-river delta-river changed the title Feature/ar5iv_flow WIP: Feature/ar5iv_flow Jan 30, 2024
@delta-river delta-river changed the title WIP: Feature/ar5iv_flow Feature/ar5iv_flow Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant