Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to process arXiv tex files without downloading? #51

Closed
irene622 opened this issue Jun 1, 2023 · 1 comment
Closed

how to process arXiv tex files without downloading? #51

irene622 opened this issue Jun 1, 2023 · 1 comment

Comments

@irene622
Copy link

irene622 commented Jun 1, 2023

I download myself arXiv tex files without using running scripts/arxiv-kickoff-download.sh.

My data structure is

my_arxiv_src
 |- papername1
      |- name.tex
 |- papername2
      |- name.tex
      |- other_name.tex

I want to preprocess my latex data, so I run
bash scripts/arxiv-kickoff-cleaning.sh
and arxiv-kickoff-cleaning.sh is the following

#!/bin/bash

set -e

WORKERS=2

# load modules
module load gcc/10.2.0 cuda/11.4 cudampi/openmpi-4.1.1 conda/pytorch_1.12.0
pip install -r arxiv_requirements.txt

export DATA_DIR="./my_ arxiv_src"
export TARGET_DIR="./data/arxiv/processed"
export WORK_DIR="./work"

mkdir -p logs/arxiv/cleaning

# setup partitions
python run_clean.py --data_dir "$DATA_DIR" --target_dir "$TARGET_DIR" --workers $WORKERS --setup

# run download in job array
sbatch scripts/arxiv-clean-slurm.sbatch

arxiv-kickoff-cleaning.sh runs with no error but,
the result files which are arxiv_1.jsonl and arxiv_2.jsonl have not content...

What is the DATA_DIR and TARGET_DIR ?
Is there anything running method with latex files?

@mauriceweber
Copy link
Collaborator

mauriceweber commented Jun 9, 2023

Hi @irene622 , thanks for your question!

The run_clean.py script expects your data to be organized in the same way as when it is downloaded from the arxiv s3 bucket. The most straight forward way is thus probably to simply mirror this structure. In this case you will need to have package your my_arxiv_src into a a tar file and store it in data/src. So you need something like this:

data/src
|-my_arxiv_src.tar
    |-parpername1.gz
        |- name.tex
    |-papername2.gz
        |-name.tex
        |-other_name.tex

You can then call python run_clean.py --data_dir data/src --target_dir /dir/to/desired/output. Also check out this part in the code of the arxiv_cleaner module:

if tar_fp_list is None:
def _tar_fp_iterator():
for _tar_fp in self._data_dir.glob("*.tar"):
yield _tar_fp
else:
def _tar_fp_iterator():
for _tar_fp in tar_fp_list:
yield _tar_fp
failed = 0
processed = 0
for tar_fp in _tar_fp_iterator():
print(f"[{get_timestamp()}][INFO] start processing {tar_fp}")
with tempfile.TemporaryDirectory(dir=self._work_dir) as tmpdir:
with tarfile.open(tar_fp) as tf:
tf.extractall(members=tf.getmembers(), path=tmpdir)
for proj_dir_or_file in pathlib.Path(tmpdir).rglob("*.gz"):
# get arxiv id and month from the filename
yymm = proj_dir_or_file.parent.stem
arxiv_id = proj_dir_or_file.stem
# load the tex source files (we also get the timestamp
# here)
data = _tex_proj_loader(proj_dir_or_file)
if data is None:
failed += 1
continue
tex_files, timestamp = data
processed += 1
if processed > max_files > 0:
break
yield tex_files, yymm, arxiv_id, timestamp
else:
continue
break
print(f"[{get_timestamp()}][INFO] # Failed loading : {failed}")
print(f"[{get_timestamp()}][INFO] done.")

Let me know if this helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants