how to process arXiv tex files without downloading? #51

irene622 · 2023-06-01T10:22:57Z

I download myself arXiv tex files without using running scripts/arxiv-kickoff-download.sh.

My data structure is

my_arxiv_src
 |- papername1
      |- name.tex
 |- papername2
      |- name.tex
      |- other_name.tex

I want to preprocess my latex data, so I run
bash scripts/arxiv-kickoff-cleaning.sh
and arxiv-kickoff-cleaning.sh is the following

#!/bin/bash

set -e

WORKERS=2

# load modules
module load gcc/10.2.0 cuda/11.4 cudampi/openmpi-4.1.1 conda/pytorch_1.12.0
pip install -r arxiv_requirements.txt

export DATA_DIR="./my_ arxiv_src"
export TARGET_DIR="./data/arxiv/processed"
export WORK_DIR="./work"

mkdir -p logs/arxiv/cleaning

# setup partitions
python run_clean.py --data_dir "$DATA_DIR" --target_dir "$TARGET_DIR" --workers $WORKERS --setup

# run download in job array
sbatch scripts/arxiv-clean-slurm.sbatch

arxiv-kickoff-cleaning.sh runs with no error but,
the result files which are arxiv_1.jsonl and arxiv_2.jsonl have not content...

What is the DATA_DIR and TARGET_DIR ?
Is there anything running method with latex files?

The text was updated successfully, but these errors were encountered:

mauriceweber · 2023-06-09T17:09:40Z

Hi @irene622 , thanks for your question!

The run_clean.py script expects your data to be organized in the same way as when it is downloaded from the arxiv s3 bucket. The most straight forward way is thus probably to simply mirror this structure. In this case you will need to have package your my_arxiv_src into a a tar file and store it in data/src. So you need something like this:

data/src
|-my_arxiv_src.tar
    |-parpername1.gz
        |- name.tex
    |-papername2.gz
        |-name.tex
        |-other_name.tex

You can then call python run_clean.py --data_dir data/src --target_dir /dir/to/desired/output. Also check out this part in the code of the arxiv_cleaner module:

RedPajama-Data/data_prep/arxiv/arxiv_cleaner.py

Lines 131 to 177 in d174968

    
           if tar_fp_list is None: 
        
               def _tar_fp_iterator(): 
        
                   for _tar_fp in self._data_dir.glob("*.tar"): 
        
                       yield _tar_fp 
        
           else: 
        
               def _tar_fp_iterator(): 
        
                   for _tar_fp in tar_fp_list: 
        
                       yield _tar_fp 
        
           failed = 0 
        
           processed = 0 
        
           for tar_fp in _tar_fp_iterator(): 
        
               print(f"[{get_timestamp()}][INFO] start processing {tar_fp}") 
        
               with tempfile.TemporaryDirectory(dir=self._work_dir) as tmpdir: 
        
                   with tarfile.open(tar_fp) as tf: 
        
                       tf.extractall(members=tf.getmembers(), path=tmpdir) 
        
                       for proj_dir_or_file in pathlib.Path(tmpdir).rglob("*.gz"): 
        
                           # get arxiv id and month from the filename 
        
                           yymm = proj_dir_or_file.parent.stem 
        
                           arxiv_id = proj_dir_or_file.stem 
        
                           # load the tex source files (we also get the timestamp 
        
                           # here) 
        
                           data = _tex_proj_loader(proj_dir_or_file) 
        
                           if data is None: 
        
                               failed += 1 
        
                               continue 
        
                           tex_files, timestamp = data 
        
                           processed += 1 
        
                           if processed > max_files > 0: 
        
                               break 
        
                           yield tex_files, yymm, arxiv_id, timestamp 
        
                       else: 
        
                           continue 
        
                       break 
        
           print(f"[{get_timestamp()}][INFO] # Failed loading : {failed}") 
        
           print(f"[{get_timestamp()}][INFO] done.")

Let me know if this helps!

mauriceweber closed this as completed Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to process arXiv tex files without downloading? #51

how to process arXiv tex files without downloading? #51

irene622 commented Jun 1, 2023

mauriceweber commented Jun 9, 2023 •

edited

Loading

how to process arXiv tex files without downloading? #51

how to process arXiv tex files without downloading? #51

Comments

irene622 commented Jun 1, 2023

mauriceweber commented Jun 9, 2023 • edited Loading

mauriceweber commented Jun 9, 2023 •

edited

Loading