Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues encountered when processing big data #181

Open
lidiancracy opened this issue Sep 19, 2023 · 1 comment
Open

Issues encountered when processing big data #181

lidiancracy opened this issue Sep 19, 2023 · 1 comment

Comments

@lidiancracy
Copy link
Contributor

lidiancracy commented Sep 19, 2023

When using the process.sh script, I can process my test and validation datasets normally, but I am unable to process my training dataset without any error output. I then added a batch field in extract.py and changed the directory scanning to batch scanning instead of scanning all at once. After these modifications, I was able to get the expected output for the training data. The modified py file is as follows,we add the parameter "batch_size" and update ExtractFeaturesForDirsList mehtod:

#!/usr/bin/python

import itertools
import multiprocessing
import os
import sys
import shutil
import subprocess
from threading import Timer
import sys
from argparse import ArgumentParser
from subprocess import Popen, PIPE, STDOUT, call

# ......

def ExtractFeaturesForDirsList(args, dirs):
    tmp_dir = f"./tmp/feature_extractor{os.getpid()}/"
    if os.path.exists(tmp_dir):
        shutil.rmtree(tmp_dir, ignore_errors=True)
    os.makedirs(tmp_dir)
    try:
        for i in range(0, len(dirs), args.batch_size):  # 使用range和batch_size来分批处理
            batch_dirs = dirs[i:i + args.batch_size]
            p = multiprocessing.Pool(4)
            p.starmap(ParallelExtractDir, zip(itertools.repeat(args), itertools.repeat(tmp_dir), batch_dirs))
            output_files = os.listdir(tmp_dir)
            for f in output_files:
                os.system("cat %s/%s" % (tmp_dir, f))
                os.remove(os.path.join(tmp_dir, f))  # 删除处理过的文件,为下一批做准备
    finally:
        shutil.rmtree(tmp_dir, ignore_errors=True)


if __name__ == '__main__':
    parser = ArgumentParser()
    parser.add_argument("-maxlen", "--max_path_length", dest="max_path_length", required=False, default=8)
    parser.add_argument("-maxwidth", "--max_path_width", dest="max_path_width", required=False, default=2)
    parser.add_argument("-threads", "--num_threads", dest="num_threads", required=False, default=64)
    parser.add_argument("-j", "--jar", dest="jar", required=True)
    parser.add_argument("-dir", "--dir", dest="dir", required=False)
    parser.add_argument("-file", "--file", dest="file", required=False)
    parser.add_argument("-batch_size", "--batch_size", dest="batch_size", required=False, default=5, type=int)

    args = parser.parse_args()

    if args.file is not None:
        command = 'java -cp ' + args.jar + ' JavaExtractor.App --max_path_length ' + \
                  str(args.max_path_length) + ' --max_path_width ' + str(args.max_path_width) + ' --file ' + args.file
        os.system(command)
    elif args.dir is not None:
        subdirs = get_immediate_subdirectories(args.dir)
        to_extract = subdirs
        if len(subdirs) == 0:
            to_extract = [args.dir.rstrip('/')]
        ExtractFeaturesForDirsList(args, to_extract)
@urialon
Copy link
Collaborator

urialon commented Sep 19, 2023

Hi @lidiancracy ,
Thank you for your interest in our work and for your fix!

I think there are timeouts in the Java side, and by splitting to batches you avoid them.

Thanks again, we would love to adopt this fix if you sent it as a PR.

Best,
Uri

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants