Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error of preprocess #50

Closed
lizhuo-1994 opened this issue May 27, 2020 · 12 comments
Closed

error of preprocess #50

lizhuo-1994 opened this issue May 27, 2020 · 12 comments

Comments

@lizhuo-1994
Copy link

lizhuo-1994 commented May 27, 2020

Extracting paths from validation set...
Finished extracting paths from validation set
Extracting paths from test set...
Finished extracting paths from test set
Extracting paths from training set...
dir: data/train was not completed in time
Finished extracting paths from training set
Creating histograms from the training data
subtoken vocab size: 0
node vocab size: 0
target vocab size: 0
File: 1.test.raw.txt
Traceback (most recent call last):
File "preprocess.py", line 115, in
max_contexts=int(args.max_contexts), max_data_contexts=int(args.max_data_contexts))
File "preprocess.py", line 53, in process_file
print('Average total contexts: ' + str(float(sum_total) / total))
ZeroDivisionError: float division by zero

here is my preprocess.sh:
TRAIN_DIR=data/train
VAL_DIR=data/validation
TEST_DIR=data/tes
DATASET_NAME=1
MAX_DATA_CONTEXTS=1000
MAX_CONTEXTS=200
SUBTOKEN_VOCAB_SIZE=186277
TARGET_VOCAB_SIZE=26347
NUM_THREADS=1
PYTHON=python3.7

@lizhuo-1994
Copy link
Author

I modified a little of java large data set but I did not rewrite anything for code2seq, could you please help me about my issue? Thanks a lot!

@urialon
Copy link
Contributor

urialon commented May 27, 2020

Hi @lizhuo-1994 ,
Thank you for your interest in code2seq!

What is your Java version? Please run "java --version"

@urialon
Copy link
Contributor

urialon commented May 27, 2020

Additionally - can you try to run the extractor directly, without the python wrapper:

java -cp JavaExtractor/JPredict/target/JavaExtractor-0.0.1-SNAPSHOT.jar JavaExtractor.App --max_path_length 8 --max_path_width 2 --dir JavaExtractor/JPredict/src/main

@lizhuo-1994
Copy link
Author

$ java -version
openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~16.04-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)

$ java -cp JavaExtractor/JPredict/target/JavaExtractor-0.0.1-SNAPSHOT.jar JavaExtractor.App --max_path_length 8 --max_path_width 2 --dir JavaExtractor/JPredict/src/main

Error: Could not find or load main class JavaExtractor.App

here is my result, thanks for helping~

@urialon
Copy link
Contributor

urialon commented May 28, 2020

Did you run this from the main code2seq directory? Does the jar file exist?
Can you please run:
ls -lt JavaExtractor/JPredict/target/JavaExtractor-0.0.1-SNAPSHOT.jar
?

If the file exists, then please run:

jar tvf JavaExtractor/JPredict/target/JavaExtractor-0.0.1-SNAPSHOT.jar | grep JavaExtractor

@lizhuo-1994
Copy link
Author

it works! it seems that I lost *.jar file and I found it back ,thanks for helping!

@lizhuo-1994
Copy link
Author

but here is another problem:

Extracting paths from validation set...
Finished extracting paths from validation set
Extracting paths from test set...
Finished extracting paths from test set
Extracting paths from training set...
dir: data/train was not completed in time
Finished extracting paths from training set
Creating histograms from the training data
subtoken vocab size: 0
node vocab size: 0
target vocab size: 0
File: 1.test.raw.txt
Traceback (most recent call last):
File "preprocess.py", line 115, in
max_contexts=int(args.max_contexts), max_data_contexts=int(args.max_data_contexts))
File "preprocess.py", line 53, in process_file
print('Average total contexts: ' + str(float(sum_total) / total))
ZeroDivisionError: float division by zero

@lizhuo-1994
Copy link
Author

maybe it is because of timeout , I will try it again, thanks ~

@urialon
Copy link
Contributor

urialon commented May 29, 2020

Yes, there are timeouts, and we originally used a 64-cores machine to preprocess the datasets.
So using a smaller machine might trigger timeouts.
The exact time is defined here:
https://github.com/tech-srl/code2seq/blob/master/JavaExtractor/extract.py#L37

By default, 6 processes run in parallel (see: https://github.com/tech-srl/code2seq/blob/master/JavaExtractor/extract.py#L66 and each of them runs with 64 threads (see: https://github.com/tech-srl/code2seq/blob/master/preprocess.sh#L32)

To verify that preprocessing runs on a small dataset, you can try preprocessing the JavaExtractor itself. I.e., point the training+test+validation paths to JavaExtractor/JPredict/src/ and verify that it runs successfully within a few seconds or so.

@lizhuo-1994
Copy link
Author

thanks for the explanation, I re-configured it and now it seems working well.

BTW, it is really disk-consuming and time-consuming, so I think it would be running about 2-3days for preprocessing

@urialon
Copy link
Contributor

urialon commented Jun 2, 2020

Unfortunately, that's right.
The preprocessing pipeline was designed to process millions of examples and it is disk- and time- consuming.

I'm closing this issue for now, feel free to re-open if you have any additional question.

@urialon urialon closed this as completed Jun 2, 2020
@walt676
Copy link

walt676 commented Jan 5, 2021

thanks for the explanation, I re-configured it and now it seems working well.

BTW, it is really disk-consuming and time-consuming, so I think it would be running about 2-3days for preprocessing

Hello, may I ask the specific configuration of your machine and the last parameter you used?

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants