error of preprocess #50

lizhuo-1994 · 2020-05-27T15:05:07Z

Extracting paths from validation set...
Finished extracting paths from validation set
Extracting paths from test set...
Finished extracting paths from test set
Extracting paths from training set...
dir: data/train was not completed in time
Finished extracting paths from training set
Creating histograms from the training data
subtoken vocab size: 0
node vocab size: 0
target vocab size: 0
File: 1.test.raw.txt
Traceback (most recent call last):
File "preprocess.py", line 115, in
max_contexts=int(args.max_contexts), max_data_contexts=int(args.max_data_contexts))
File "preprocess.py", line 53, in process_file
print('Average total contexts: ' + str(float(sum_total) / total))
ZeroDivisionError: float division by zero

here is my preprocess.sh:
TRAIN_DIR=data/train
VAL_DIR=data/validation
TEST_DIR=data/tes
DATASET_NAME=1
MAX_DATA_CONTEXTS=1000
MAX_CONTEXTS=200
SUBTOKEN_VOCAB_SIZE=186277
TARGET_VOCAB_SIZE=26347
NUM_THREADS=1
PYTHON=python3.7

lizhuo-1994 · 2020-05-27T15:07:39Z

I modified a little of java large data set but I did not rewrite anything for code2seq, could you please help me about my issue? Thanks a lot!

urialon · 2020-05-27T19:48:55Z

Hi @lizhuo-1994 ,
Thank you for your interest in code2seq!

What is your Java version? Please run "java --version"

urialon · 2020-05-27T19:51:37Z

Additionally - can you try to run the extractor directly, without the python wrapper:

java -cp JavaExtractor/JPredict/target/JavaExtractor-0.0.1-SNAPSHOT.jar JavaExtractor.App --max_path_length 8 --max_path_width 2 --dir JavaExtractor/JPredict/src/main

lizhuo-1994 · 2020-05-28T01:35:55Z

$ java -version
openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~16.04-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)

$ java -cp JavaExtractor/JPredict/target/JavaExtractor-0.0.1-SNAPSHOT.jar JavaExtractor.App --max_path_length 8 --max_path_width 2 --dir JavaExtractor/JPredict/src/main

Error: Could not find or load main class JavaExtractor.App

here is my result, thanks for helping~

urialon · 2020-05-28T05:53:38Z

Did you run this from the main code2seq directory? Does the jar file exist?
Can you please run:
ls -lt JavaExtractor/JPredict/target/JavaExtractor-0.0.1-SNAPSHOT.jar
?

If the file exists, then please run:

jar tvf JavaExtractor/JPredict/target/JavaExtractor-0.0.1-SNAPSHOT.jar | grep JavaExtractor

lizhuo-1994 · 2020-05-28T06:12:29Z

it works! it seems that I lost *.jar file and I found it back ,thanks for helping!

lizhuo-1994 · 2020-05-28T13:15:44Z

but here is another problem:

Extracting paths from validation set...
Finished extracting paths from validation set
Extracting paths from test set...
Finished extracting paths from test set
Extracting paths from training set...
dir: data/train was not completed in time
Finished extracting paths from training set
Creating histograms from the training data
subtoken vocab size: 0
node vocab size: 0
target vocab size: 0
File: 1.test.raw.txt
Traceback (most recent call last):
File "preprocess.py", line 115, in
max_contexts=int(args.max_contexts), max_data_contexts=int(args.max_data_contexts))
File "preprocess.py", line 53, in process_file
print('Average total contexts: ' + str(float(sum_total) / total))
ZeroDivisionError: float division by zero

lizhuo-1994 · 2020-05-28T13:35:13Z

maybe it is because of timeout , I will try it again, thanks ~

urialon · 2020-05-29T17:51:10Z

Yes, there are timeouts, and we originally used a 64-cores machine to preprocess the datasets.
So using a smaller machine might trigger timeouts.
The exact time is defined here:
https://github.com/tech-srl/code2seq/blob/master/JavaExtractor/extract.py#L37

By default, 6 processes run in parallel (see: https://github.com/tech-srl/code2seq/blob/master/JavaExtractor/extract.py#L66 and each of them runs with 64 threads (see: https://github.com/tech-srl/code2seq/blob/master/preprocess.sh#L32)

To verify that preprocessing runs on a small dataset, you can try preprocessing the JavaExtractor itself. I.e., point the training+test+validation paths to JavaExtractor/JPredict/src/ and verify that it runs successfully within a few seconds or so.

lizhuo-1994 · 2020-05-30T03:38:05Z

thanks for the explanation, I re-configured it and now it seems working well.

BTW, it is really disk-consuming and time-consuming, so I think it would be running about 2-3days for preprocessing

urialon · 2020-06-02T05:16:09Z

Unfortunately, that's right.
The preprocessing pipeline was designed to process millions of examples and it is disk- and time- consuming.

I'm closing this issue for now, feel free to re-open if you have any additional question.

walt676 · 2021-01-05T09:57:08Z

thanks for the explanation, I re-configured it and now it seems working well.

BTW, it is really disk-consuming and time-consuming, so I think it would be running about 2-3days for preprocessing

Hello, may I ask the specific configuration of your machine and the last parameter you used?

Thanks a lot!

urialon closed this as completed Jun 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error of preprocess #50

error of preprocess #50

lizhuo-1994 commented May 27, 2020 •

edited

Loading

lizhuo-1994 commented May 27, 2020

urialon commented May 27, 2020

urialon commented May 27, 2020

lizhuo-1994 commented May 28, 2020

urialon commented May 28, 2020 •

edited

Loading

lizhuo-1994 commented May 28, 2020

lizhuo-1994 commented May 28, 2020

lizhuo-1994 commented May 28, 2020

urialon commented May 29, 2020

lizhuo-1994 commented May 30, 2020

urialon commented Jun 2, 2020

walt676 commented Jan 5, 2021

error of preprocess #50

error of preprocess #50

Comments

lizhuo-1994 commented May 27, 2020 • edited Loading

lizhuo-1994 commented May 27, 2020

urialon commented May 27, 2020

urialon commented May 27, 2020

lizhuo-1994 commented May 28, 2020

urialon commented May 28, 2020 • edited Loading

lizhuo-1994 commented May 28, 2020

lizhuo-1994 commented May 28, 2020

lizhuo-1994 commented May 28, 2020

urialon commented May 29, 2020

lizhuo-1994 commented May 30, 2020

urialon commented Jun 2, 2020

walt676 commented Jan 5, 2021

lizhuo-1994 commented May 27, 2020 •

edited

Loading

urialon commented May 28, 2020 •

edited

Loading