Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocess stuck #7

Open
nikhilrayaprolu opened this issue May 14, 2021 · 5 comments
Open

Preprocess stuck #7

nikhilrayaprolu opened this issue May 14, 2021 · 5 comments
Labels
bug Something isn't working

Comments

@nikhilrayaprolu
Copy link

馃悰 Bug

On executing python scripts/preprocess.py cnndm --mode pipeline
Preprocessing stuck at this point:

image

some of the oraclewords are not generated too.

image

Environment

  • fairseq Version (e.g., 1.0 or master): recommended commit
  • PyTorch Version (e.g., 1.0) : 1.8
  • OS (e.g., Linux): Linux
  • How you installed fairseq (pip, source): source
  • Build command you used (if compiling from source): source
  • Python version: 3.6.8
  • CUDA/cuDNN version: 10.2
@nikhilrayaprolu nikhilrayaprolu added the bug Something isn't working label May 14, 2021
@nikhilrayaprolu
Copy link
Author

@jxhe @muggin

@geeraay
Copy link

geeraay commented May 23, 2021

Hi @nikhilrayaprolu,

I faced the same problem with you, it was because the preprocessing script on whole cnndm training dataset took more than 32GB RAM. I would suggest you to split the train set into several parts, then merge them later after preprocess on those parts finished.

@nikhilrayaprolu
Copy link
Author

thanks for the reply @geeraay

@nikhilrayaprolu
Copy link
Author

@geeraay can you provide some more explanation on how the splitting and merging is done. Any accompanying code would really be helpful.

@geeraay
Copy link

geeraay commented May 26, 2021

I don't remember the exact step I've done back then, but the idea is this.

I did something like
split -n l/${nsplit} /path-to-file/train.source /path-to-file/train.source.
it will create train.source.00, train.source.01, ... , train.source.${nsplit}

Then I rename the generated files to
train_1.source, train_2.source, ..., train_${nsplit}.source.

After that you could run
python scripts/preprocess.py cnndm --mode pipeline --split train_1,train_2,...,train_${nsplit}

wait until the preprocessing step is done, then I manually copy and paste the generated files into one big train.source file.

Or you can simply use bigger RAM machine to preprocess without splitting the file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants