Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c4/multilingual produces Dataflow job file too big (38MB >> 10MB) #2711

Closed
versae opened this issue Nov 9, 2020 · 115 comments
Closed

c4/multilingual produces Dataflow job file too big (38MB >> 10MB) #2711

versae opened this issue Nov 9, 2020 · 115 comments
Labels
bug Something isn't working

Comments

@versae
Copy link

versae commented Nov 9, 2020

Short description
We are trying to extract the Norwegian (and eventually other Nordic languages) portion of c4/multilingual. Since there is no easy way to download only the data for one language, we are processing the entire c4/multilingual corpus first.

Environment information

  • Operating System: Debian GNU/Linux 10

  • Python version: Python 3.8.5 (miniconda)

  • tensorflow-datasets/tfds-nightly version: 4.1.0 / 4.1.0.dev202011080107

  • tensorflow/tf-nightly version: 2.3.1 / 2.5.0.dev20201108 (tried with and without tf-nightly)

  • Does the issue still exists with the last tfds-nightly package (pip install --upgrade tfds-nightly) ?
    Yes, it does. We read all issues related to C4 and incorporated the necessary changes: pinning dill version and adding options for 450 workers and experiments=shuffle_mode=service in Apache Beam.

Reproduction instructions
On a clean VM with 8vCPU and 32GB of RAM, we installed miniconda and run the next commands:

DATASET_NAME=c4
DATASET_CONFIG=multilingual
GCP_PROJECT=...
GCS_BUCKET=...
GCS_BUCKET_REGION=...

# Add all 72 dumps
rm wet.paths.urls
echo "CC-MAIN-2013-20" >> wet.paths.urls
...
echo "CC-MAIN-2020-40" >> wet.paths.urls

# Put them in the bucket
for wetpath in `cat wet.paths.urls` ; do curl -s https://commoncrawl.s3.amazonaws.com/crawl-data/$wetpath/wet.paths.gz | gunzip | pv --name $wetpath --bytes | gsutil -q cp - "$GCS_BUCKET/tensorflow_datasets/downloads/manual/crawl-data/$wetpath/web.paths" ; done

# Prepare requirements
rm /tmp/beam_requirements.txt
echo "tensorflow_datasets[$DATASET_NAME]" >> /tmp/beam_requirements.txt
echo "tfds-nightly[gcp,$DATASET_NAME]" >> /tmp/beam_requirements.txt
echo "google-apitools" >> /tmp/beam_requirements.txt
# there's an error with avro-python3 and dill, dill version needs to be fixed
# https://github.com/tensorflow/datasets/issues/2636#issuecomment-722551597
echo "dill==0.3.1.1" >> /tmp/beam_requirements.txt
python -m pip install tensorflow tf-nightly
python -m pip install -r /tmp/beam_requirements.txt

# Run main command
python -m tensorflow_datasets.scripts.download_and_prepare \
  --datasets=$DATASET_NAME/$DATASET_CONFIG \
  --data_dir=$GCS_BUCKET/tensorflow_datasets \
  --beam_pipeline_options=\
"region=$GCS_BUCKET_REGION,runner=DataflowRunner,project=$GCP_PROJECT,job_name=$DATASET_NAME-gen,"\
"staging_location=$GCS_BUCKET/binaries,temp_location=$GCS_BUCKET/temp,"\
"dataflow_job_file=gs://$GCS_BUCKET/job_file.json,"\
"requirements_file=/tmp/beam_requirements.txt,max_num_workers=450,experiments=shuffle_mode=service" 2>&1 | tee nb-mc4.log

Link to logs
We removed information about our project and bucket in the logs:

Expected behavior
We would have expected for the script to successfully launch the pipeline in Dataflow, but the JSON job file seems to be too big (37.5MB when max is 10MB), therefore all we get is a Your client issued a request that was too large error message (formatted as a HTML page in the console output).

Sample of the output

  <title>Error 413 (Request Entity Too Large)!!1</title>
  <p><b>413.</b> <ins>That���s an error.</ins>
  <p>Your client issued a request that was too large.</p>
  <ins>That���s all we know.</ins>

Additional context
If there is any other way to extract a language portion of c4/multilingual we'd be eager to try it as well.

Update (March 1st, 2021): Instructions to successfully run the pipeline using one dump are detailed in the #2711 (comment).

@versae versae added the bug Something isn't working label Nov 9, 2020
@Conchylicultor
Copy link
Member

@adarob FYI

@adarob
Copy link
Member

adarob commented Nov 9, 2020

@versae I'd recommend generating all languages you want at the same time since it has to process all of the data either way. However, I think if you added a new config with only the languages you want, you'd avoid the JSON issue.

@Conchylicultor we would need to allow them to add a new config to the dataset with only the languages they are interested in. What's the recommended way for them to do that?

@versae
Copy link
Author

versae commented Nov 9, 2020

Thanks for the quick replies, @Conchylicultor and @adarob. We thought about the custom config, but had troubles figuring out how to make it work. Even with a custom config, it's still not clear to us how to run the download_and_prepare code, either in command or script mode.

After inspecting the source code, it seems it should be possible to only process the data for one language, even if we have to download everything, but not sure about how to do it.

@adarob
Copy link
Member

adarob commented Nov 9, 2020

You should be able to add a config similar to 'multilingual', but with only the nordic languages listed. Let's say you name it
'nordic'. Then you can call download_and_prepare with c4/nordic as the dataset.

You'd make the change to a local clone of the repo and then you can run pip install -e . to install it on the master VM.

@versae
Copy link
Author

versae commented Nov 9, 2020

I see. It feels unnecessarily complicated, but we'll give it a try 🤞 Really, it'd be great if some jobfile size control was allowed or implemented directly within the download_and_prepare script. The official docs recommend to restructure the pipeline to avoid such errors. We hope downloading only Nordic or Norwegian languages would do the trick, but honestly it seems the issue is a bit deeper.

I wonder, if the limit it's 10MB and the current code generates a jobfile of almost 40MB, is there a way to lift, even temporarily, this limitation to at least be able to launch the job?

In any case, thanks for the help.

@adarob
Copy link
Member

adarob commented Nov 9, 2020

The issue is the large number of splits (>100) that are produced by this pipeline. Limiting to ~10 languages should reduce the json size by a factor of 10, I believe.

Other options to fix the deeper problem would involve either having the DataFlow team raise the limit or merge some of the downstream steps in the TFDS sharding and writing portions of the pipeline.

@versae
Copy link
Author

versae commented Nov 9, 2020

Thinking about, if the downloading and processing of the data happens in the workers, the information about the languages is not available until the query is already sent and in execution, isn't it? Limiting the languages, or for that matter, selecting specific splits, would still generate the same jobfile to be sent to Dataflow I believe.

@adarob
Copy link
Member

adarob commented Nov 9, 2020

The actual pipeline "blueprint" is pre-generated and sent to the DataFlow service, which is where I think your issue is. This blueprint includes all of the per-language stages.

@versae
Copy link
Author

versae commented Nov 9, 2020

We finally tried our custom config for Norwegian (excluding no-validation) but the jobfile is still too big (10.2MB), so Dataflow is rejecting the request for 200KB :\

@adarob
Copy link
Member

adarob commented Nov 9, 2020

Can you try also removing c4_utils.UNKNOWN_LANGUAGE here

for lang in self.builder_config.languages + [c4_utils.UNKNOWN_LANGUAGE]:

@versae
Copy link
Author

versae commented Nov 9, 2020

Tried with and without that line. The request now went through but Dataflow still failed and complained about size, since it's still a tiny bit larger than 10MB (~10.2MB)

{
  "error": {
    "code": 400,
    "message": "(ffddc5dbfa10dae9): The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs.",
    "status": "INVALID_ARGUMENT"
  }
}

Adding the updated job_file.json (compressed) in case it's useful. There are a couple of big chunks of binary data taking most of the size of the file.

Wondering if the hard limit of 10MB could be lifted temporarily at least or for individual projects.

@adarob
Copy link
Member

adarob commented Nov 10, 2020

I'll see if we can get some help from the dataflow team.

@versae
Copy link
Author

versae commented Nov 10, 2020

Thank you so much.

@rezarokni
Copy link

If possible its best to reduce the nodes in the graph, for example multiplexing the values into a source. @adarob we should look at the pipeline in more detail to look at that.

In the mean time you can make use of --experiments=upload_graph with the Dataflow pipeline arguments, which allow larger than 10MB pipelines. Although note things like the UI will have limitations with this experimental flag.

@versae
Copy link
Author

versae commented Nov 11, 2020

Interesting. This experiments=upload_graph argument replaces experiments=shuffle_mode=service, or is there a way to have them both?

@rezarokni
Copy link

You can have both in a list

@versae
Copy link
Author

versae commented Nov 11, 2020

The job now went through and it appears workers are being started. Adding more languages to the config also works.

Just one question though, should we use max_num_workers or directly num_workers?

Thanks!

@versae versae closed this as completed Nov 11, 2020
@adarob
Copy link
Member

adarob commented Nov 11, 2020 via email

@versae
Copy link
Author

versae commented Nov 12, 2020

It's been working for 23 hours now but we don't see any progress, neither in the log of the process, the dataflow log, or the dataflow diagram. Total allocated HDD is 10.99TB, which is insufficient for the entire corpus, although we don't know if the corpus is downloaded in chunks, processed, and then discarded.

Moreover, nothing is being written to the bucket. And after a few hours, we started to see a lot of messages like these ones:

I1112 06:43:57.139022 139788711368448 transport.py:183] Refreshing due to a 401 (attempt 1/2)
I1112 06:43:58.973949 139788711368448 transport.py:183] Refreshing due to a 401 (attempt 1/2)
I1112 07:18:52.325422 139788711368448 transport.py:183] Refreshing due to a 401 (attempt 1/2)
I1112 07:18:54.354025 139788711368448 transport.py:183] Refreshing due to a 401 (attempt 1/2)
I1112 08:18:31.006261 139788711368448 transport.py:183] Refreshing due to a 401 (attempt 1/2)

Nothing else written in the process log. Our guess is that 401 is some sort of unauthorized OAuth bearer token issue? But we don't know if we should worry about it, just let it run for another couple of hours, or stop it right away. It's been some 24 expensive hours for us running 450 workers :)

@adarob
Copy link
Member

adarob commented Nov 12, 2020

Do you not see any counters? @rezarokni is this a side effect of upload_graph?

If the job is not crashing my suspicion is that it's working. The input dataset is 71x the size of the original C4, so it going to take quite a bit longer. I'm not sure how much longer, but it sould be less than 71x as long.

@versae
Copy link
Author

versae commented Nov 12, 2020

If by counters you mean the stages of each box in the diagram, a few boxes at the beginning are outlined in dashed green (started), but most of them are greyed out (not even started yet). All counters are at 0 (zero), none reported to succeed yet. Attaching a screenshot for reference.

dataflow

PS: If this is off-topic now for the current issue, I can create another issue and move the discussion there.

@adarob
Copy link
Member

adarob commented Nov 12, 2020

When the job starts running properly, you should see a "Custom Counters" section on the right as well, assuming the upload_graph option isn't disabling it somehow.

Just to be clear, have you downloaded all of the WET files to the manual directory?

@versae
Copy link
Author

versae commented Nov 12, 2020

We have 72 files like this one in the bucket: BUCKET/tensorflow_datasets/downloads/manual/crawl-data/CC-MAIN-2013-20/wet.paths, one for each Common Crawl dump. Are the WET paths files what we need or should we have downloaded the actual contents of all the URLs inside the WET files into the bucket?

We used the next code to put the wet.paths files in the bucket:

# Add all 72 dumps
rm wet.paths.urls
echo "CC-MAIN-2013-20" >> wet.paths.urls
...
echo "CC-MAIN-2020-40" >> wet.paths.urls

# Put them in the bucket
for wetpath in `cat wet.paths.urls` ; do curl -s https://commoncrawl.s3.amazonaws.com/crawl-data/$wetpath/wet.paths.gz | gunzip | pv --name $wetpath --bytes | gsutil -q cp - "$GCS_BUCKET/tensorflow_datasets/downloads/manual/crawl-data/$wetpath/web.paths" ; done

@versae
Copy link
Author

versae commented Nov 13, 2020

Is there any other indicator or flag that the job is actually running?

@rezarokni
Copy link

The experimental upload_graph option can cause issues in the UI.
If you click through to the logs from the UI, you should still see the logs in the log UI, which should give some indications.
You may also see some of the metrics using the gcloud options:
https://cloud.google.com/sdk/gcloud/reference/beta/dataflow/metrics

@versae
Copy link
Author

versae commented Nov 13, 2020

Attaching the Dataflow metrics as returned by the CLI command: c4-nordic-gen.metrics.txt Nothing of interest there we believe.

In the UI, workers logs are empty, and the jobs log hasn't returned anything since 36 hours ago:
image

The download_and_prepare script keeps telling 401 a couple of times every hour, always (attempt 1/2). We are starting to really worry nothing is happening.

@versae
Copy link
Author

versae commented Nov 13, 2020

Our only hope is that some workers look like this:
image

@adarob
Copy link
Member

adarob commented Nov 13, 2020 via email

@versae
Copy link
Author

versae commented Dec 4, 2020

Is the custom Config needed in the workers or only by the main process launching the Dataflow job?

@adarob
Copy link
Member

adarob commented Dec 4, 2020

The custom config is not needed by the workers. It's okay that the installed version doesn't include it.

Just to be clear, I was able to run this with no issue back when I submitted #2734. I'm not sure why you are still having trouble.

@rezarokni
Copy link

rezarokni commented Dec 4, 2020

@versae best to delete the comment with txt file as the raw info is not needed.

@versae
Copy link
Author

versae commented Dec 4, 2020

@adarob, @rezarokni thanks for all the help, we're all in good faith here :) I wish it just worked for me. After removing the local clone from the workers dependencies, getting rid of the experimental flag (job graph file is now 1.3MB instead of 13MB), it now seems we have reached a milestone. I can now see workers logs in the UI and the first stage (counter) has completed successfully. We are now getting this tracebacks in the workers:

An exception was raised when trying to execute the workitem 7861000946831734554 : Traceback (most recent call last):
  ...
  File "/home/versae/datasets/tensorflow_datasets/text/c4.py", line 403, in download_wet_file
    name=f"{lang}-validation",
NameError: name 'uuid' is not defined

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  ...
  File "/home/versae/datasets/tensorflow_datasets/text/c4.py", line 403, in download_wet_file
    name=f"{lang}-validation",
NameError: name 'uuid' is not defined [while running 'Map(download_wet_file)']

Note: imports, functions and other variables defined in the global context of your __main__ file of your Dataflow pipeline are, by default, not available in the worker execution environment, and such references will cause a NameError, unless the --save_main_session pipeline option is set to True. Please see https://cloud.google.com/dataflow/faq#how-do-i-handle-nameerrors for additional documentation on configuring your worker execution environment.

However, since there are issues installing tfds-nightly (#2827), I used regular tensorflow_datasets as a dependency in the workers (adding git+https://... did not work either since git is not available in the workers). So not sure the error I'm getting now is related to that or not.

@tvalentyn
Copy link

I see a lot of Error syncing pod errors

This error indicates that a worker Docker container failed to start.
This typically happens when one of the containers is crashlooping on startup,
or a container image cannot be pulled.

If the container is crashlooping, look for an error message
describing the failure in worker-startup, harness-startup or docker logs.
Startup crashes are commonly caused by dependency issues with Python jobs,
which can be due to incompatible specified dependencies or network issues that
cause the specified repositories to be unreachable. Note that Apache Beam Python
SDK installed in the container should be the same as the SDK used to launch the
job.

If you are using a custom container image with your job and the workers are
unable to pull the image, verify that the image name and tag are correct,
and that the image is accessible by the Dataflow workers.

@versae
Copy link
Author

versae commented Dec 6, 2020

After downgrading pip to "<20.0" we got it to work. It's been running for almost 18 hours now. So far we've got a few errors related to downloading of WET files (a HTTP 503 downloading https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2015-27/segments/1435375090887.26/wet/CC-MAIN-20150627031810-00254-ip-10-179-60-89.ec2.internal.warc.wet.gz, a HTTP 500 for https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247479627.17/wet/CC-MAIN-20190215224408-20190216010408-00436.warc.wet.gz, but also Remote end closed connection without response, Connection reset by peer, and Connection broken). Hopefully, the impact of those errors is not too big.

It has shuffled almost 40TB of data on 132k WET files and counting (so nice to be able to see all the counters). I'm closing the issue and will write a small TL;DR when it's done for anyone following. Thank you so much for all the work, @adarob, @rezarokni, and @tvalentyn!

@versae versae closed this as completed Dec 6, 2020
@versae
Copy link
Author

versae commented Dec 7, 2020

@adarob, @rezarokni, any way to estimate how long the process will take?

@versae
Copy link
Author

versae commented Dec 7, 2020

If it crashes for any reason, is there a way to resume processing?

@tvalentyn
Copy link

any way to estimate how long the process will take?

It's tricky and depends on the pipeline and even the Dataset. Besides trying the pipeline on datasets of different sizes, it would be prudent to check that pipeline can scale and data is ~evenly distributed between workers (no hot keys).

If it crashes for any reason, is there a way to resume processing?

I don't think it is possible at this time. However Dataflow will retry failing work item at least 4 times before giving up.

@versae
Copy link
Author

versae commented Dec 8, 2020

We have enabled shuffle_mode=service , not sure if it mitigates the hot key thing, but we haven't got any errors related to hot keys yet.

To be honest, we are a bit worried. This is costing us around $1000 a day. We are small unit at the National Library of Norway trying to get a mono-lingual Norwegian BERT model released to the public. We are a non-profit European organization (no grant awarding, so not eligible for free GCP credits AFAICT). A rough estimate of the total time would be really helpful. It would be terrible running out of funds before the process finishes with no chance of resuming it later.

There is full activity on all the VMs (~80% CPU), and it has processed 120TB of data so far. Here's a screenshot of the counters. Looking at the downloaded WET files (~320k), and calculating an average of 60k WET files per dump (there are 72), we estimated we are at about 15% of processing after 60 hours, so we still need 2 full weeks (400 hours) to get the Norwegian part of the mC4. But not sure if that's in any way realistic.

image

@adarob
Copy link
Member

adarob commented Dec 8, 2020

I've never built one of these datasets on DataFlow with more than 1 crawl (the default for English only). For multilingual, we used 72 crawls to help get enough data for the tail languages, but we only ran it on our internal system which used many more workers. Based on this estimate (which is actually only for the first stage of processing), I'd highly suggest you reduce the number of crawls you use in the config.

@versae
Copy link
Author

versae commented Dec 21, 2020

As you pointed out, it seems unreasonable for us to run this for all the crawls. It seems like even restricting it to one crawl will be a significant expense. We are trying to build a very large corpus for Norwegian (and the other Nordic languages), and this looks like a very good source. Are you able to give us a rough estimate (based on your experience) on how much data one crawl would give us? We do have access to the OSCAR dataset based on Common Crawl. Would you happen to know how would one crawl differ from this dataset on one or two crawls?

@adarob
Copy link
Member

adarob commented Dec 21, 2020 via email

@versae
Copy link
Author

versae commented Dec 21, 2020

Sure, the languages we're interested in are Norwegian, Swedish, Danish, Icelandic, an Faroese (although I think this one is not included in mC4). The ISO codes are no, sv, da, and is (Faroese would be fo).

@versae
Copy link
Author

versae commented Dec 21, 2020

It'd be also great to know how many words or GB of raw text would be.

@adarob
Copy link
Member

adarob commented Dec 29, 2020

First off, I found a significant bottleneck that reduces the parallelism to ~71 instead of your number of workers. It will be fixed in #2895.

Here are the number of documents from 1 common crawl dump:

beam:ParDo(_PredictLanguageFn):MetricName(namespace=language-filter, name=passed:da) | 3,163,698 (15 GB)
beam:ParDo(_PredictLanguageFn):MetricName(namespace=language-filter, name=passed:is) | 219,188 (1 GB)
beam:ParDo(_PredictLanguageFn):MetricName(namespace=language-filter, name=passed:no) | 2,617,118 (12 GB)
beam:ParDo(_PredictLanguageFn):MetricName(namespace=language-filter, name=passed:sv) | 5,900,912 (26 GB)

@acul3
Copy link

acul3 commented Jan 8, 2021

hi @versae thanks for creating the issue...
did you guys finish it successfully(extracting the specific language from mc4) ?
if it did..how long does it take?

i'm also planning to extract specific language from mc4 using dataflow beam dataset..
any tips to prepare it or maybe code how to do it would be helpful
thanks

@versae
Copy link
Author

versae commented Feb 9, 2021

Hi!

Sorry for the long hiatus. Thanks, @adarob. We're finally resuming work on this and I hope to start the processing in the coming days. We might just start with the first, last, and in between dumps just to get an idea of how long and how big. But the estimates in number of documents per language really help.

@acul3, we were not able to finish the processing even after starting 500 workers for over 36 hours. We'll prepare everything again and report back. If you have any insights I'd be eager to know as well.

Cheers.

@versae
Copy link
Author

versae commented Feb 25, 2021

I was finally able to try this again. A few things have changed and it seems the experimental flags are not needed anymore. The setup is as follows:

python -m tensorflow_datasets.scripts.download_and_prepare \
  --datasets=$DATASET_NAME/$DATASET_CONFIG \
  --data_dir=$GCS_BUCKET/tensorflow_datasets \
  --beam_pipeline_options="region=$GCS_BUCKET_REGION,runner=DataflowRunner,project=$GCP_PROJECT,job_name=$DATASET_NAME-$DATASET
_CONFIG-1dump-gen,staging_location=$GCS_BUCKET/binaries,temp_location=$GCS_BUCKET/temp,dataflow_job_file=$GCS_BUCKET/job_file.json,requirements_file=/
tmp/beam_requirements.txt,autoscaling_algorithm=NONE,num_workers=50" 2>&1 | tee nb-mc4-1dump.log

The beam_requirements.txt looks as follows:

apache-beam[gcp]
tensorflow-datasets[c4]
google-apitools
dill<0.3.2,>=0.3.1.1

The last main process logs:

INFO[dataflow_runner.py]: 2021-02-24T10:48:08.070Z: JOB_MESSAGE_DEBUG: Value "no-validation_write/GroupBucketsAndBoundaries/GroupByKey/Session" materialized.
INFO[dataflow_runner.py]: 2021-02-24T10:48:08.095Z: JOB_MESSAGE_DEBUG: Value "no-validation_write/GroupShards/Session" materialized.                 
INFO[dataflow_runner.py]: 2021-02-24T10:48:08.132Z: JOB_MESSAGE_DEBUG: Value "sv_write/GroupByBucket/Session" materialized.                          
INFO[dataflow_runner.py]: 2021-02-24T10:48:08.160Z: JOB_MESSAGE_DEBUG: Value "sv_write/GroupBucketsAndBoundaries/GroupByKey/Session" materialized.   
INFO[dataflow_runner.py]: 2021-02-24T10:48:08.196Z: JOB_MESSAGE_DEBUG: Value "sv_write/CombineBucketsSizes/CombineBucketsSizes/CombinePerKey/GroupByKey/Session" materialized.

I run it for 24 hours only on one dump, CC-MAIN-2013-20. Everything run fine for approximately 11 hours, as I could see data being written to the bucket, counters counting, and workers CPU working at 90%-95% capacity. Then, it all suddenly stop. Workers CPU usage dropped to 4%, counters stopped, and the bucket did not grow anymore. No critical nor regular errors in the logs. No budget problem. Nothing. I really have no clue what is going on. I just restarted the process to see if anything changes.

image

CPU utilization (All Workers)

I'd be happy to open a new issue if that's more appropriate.

@Conchylicultor
Copy link
Member

If the input pipeline entered a phase where it is IO bound, it might be normal that the CPU rate drop.

@versae
Copy link
Author

versae commented Feb 25, 2021

The throughput also drops to zero for all the pipeline stages, and although progress is 100% is not marked as "succeeded". For example:

image

@versae
Copy link
Author

versae commented Feb 26, 2021

It seems like I got past that problem on a second run, but it has now failed with a "No space left on the device" error that I guess is coming from the workers as they try to write too many or too big temporary files.

File "/usr/local/lib/python3.8/site-packages/dataflow_worker/shuffle.py", line 282, in __next__ return next(self.iterator)
File "/usr/local/lib/python3.8/site-packages/dataflow_worker/shuffle.py", line 240, in __iter__ chunk, next_position = self.reader.Read(start_position, end_position)
File "third_party/windmill/shuffle/python/shuffle_client.pyx", line 135, in shuffle_client.PyShuffleReader.Read OSError: Shuffle read failed: b'INTERNAL: RPC error: IO error: /var/shuffle/sorted-dataset-1/1818: No space left on device, {"created":"@1614294826.207181169","description":"Error received from peer ipv4:10.128.0.123:12346","file":"third_party/grpc/src/core/lib/surface/call.cc","file_line":1068,"grpc_message":"IO error: /var/shuffle/sorted-dataset-1/1818: No space left on device","grpc_status":13} when c4-nordic-1dump-gen-02250406-zsv5-harness-s9tq talking to c4-nordic-1dump-gen-02250406-zsv5-harness-rrsn:12346. [type.googleapis.com/util.MessageSetPayload=\'[dist_proc.dax.internal.TrailProto] { trail_point { source_file_loc { filepath: "dist_proc/dax/shuffle/sorter/shuffle_client_key_value_iterator.cc" line: 84 } } trail_point { source_file_loc { filepath: "dist_proc/dax/shuffle/sorter/chunked_sorted_reader.cc" line: 73 } } }\']'

        at __iter__ (/usr/local/lib/python3.8/site-packages/dataflow_worker/shuffle.py:441)
        at dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start (shuffle_operations.py:82)
        at dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start (shuffle_operations.py:80)
        at dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start (shuffle_operations.py:79)
        at dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start (shuffle_operations.py:64)
        at dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start (shuffle_operations.py:63)
        at execute (/usr/local/lib/python3.8/site-packages/dataflow_worker/executor.py:179)
        at do_work (/usr/local/lib/python3.8/site-packages/dataflow_worker/batchworker.py:649)

Not sure what's a proper disk size for the workers, I'm using whatever is the default now. I also don't know if I should run it again with experiments=shuffle_mode=service or if that's just irrelevant at this point.

@adarob
Copy link
Member

adarob commented Feb 26, 2021

experiments=shuffle_mode=service should reduce the files, but potentially only on GCS. @daphnei has also recently seen this issue. It appears to be new.

@tvalentyn
Copy link

If the temporary files are created by shuffle, --experiments=shuffle_mode=service should help;
I think the default --disk_size_gb disk size is 250G; with shuffle service it is 25G.

@versae
Copy link
Author

versae commented Mar 1, 2021

It finished! 🎉 It successfully processed the 105TB and almost 2 billion files of the CC-MAIN-2013-20 dump. Using 75 workers, it took 31 hours. That's roughly 3.3TB per hour. I increased disk size to 100GB (it might be unnecessarily big), force v2 of the runner, and enabled shuffle mode. The total cost was around 950€.

DATASET_NAME=c4
DATASET_CONFIG=nordic
GCP_PROJECT=...
GCS_BUCKET=...
GCS_BUCKET_REGION=...

The command looks like this:

python -m tensorflow_datasets.scripts.download_and_prepare \
  --datasets=$DATASET_NAME/$DATASET_CONFIG \
  --data_dir=$GCS_BUCKET/tensorflow_datasets \
  --beam_pipeline_options="region=$GCS_BUCKET_REGION,runner=DataflowRunner,project=$GCP_PROJECT,job_name=$DATASET_NAME-$DATASET_CONFIG-1dump-gen,staging_location=$GCS_BUCKET/binaries,temp_location=$GCS_BUCKET/temp,dataflow_job_file=$GCS_BUCKET/job_file.json,requirements_file=/tmp/beam_requirements.txt,autoscaling_algorithm=NONE,disk_size_gb=100,num_workers=75,experiments=shuffle_mode=service,experiments=use_runner_v2," 2>&1 | tee nb-mc4-1dump.log

The beam_requirements.txt file contains:

apache-beam[gcp]
tensorflow-datasets[c4]
google-apitools
dill<0.3.2,>=0.3.1.1

And I patched locally 0db70eb to include a nordic config and ignore the unknown languages: nordic_dump2013-20.patch.txt.

I did not delete the contents of the bucket from the previous run. Do runs cache any part of the process so after a failed run it takes shorter to run?

@daphnei
Copy link
Contributor

daphnei commented Mar 9, 2021

Congrats on getting this working!

Out of curiosity, did you have to request a lot of quota increases from GCS to make it work, and if so what quotas did you get increases to?

The last time I tried this, I kept running into various out-of-quota warnings.

@sumanthd17
Copy link

@versae Congratulations on getting this work.

I would like to understand the finer details of this work. We are planning on doing this for english. Is it possible to set up a meeting to discuss this.

Thanks in Advance

@versae
Copy link
Author

versae commented Mar 9, 2021

Thanks, @daphnei, @sumanthd17.

@daphnei, the quota thing was the first problem I had. Thankfully, it was fixed by my manager. I'm my experience, having a budget estimate on resources and cost really helps when justifying how much money you need to run experiments.

@sumanthd17, I am not affiliated with Google. I'm sure that @adarob and @rezarokni know way better the internals of their own work. That being said, happy to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants