c4/multilingual produces Dataflow job file too big (38MB >> 10MB) #2711

versae · 2020-11-09T09:42:26Z

Short description
We are trying to extract the Norwegian (and eventually other Nordic languages) portion of c4/multilingual. Since there is no easy way to download only the data for one language, we are processing the entire c4/multilingual corpus first.

Environment information

Operating System: Debian GNU/Linux 10
Python version: Python 3.8.5 (miniconda)
tensorflow-datasets/tfds-nightly version: 4.1.0 / 4.1.0.dev202011080107
tensorflow/tf-nightly version: 2.3.1 / 2.5.0.dev20201108 (tried with and without tf-nightly)
Does the issue still exists with the last tfds-nightly package (pip install --upgrade tfds-nightly) ?
Yes, it does. We read all issues related to C4 and incorporated the necessary changes: pinning dill version and adding options for 450 workers and experiments=shuffle_mode=service in Apache Beam.

Reproduction instructions
On a clean VM with 8vCPU and 32GB of RAM, we installed miniconda and run the next commands:

DATASET_NAME=c4
DATASET_CONFIG=multilingual
GCP_PROJECT=...
GCS_BUCKET=...
GCS_BUCKET_REGION=...

# Add all 72 dumps
rm wet.paths.urls
echo "CC-MAIN-2013-20" >> wet.paths.urls
...
echo "CC-MAIN-2020-40" >> wet.paths.urls

# Put them in the bucket
for wetpath in `cat wet.paths.urls` ; do curl -s https://commoncrawl.s3.amazonaws.com/crawl-data/$wetpath/wet.paths.gz | gunzip | pv --name $wetpath --bytes | gsutil -q cp - "$GCS_BUCKET/tensorflow_datasets/downloads/manual/crawl-data/$wetpath/web.paths" ; done

# Prepare requirements
rm /tmp/beam_requirements.txt
echo "tensorflow_datasets[$DATASET_NAME]" >> /tmp/beam_requirements.txt
echo "tfds-nightly[gcp,$DATASET_NAME]" >> /tmp/beam_requirements.txt
echo "google-apitools" >> /tmp/beam_requirements.txt
# there's an error with avro-python3 and dill, dill version needs to be fixed
# https://github.com/tensorflow/datasets/issues/2636#issuecomment-722551597
echo "dill==0.3.1.1" >> /tmp/beam_requirements.txt
python -m pip install tensorflow tf-nightly
python -m pip install -r /tmp/beam_requirements.txt

# Run main command
python -m tensorflow_datasets.scripts.download_and_prepare \
  --datasets=$DATASET_NAME/$DATASET_CONFIG \
  --data_dir=$GCS_BUCKET/tensorflow_datasets \
  --beam_pipeline_options=\
"region=$GCS_BUCKET_REGION,runner=DataflowRunner,project=$GCP_PROJECT,job_name=$DATASET_NAME-gen,"\
"staging_location=$GCS_BUCKET/binaries,temp_location=$GCS_BUCKET/temp,"\
"dataflow_job_file=gs://$GCS_BUCKET/job_file.json,"\
"requirements_file=/tmp/beam_requirements.txt,max_num_workers=450,experiments=shuffle_mode=service" 2>&1 | tee nb-mc4.log

Link to logs
We removed information about our project and bucket in the logs:

Process log: nb-mc4.log
JSON job file: job_file.zip

Expected behavior
We would have expected for the script to successfully launch the pipeline in Dataflow, but the JSON job file seems to be too big (37.5MB when max is 10MB), therefore all we get is a Your client issued a request that was too large error message (formatted as a HTML page in the console output).

Sample of the output

  <title>Error 413 (Request Entity Too Large)!!1</title>
  <p><b>413.</b> <ins>That���s an error.</ins>
  <p>Your client issued a request that was too large.</p>
  <ins>That���s all we know.</ins>

Additional context
If there is any other way to extract a language portion of c4/multilingual we'd be eager to try it as well.

Update (March 1st, 2021): Instructions to successfully run the pipeline using one dump are detailed in the #2711 (comment).

The text was updated successfully, but these errors were encountered:

Conchylicultor · 2020-11-09T12:05:52Z

@adarob FYI

adarob · 2020-11-09T14:24:42Z

@versae I'd recommend generating all languages you want at the same time since it has to process all of the data either way. However, I think if you added a new config with only the languages you want, you'd avoid the JSON issue.

@Conchylicultor we would need to allow them to add a new config to the dataset with only the languages they are interested in. What's the recommended way for them to do that?

versae · 2020-11-09T14:31:18Z

Thanks for the quick replies, @Conchylicultor and @adarob. We thought about the custom config, but had troubles figuring out how to make it work. Even with a custom config, it's still not clear to us how to run the download_and_prepare code, either in command or script mode.

After inspecting the source code, it seems it should be possible to only process the data for one language, even if we have to download everything, but not sure about how to do it.

adarob · 2020-11-09T14:36:09Z

You should be able to add a config similar to 'multilingual', but with only the nordic languages listed. Let's say you name it
'nordic'. Then you can call download_and_prepare with c4/nordic as the dataset.

You'd make the change to a local clone of the repo and then you can run pip install -e . to install it on the master VM.

versae · 2020-11-09T15:29:47Z

I see. It feels unnecessarily complicated, but we'll give it a try 🤞 Really, it'd be great if some jobfile size control was allowed or implemented directly within the download_and_prepare script. The official docs recommend to restructure the pipeline to avoid such errors. We hope downloading only Nordic or Norwegian languages would do the trick, but honestly it seems the issue is a bit deeper.

I wonder, if the limit it's 10MB and the current code generates a jobfile of almost 40MB, is there a way to lift, even temporarily, this limitation to at least be able to launch the job?

In any case, thanks for the help.

adarob · 2020-11-09T15:37:06Z

The issue is the large number of splits (>100) that are produced by this pipeline. Limiting to ~10 languages should reduce the json size by a factor of 10, I believe.

Other options to fix the deeper problem would involve either having the DataFlow team raise the limit or merge some of the downstream steps in the TFDS sharding and writing portions of the pipeline.

versae · 2020-11-09T16:03:57Z

Thinking about, if the downloading and processing of the data happens in the workers, the information about the languages is not available until the query is already sent and in execution, isn't it? Limiting the languages, or for that matter, selecting specific splits, would still generate the same jobfile to be sent to Dataflow I believe.

adarob · 2020-11-09T16:12:13Z

The actual pipeline "blueprint" is pre-generated and sent to the DataFlow service, which is where I think your issue is. This blueprint includes all of the per-language stages.

versae · 2020-11-09T17:44:02Z

We finally tried our custom config for Norwegian (excluding no-validation) but the jobfile is still too big (10.2MB), so Dataflow is rejecting the request for 200KB :\

adarob · 2020-11-09T17:45:38Z

Can you try also removing c4_utils.UNKNOWN_LANGUAGE here

datasets/tensorflow_datasets/text/c4.py

Line 391 in 0371b59

for lang in self.builder_config.languages + [c4_utils.UNKNOWN_LANGUAGE]:

versae · 2020-11-09T22:55:01Z

Tried with and without that line. The request now went through but Dataflow still failed and complained about size, since it's still a tiny bit larger than 10MB (~10.2MB)

{
  "error": {
    "code": 400,
    "message": "(ffddc5dbfa10dae9): The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs.",
    "status": "INVALID_ARGUMENT"
  }
}

Adding the updated job_file.json (compressed) in case it's useful. There are a couple of big chunks of binary data taking most of the size of the file.

Wondering if the hard limit of 10MB could be lifted temporarily at least or for individual projects.

adarob · 2020-11-10T15:50:20Z

I'll see if we can get some help from the dataflow team.

versae · 2020-11-10T16:00:54Z

Thank you so much.

rezarokni · 2020-11-10T23:53:33Z

If possible its best to reduce the nodes in the graph, for example multiplexing the values into a source. @adarob we should look at the pipeline in more detail to look at that.

In the mean time you can make use of --experiments=upload_graph with the Dataflow pipeline arguments, which allow larger than 10MB pipelines. Although note things like the UI will have limitations with this experimental flag.

versae · 2020-11-11T00:20:32Z

Interesting. This experiments=upload_graph argument replaces experiments=shuffle_mode=service, or is there a way to have them both?

rezarokni · 2020-11-11T02:21:38Z

You can have both in a list

versae · 2020-11-11T11:42:32Z

The job now went through and it appears workers are being started. Adding more languages to the config also works.

Just one question though, should we use max_num_workers or directly num_workers?

Thanks!

adarob · 2020-11-11T16:39:00Z

I think num_workers is the way to go, and also disabling auto scale.

…

On Wed, Nov 11, 2020 at 6:42 AM Javier de la Rosa ***@***.***> wrote: The job now went through and it appears workers are being started. Adding more languages to the config also works. Just one question though, should we use max_num_workers or directly num_workers? Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2711 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIJV2BXXG3AF36Q7UCMM2LSPJ2DPANCNFSM4TPDD2ZQ> .

versae · 2020-11-12T15:52:04Z

It's been working for 23 hours now but we don't see any progress, neither in the log of the process, the dataflow log, or the dataflow diagram. Total allocated HDD is 10.99TB, which is insufficient for the entire corpus, although we don't know if the corpus is downloaded in chunks, processed, and then discarded.

Moreover, nothing is being written to the bucket. And after a few hours, we started to see a lot of messages like these ones:

I1112 06:43:57.139022 139788711368448 transport.py:183] Refreshing due to a 401 (attempt 1/2)
I1112 06:43:58.973949 139788711368448 transport.py:183] Refreshing due to a 401 (attempt 1/2)
I1112 07:18:52.325422 139788711368448 transport.py:183] Refreshing due to a 401 (attempt 1/2)
I1112 07:18:54.354025 139788711368448 transport.py:183] Refreshing due to a 401 (attempt 1/2)
I1112 08:18:31.006261 139788711368448 transport.py:183] Refreshing due to a 401 (attempt 1/2)

Nothing else written in the process log. Our guess is that 401 is some sort of unauthorized OAuth bearer token issue? But we don't know if we should worry about it, just let it run for another couple of hours, or stop it right away. It's been some 24 expensive hours for us running 450 workers :)

adarob · 2020-11-12T16:01:36Z

Do you not see any counters? @rezarokni is this a side effect of upload_graph?

If the job is not crashing my suspicion is that it's working. The input dataset is 71x the size of the original C4, so it going to take quite a bit longer. I'm not sure how much longer, but it sould be less than 71x as long.

versae · 2020-11-12T17:06:45Z

If by counters you mean the stages of each box in the diagram, a few boxes at the beginning are outlined in dashed green (started), but most of them are greyed out (not even started yet). All counters are at 0 (zero), none reported to succeed yet. Attaching a screenshot for reference.

PS: If this is off-topic now for the current issue, I can create another issue and move the discussion there.

adarob · 2020-11-12T17:13:14Z

When the job starts running properly, you should see a "Custom Counters" section on the right as well, assuming the upload_graph option isn't disabling it somehow.

Just to be clear, have you downloaded all of the WET files to the manual directory?

versae · 2020-11-12T17:32:45Z

We have 72 files like this one in the bucket: BUCKET/tensorflow_datasets/downloads/manual/crawl-data/CC-MAIN-2013-20/wet.paths, one for each Common Crawl dump. Are the WET paths files what we need or should we have downloaded the actual contents of all the URLs inside the WET files into the bucket?

We used the next code to put the wet.paths files in the bucket:

# Add all 72 dumps
rm wet.paths.urls
echo "CC-MAIN-2013-20" >> wet.paths.urls
...
echo "CC-MAIN-2020-40" >> wet.paths.urls

# Put them in the bucket
for wetpath in `cat wet.paths.urls` ; do curl -s https://commoncrawl.s3.amazonaws.com/crawl-data/$wetpath/wet.paths.gz | gunzip | pv --name $wetpath --bytes | gsutil -q cp - "$GCS_BUCKET/tensorflow_datasets/downloads/manual/crawl-data/$wetpath/web.paths" ; done

versae · 2020-11-13T00:55:06Z

Is there any other indicator or flag that the job is actually running?

rezarokni · 2020-11-13T02:29:39Z

The experimental upload_graph option can cause issues in the UI.
If you click through to the logs from the UI, you should still see the logs in the log UI, which should give some indications.
You may also see some of the metrics using the gcloud options:
https://cloud.google.com/sdk/gcloud/reference/beta/dataflow/metrics

versae · 2020-11-13T08:38:37Z

Attaching the Dataflow metrics as returned by the CLI command: c4-nordic-gen.metrics.txt Nothing of interest there we believe.

In the UI, workers logs are empty, and the jobs log hasn't returned anything since 36 hours ago:

The download_and_prepare script keeps telling 401 a couple of times every hour, always (attempt 1/2). We are starting to really worry nothing is happening.

versae · 2020-11-13T08:57:54Z

Our only hope is that some workers look like this:

adarob · 2020-11-13T14:04:44Z

It sounds like you only download the wet path files, not the actual wet files. I'm not sure why it's not just crashing when it can't find them in your manual directory. I can try to set it up to download on the workers instead of requiring you to do it ahead of time, but it's not straightforward due to how TFDS download manager stores all of the files in the same directory. I'll spend a bit of time on it this morning.

…

On Fri, Nov 13, 2020 at 3:58 AM Javier de la Rosa ***@***.***> wrote: Our only hope is that some workers look like this: [image: image] <https://user-images.githubusercontent.com/173537/99048958-8b75b000-2596-11eb-9a3e-cf9fea6af07f.png> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2711 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIJV2FMVXJ3Z3MCDZCYTFDSPTYKDANCNFSM4TPDD2ZQ> .

versae · 2020-12-04T13:01:24Z

Is the custom Config needed in the workers or only by the main process launching the Dataflow job?

adarob · 2020-12-04T14:22:40Z

The custom config is not needed by the workers. It's okay that the installed version doesn't include it.

Just to be clear, I was able to run this with no issue back when I submitted #2734. I'm not sure why you are still having trouble.

rezarokni · 2020-12-04T15:22:20Z

@versae best to delete the comment with txt file as the raw info is not needed.

versae · 2020-12-04T16:52:55Z

@adarob, @rezarokni thanks for all the help, we're all in good faith here :) I wish it just worked for me. After removing the local clone from the workers dependencies, getting rid of the experimental flag (job graph file is now 1.3MB instead of 13MB), it now seems we have reached a milestone. I can now see workers logs in the UI and the first stage (counter) has completed successfully. We are now getting this tracebacks in the workers:

An exception was raised when trying to execute the workitem 7861000946831734554 : Traceback (most recent call last):
  ...
  File "/home/versae/datasets/tensorflow_datasets/text/c4.py", line 403, in download_wet_file
    name=f"{lang}-validation",
NameError: name 'uuid' is not defined

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  ...
  File "/home/versae/datasets/tensorflow_datasets/text/c4.py", line 403, in download_wet_file
    name=f"{lang}-validation",
NameError: name 'uuid' is not defined [while running 'Map(download_wet_file)']

Note: imports, functions and other variables defined in the global context of your __main__ file of your Dataflow pipeline are, by default, not available in the worker execution environment, and such references will cause a NameError, unless the --save_main_session pipeline option is set to True. Please see https://cloud.google.com/dataflow/faq#how-do-i-handle-nameerrors for additional documentation on configuring your worker execution environment.

However, since there are issues installing tfds-nightly (#2827), I used regular tensorflow_datasets as a dependency in the workers (adding git+https://... did not work either since git is not available in the workers). So not sure the error I'm getting now is related to that or not.

tvalentyn · 2020-12-04T16:53:57Z

I see a lot of Error syncing pod errors

This error indicates that a worker Docker container failed to start.
This typically happens when one of the containers is crashlooping on startup,
or a container image cannot be pulled.

If the container is crashlooping, look for an error message
describing the failure in worker-startup, harness-startup or docker logs.
Startup crashes are commonly caused by dependency issues with Python jobs,
which can be due to incompatible specified dependencies or network issues that
cause the specified repositories to be unreachable. Note that Apache Beam Python
SDK installed in the container should be the same as the SDK used to launch the
job.

If you are using a custom container image with your job and the workers are
unable to pull the image, verify that the image name and tag are correct,
and that the image is accessible by the Dataflow workers.

versae · 2020-12-06T14:46:34Z

After downgrading pip to "<20.0" we got it to work. It's been running for almost 18 hours now. So far we've got a few errors related to downloading of WET files (a HTTP 503 downloading https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2015-27/segments/1435375090887.26/wet/CC-MAIN-20150627031810-00254-ip-10-179-60-89.ec2.internal.warc.wet.gz, a HTTP 500 for https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247479627.17/wet/CC-MAIN-20190215224408-20190216010408-00436.warc.wet.gz, but also Remote end closed connection without response, Connection reset by peer, and Connection broken). Hopefully, the impact of those errors is not too big.

It has shuffled almost 40TB of data on 132k WET files and counting (so nice to be able to see all the counters). I'm closing the issue and will write a small TL;DR when it's done for anyone following. Thank you so much for all the work, @adarob, @rezarokni, and @tvalentyn!

versae · 2020-12-07T08:18:14Z

@adarob, @rezarokni, any way to estimate how long the process will take?

versae · 2020-12-07T09:22:32Z

If it crashes for any reason, is there a way to resume processing?

tvalentyn · 2020-12-08T06:37:31Z

any way to estimate how long the process will take?

It's tricky and depends on the pipeline and even the Dataset. Besides trying the pipeline on datasets of different sizes, it would be prudent to check that pipeline can scale and data is ~evenly distributed between workers (no hot keys).

If it crashes for any reason, is there a way to resume processing?

I don't think it is possible at this time. However Dataflow will retry failing work item at least 4 times before giving up.

versae · 2020-12-08T09:29:19Z

We have enabled shuffle_mode=service , not sure if it mitigates the hot key thing, but we haven't got any errors related to hot keys yet.

To be honest, we are a bit worried. This is costing us around $1000 a day. We are small unit at the National Library of Norway trying to get a mono-lingual Norwegian BERT model released to the public. We are a non-profit European organization (no grant awarding, so not eligible for free GCP credits AFAICT). A rough estimate of the total time would be really helpful. It would be terrible running out of funds before the process finishes with no chance of resuming it later.

There is full activity on all the VMs (~80% CPU), and it has processed 120TB of data so far. Here's a screenshot of the counters. Looking at the downloaded WET files (~320k), and calculating an average of 60k WET files per dump (there are 72), we estimated we are at about 15% of processing after 60 hours, so we still need 2 full weeks (400 hours) to get the Norwegian part of the mC4. But not sure if that's in any way realistic.

adarob · 2020-12-08T14:42:13Z

I've never built one of these datasets on DataFlow with more than 1 crawl (the default for English only). For multilingual, we used 72 crawls to help get enough data for the tail languages, but we only ran it on our internal system which used many more workers. Based on this estimate (which is actually only for the first stage of processing), I'd highly suggest you reduce the number of crawls you use in the config.

versae · 2020-12-21T13:11:32Z

As you pointed out, it seems unreasonable for us to run this for all the crawls. It seems like even restricting it to one crawl will be a significant expense. We are trying to build a very large corpus for Norwegian (and the other Nordic languages), and this looks like a very good source. Are you able to give us a rough estimate (based on your experience) on how much data one crawl would give us? We do have access to the OSCAR dataset based on Common Crawl. Would you happen to know how would one crawl differ from this dataset on one or two crawls?

adarob · 2020-12-21T15:24:15Z

If you let me know the languages you're interested in, I can compute ~ how many documents you'll get from one crawl.

…

On Mon, Dec 21, 2020 at 8:11 AM Javier de la Rosa ***@***.***> wrote: As you pointed out, it seems unreasonable for us to run this for all the crawls. It seems like even restricting it to one crawl will be a significant expense. We are trying to build a very large corpus for Norwegian (and the other Nordic languages), and this looks like a very good source. Are you able to give us a rough estimate (based on your experience) on how much data one crawl would give us? We do have access to the OSCAR dataset based on Common Crawl. Would you happen to know how would one crawl differ from this dataset on one or two crawls? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2711 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIJV2FGYRX5DQKKXQDFETDSV5CRLANCNFSM4TPDD2ZQ> .

versae · 2020-12-21T21:43:36Z

Sure, the languages we're interested in are Norwegian, Swedish, Danish, Icelandic, an Faroese (although I think this one is not included in mC4). The ISO codes are no, sv, da, and is (Faroese would be fo).

versae · 2020-12-21T21:44:35Z

It'd be also great to know how many words or GB of raw text would be.

adarob · 2020-12-29T13:21:03Z

First off, I found a significant bottleneck that reduces the parallelism to ~71 instead of your number of workers. It will be fixed in #2895.

Here are the number of documents from 1 common crawl dump:

beam:ParDo(_PredictLanguageFn):MetricName(namespace=language-filter, name=passed:da) | 3,163,698 (15 GB)
beam:ParDo(_PredictLanguageFn):MetricName(namespace=language-filter, name=passed:is) | 219,188 (1 GB)
beam:ParDo(_PredictLanguageFn):MetricName(namespace=language-filter, name=passed:no) | 2,617,118 (12 GB)
beam:ParDo(_PredictLanguageFn):MetricName(namespace=language-filter, name=passed:sv) | 5,900,912 (26 GB)

acul3 · 2021-01-08T08:19:07Z

hi @versae thanks for creating the issue...
did you guys finish it successfully(extracting the specific language from mc4) ?
if it did..how long does it take?

i'm also planning to extract specific language from mc4 using dataflow beam dataset..
any tips to prepare it or maybe code how to do it would be helpful
thanks

versae · 2021-02-09T15:23:19Z

Hi!

Sorry for the long hiatus. Thanks, @adarob. We're finally resuming work on this and I hope to start the processing in the coming days. We might just start with the first, last, and in between dumps just to get an idea of how long and how big. But the estimates in number of documents per language really help.

@acul3, we were not able to finish the processing even after starting 500 workers for over 36 hours. We'll prepare everything again and report back. If you have any insights I'd be eager to know as well.

Cheers.

versae · 2021-02-25T09:34:53Z

I was finally able to try this again. A few things have changed and it seems the experimental flags are not needed anymore. The setup is as follows:

python -m tensorflow_datasets.scripts.download_and_prepare \
  --datasets=$DATASET_NAME/$DATASET_CONFIG \
  --data_dir=$GCS_BUCKET/tensorflow_datasets \
  --beam_pipeline_options="region=$GCS_BUCKET_REGION,runner=DataflowRunner,project=$GCP_PROJECT,job_name=$DATASET_NAME-$DATASET
_CONFIG-1dump-gen,staging_location=$GCS_BUCKET/binaries,temp_location=$GCS_BUCKET/temp,dataflow_job_file=$GCS_BUCKET/job_file.json,requirements_file=/
tmp/beam_requirements.txt,autoscaling_algorithm=NONE,num_workers=50" 2>&1 | tee nb-mc4-1dump.log

The beam_requirements.txt looks as follows:

apache-beam[gcp]
tensorflow-datasets[c4]
google-apitools
dill<0.3.2,>=0.3.1.1

The last main process logs:

INFO[dataflow_runner.py]: 2021-02-24T10:48:08.070Z: JOB_MESSAGE_DEBUG: Value "no-validation_write/GroupBucketsAndBoundaries/GroupByKey/Session" materialized.
INFO[dataflow_runner.py]: 2021-02-24T10:48:08.095Z: JOB_MESSAGE_DEBUG: Value "no-validation_write/GroupShards/Session" materialized.                 
INFO[dataflow_runner.py]: 2021-02-24T10:48:08.132Z: JOB_MESSAGE_DEBUG: Value "sv_write/GroupByBucket/Session" materialized.                          
INFO[dataflow_runner.py]: 2021-02-24T10:48:08.160Z: JOB_MESSAGE_DEBUG: Value "sv_write/GroupBucketsAndBoundaries/GroupByKey/Session" materialized.   
INFO[dataflow_runner.py]: 2021-02-24T10:48:08.196Z: JOB_MESSAGE_DEBUG: Value "sv_write/CombineBucketsSizes/CombineBucketsSizes/CombinePerKey/GroupByKey/Session" materialized.

I run it for 24 hours only on one dump, CC-MAIN-2013-20. Everything run fine for approximately 11 hours, as I could see data being written to the bucket, counters counting, and workers CPU working at 90%-95% capacity. Then, it all suddenly stop. Workers CPU usage dropped to 4%, counters stopped, and the bucket did not grow anymore. No critical nor regular errors in the logs. No budget problem. Nothing. I really have no clue what is going on. I just restarted the process to see if anything changes.

I'd be happy to open a new issue if that's more appropriate.

Conchylicultor · 2021-02-25T10:05:01Z

If the input pipeline entered a phase where it is IO bound, it might be normal that the CPU rate drop.

versae · 2021-02-25T10:16:24Z

The throughput also drops to zero for all the pipeline stages, and although progress is 100% is not marked as "succeeded". For example:

versae · 2021-02-26T00:51:46Z

It seems like I got past that problem on a second run, but it has now failed with a "No space left on the device" error that I guess is coming from the workers as they try to write too many or too big temporary files.

File "/usr/local/lib/python3.8/site-packages/dataflow_worker/shuffle.py", line 282, in __next__ return next(self.iterator)
File "/usr/local/lib/python3.8/site-packages/dataflow_worker/shuffle.py", line 240, in __iter__ chunk, next_position = self.reader.Read(start_position, end_position)
File "third_party/windmill/shuffle/python/shuffle_client.pyx", line 135, in shuffle_client.PyShuffleReader.Read OSError: Shuffle read failed: b'INTERNAL: RPC error: IO error: /var/shuffle/sorted-dataset-1/1818: No space left on device, {"created":"@1614294826.207181169","description":"Error received from peer ipv4:10.128.0.123:12346","file":"third_party/grpc/src/core/lib/surface/call.cc","file_line":1068,"grpc_message":"IO error: /var/shuffle/sorted-dataset-1/1818: No space left on device","grpc_status":13} when c4-nordic-1dump-gen-02250406-zsv5-harness-s9tq talking to c4-nordic-1dump-gen-02250406-zsv5-harness-rrsn:12346. [type.googleapis.com/util.MessageSetPayload=\'[dist_proc.dax.internal.TrailProto] { trail_point { source_file_loc { filepath: "dist_proc/dax/shuffle/sorter/shuffle_client_key_value_iterator.cc" line: 84 } } trail_point { source_file_loc { filepath: "dist_proc/dax/shuffle/sorter/chunked_sorted_reader.cc" line: 73 } } }\']'

        at __iter__ (/usr/local/lib/python3.8/site-packages/dataflow_worker/shuffle.py:441)
        at dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start (shuffle_operations.py:82)
        at dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start (shuffle_operations.py:80)
        at dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start (shuffle_operations.py:79)
        at dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start (shuffle_operations.py:64)
        at dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start (shuffle_operations.py:63)
        at execute (/usr/local/lib/python3.8/site-packages/dataflow_worker/executor.py:179)
        at do_work (/usr/local/lib/python3.8/site-packages/dataflow_worker/batchworker.py:649)

Not sure what's a proper disk size for the workers, I'm using whatever is the default now. I also don't know if I should run it again with experiments=shuffle_mode=service or if that's just irrelevant at this point.

adarob · 2021-02-26T14:44:26Z

experiments=shuffle_mode=service should reduce the files, but potentially only on GCS. @daphnei has also recently seen this issue. It appears to be new.

tvalentyn · 2021-02-26T18:14:14Z

If the temporary files are created by shuffle, --experiments=shuffle_mode=service should help;
I think the default --disk_size_gb disk size is 250G; with shuffle service it is 25G.

versae · 2021-03-01T09:24:35Z

It finished! 🎉 It successfully processed the 105TB and almost 2 billion files of the CC-MAIN-2013-20 dump. Using 75 workers, it took 31 hours. That's roughly 3.3TB per hour. I increased disk size to 100GB (it might be unnecessarily big), force v2 of the runner, and enabled shuffle mode. The total cost was around 950€.

DATASET_NAME=c4
DATASET_CONFIG=nordic
GCP_PROJECT=...
GCS_BUCKET=...
GCS_BUCKET_REGION=...

The command looks like this:

python -m tensorflow_datasets.scripts.download_and_prepare \
  --datasets=$DATASET_NAME/$DATASET_CONFIG \
  --data_dir=$GCS_BUCKET/tensorflow_datasets \
  --beam_pipeline_options="region=$GCS_BUCKET_REGION,runner=DataflowRunner,project=$GCP_PROJECT,job_name=$DATASET_NAME-$DATASET_CONFIG-1dump-gen,staging_location=$GCS_BUCKET/binaries,temp_location=$GCS_BUCKET/temp,dataflow_job_file=$GCS_BUCKET/job_file.json,requirements_file=/tmp/beam_requirements.txt,autoscaling_algorithm=NONE,disk_size_gb=100,num_workers=75,experiments=shuffle_mode=service,experiments=use_runner_v2," 2>&1 | tee nb-mc4-1dump.log

The beam_requirements.txt file contains:

apache-beam[gcp]
tensorflow-datasets[c4]
google-apitools
dill<0.3.2,>=0.3.1.1

And I patched locally 0db70eb to include a nordic config and ignore the unknown languages: nordic_dump2013-20.patch.txt.

I did not delete the contents of the bucket from the previous run. Do runs cache any part of the process so after a failed run it takes shorter to run?

daphnei · 2021-03-09T05:40:35Z

Congrats on getting this working!

Out of curiosity, did you have to request a lot of quota increases from GCS to make it work, and if so what quotas did you get increases to?

The last time I tried this, I kept running into various out-of-quota warnings.

sumanthd17 · 2021-03-09T06:36:15Z

@versae Congratulations on getting this work.

I would like to understand the finer details of this work. We are planning on doing this for english. Is it possible to set up a meeting to discuss this.

Thanks in Advance

versae · 2021-03-09T13:31:39Z

Thanks, @daphnei, @sumanthd17.

@daphnei, the quota thing was the first problem I had. Thankfully, it was fixed by my manager. I'm my experience, having a budget estimate on resources and cost really helps when justifying how much money you need to run experiments.

@sumanthd17, I am not affiliated with Google. I'm sure that @adarob and @rezarokni know way better the internals of their own work. That being said, happy to help.

versae added the bug Something isn't working label Nov 9, 2020

versae closed this as completed Nov 11, 2020

versae closed this as completed Dec 6, 2020

spate141 mentioned this issue Nov 8, 2022

Running C4 dataset pipeline on Cloud Dataflow - running time and resources #1931

Open

c4/multilingual produces Dataflow job file too big (38MB >> 10MB) #2711

c4/multilingual produces Dataflow job file too big (38MB >> 10MB) #2711

Comments

versae commented Nov 9, 2020 • edited Loading

Conchylicultor commented Nov 9, 2020

adarob commented Nov 9, 2020

versae commented Nov 9, 2020

adarob commented Nov 9, 2020

versae commented Nov 9, 2020

adarob commented Nov 9, 2020

versae commented Nov 9, 2020 • edited Loading

adarob commented Nov 9, 2020

versae commented Nov 9, 2020

adarob commented Nov 9, 2020

versae commented Nov 9, 2020 • edited Loading

adarob commented Nov 10, 2020

versae commented Nov 10, 2020

rezarokni commented Nov 10, 2020

versae commented Nov 11, 2020

rezarokni commented Nov 11, 2020

versae commented Nov 11, 2020

adarob commented Nov 11, 2020 via email

versae commented Nov 12, 2020 • edited Loading

adarob commented Nov 12, 2020

versae commented Nov 12, 2020

adarob commented Nov 12, 2020

versae commented Nov 12, 2020 • edited Loading

versae commented Nov 13, 2020

rezarokni commented Nov 13, 2020

versae commented Nov 13, 2020

versae commented Nov 13, 2020

adarob commented Nov 13, 2020 via email

versae commented Dec 4, 2020

adarob commented Dec 4, 2020

rezarokni commented Dec 4, 2020 • edited Loading

versae commented Dec 4, 2020

tvalentyn commented Dec 4, 2020

versae commented Dec 6, 2020 • edited Loading

versae commented Dec 7, 2020

versae commented Dec 7, 2020

tvalentyn commented Dec 8, 2020

versae commented Dec 8, 2020

adarob commented Dec 8, 2020 • edited Loading

versae commented Dec 21, 2020

adarob commented Dec 21, 2020 via email

versae commented Dec 21, 2020

versae commented Dec 21, 2020

adarob commented Dec 29, 2020

acul3 commented Jan 8, 2021 • edited Loading

versae commented Feb 9, 2021

versae commented Feb 25, 2021 • edited Loading

Conchylicultor commented Feb 25, 2021

versae commented Feb 25, 2021

versae commented Feb 26, 2021

adarob commented Feb 26, 2021

tvalentyn commented Feb 26, 2021

versae commented Mar 1, 2021 • edited Loading

daphnei commented Mar 9, 2021

sumanthd17 commented Mar 9, 2021

versae commented Mar 9, 2021

versae commented Nov 9, 2020 •

edited

Loading

versae commented Nov 9, 2020 •

edited

Loading

versae commented Nov 9, 2020 •

edited

Loading

versae commented Nov 12, 2020 •

edited

Loading

versae commented Nov 12, 2020 •

edited

Loading

rezarokni commented Dec 4, 2020 •

edited

Loading

versae commented Dec 6, 2020 •

edited

Loading

adarob commented Dec 8, 2020 •

edited

Loading

acul3 commented Jan 8, 2021 •

edited

Loading

versae commented Feb 25, 2021 •

edited

Loading

versae commented Mar 1, 2021 •

edited

Loading