Running C4 dataset pipeline on Cloud Dataflow - running time and resources #1931

shepsels · 2020-04-22T14:19:17Z

What I need help with / What I was wondering
I'm running the C4 Dataflow pipeline as described in this guide:
https://www.tensorflow.org/datasets/beam_datasets.
At first, I ran it without any restrictions, and it tried to scale up, until it used all of our free addresses across our entire Gcloud account.
On the second run, we set max_workers to 20. It's running for quite some time (~72h) and we do not have any way to estimate for how long it will run, if there's any error (no errors are shown in logs).

We'll be happy to understand if that's reasonable running time, and to get some ways to inspect this pipeline and figure out our progress.

Thank you.

Environment information
(if applicable)

Python version: 3.7.4
tensorflow-datasets/tfds-nightly version: tfds-nightly
tensorflow/tensorflow-gpu/tf-nightly/tf-nightly-gpu version: tensorflow 2

The text was updated successfully, but these errors were encountered:

Conchylicultor · 2020-04-22T16:10:15Z

@adarob FYI

adarob · 2020-04-22T16:17:59Z

I recently added this information to the T5 README (https://github.com/google-research/text-to-text-transfer-transformer#c4):

C4

The [C4][c4] dataset we created for unsupervised pre-training is available in TensorFlow Datasets, but it requires a significant amount of bandwith for downloading the raw [Common Crawl][cc] scrapes (~7 TB) and compute for its preparation (~341 CPU-days). We suggest you take advantage of the [Apache Beam][beam] support in TFDS, which enables distributed preprocessing of the dataset and can be run on [Google Cloud Dataflow][gcd]. With 450 workers, the job should complete in ~18 hours.

After defining MY_PROJECT and MY_BUCKET appropriately, you can build the datast in DataFlow from GCP using the following commands:

pip install tfds-nightly[c4,gcp]
echo 'tfds-nightly[c4]' > beam_requirements.txt
python -m tensorflow_datasets.scripts.download_and_prepare \
  --datasets=c4/en \
  --data_dir=gs://$MY_BUCKET/tensorflow_datasets \
  --beam_pipeline_options="project=$MY_PROJECT,job_name=c4,staging_location=gs://$MY_BUCKET/binaries,temp_location=gs://$MY_BUCKET/temp,runner=DataflowRunner,requirements_file=/tmp/beam_requirements.txt,experiments=shuffle_mode=service"

Conchylicultor · 2020-04-22T16:56:23Z

@adarob Maybe we should also add the doc in c4 description, or at least a link to the instructions: https://www.tensorflow.org/datasets/catalog/c4

shepsels · 2020-04-22T17:02:25Z

Thanks @adarob @Conchylicultor. We are using beam and download_and_prepare script. Good to know That nothing is stuck, just some more time to wait. And yes, I think it will be very useful to add that info to the main c4 info page, I guess that if I knew this before I would find a way to run this through the weekend on more CPUs.

Thank you again,
Paz.

shepsels · 2020-04-24T16:05:14Z

Hi again, I have a follow-up question. We started running the Dataflow pipeline ~25 hours ago while following your recommendations, with 7*64=448 CPU cores. The pipeline already used 10767 vCPU hours, which is ~450 CPU days. Is there a way for us to figure out for how long it will still be running? Or to understand if something has gone wrong?

Thanks again,
Paz.

adarob · 2020-04-24T16:16:57Z

Did you enable the shuffle service?

…

On Fri, Apr 24, 2020 at 9:05 AM shepsels ***@***.***> wrote: Reopened #1931 <#1931>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1931 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIJV2E4JIUDW2PQ4LJMT3TROG2EVANCNFSM4MOGWJSA> .

shepsels · 2020-04-24T16:19:45Z

@adarob Not explicitly, no.

shepsels · 2020-04-24T16:39:22Z

@adarob Is this crucial (should I start again?) or it'll just take some more time?

adarob · 2020-04-24T16:53:13Z

What exact command did you use to launch? I haven't tried without shuffle service. It will certainly take longer and will use more memory, which could possibly cause you to oom. It's worth a try though!

…

On Fri, Apr 24, 2020, 9:39 AM shepsels ***@***.***> wrote: @adarob <https://github.com/adarob> Is this crucial (should I start again?) or it'll just take some more time? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1931 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIJV2BL2XLVBTWVWSSURD3ROG6ERANCNFSM4MOGWJSA> .

shepsels · 2020-04-24T17:37:01Z

I'd like to run it again with the correct parameters. But after it ran for so long before, I want to be 100% sure this configuration is complete and optimal. Can you please take a look at this and let me know if something is missing, or not optimal?

python -m tensorflow_datasets.scripts.download_and_prepare \ --datasets=c4/en \ --data_dir=gs://$MY_BUCKET/tensorflow_datasets \ --beam_pipeline_options="machine_type=n1-standard-64,disk_size_gb=5000,max_num_workers=7,project=$MY_PROJECT,job_name=c4,staging_location=gs://$MY_BUCKET/binaries,temp_location=gs://$MY_BUCKET/temp,runner=DataflowRunner,requirements_file=/tmp/beam_requirements.txt,experiments=shuffle_mode=service"

I added these: machine_type=n1-standard-64,disk_size_gb=5000,max_num_workers=7 because we have a quota on the number of the machines in our billing account, and the default machine was n1-standard-1 (which will require ~450 machines), so we would like to use 7 64-core machines instead.
I also set disk size because I understood that the default is 250GB per worker and I have fewer workers.
Anything wrong with that? Will it do the work?
Thanks.

adarob · 2020-04-24T17:40:37Z

I'm not positive it will use multiple cores on each machine by default due to the python global lock. As long as you had `experiments=shuffle_mode=service` with 450 workers before, you should have been using the same setup as I did which completed in less than a day.

…

On Fri, Apr 24, 2020 at 10:37 AM shepsels ***@***.***> wrote: I'd like to run it again with the correct parameters. But after it ran for so long before, I want to be 100% sure this configuration is complete and optimal. Can you please take a look at this and let me know if something is missing, or not optimal? python -m tensorflow_datasets.scripts.download_and_prepare \ --datasets=c4/en \ --data_dir=gs://$MY_BUCKET/tensorflow_datasets \ --beam_pipeline_options="machine_type=n1-standard-64,disk_size_gb=5000,max_num_workers=7,project=$MY_PROJECT,job_name=c4,staging_location=gs://$MY_BUCKET/binaries,temp_location=gs://$MY_BUCKET/temp,runner=DataflowRunner,requirements_file=/tmp/beam_requirements.txt,experiments=shuffle_mode=service" I added these: machine_type=n1-standard-64,disk_size_gb=5000,max_num_workers=7 because we have a quota on the number of the machines in our billing account, and the default machine was n1-standard-1 (which will require ~450 machines), so we would like to use 7 64-core machines instead. I also set disk size because I understood that the default is 250GB per worker and I have fewer workers. Anything wrong with that? Will it do the work? Thanks. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1931 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIJV2FOZJLVNFR4XLI4KG3ROHE4XANCNFSM4MOGWJSA> .

roynirmal · 2020-07-28T21:15:02Z

@shepsels how long did it take ultimately to finish the code? I also cannot set 450 workers due to quota limit so I am planning to use the same beam_pipeline_options as yours

shepsels · 2020-07-29T11:54:11Z

@roynirmal After some struggling I raised the quota and ran it as @adarob suggested. It took less than 24 hours.

roynirmal · 2020-07-29T12:27:08Z

@shepsels That's good to know! Can you tell me how were you successful in raising the quota? I raised the CPU quota for us-central1 to 450. However I am getting maxed out at 32 CPUs since apparently that's the global quota. I asked to raise it but got rejected. Any help is appreciated since I am running a deadline.

adarob · 2020-07-29T13:07:32Z

If anyone wanted to host the dataset in a public bucket for others to share, it is essentially free for the host with Requester Pays (https://cloud.google.com/storage/docs/requester-pays).

roynirmal · 2020-07-29T13:09:39Z

@adarob good to know, I can host it once I have finished the download. With my current quota looks like it will take around 10 days :\

theTB · 2020-08-11T16:43:53Z

@roynirmal @shepsels were you able to host the data by any chance? Would be really helpful, thanks!

wnagele · 2020-08-17T16:49:44Z

I would also really appreciate if I could use this dataset already processed. The cost of processing all of this is too high for my testing.

roynirmal · 2020-08-23T14:03:59Z

Hey @theTB @wnagele, I did not ultimately download it since the cost was too much! But I am open to the idea of sharing the cost to download the data.

theTB · 2020-08-23T16:31:50Z

Did you figure out the resources and cost required for processing? I tried using the free credits but I keep hitting a quota limit at the number of in-use IP addresses which doesn't allow me to scale to more workers (even though I have a lot more vCPU available in my quota). Is there a workaround to not use so many IP addresses?

roynirmal · 2020-08-24T07:26:19Z

Processing the whole dataset will definitely eat up the entire free credits.
I think Cloud Dataflow charges exorbitantly. I am not sure how much it will cost us if we can indeed download the data in less than 24 hours. I also was allowed only 1 IP address, so the workaround is to use a single machine with multiple cores

prashant-kikani · 2020-08-29T12:22:38Z

Did anyone has uploaded the cleaned C4 data with size ~750 GB?

I also wanted to download only the clean C4 data. Not the entire 7 TB CC data.

Thanks.

adarob · 2020-08-29T15:56:05Z

@craffel

feather820 · 2020-12-09T03:12:36Z

@roynirmal After some struggling I raised the quota and ran it as @adarob suggested. It took less than 24 hours.

How much money did you cost to download C4 dataset, I want to download it, but I'm afraid it will cost a lot.

amchauhan · 2021-03-10T21:22:57Z

Hi @adarob @shepsels any estimates on processing mC4 (multilingual one which is ≈ 26TB) on GCP both in terms of time and cost?

spate141 · 2022-11-08T14:12:24Z

@adarob @shepsels @feather820 @amchauhan
Came here by stumbling upon many threads. If anyone has successfully processed CC data; can we please have some numbers in terms of resources used and time it took to process the data?

craffel · 2022-11-08T14:15:52Z

FYI, you can download a prepared and preprocessed C4 directly now: https://huggingface.co/datasets/allenai/c4

spate141 · 2022-11-08T14:38:35Z

Thanks @craffel but I have a use case which requires processing the latest CC data dumps. C4 from Allen seems like the April 2019 version.

kimsan0622 · 2022-11-08T14:53:22Z

Hello, @spate141 ,
I strongly recommend that you use AI2's preprocessed C4 release as @craffel mentioned (I also have used it).
But if you want to process a new Common Crawl dump, you will consume about 3,500 bucks for one WET file (I consumed 3,500 bucks to process one WET file with 1024 cores and Dataflow API on the 'C4 en' split).

spate141 · 2022-11-08T17:52:10Z

Hi @kimsan0622 It looks like using 75 workers with 105TB and almost 2 billion files of the CC-MAIN-2013-20 dump costs about 950€. Did you mean 3500 USD?

kimsan0622 · 2022-11-09T00:10:03Z

@spate141
There was a mistake in the cost calculation. It cost 3500 USD to process 2 WET files with Dataflow.

versae · 2022-11-10T12:06:40Z

FWIW, these days it might be useful to try the olm-datasets approach, i.e., a single massive instance doing it all.

spate141 · 2022-11-10T15:14:51Z

Thanks @versae! olm-datasets seems pretty straightforward for processing CC data, that's exactly what I was looking for!

shepsels added the help label Apr 22, 2020

shepsels closed this as completed Apr 22, 2020

shepsels reopened this Apr 24, 2020

This was referenced Apr 28, 2023

how long does it take to train mT5 with mC4? google-research/multilingual-t5#115

Closed

preprocessing mC4 for gT5 google-research-datasets/clang8#12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running C4 dataset pipeline on Cloud Dataflow - running time and resources #1931

Running C4 dataset pipeline on Cloud Dataflow - running time and resources #1931

shepsels commented Apr 22, 2020

Conchylicultor commented Apr 22, 2020

adarob commented Apr 22, 2020

Conchylicultor commented Apr 22, 2020

shepsels commented Apr 22, 2020

shepsels commented Apr 24, 2020

adarob commented Apr 24, 2020 via email

shepsels commented Apr 24, 2020

shepsels commented Apr 24, 2020

adarob commented Apr 24, 2020 via email

shepsels commented Apr 24, 2020

adarob commented Apr 24, 2020 via email

roynirmal commented Jul 28, 2020

shepsels commented Jul 29, 2020

roynirmal commented Jul 29, 2020

adarob commented Jul 29, 2020

roynirmal commented Jul 29, 2020

theTB commented Aug 11, 2020

wnagele commented Aug 17, 2020

roynirmal commented Aug 23, 2020

theTB commented Aug 23, 2020

roynirmal commented Aug 24, 2020

prashant-kikani commented Aug 29, 2020 •

edited

Loading

adarob commented Aug 29, 2020

feather820 commented Dec 9, 2020

amchauhan commented Mar 10, 2021

spate141 commented Nov 8, 2022

craffel commented Nov 8, 2022

spate141 commented Nov 8, 2022

kimsan0622 commented Nov 8, 2022

spate141 commented Nov 8, 2022

kimsan0622 commented Nov 9, 2022

versae commented Nov 10, 2022 •

edited

Loading

spate141 commented Nov 10, 2022

Running C4 dataset pipeline on Cloud Dataflow - running time and resources #1931

Running C4 dataset pipeline on Cloud Dataflow - running time and resources #1931

Comments

shepsels commented Apr 22, 2020

Conchylicultor commented Apr 22, 2020

adarob commented Apr 22, 2020

C4

Conchylicultor commented Apr 22, 2020

shepsels commented Apr 22, 2020

shepsels commented Apr 24, 2020

adarob commented Apr 24, 2020 via email

shepsels commented Apr 24, 2020

shepsels commented Apr 24, 2020

adarob commented Apr 24, 2020 via email

shepsels commented Apr 24, 2020

adarob commented Apr 24, 2020 via email

roynirmal commented Jul 28, 2020

shepsels commented Jul 29, 2020

roynirmal commented Jul 29, 2020

adarob commented Jul 29, 2020

roynirmal commented Jul 29, 2020

theTB commented Aug 11, 2020

wnagele commented Aug 17, 2020

roynirmal commented Aug 23, 2020

theTB commented Aug 23, 2020

roynirmal commented Aug 24, 2020

prashant-kikani commented Aug 29, 2020 • edited Loading

adarob commented Aug 29, 2020

feather820 commented Dec 9, 2020

amchauhan commented Mar 10, 2021

spate141 commented Nov 8, 2022

craffel commented Nov 8, 2022

spate141 commented Nov 8, 2022

kimsan0622 commented Nov 8, 2022

spate141 commented Nov 8, 2022

kimsan0622 commented Nov 9, 2022

versae commented Nov 10, 2022 • edited Loading

spate141 commented Nov 10, 2022

prashant-kikani commented Aug 29, 2020 •

edited

Loading

versae commented Nov 10, 2022 •

edited

Loading