Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running C4 dataset pipeline on Cloud Dataflow - running time and resources #1931

Open
shepsels opened this issue Apr 22, 2020 · 33 comments
Open
Labels

Comments

@shepsels
Copy link

What I need help with / What I was wondering
I'm running the C4 Dataflow pipeline as described in this guide:
https://www.tensorflow.org/datasets/beam_datasets.
At first, I ran it without any restrictions, and it tried to scale up, until it used all of our free addresses across our entire Gcloud account.
On the second run, we set max_workers to 20. It's running for quite some time (~72h) and we do not have any way to estimate for how long it will run, if there's any error (no errors are shown in logs).

We'll be happy to understand if that's reasonable running time, and to get some ways to inspect this pipeline and figure out our progress.

Thank you.

Environment information
(if applicable)

  • Python version: 3.7.4
  • tensorflow-datasets/tfds-nightly version: tfds-nightly
  • tensorflow/tensorflow-gpu/tf-nightly/tf-nightly-gpu version: tensorflow 2
@shepsels shepsels added the help label Apr 22, 2020
@Conchylicultor
Copy link
Member

@adarob FYI

@adarob
Copy link
Member

adarob commented Apr 22, 2020

I recently added this information to the T5 README (https://github.com/google-research/text-to-text-transfer-transformer#c4):

C4

The [C4][c4] dataset we created for unsupervised pre-training is available in TensorFlow Datasets, but it requires a significant amount of bandwith for downloading the raw [Common Crawl][cc] scrapes (~7 TB) and compute for its preparation (~341 CPU-days). We suggest you take advantage of the [Apache Beam][beam] support in TFDS, which enables distributed preprocessing of the dataset and can be run on [Google Cloud Dataflow][gcd]. With 450 workers, the job should complete in ~18 hours.

After defining MY_PROJECT and MY_BUCKET appropriately, you can build the datast in DataFlow from GCP using the following commands:

pip install tfds-nightly[c4,gcp]
echo 'tfds-nightly[c4]' > beam_requirements.txt
python -m tensorflow_datasets.scripts.download_and_prepare \
  --datasets=c4/en \
  --data_dir=gs://$MY_BUCKET/tensorflow_datasets \
  --beam_pipeline_options="project=$MY_PROJECT,job_name=c4,staging_location=gs://$MY_BUCKET/binaries,temp_location=gs://$MY_BUCKET/temp,runner=DataflowRunner,requirements_file=/tmp/beam_requirements.txt,experiments=shuffle_mode=service"

@Conchylicultor
Copy link
Member

@adarob Maybe we should also add the doc in c4 description, or at least a link to the instructions: https://www.tensorflow.org/datasets/catalog/c4

@shepsels
Copy link
Author

Thanks @adarob @Conchylicultor. We are using beam and download_and_prepare script. Good to know That nothing is stuck, just some more time to wait. And yes, I think it will be very useful to add that info to the main c4 info page, I guess that if I knew this before I would find a way to run this through the weekend on more CPUs.

Thank you again,
Paz.

@shepsels
Copy link
Author

Hi again, I have a follow-up question. We started running the Dataflow pipeline ~25 hours ago while following your recommendations, with 7*64=448 CPU cores. The pipeline already used 10767 vCPU hours, which is ~450 CPU days. Is there a way for us to figure out for how long it will still be running? Or to understand if something has gone wrong?

Thanks again,
Paz.

@shepsels shepsels reopened this Apr 24, 2020
@adarob
Copy link
Member

adarob commented Apr 24, 2020 via email

@shepsels
Copy link
Author

@adarob Not explicitly, no.

@shepsels
Copy link
Author

@adarob Is this crucial (should I start again?) or it'll just take some more time?

@adarob
Copy link
Member

adarob commented Apr 24, 2020 via email

@shepsels
Copy link
Author

I'd like to run it again with the correct parameters. But after it ran for so long before, I want to be 100% sure this configuration is complete and optimal. Can you please take a look at this and let me know if something is missing, or not optimal?

python -m tensorflow_datasets.scripts.download_and_prepare \ --datasets=c4/en \ --data_dir=gs://$MY_BUCKET/tensorflow_datasets \ --beam_pipeline_options="machine_type=n1-standard-64,disk_size_gb=5000,max_num_workers=7,project=$MY_PROJECT,job_name=c4,staging_location=gs://$MY_BUCKET/binaries,temp_location=gs://$MY_BUCKET/temp,runner=DataflowRunner,requirements_file=/tmp/beam_requirements.txt,experiments=shuffle_mode=service"

I added these: machine_type=n1-standard-64,disk_size_gb=5000,max_num_workers=7 because we have a quota on the number of the machines in our billing account, and the default machine was n1-standard-1 (which will require ~450 machines), so we would like to use 7 64-core machines instead.
I also set disk size because I understood that the default is 250GB per worker and I have fewer workers.
Anything wrong with that? Will it do the work?
Thanks.

@adarob
Copy link
Member

adarob commented Apr 24, 2020 via email

@roynirmal
Copy link

@shepsels how long did it take ultimately to finish the code? I also cannot set 450 workers due to quota limit so I am planning to use the same beam_pipeline_options as yours

@shepsels
Copy link
Author

@roynirmal After some struggling I raised the quota and ran it as @adarob suggested. It took less than 24 hours.

@roynirmal
Copy link

@shepsels That's good to know! Can you tell me how were you successful in raising the quota? I raised the CPU quota for us-central1 to 450. However I am getting maxed out at 32 CPUs since apparently that's the global quota. I asked to raise it but got rejected. Any help is appreciated since I am running a deadline.

@adarob
Copy link
Member

adarob commented Jul 29, 2020

If anyone wanted to host the dataset in a public bucket for others to share, it is essentially free for the host with Requester Pays (https://cloud.google.com/storage/docs/requester-pays).

@roynirmal
Copy link

@adarob good to know, I can host it once I have finished the download. With my current quota looks like it will take around 10 days :\

@theTB
Copy link

theTB commented Aug 11, 2020

@roynirmal @shepsels were you able to host the data by any chance? Would be really helpful, thanks!

@wnagele
Copy link

wnagele commented Aug 17, 2020

I would also really appreciate if I could use this dataset already processed. The cost of processing all of this is too high for my testing.

@roynirmal
Copy link

Hey @theTB @wnagele, I did not ultimately download it since the cost was too much! But I am open to the idea of sharing the cost to download the data.

@theTB
Copy link

theTB commented Aug 23, 2020

Did you figure out the resources and cost required for processing? I tried using the free credits but I keep hitting a quota limit at the number of in-use IP addresses which doesn't allow me to scale to more workers (even though I have a lot more vCPU available in my quota). Is there a workaround to not use so many IP addresses?

@roynirmal
Copy link

Processing the whole dataset will definitely eat up the entire free credits.
I think Cloud Dataflow charges exorbitantly. I am not sure how much it will cost us if we can indeed download the data in less than 24 hours. I also was allowed only 1 IP address, so the workaround is to use a single machine with multiple cores

@prashant-kikani
Copy link

prashant-kikani commented Aug 29, 2020

Did anyone has uploaded the cleaned C4 data with size ~750 GB?

I also wanted to download only the clean C4 data. Not the entire 7 TB CC data.

Thanks.

@adarob
Copy link
Member

adarob commented Aug 29, 2020

@craffel

@feather820
Copy link

@roynirmal After some struggling I raised the quota and ran it as @adarob suggested. It took less than 24 hours.

How much money did you cost to download C4 dataset, I want to download it, but I'm afraid it will cost a lot.

@amchauhan
Copy link

Hi @adarob @shepsels any estimates on processing mC4 (multilingual one which is ≈ 26TB) on GCP both in terms of time and cost?

@spate141
Copy link

spate141 commented Nov 8, 2022

@adarob @shepsels @feather820 @amchauhan
Came here by stumbling upon many threads. If anyone has successfully processed CC data; can we please have some numbers in terms of resources used and time it took to process the data?

@craffel
Copy link
Contributor

craffel commented Nov 8, 2022

FYI, you can download a prepared and preprocessed C4 directly now: https://huggingface.co/datasets/allenai/c4

@spate141
Copy link

spate141 commented Nov 8, 2022

Thanks @craffel but I have a use case which requires processing the latest CC data dumps. C4 from Allen seems like the April 2019 version.

@kimsan0622
Copy link

Hello, @spate141 ,
I strongly recommend that you use AI2's preprocessed C4 release as @craffel mentioned (I also have used it).
But if you want to process a new Common Crawl dump, you will consume about 3,500 bucks for one WET file (I consumed 3,500 bucks to process one WET file with 1024 cores and Dataflow API on the 'C4 en' split).

@spate141
Copy link

spate141 commented Nov 8, 2022

@kimsan0622
Copy link

@spate141
There was a mistake in the cost calculation. It cost 3500 USD to process 2 WET files with Dataflow.

@versae
Copy link

versae commented Nov 10, 2022

FWIW, these days it might be useful to try the olm-datasets approach, i.e., a single massive instance doing it all.

@spate141
Copy link

Thanks @versae! olm-datasets seems pretty straightforward for processing CC data, that's exactly what I was looking for!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests