-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running C4 dataset pipeline on Cloud Dataflow - running time and resources #1931
Comments
@adarob FYI |
I recently added this information to the T5 README (https://github.com/google-research/text-to-text-transfer-transformer#c4): C4The [C4][c4] dataset we created for unsupervised pre-training is available in TensorFlow Datasets, but it requires a significant amount of bandwith for downloading the raw [Common Crawl][cc] scrapes (~7 TB) and compute for its preparation (~341 CPU-days). We suggest you take advantage of the [Apache Beam][beam] support in TFDS, which enables distributed preprocessing of the dataset and can be run on [Google Cloud Dataflow][gcd]. With 450 workers, the job should complete in ~18 hours. After defining pip install tfds-nightly[c4,gcp]
echo 'tfds-nightly[c4]' > beam_requirements.txt
python -m tensorflow_datasets.scripts.download_and_prepare \
--datasets=c4/en \
--data_dir=gs://$MY_BUCKET/tensorflow_datasets \
--beam_pipeline_options="project=$MY_PROJECT,job_name=c4,staging_location=gs://$MY_BUCKET/binaries,temp_location=gs://$MY_BUCKET/temp,runner=DataflowRunner,requirements_file=/tmp/beam_requirements.txt,experiments=shuffle_mode=service" |
@adarob Maybe we should also add the doc in |
Thanks @adarob @Conchylicultor. We are using beam and download_and_prepare script. Good to know That nothing is stuck, just some more time to wait. And yes, I think it will be very useful to add that info to the main c4 info page, I guess that if I knew this before I would find a way to run this through the weekend on more CPUs. Thank you again, |
Hi again, I have a follow-up question. We started running the Dataflow pipeline ~25 hours ago while following your recommendations, with 7*64=448 CPU cores. The pipeline already used 10767 vCPU hours, which is ~450 CPU days. Is there a way for us to figure out for how long it will still be running? Or to understand if something has gone wrong? Thanks again, |
Did you enable the shuffle service?
…On Fri, Apr 24, 2020 at 9:05 AM shepsels ***@***.***> wrote:
Reopened #1931 <#1931>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1931 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIJV2E4JIUDW2PQ4LJMT3TROG2EVANCNFSM4MOGWJSA>
.
|
@adarob Not explicitly, no. |
@adarob Is this crucial (should I start again?) or it'll just take some more time? |
What exact command did you use to launch?
I haven't tried without shuffle service. It will certainly take longer and
will use more memory, which could possibly cause you to oom. It's worth a
try though!
…On Fri, Apr 24, 2020, 9:39 AM shepsels ***@***.***> wrote:
@adarob <https://github.com/adarob> Is this crucial (should I start
again?) or it'll just take some more time?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1931 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIJV2BL2XLVBTWVWSSURD3ROG6ERANCNFSM4MOGWJSA>
.
|
I'd like to run it again with the correct parameters. But after it ran for so long before, I want to be 100% sure this configuration is complete and optimal. Can you please take a look at this and let me know if something is missing, or not optimal?
I added these: |
I'm not positive it will use multiple cores on each machine by default due
to the python global lock. As long as you had
`experiments=shuffle_mode=service` with 450 workers before, you should have
been using the same setup as I did which completed in less than a day.
…On Fri, Apr 24, 2020 at 10:37 AM shepsels ***@***.***> wrote:
I'd like to run it again with the correct parameters. But after it ran for
so long before, I want to be 100% sure this configuration is complete and
optimal. Can you please take a look at this and let me know if something is
missing, or not optimal?
python -m tensorflow_datasets.scripts.download_and_prepare \
--datasets=c4/en \ --data_dir=gs://$MY_BUCKET/tensorflow_datasets \
--beam_pipeline_options="machine_type=n1-standard-64,disk_size_gb=5000,max_num_workers=7,project=$MY_PROJECT,job_name=c4,staging_location=gs://$MY_BUCKET/binaries,temp_location=gs://$MY_BUCKET/temp,runner=DataflowRunner,requirements_file=/tmp/beam_requirements.txt,experiments=shuffle_mode=service"
I added these:
machine_type=n1-standard-64,disk_size_gb=5000,max_num_workers=7 because
we have a quota on the number of the machines in our billing account, and
the default machine was n1-standard-1 (which will require ~450 machines),
so we would like to use 7 64-core machines instead.
I also set disk size because I understood that the default is 250GB per
worker and I have fewer workers.
Anything wrong with that? Will it do the work?
Thanks.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1931 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIJV2FOZJLVNFR4XLI4KG3ROHE4XANCNFSM4MOGWJSA>
.
|
@shepsels how long did it take ultimately to finish the code? I also cannot set 450 workers due to quota limit so I am planning to use the same |
@roynirmal After some struggling I raised the quota and ran it as @adarob suggested. It took less than 24 hours. |
@shepsels That's good to know! Can you tell me how were you successful in raising the quota? I raised the CPU quota for us-central1 to 450. However I am getting maxed out at 32 CPUs since apparently that's the global quota. I asked to raise it but got rejected. Any help is appreciated since I am running a deadline. |
If anyone wanted to host the dataset in a public bucket for others to share, it is essentially free for the host with Requester Pays (https://cloud.google.com/storage/docs/requester-pays). |
@adarob good to know, I can host it once I have finished the download. With my current quota looks like it will take around 10 days :\ |
@roynirmal @shepsels were you able to host the data by any chance? Would be really helpful, thanks! |
I would also really appreciate if I could use this dataset already processed. The cost of processing all of this is too high for my testing. |
Did you figure out the resources and cost required for processing? I tried using the free credits but I keep hitting a quota limit at the number of in-use IP addresses which doesn't allow me to scale to more workers (even though I have a lot more vCPU available in my quota). Is there a workaround to not use so many IP addresses? |
Processing the whole dataset will definitely eat up the entire free credits. |
Did anyone has uploaded the cleaned C4 data with size ~750 GB? I also wanted to download only the clean C4 data. Not the entire 7 TB CC data. Thanks. |
How much money did you cost to download C4 dataset, I want to download it, but I'm afraid it will cost a lot. |
@adarob @shepsels @feather820 @amchauhan |
FYI, you can download a prepared and preprocessed C4 directly now: https://huggingface.co/datasets/allenai/c4 |
Thanks @craffel but I have a use case which requires processing the latest CC data dumps. C4 from Allen seems like the April 2019 version. |
Hello, @spate141 , |
Hi @kimsan0622 It looks like using 75 workers with 105TB and almost 2 billion files of the CC-MAIN-2013-20 dump costs about 950€. Did you mean 3500 USD? |
@spate141 |
FWIW, these days it might be useful to try the |
Thanks @versae! olm-datasets seems pretty straightforward for processing CC data, that's exactly what I was looking for! |
What I need help with / What I was wondering
I'm running the C4 Dataflow pipeline as described in this guide:
https://www.tensorflow.org/datasets/beam_datasets.
At first, I ran it without any restrictions, and it tried to scale up, until it used all of our free addresses across our entire Gcloud account.
On the second run, we set max_workers to 20. It's running for quite some time (~72h) and we do not have any way to estimate for how long it will run, if there's any error (no errors are shown in logs).
We'll be happy to understand if that's reasonable running time, and to get some ways to inspect this pipeline and figure out our progress.
Thank you.
Environment information
(if applicable)
tensorflow-datasets
/tfds-nightly
version: tfds-nightlytensorflow
/tensorflow-gpu
/tf-nightly
/tf-nightly-gpu
version: tensorflow 2The text was updated successfully, but these errors were encountered: