-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
c4/multilingual produces Dataflow job file too big (38MB >> 10MB) #2711
Comments
@adarob FYI |
@versae I'd recommend generating all languages you want at the same time since it has to process all of the data either way. However, I think if you added a new config with only the languages you want, you'd avoid the JSON issue. @Conchylicultor we would need to allow them to add a new config to the dataset with only the languages they are interested in. What's the recommended way for them to do that? |
Thanks for the quick replies, @Conchylicultor and @adarob. We thought about the custom config, but had troubles figuring out how to make it work. Even with a custom config, it's still not clear to us how to run the After inspecting the source code, it seems it should be possible to only process the data for one language, even if we have to download everything, but not sure about how to do it. |
You should be able to add a config similar to 'multilingual', but with only the nordic languages listed. Let's say you name it You'd make the change to a local clone of the repo and then you can run |
I see. It feels unnecessarily complicated, but we'll give it a try 🤞 Really, it'd be great if some jobfile size control was allowed or implemented directly within the I wonder, if the limit it's 10MB and the current code generates a jobfile of almost 40MB, is there a way to lift, even temporarily, this limitation to at least be able to launch the job? In any case, thanks for the help. |
The issue is the large number of splits (>100) that are produced by this pipeline. Limiting to ~10 languages should reduce the json size by a factor of 10, I believe. Other options to fix the deeper problem would involve either having the DataFlow team raise the limit or merge some of the downstream steps in the TFDS sharding and writing portions of the pipeline. |
Thinking about, if the downloading and processing of the data happens in the workers, the information about the languages is not available until the query is already sent and in execution, isn't it? Limiting the languages, or for that matter, selecting specific splits, would still generate the same jobfile to be sent to Dataflow I believe. |
The actual pipeline "blueprint" is pre-generated and sent to the DataFlow service, which is where I think your issue is. This blueprint includes all of the per-language stages. |
We finally tried our custom config for Norwegian (excluding |
Can you try also removing c4_utils.UNKNOWN_LANGUAGE here datasets/tensorflow_datasets/text/c4.py Line 391 in 0371b59
|
Tried with and without that line. The request now went through but Dataflow still failed and complained about size, since it's still a tiny bit larger than 10MB (~10.2MB) {
"error": {
"code": 400,
"message": "(ffddc5dbfa10dae9): The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs.",
"status": "INVALID_ARGUMENT"
}
} Adding the updated Wondering if the hard limit of 10MB could be lifted temporarily at least or for individual projects. |
I'll see if we can get some help from the dataflow team. |
Thank you so much. |
If possible its best to reduce the nodes in the graph, for example multiplexing the values into a source. @adarob we should look at the pipeline in more detail to look at that. In the mean time you can make use of --experiments=upload_graph with the Dataflow pipeline arguments, which allow larger than 10MB pipelines. Although note things like the UI will have limitations with this experimental flag. |
Interesting. This |
You can have both in a list |
The job now went through and it appears workers are being started. Adding more languages to the config also works. Just one question though, should we use Thanks! |
I think num_workers is the way to go, and also disabling auto scale.
…On Wed, Nov 11, 2020 at 6:42 AM Javier de la Rosa ***@***.***> wrote:
The job now went through and it appears workers are being started. Adding
more languages to the config also works.
Just one question though, should we use max_num_workers or directly
num_workers?
Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2711 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIJV2BXXG3AF36Q7UCMM2LSPJ2DPANCNFSM4TPDD2ZQ>
.
|
It's been working for 23 hours now but we don't see any progress, neither in the log of the process, the dataflow log, or the dataflow diagram. Total allocated HDD is 10.99TB, which is insufficient for the entire corpus, although we don't know if the corpus is downloaded in chunks, processed, and then discarded. Moreover, nothing is being written to the bucket. And after a few hours, we started to see a lot of messages like these ones:
Nothing else written in the process log. Our guess is that 401 is some sort of unauthorized OAuth bearer token issue? But we don't know if we should worry about it, just let it run for another couple of hours, or stop it right away. It's been some 24 expensive hours for us running 450 workers :) |
Do you not see any counters? @rezarokni is this a side effect of If the job is not crashing my suspicion is that it's working. The input dataset is 71x the size of the original C4, so it going to take quite a bit longer. I'm not sure how much longer, but it sould be less than 71x as long. |
If by counters you mean the stages of each box in the diagram, a few boxes at the beginning are outlined in dashed green (started), but most of them are greyed out (not even started yet). All counters are at 0 (zero), none reported to succeed yet. Attaching a screenshot for reference. PS: If this is off-topic now for the current issue, I can create another issue and move the discussion there. |
When the job starts running properly, you should see a "Custom Counters" section on the right as well, assuming the Just to be clear, have you downloaded all of the WET files to the manual directory? |
We have 72 files like this one in the bucket: We used the next code to put the # Add all 72 dumps
rm wet.paths.urls
echo "CC-MAIN-2013-20" >> wet.paths.urls
...
echo "CC-MAIN-2020-40" >> wet.paths.urls
# Put them in the bucket
for wetpath in `cat wet.paths.urls` ; do curl -s https://commoncrawl.s3.amazonaws.com/crawl-data/$wetpath/wet.paths.gz | gunzip | pv --name $wetpath --bytes | gsutil -q cp - "$GCS_BUCKET/tensorflow_datasets/downloads/manual/crawl-data/$wetpath/web.paths" ; done |
Is there any other indicator or flag that the job is actually running? |
The experimental upload_graph option can cause issues in the UI. |
Attaching the Dataflow metrics as returned by the CLI command: c4-nordic-gen.metrics.txt Nothing of interest there we believe. In the UI, workers logs are empty, and the jobs log hasn't returned anything since 36 hours ago: The |
It sounds like you only download the wet path files, not the actual wet
files. I'm not sure why it's not just crashing when it can't find them in
your manual directory.
I can try to set it up to download on the workers instead of requiring you
to do it ahead of time, but it's not straightforward due to how TFDS
download manager stores all of the files in the same directory.
I'll spend a bit of time on it this morning.
…On Fri, Nov 13, 2020 at 3:58 AM Javier de la Rosa ***@***.***> wrote:
Our only hope is that some workers look like this:
[image: image]
<https://user-images.githubusercontent.com/173537/99048958-8b75b000-2596-11eb-9a3e-cf9fea6af07f.png>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2711 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIJV2FMVXJ3Z3MCDZCYTFDSPTYKDANCNFSM4TPDD2ZQ>
.
|
Is the custom Config needed in the workers or only by the main process launching the Dataflow job? |
The custom config is not needed by the workers. It's okay that the installed version doesn't include it. Just to be clear, I was able to run this with no issue back when I submitted #2734. I'm not sure why you are still having trouble. |
@versae best to delete the comment with txt file as the raw info is not needed. |
@adarob, @rezarokni thanks for all the help, we're all in good faith here :) I wish it just worked for me. After removing the local clone from the workers dependencies, getting rid of the experimental flag (job graph file is now 1.3MB instead of 13MB), it now seems we have reached a milestone. I can now see workers logs in the UI and the first stage (counter) has completed successfully. We are now getting this tracebacks in the workers:
However, since there are issues installing |
This error indicates that a worker Docker container failed to start. If the container is crashlooping, look for an error message If you are using a custom container image with your job and the workers are |
After downgrading pip to "<20.0" we got it to work. It's been running for almost 18 hours now. So far we've got a few errors related to downloading of WET files (a HTTP 503 downloading It has shuffled almost 40TB of data on 132k WET files and counting (so nice to be able to see all the counters). I'm closing the issue and will write a small TL;DR when it's done for anyone following. Thank you so much for all the work, @adarob, @rezarokni, and @tvalentyn! |
@adarob, @rezarokni, any way to estimate how long the process will take? |
If it crashes for any reason, is there a way to resume processing? |
It's tricky and depends on the pipeline and even the Dataset. Besides trying the pipeline on datasets of different sizes, it would be prudent to check that pipeline can scale and data is ~evenly distributed between workers (no hot keys).
I don't think it is possible at this time. However Dataflow will retry failing work item at least 4 times before giving up. |
We have enabled To be honest, we are a bit worried. This is costing us around $1000 a day. We are small unit at the National Library of Norway trying to get a mono-lingual Norwegian BERT model released to the public. We are a non-profit European organization (no grant awarding, so not eligible for free GCP credits AFAICT). A rough estimate of the total time would be really helpful. It would be terrible running out of funds before the process finishes with no chance of resuming it later. There is full activity on all the VMs (~80% CPU), and it has processed 120TB of data so far. Here's a screenshot of the counters. Looking at the downloaded WET files (~320k), and calculating an average of 60k WET files per dump (there are 72), we estimated we are at about 15% of processing after 60 hours, so we still need 2 full weeks (400 hours) to get the Norwegian part of the mC4. But not sure if that's in any way realistic. |
I've never built one of these datasets on DataFlow with more than 1 crawl (the default for English only). For multilingual, we used 72 crawls to help get enough data for the tail languages, but we only ran it on our internal system which used many more workers. Based on this estimate (which is actually only for the first stage of processing), I'd highly suggest you reduce the number of crawls you use in the config. |
As you pointed out, it seems unreasonable for us to run this for all the crawls. It seems like even restricting it to one crawl will be a significant expense. We are trying to build a very large corpus for Norwegian (and the other Nordic languages), and this looks like a very good source. Are you able to give us a rough estimate (based on your experience) on how much data one crawl would give us? We do have access to the OSCAR dataset based on Common Crawl. Would you happen to know how would one crawl differ from this dataset on one or two crawls? |
If you let me know the languages you're interested in, I can compute ~ how
many documents you'll get from one crawl.
…On Mon, Dec 21, 2020 at 8:11 AM Javier de la Rosa ***@***.***> wrote:
As you pointed out, it seems unreasonable for us to run this for all the
crawls. It seems like even restricting it to one crawl will be a
significant expense. We are trying to build a very large corpus for
Norwegian (and the other Nordic languages), and this looks like a very good
source. Are you able to give us a rough estimate (based on your experience)
on how much data one crawl would give us? We do have access to the OSCAR
dataset based on Common Crawl. Would you happen to know how would one crawl
differ from this dataset on one or two crawls?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2711 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIJV2FGYRX5DQKKXQDFETDSV5CRLANCNFSM4TPDD2ZQ>
.
|
Sure, the languages we're interested in are Norwegian, Swedish, Danish, Icelandic, an Faroese (although I think this one is not included in mC4). The ISO codes are |
It'd be also great to know how many words or GB of raw text would be. |
First off, I found a significant bottleneck that reduces the parallelism to ~71 instead of your number of workers. It will be fixed in #2895. Here are the number of documents from 1 common crawl dump:
|
hi @versae thanks for creating the issue... i'm also planning to extract specific language from mc4 using dataflow beam dataset.. |
Hi! Sorry for the long hiatus. Thanks, @adarob. We're finally resuming work on this and I hope to start the processing in the coming days. We might just start with the first, last, and in between dumps just to get an idea of how long and how big. But the estimates in number of documents per language really help. @acul3, we were not able to finish the processing even after starting 500 workers for over 36 hours. We'll prepare everything again and report back. If you have any insights I'd be eager to know as well. Cheers. |
I was finally able to try this again. A few things have changed and it seems the experimental flags are not needed anymore. The setup is as follows: python -m tensorflow_datasets.scripts.download_and_prepare \
--datasets=$DATASET_NAME/$DATASET_CONFIG \
--data_dir=$GCS_BUCKET/tensorflow_datasets \
--beam_pipeline_options="region=$GCS_BUCKET_REGION,runner=DataflowRunner,project=$GCP_PROJECT,job_name=$DATASET_NAME-$DATASET
_CONFIG-1dump-gen,staging_location=$GCS_BUCKET/binaries,temp_location=$GCS_BUCKET/temp,dataflow_job_file=$GCS_BUCKET/job_file.json,requirements_file=/
tmp/beam_requirements.txt,autoscaling_algorithm=NONE,num_workers=50" 2>&1 | tee nb-mc4-1dump.log The
The last main process logs:
I run it for 24 hours only on one dump, I'd be happy to open a new issue if that's more appropriate. |
If the input pipeline entered a phase where it is IO bound, it might be normal that the CPU rate drop. |
It seems like I got past that problem on a second run, but it has now failed with a "No space left on the device" error that I guess is coming from the workers as they try to write too many or too big temporary files.
Not sure what's a proper disk size for the workers, I'm using whatever is the default now. I also don't know if I should run it again with |
|
If the temporary files are created by shuffle, |
It finished! 🎉 It successfully processed the 105TB and almost 2 billion files of the DATASET_NAME=c4
DATASET_CONFIG=nordic
GCP_PROJECT=...
GCS_BUCKET=...
GCS_BUCKET_REGION=... The command looks like this: python -m tensorflow_datasets.scripts.download_and_prepare \
--datasets=$DATASET_NAME/$DATASET_CONFIG \
--data_dir=$GCS_BUCKET/tensorflow_datasets \
--beam_pipeline_options="region=$GCS_BUCKET_REGION,runner=DataflowRunner,project=$GCP_PROJECT,job_name=$DATASET_NAME-$DATASET_CONFIG-1dump-gen,staging_location=$GCS_BUCKET/binaries,temp_location=$GCS_BUCKET/temp,dataflow_job_file=$GCS_BUCKET/job_file.json,requirements_file=/tmp/beam_requirements.txt,autoscaling_algorithm=NONE,disk_size_gb=100,num_workers=75,experiments=shuffle_mode=service,experiments=use_runner_v2," 2>&1 | tee nb-mc4-1dump.log The
And I patched locally 0db70eb to include a I did not delete the contents of the bucket from the previous run. Do runs cache any part of the process so after a failed run it takes shorter to run? |
Congrats on getting this working! Out of curiosity, did you have to request a lot of quota increases from GCS to make it work, and if so what quotas did you get increases to? The last time I tried this, I kept running into various out-of-quota warnings. |
@versae Congratulations on getting this work. I would like to understand the finer details of this work. We are planning on doing this for english. Is it possible to set up a meeting to discuss this. Thanks in Advance |
Thanks, @daphnei, @sumanthd17. @daphnei, the quota thing was the first problem I had. Thankfully, it was fixed by my manager. I'm my experience, having a budget estimate on resources and cost really helps when justifying how much money you need to run experiments. @sumanthd17, I am not affiliated with Google. I'm sure that @adarob and @rezarokni know way better the internals of their own work. That being said, happy to help. |
Short description
We are trying to extract the Norwegian (and eventually other Nordic languages) portion of
c4/multilingual
. Since there is no easy way to download only the data for one language, we are processing the entirec4/multilingual
corpus first.Environment information
Operating System:
Debian GNU/Linux 10
Python version:
Python 3.8.5
(miniconda)tensorflow-datasets
/tfds-nightly
version:4.1.0
/4.1.0.dev202011080107
tensorflow
/tf-nightly
version:2.3.1
/2.5.0.dev20201108
(tried with and withouttf-nightly
)Does the issue still exists with the last
tfds-nightly
package (pip install --upgrade tfds-nightly
) ?Yes, it does. We read all issues related to C4 and incorporated the necessary changes: pinning
dill
version and adding options for 450 workers andexperiments=shuffle_mode=service
in Apache Beam.Reproduction instructions
On a clean VM with 8vCPU and 32GB of RAM, we installed miniconda and run the next commands:
Link to logs
We removed information about our project and bucket in the logs:
Process log: nb-mc4.log
JSON job file: job_file.zip
Expected behavior
We would have expected for the script to successfully launch the pipeline in Dataflow, but the JSON job file seems to be too big (37.5MB when max is 10MB), therefore all we get is a
Your client issued a request that was too large
error message (formatted as a HTML page in the console output).Sample of the output
Additional context
If there is any other way to extract a language portion of
c4/multilingual
we'd be eager to try it as well.Update (March 1st, 2021): Instructions to successfully run the pipeline using one dump are detailed in the #2711 (comment).
The text was updated successfully, but these errors were encountered: