Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot create an Artifact with 7 files of size 975244241 bytes #3823

Open
samuela opened this issue Jun 21, 2022 · 19 comments
Open

Cannot create an Artifact with 7 files of size 975244241 bytes #3823

samuela opened this issue Jun 21, 2022 · 19 comments
Labels
c:artifacts Candidate for artifact branch

Comments

@samuela
Copy link

samuela commented Jun 21, 2022

I have the following script:

import wandb

if __name__ == "__main__":
  with wandb.init(
      project="private",
      entity="private",
      tags=["DELETEME"],
  ) as wandb_run:
    artifact = wandb.Artifact("deleteme", type="deleteme")

    for epoch in range(100):
      print(f"Epoch {epoch}")
      with artifact.new_file(f"checkpoint{epoch}", mode="wb") as f:
        contents = open("/dev/urandom", "rb").read(975244241)
        print(len(contents))
        f.write(contents)

Running this script I see the following error:

[nix-shell:/efs/research/lottery]$ python deleteme.py 
wandb: Currently logged in as: skainswo. Use `wandb login --relogin` to force relogin
wandb: wandb version 0.12.18 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.12.16
wandb: Run data is saved locally in /tmp/wandb/run-20220621_073539-11aknk72
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run silvery-microwave-547
wandb: ⭐️ View project at https://wandb.ai/skainswo/playing-the-lottery
wandb: 🚀 View run at https://wandb.ai/skainswo/playing-the-lottery/runs/11aknk72
Epoch 0
975244241
Epoch 1
975244241
Epoch 2
975244241
Epoch 3
975244241
Epoch 4
975244241
Epoch 5
975244241
Epoch 6
975244241
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb:                                                                                
wandb: Synced silvery-microwave-547: https://wandb.ai/skainswo/playing-the-lottery/runs/11aknk72
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 1 other file(s)
wandb: Find logs at: /tmp/wandb/run-20220621_073539-11aknk72/logs
Traceback (most recent call last):
  File "/efs/research/lottery/deleteme.py", line 16, in <module>
    f.write(contents)
OSError: [Errno 28] No space left on device

[nix-shell:/efs/research/lottery]$ df -H
Filesystem                Size  Used Avail Use% Mounted on
devtmpfs                  3.3G     0  3.3G   0% /dev
tmpfs                      33G     0   33G   0% /dev/shm
tmpfs                      17G  7.6M   17G   1% /run
tmpfs                      33G  345k   33G   1% /run/wrappers
/dev/disk/by-label/nixos  136G  118G   11G  92% /
<censored IP>:/             9.3E  2.7G  9.3E   1% /efs
tmpfs                     6.5G  263k  6.5G   1% /run/user/1000

[nix-shell:/efs/research/lottery]$ 

which is incorrect, considering I have more than enough free space left on disk to fit the files. I've also replicated this on machine with terabytes of disk space left, and still see the same error.

For the sake of exact reproducibility, I'm running with the following shell.nix:

# Run with nixGL, eg `nixGLNvidia-510.47.03 python cifar10_convnet_run.py --test`

# To prevent JAX from allocating all GPU memory: XLA_PYTHON_CLIENT_PREALLOCATE=false
# To push build to cachix: nix-store -qR --include-outputs $(nix-instantiate shell.nix) | cachix push ploop

let
  # pkgs = import (/home/skainswo/dev/nixpkgs) { };

  # Last updated: 2022-05-16. Check for new commits at status.nixos.org.
  pkgs = import (fetchTarball "https://github.com/NixOS/nixpkgs/archive/556ce9a40abde33738e6c9eac65f965a8be3b623.tar.gz") {
    config.allowUnfree = true;
    # These actually cause problems for some reason. bug report?
    # config.cudaSupport = true;
    # config.cudnnSupport = true;
  };
in
pkgs.mkShell {
  buildInputs = with pkgs; [
    ffmpeg
    python3
    python3Packages.augmax
    python3Packages.einops
    python3Packages.flax
    python3Packages.ipython
    python3Packages.jax
    # See https://discourse.nixos.org/t/petition-to-build-and-cache-unfree-packages-on-cache-nixos-org/17440/14
    # as to why we don't use the source builds of jaxlib/tensorflow.
    (python3Packages.jaxlib-bin.override {
      cudaSupport = true;
    })
    python3Packages.matplotlib
    # python3Packages.pandas
    python3Packages.plotly
    # python3Packages.scikit-learn
    (python3Packages.tensorflow-bin.override {
      cudaSupport = false;
    })
    # Thankfully tensorflow-datasets does not have tensorflow as a propagatedBuildInput. If that were the case for any
    # of these dependencies, we'd be in trouble since Python does not like multiple versions of the same package in
    # PYTHONPATH.
    python3Packages.tensorflow-datasets
    python3Packages.tqdm
    python3Packages.wandb
    yapf
  ];

  # Don't clog EFS with wandb results. Wandb will create and use /tmp/wandb.
  WANDB_DIR = "/tmp";

  # Don't check this into version control!
  WANDB_API_KEY = "... secret ...";
}

In short: I'm running on NixOS 22.05, wandb client version 0.12.16, Python 3.9.12 on a p3.2xlarge EC2 instance. I've also replicated the issue on Ubuntu, and with and without a custom WANDB_DIR setting.

Why can't I save 7 files in an Artifact?

@MBakirWB
Copy link

Hi @samuela ,

Thank-you for writing in about this. Can you please share the debug.log and debug-internal.log files associated with this crash, these files should be present in a folder called wandb relative to the directory where these runs were initiated. You can email them to me directly to mohammad.bakir@wandb.com.

In the meantime can you please provide a response to the following:

Are you running code directly on the ECS instance to set a directory path , example /home/ec2-user/SageMaker which you are also running in your container? If this path isn't set correctly in the container we will default to the tmp file in the container which, if not enough disk space has been allocated, will cause the disk space error.

@samuela
Copy link
Author

samuela commented Jun 23, 2022

@MBakirWB I created https://gist.github.com/samuela/6f9542584c8860a035870552edc7fac9 with all of the requested logs. In addition, I further reduced the reproduction a bit, and removed the custom WANDB_DIR for simplicity.

Somewhat interestingly, the latest run lasted longer, but still failed with the same error. Check out the logs in the gist for all the deets!

Are you running code directly on the ECS instance to set a directory path , example /home/ec2-user/SageMaker which you are also running in your container? If this path isn't set correctly in the container we will default to the tmp file in the container which, if not enough disk space has been allocated, will cause the disk space error.

I'm not sure I understand... There is no container or SageMaker technology at play here. I'm running directly on a VM. I can confirm that this is not due to writing in /tmp, since I'm able to reproduce using the ~/my/project/wandb/ default location (cf. this latest gist).

@MBakirWB
Copy link

@samuela ,

Thank you for providing all the requested items and providing clarification that you directly running on a VM. I did escalate this to our engineering team to take a closer look. Once I have feedback I will update you immediately.

@MBakirWB
Copy link

Hi @samuela , team got back to me with the following.

What is the size of the VM you are using?

What might be happening here is the minio storage inside the docker container running on the VM is running out of disk space. Some of the main locations that our application write to are the /var/log directory and the /vol/env directory, check these directories disk space.

Also, w.r.t to artifacts specifically: the No space left on device error message usually comes up because uploading an artifact causes it to be written to the cache directory (~/.cache/wandb), and the disk runs out of space when that happens. If the VM size is not an issue here then try setting this env var to WANDB_CACHE_DIR=/tmp.

@samuela
Copy link
Author

samuela commented Jun 30, 2022

What is the size of the VM you are using?

Size in which dimension?

What might be happening here is the minio storage inside the docker container running on the VM is running out of disk space.

How are Docker or MinIO relevant here? I'm not using containers in any way.

@samuela
Copy link
Author

samuela commented Jun 30, 2022

The gist linked above was run on a m6a.large instance with 166G available on /, more than enough to store the artifact.

@MBakirWB
Copy link

Thanks for the clarification @samuela , I do recognize my mistake now in misinterpreting your initial/subsequent submissions. You are running wandb client on an EC2 instance.

Revisiting earlier comments, we recommend for users today to have 2x the space to upload the artificat as we write through the cache. Please revisit the second suggestion regarding the cache directory which can be overridden by setting the WANDB_CACHE_DIR env var. The env vars are here.

@samuela
Copy link
Author

samuela commented Jul 5, 2022

Following your advice I have set WANDB_CACHE_DIR=/tmp and have verified that I have a generous amount more than the 2x space reportedly necessary to upload the artifact, but I'm still seeing the same error:

...
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb:
wandb:
wandb: Run history:
wandb:          epoch ▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▅▆▆▆▆▇▇▇▇███
wandb:  test_accuracy ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:      test_loss ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: train_accuracy ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:     train_loss ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:
wandb: Run summary:
wandb:          epoch 30
wandb:  test_accuracy 0.0947
wandb:      test_loss 2.45663
wandb: train_accuracy 0.0
wandb:     train_loss 0.0
wandb:
wandb: Synced upbeat-night-630: https://wandb.ai/skainswo/playing-the-lottery/runs/3rthz1nb
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 1 other file(s)
wandb: Find logs at: /tmp/wandb/run-20220704_234531-3rthz1nb/logs
Traceback (most recent call last):
  File "/efs/research/lottery/cifar10_vgg_run.py", line 277, in <module>
    f.write(flax.serialization.to_bytes(train_state))
OSError: [Errno 28] No space left on device
ip-172-30-0-201% df -H
Filesystem                Size  Used Avail Use% Mounted on
devtmpfs                  3.3G     0  3.3G   0% /dev
tmpfs                      33G     0   33G   0% /dev/shm
tmpfs                      17G  7.1M   17G   1% /run
tmpfs                      33G  345k   33G   1% /run/wrappers
/dev/disk/by-label/nixos  542G   26G  489G   5% /
172.30.0.75:/             9.3E  4.0G  9.3E   1% /efs
tmpfs                     6.5G   50k  6.5G   1% /run/user/1000
ip-172-30-0-201%

How should I proceed from here?

@samuela
Copy link
Author

samuela commented Jul 5, 2022

I'm also seeing this issue when setting WANDB_DIR and WANDB_CACHE_DIR to /efs/tmp which has exabytes of storage available...

@samuela
Copy link
Author

samuela commented Jul 5, 2022

FWIW this is an issue specific to wandb since

with artifact.new_file(f"checkpoint{epoch}", mode="wb") as f:
  f.write(flax.serialization.to_bytes(train_state.params))

errors out but

          filename = f"/tmp/checkpoint{epoch}"
          with open(filename, mode="wb") as f:
            f.write(flax.serialization.to_bytes(train_state.params))
          artifact.add_file(filename)

works fine.

@exalate-issue-sync
Copy link

WandB Internal User commented:
samuela commented:
The gist linked above was run on a m6a.large instance with 166G available on /, more than enough to store the artifact.

@exalate-issue-sync
Copy link

WandB Internal User commented:
samuela commented:
FWIW this is an issue specific to wandb since

with artifact.new_file(f"checkpoint{epoch}", mode="wb") as f:
  f.write(flax.serialization.to_bytes(train_state.params))

errors out but

          filename = f"/tmp/checkpoint{epoch}"
          with open(filename, mode="wb") as f:
            f.write(flax.serialization.to_bytes(train_state.params))
          artifact.add_file(filename)

works fine.

@MBakirWB
Copy link

MBakirWB commented Jul 7, 2022

HI @samuela,

Thanks for the update. I spoke to the team and they are recommending the following test:

  • Delete the cache wandb folder to ensure all checksums are deleted.
  • Comment out print(len(contents)) in your epoch loop
  • log the artifact prior to your epoch loop using wandb_run.log_artifact(...)
  • set run finish post loop
artifact = wandb.Artifact("deleteme", type="deleteme")
wandb_run.log_artifact(artifact)
for epoch in range(100):
   print(f"Epoch {epoch}")
   with artifact.new_file(f"checkpoint{epoch}", mode="wb") as f:
        contents = open("/dev/urandom", "rb").read(975244241)
        #print(len(contents))
        f.write(contents)
wandb_run.finish()

@MBakirWB
Copy link

HI @samuela , following up on this. Do you still require support on this issue? Thanks

@samuela
Copy link
Author

samuela commented Jul 13, 2022

Unfortunately I don't have bandwidth to continue tracking this down, but happy to help if you have any issues running the reproduction steps

@vanpelt
Copy link
Contributor

vanpelt commented Jul 13, 2022

@samuela artifact.new_file will create the file in the current working directory and then write through the cache. If you set WANDB_CACHE_DIR to the large volume and call os.chdir before calling new_file to a large volume it should work.

We should consider staging these new files in WANDB_DIR in the future by default but I think that's what's happening.

@vanpelt
Copy link
Contributor

vanpelt commented Jul 13, 2022

I take that back, it writes to tempfile.TemporaryDirectory() from the python stdlib. You can set _artifact_dir on an instance of artifact to an empty directory on a partition that has sufficient space...

@dorbittonn
Copy link

dorbittonn commented Nov 4, 2022

Same as this issue states, I'm getting the no space left on device as well.
I'm training on aws batch with container and mounted efs

image

I get the error even though I defined WANDB_DIR=efs/wandb and WANDB_CACHE_DIR=efs/wandb_cache....
If all of my logs and artifacts are stored out of the machine's disc, why I'm running out of storage?

@luisbergua
Copy link
Contributor

Hi @dorbittonn, thanks for writing in! I've followed up with you answering the email that you sent to support. Thanks!

@kptkin kptkin added the c:artifacts Candidate for artifact branch label Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c:artifacts Candidate for artifact branch
Projects
None yet
Development

No branches or pull requests

6 participants