Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File size is different from the sum of chunks (from meta.cat) #4289

Closed
aronneagu opened this issue Mar 8, 2023 · 2 comments
Closed

File size is different from the sum of chunks (from meta.cat) #4289

aronneagu opened this issue Mar 8, 2023 · 2 comments

Comments

@aronneagu
Copy link
Contributor

aronneagu commented Mar 8, 2023

Describe the bug
The Filer has different views on the size of the file, leading to file corruption
The file is in a directory that is pointing to an Azure Storage remote.

System Setup

> weed version
version 30GB 3.42 8821d6b1619b7a55b515a98391193297e77cbe52 linux amd64
> /usr/local/bin/weed -v=3 -logdir=/mnt/seaweed_bcminioresearch/logs server -master.dir=/mnt/seaweed_bcminioresearch/master -volume.index=leveldb -volume.max=8 -filer.dirListLimit=10000 -metricsPort=9327 -dir=/mnt/seaweed_bcminioresearch -s3 
> lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.10
Release:        22.10
Codename:       kinetic

Screenshots

# in weed shell
> meta.cat /buckets/bcminioresearch/models/datasets/credit/global_credit_data_monthly/live/_metadata
{
  "name":  "_metadata",
  "isDirectory":  false,
  "chunks":  [
    {
      "fileId":  "28,4231f9c5edef",
      "offset":  "0",
      "size":  "1225797",
      "modifiedTsNs":  "1678106055",
      "eTag":  "b53111d5",
      "sourceFileId":  "",
      "fid":  {
        "volumeId":  28,
        "fileKey":  "16945",
        "cookie":  4190498287
      },
      "sourceFid":  null,
      "cipherKey":  "",
      "isCompressed":  false,
      "isChunkManifest":  false
    }
  ],
  "attributes":  {
    "fileSize":  "1236149",
    "mtime":  "1678264028",
    "fileMode":  420,
    "uid":  0,
    "gid":  0,
    "crtime":  "0",
    "mime":  "",
    "ttlSec":  0,
    "userName":  "",
    "groupName":  [],
    "symlinkTarget":  "",
    "md5":  "",
    "rdev":  0,
    "inode":  "0"
  },
  "extended":  {},
  "hardLinkId":  "",
  "hardLinkCounter":  0,
  "content":  "",
  "remoteEntry":  {
    "storageName":  "azure",
    "lastLocalSyncTsNs":  "0",
    "remoteETag":  "0x8DB1FAEE885E3FA",
    "remoteMtime":  "1678264028",
    "remoteSize":  "1236149"
  },
  "quota":  "0"
}chunks 1 meta size: 116 gzip:144

You'll notice that atttributes.fileSize and chunks[0].size are different.

Expected behavior
The file has only one chunk, I would have expected those two to be the same

@aronneagu
Copy link
Contributor Author

Does this look familiar to anyone? We have millions of files and not entirely sure how to reproduce this. Is there any way to progress this?

@aronneagu
Copy link
Contributor Author

@chrislusf I have found a way to reproduce this

# Install rclone
apt install rclone -y
# Create rclone config
mkdir -p /root/.config/rclone || true
cat >/root/.config/rclone/rclone.conf <<EOF
[seaweedsa]
type = azureblob
account = seaweedsa
key = $AZURE_KEY
EOF
# Cleanup the Azure Storage account
rclone delete seaweedsa:seaweed/_metadata || true
# Copy original file to Azure Storage
rclone copyto _metadata.old seaweedsa:seaweed/_metadata || true
# Add remote to seaweed
echo "remote.configure -name=azure -type=azure -azure.account_name=seaweedsa -azure.account_key=$AZURE_KEY" | weed shell
echo "remote.mount -remote=azure/seaweed -dir=/buckets/seaweed" | weed shell
# Get file from seaweed (populate cache)
wget localhost:8888/buckets/seaweed/_metadata >/dev/null
# Update file in Azure Storage
rclone copyto _metadata.new seaweedsa:seaweed/_metadata || true
# Sync between seaweed and Azure Storage
echo "remote.meta.sync -dir=/buckets/seaweed" | weed shell
# Get file metadata
echo "meta.cat /buckets/seaweed/_metadata"| weed shell

^^ You'll notice that the chunks[0].size is different from remoteEntry.remoteSize and from attributes.fileSize
remoteEntry.remoteSize is equal to attributes.fileSize

The steps are

  1. Seaweed add a remote (Azure storage)
  2. A file is fetched from that remote (and thus cached on the filer)
  3. The file in Azure Storage is updated directly
  4. remote.meta.sync is ran, to "inform" filer that the files has been updated in the remote -> that updates the remoteEntry.remoteSize and attributes.fileSize, but not the chunk

There are at least 2 possible solutions here

  1. When remote.meta.sync runs, if the file has been updated in the remote and there are chunks saved locally, to invalidate them (delete them)
  2. When remote.meta.sync run, if the file has been updated in the remote, redownload the chunks from remote

I'll leave that decision up to @chrislusf

The files refenced in the script can be downloaded from https://filetransfer.io/data-package/yBqqJKfK#link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant