Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[S3API] Empty Etag returned when using Azure Storage as remote #3987

Closed
aronneagu opened this issue Nov 16, 2022 · 5 comments
Closed

[S3API] Empty Etag returned when using Azure Storage as remote #3987

aronneagu opened this issue Nov 16, 2022 · 5 comments

Comments

@aronneagu
Copy link
Contributor

Describe the bug
Fetching the etag using the Minio Python SDK of a file in Azure Remote, using the SeaweedFS S3 API, is returning empty result

System Setup

> lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.1 LTS
Release:        18.04
Codename:       bionic
> weed version
version 30GB 3.33  linux amd64
python3 --version
Python 3.8.0

Expected behavior
Fetching the Etag for the file to return the etag as defined in the remote

Steps to reproduce
Save the following as a bash script and run it. Export the variable ACCOUNT_KEY (was provided in private, starts with rwe...) before running the script: export ACCOUNT_KEY=rwe...

#!/bin/bash
# Set env vars for ACCOUNT_KEY
export ACCOUNT_NAME="seaweedsa"
# Access Key of the server.
pip3 install minio
TMPDIR=$(mktemp -d)
mkdir $TMPDIR/sync
echo Log files are saved in": $TMPDIR"
/usr/local/bin/weed -logdir=${TMPDIR} server -dir=${TMPDIR} -s3 2>/dev/null &
server_pid=$!
echo "remote.configure -name=azure -type=azure -azure.account_name=$ACCOUNT_NAME -azure.account_key=$ACCOUNT_KEY" | weed shell
echo "remote.mount -remote=azure/seaweed -dir=/buckets/seaweed" | weed shell
/usr/local/bin/weed -logdir=${TMPDIR}/sync filer.remote.sync -dir=/buckets/seaweed 2>&1 &
sync_pid=$!
cat <<EOF >min.py
from minio import Minio
import io, os
client = Minio("localhost:8333",secure=False)
result = client.put_object(
    "seaweed", "test-object", io.BytesIO(b"hello"), 5,
    metadata={"Uploaded-by": "aneagu"},
    content_type="application/parquet",
)
result = client.stat_object("seaweed","test-object")
print("The ETAG is: ")
result.etag
EOF
python3 min.py
echo "Waiting for sync to complete" && sleep 2
kill $sync_pid
kill $server_pid

Screenshots
Seaweed is aware of the etag in the remote (Azure Storage in this case), which can be seen in the key remoteEtag

# weed shell
> meta.cat /buckets/seaweed/test-object
{
  "name":  "test-object",
  "isDirectory":  false,
  "chunks":  [
    {
      "fileId":  "4,016fecacc6",
      "offset":  "0",
      "size":  "5",
      "modifiedTsNs":  "1668593275975362363",
      "eTag":  "XUFAKrxLKna5cZ2REBfFkg==",
      "sourceFileId":  "",
      "fid":  {
        "volumeId":  4,
        "fileKey":  "1",
        "cookie":  1877781702
      },
      "sourceFid":  null,
      "cipherKey":  "",
      "isCompressed":  false,
      "isChunkManifest":  false
    }
  ],
  "attributes":  {
    "fileSize":  "5",
    "mtime":  "1668593275",
    "fileMode":  432,
    "uid":  0,
    "gid":  0,
    "crtime":  "1668593275",
    "mime":  "application/parquet",
    "ttlSec":  0,
    "userName":  "",
    "groupName":  [],
    "symlinkTarget":  "",
    "md5":  "XUFAKrxLKna5cZ2REBfFkg==",
    "rdev":  0,
    "inode":  "0"
  },
  "extended":  {
    "X-Amz-Meta-Uploaded-By":  "YW5lYWd1"
  },
  "hardLinkId":  "",
  "hardLinkCounter":  0,
  "content":  "",
  "remoteEntry":  {
    "storageName":  "azure",
    "lastLocalSyncTsNs":  "1668593276051568680",
    "remoteETag":  "\"0x8DAC7BA6EA26EDD\"",
    "remoteMtime":  "1668593276",
    "remoteSize":  "5"
  },
  "quota":  "0"
}chunks 1 meta size: 218 gzip:246

Additional context
Minio Python SDK: https://min.io/docs/minio/linux/developers/python/API.html

@aronneagu
Copy link
Contributor Author

I think I have found something, seems that when there are exactly 1 chunk for a file, the ETag header is empty. Will try and reproduce next week

@aronneagu
Copy link
Contributor Author

aronneagu commented Nov 29, 2022

[Update]

Mount a remote (Azure Storage)

  • get metadata on file from remote (when the file is not in the Filer) -> returns ETag
  • download file (a copy is saved in filer)
  • get metadata on file (the file is now cached on Filer)-> missing ETag

NOTE: this error appears only when the file is 1 chunk in size

Steps to reproduce

#!/bin/bash
# Start weed server
export ACCOUNT_NAME="seaweedsa"
export WEED_BIN=$(which weed)
export TMPDIR=$(mktemp -d)
# Killing previos runs of weed server
echo "Stopping previous instances of weed server..."
pkill -9 weed

$WEED_BIN -logdir=${TMPDIR} server -dir=${TMPDIR} -s3 2>/dev/null &
echo "Starting weed server..."
server_pid=$!

# Mounte remote
echo "remote.configure -name=azure -type=azure -azure.account_name=$ACCOUNT_NAME -azure.account_key=$ACCOUNT_KEY" | weed shell
echo "remote.mount -remote=azure/seaweed -dir=/buckets/seaweed" | weed shell

# Get Etag (file not cached)
echo "meta.cat /buckets/seaweed/old_data_period.pq" | weed shell
aws --endpoint-url http://localhost:8333 s3api head-object --bucket seaweed --key old_data_period.pq
# Download file
aws --endpoint-url http://localhost:8333 s3 cp s3://seaweed/old_data_period.pq .
# Get Etag (file cached)
echo "meta.cat /buckets/seaweed/old_data_period.pq" | weed shell
aws --endpoint-url http://localhost:8333 s3api head-object --bucket seaweed --key old_data_period.pq

kill $server_pid

Output

     1  Stopping previous instances of weed server...
     2  Starting weed server...
     3  master: localhost:9333 filers: [10.7.1.6:8888]
     4  > > master: localhost:9333 filers: [10.7.1.6:8888]
     5  > /buckets/seaweed/old_data_period.pq (create)
     6  > master: localhost:9333 filers: [10.7.1.6:8888]
     7  > {
     8    "name": "old_data_period.pq",
     9    "isDirectory": false,
    10    "chunks": [],
    11    "attributes": {
    12      "fileSize": "2604226",
    13      "mtime": "1669741451",
    14      "fileMode": 420,
    15      "uid": 0,
    16      "gid": 0,
    17      "crtime": "0",
    18      "mime": "",
    19      "ttlSec": 0,
    20      "userName": "",
    21      "groupName": [],
    22      "symlinkTarget": "",
    23      "md5": "",
    24      "rdev": 0,
    25      "inode": "0"
    26    },
    27    "extended": {},
    28    "hardLinkId": "",
    29    "hardLinkCounter": 0,
    30    "content": "",
    31    "remoteEntry": {
    32      "storageName": "azure",
    33      "lastLocalSyncTsNs": "0",
    34      "remoteETag": "0x8DAD22BBCD7370A",
    35      "remoteMtime": "1669741451",
    36      "remoteSize": "2604226"
    37    },
    38    "quota": "0"
    39  }chunks 0 meta size: 75 gzip:103
    40  > {
    41      "AcceptRanges": "bytes",
    42      "LastModified": "Tue, 29 Nov 2022 17:04:11 GMT",
    43      "ContentLength": 2604226,
    44      "ETag": "\"d41d8cd98f00b204e9800998ecf8427e-0\"",
    45      "ContentDisposition": "inline; filename=\"old_data_period.pq\"",
    46      "Metadata": {}
    47  }
    49  master: localhost:9333 filers: [10.7.1.6:8888]
    50  > {
    51    "name": "old_data_period.pq",
    52    "isDirectory": false,
    53    "chunks": [
    54      {
    55        "fileId": "3,013bc959aa",
    56        "offset": "0",
    57        "size": "2604226",
    58        "modifiedTsNs": "1669742752",
    59        "eTag": "",
    60        "sourceFileId": "",
    61        "fid": {
    62          "volumeId": 3,
    63          "fileKey": "1",
    64          "cookie": 1003051434
    65        },
    66        "sourceFid": null,
    67        "cipherKey": "",
    68        "isCompressed": false,
    69        "isChunkManifest": false
    70      }
    71    ],
    72    "attributes": {
    73      "fileSize": "2604226",
    74      "mtime": "1669741451",
    75      "fileMode": 420,
    76      "uid": 0,
    77      "gid": 0,
    78      "crtime": "0",
    79      "mime": "",
    80      "ttlSec": 0,
    81      "userName": "",
    82      "groupName": [],
    83      "symlinkTarget": "",
    84      "md5": "",
    85      "rdev": 0,
    86      "inode": "0"
    87    },
    88    "extended": {},
    89    "hardLinkId": "",
    90    "hardLinkCounter": 0,
    91    "content": "",
    92    "remoteEntry": {
    93      "storageName": "azure",
    94      "lastLocalSyncTsNs": "1669742752917245243",
    95      "remoteETag": "0x8DAD22BBCD7370A",
    96      "remoteMtime": "1669741451",
    97      "remoteSize": "2604226"
    98    },
    99    "quota": "0"
   100  }chunks 1 meta size: 123 gzip:151
   101  > {
   102      "AcceptRanges": "bytes",
   103      "LastModified": "Tue, 29 Nov 2022 17:04:11 GMT",
   104      "ContentLength": 2604226,
   105      "ContentDisposition": "inline; filename=\"old_data_period.pq\"",
   106      "Metadata": {}
   107  }

You'll see on line 39, it says chunks 0 meaning the file is not saved on Filer, and then on line 44 we get the ETag. Later on, after we download the file, line 100, chunks 1 tells us that the Filer has saved the file, but lines 101-107 are now missing ETag

@chrislusf
Copy link
Collaborator

the ways to calculate etag are different for SeaweedFS and Azure. Does it matter to you?

@aronneagu
Copy link
Contributor Author

At the moment, I would say I am not bothered that they are calculated differently, but the lack of it is a worring. We are running python packages (s3fs and pyarrow) that rely that the ETag header is present. Id' hope it doesn't really matter what value it is, as long as it's computed in consistent fashion

@aronneagu
Copy link
Contributor Author

aronneagu commented Nov 30, 2022

I've discovered that when files are fetched from remote and saved to Filer, the files have empty ETag key for chunks. eg:

> meta.cat old_data_period.pq
{
  "name": "old_data_period.pq",
  "isDirectory": false,
  "chunks": [
    {
      "fileId": "17,7577ea1377",
      "offset": "0",
      "size": "2604226",
      "modifiedTsNs": "1669829772",
      "eTag": "",                        <- THIS IS EMPTY
      "sourceFileId": "",
      "fid": {
        "volumeId": 17,
        "fileKey": "117",
        "cookie": 2011829111
      },
      "sourceFid": null,
      "cipherKey": "",
      "isCompressed": false,
      "isChunkManifest": false
    }
  ],
  "attributes": {
    "fileSize": "2604226",
    "mtime": "1669741451",
    "fileMode": 420,
    "uid": 0,
    "gid": 0,
    "crtime": "0",
    "mime": "",
    "ttlSec": 0,
    "userName": "",
    "groupName": [],
    "symlinkTarget": "",
    "md5": "",                  <- THIS IS EMPTY TOO
    "rdev": 0,
    "inode": "0"
  },
  "extended": {},
  "hardLinkId": "",
  "hardLinkCounter": 0,
  "content": "",
  "remoteEntry": {
    "storageName": "azure",
    "lastLocalSyncTsNs": "1669829772428494400",
    "remoteETag": "0x8DAD22BBCD7370A",
    "remoteMtime": "1669741451",
    "remoteSize": "2604226"
  },
  "quota": "0"
}chunks 1 meta size: 131 gzip:159   <- CHUNKS=1 means file is saved on Filer

I've found that the missing ETag is most likely coming from here https://github.com/seaweedfs/seaweedfs/blob/master/weed/server/filer_grpc_server_remote.go#L153

But I am not sure exactly how to set the its value, its probably some kind of md5sum. If someone would be kind enough to point me to the right function here, I'd appreciate it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants