**FR** - Exemples d'interaction avec un service Ceph de type stockage objet (S3)
<hr> 

**EN** - Examples of how to interact with a Ceph object storage (S3) service

In [None]:
import boto3
import getpass # Used to input secrets

In [None]:
access_key = getpass.getpass(prompt='Access key: ')

In [None]:
secret_access_key = getpass.getpass(prompt='Secret access key: ') 

In [None]:
session = boto3.session.Session()

s3_client = session.client(
    service_name='s3',
    aws_access_key_id=access_key,
    aws_secret_access_key=secret_access_key,
    endpoint_url='https://142.98.30.204:8080', # Load balancer url
    use_ssl=False,
    verify=False,
)

In [None]:
s3_client.list_buckets()

In [None]:
s3_client.create_bucket(Bucket="fst")

In [None]:
s3_client.list_buckets()

In [None]:
s3_client.list_objects(Bucket="fst")

In [None]:
# Put file in bucket
data = open('your_file_path', 'rb')
s3_client.put_object(Bucket="your_bucket", Key='filename_you_want', Body=data)

# Utilities

## List objects

In [None]:
# Loop through objects listed until no more pagination (IsTruncated: False)
bucket_name = 'fst'
objects = []

response = s3_client.list_objects(Bucket=bucket_name)

while True:
    for obj in response.get('Contents', []):
        objects.append(obj)

    if response.get('IsTruncated', False):
        # If IsTruncated is True, there are more objects to retrieve.
        marker = response.get('NextMarker', None)
        response = s3_client.list_objects(Bucket=bucket_name, Marker=marker)
    else:
        break

# Now, the 'objects' list contains all objects in the bucket.

if not objects:
    print("The list is empty.")
else:
    for obj in objects:
        print(f"Object: {obj['Key']}, Size: {obj['Size']} bytes")

## Delete objects

In [None]:
# Delete all objects in <bucket_name>

while True:
    # List objects in the bucket
    response = s3_client.list_objects(Bucket=bucket_name)

    # Create a list of objects to delete
    objects_to_delete = [{'Key': obj['Key']} for obj in response.get('Contents', [])]
    print("Objects to delete list = " + str(objects_to_delete))

    # Delete the objects
    if not(objects_to_delete):
        print("Empty bucket.  No objects deleted.")
        break
    else:
        delete_response = s3_client.delete_objects(
            Bucket=bucket_name,
            Delete={'Objects': objects_to_delete}
        )

        # Check the response for errors
        if 'Errors' in delete_response:
            for error in delete_response['Errors']:
                print(f"Error deleting object {error['Key']}: {error['Message']}")

    # If there are more objects to list, continue with the next marker
    if response.get('IsTruncated', False):
        marker = response.get('NextMarker', None)
        response = s3_client.list_objects(Bucket=bucket_name, Marker=marker)
    else:
        print("All objects deleted successfully.")
        break


## Cut up a path to retain filename.ext
Useful to upload files to S3

In [None]:
def get_key(fullpath):
   """
   This function takes a full file path as input and returns the file name with its extension.

   Parameters:
   fullpath (str): The full path to the file.

   Returns:
   str: The file name with its extension.
   """
   # Split the fullpath once and store the result
   path_parts = fullpath.split('/')
   filename_with_extension = path_parts[-1]

   # Split the filename and extension into separate variables
   filename, extension = filename_with_extension.split('.')

   # Return the filename and extension
   return filename + "." + extension


In [None]:
get_key("my/path/and/filename.rs")

## Big file upload
- get upload_id
- chunk file (5GB is max chunk size generally)
- finish multipart upload

In [None]:
# Specify the bucket and object key
bucket_name = 'your_bucket'
file_path = 'input_file'
object_key = get_key(file_path)

# Initiate Multipart Upload
response = s3_client.create_multipart_upload(Bucket=bucket_name, Key=object_key)
upload_id = response['UploadId'] # upload_id needed further

In [None]:
# Split the large file into smaller parts
chunk_size = 5 * 1024 * 1024 * 1024 # 5 GB
parts = []
file = 'your/full/path/filename.ext'
# file = output_file_6gb
with open(file, 'rb') as f:
    i = 1
    while True:
        data = f.read(chunk_size)
        if not data:
            break
        part = s3_client.upload_part(
            Bucket=bucket_name,
            Key=get_key(file),
            PartNumber=i,
            UploadId=upload_id,
            Body=data
        )
        parts.append({'PartNumber': i, 'ETag': part['ETag']})
        i += 1

In [None]:
# Complete the multipart upload
s3_client.complete_multipart_upload(
    Bucket=bucket_name,
    Key=object_key,
    MultipartUpload={'Parts': parts},
    UploadId=upload_id
)

# Upload Zarr

Zarr files really are a collection of files and directories.  The example below therefore is more like a `how to` for uploading a directory. <br> <br>
The specific case illustrated here is that of a MinIO service that has "certificate issues".  The common use case is a MinIO test set up using a self-signed certificate but this example uses the internal DNS name of one of the pods where the MinIO service is running (and its exposed `NodePort`), though this is not shown explicitly thanks to `getpass`. Of course, an external IP is necessary for production use. 

In [None]:
# Given `client`, `bucket_name` defined above and
# a Zarr file (a directory that has the name of the dataset and contains the series of files and sub-directories that make up the zarr "file")

def upload_zarr_directory(client, bucket_name, local_directory):
    try:
        # Validate the arguments
        assert isinstance(client, Minio), "client must be an instance of Minio"
        assert isinstance(bucket_name, str), "bucket_name must be a string"
        assert isinstance(local_directory, str), "local_directory must be a string"

        # Check if the bucket exists
        if not client.bucket_exists(bucket_name):
            raise ValueError("Bucket '{}' does not exist on the client".format(bucket_name))
        
        # Check if the zarr_file exists

        if not pathlib.Path(zarr_file).is_file():
            raise ValueError(f"{zarr_file} is not a valid local file.")

        for file_path in pathlib.Path(local_directory).glob('**/*'):
            if file_path.is_file():
                object_name = str(pathlib.Path(zarr_filename) / file_path.relative_to(local_directory))
                client.fput_object(bucket_name, object_name, str(file_path))

    except (AssertionError, ValueError) as e:
        raise ValueError(str(e))

# Synchronize local directory with a bucket

It may be interesting for users to have a copy of a specific folder in a bucket, as a form of backup or as a way to have access to resources through-the-web (TTW).  The MinIO client (`mc`) allows one to "[mirror](https://min.io/docs/minio/linux/reference/minio-mc/mc-mirror.html)" a local directory to a bucket but it is a one-way process and it obviously requires `mc` to be installed.  Although it is fairly easy to do so -- just [download one file](https://min.io/docs/minio/linux/reference/minio-mc.html#install-mc), make it executable and run it -- there is a more comprehensive solution : [FUSE](https://en.wikipedia.org/wiki/Filesystem_in_Userspace), more specifically in our case [s3fs](https://linuxbeast.com/aws-operations/how-to-install-s3fs-and-mount-an-s3-bucket-on-ubuntu-20-04/).<br><br>
As mentioned in the latter url : 

```
The use case for S3fs is for anyone who needs to access Amazon S3 storage in a more traditional file system interface. This can be especially useful for backing up data, archiving files, or sharing data between different systems. With S3fs, you can interact with S3 as if it were a local file system, making it much easier to automate data transfer and retrieval processes. S3fs is also useful for organizations that use Amazon S3 as their primary storage solution, as it provides a more seamless way to access and manage the data stored there.
```

> NOTE : MinIO is an implementation of AWS's S3.  As such, software like `s3fs`, which are designed primarily for working with AWS S3, works with other S3 implementations.  The FUSE system can also be used to mount a local directory on Azure object storage, but one must use Microsoft's [blobfuse2](https://learn.microsoft.com/en-us/azure/storage/blobs/blobfuse2-what-is) to do so.

In [None]:
# This is shown in a notebook cell but should be carried out diretly on the commandline (shown by the prompt sign `$`)
# With `s3fs` previously installed on your system
# $ s3fs destination_bucket_name local_directory -o passwd_file=path_to_your_creds_file -o url=url_to_your_minio_service -o use_path_request_style -o ssl_verify_hostname=0 -o no_check_certificate
# Command to verify your directory is indeed mounted onto a MinIO bucket :
# $ mount | grep s3fs
# s3fs on `local_directory` type fuse.s3fs (rw,nosuid,nodev,relatime,user_id=61144,group_id=61144)


Some of the options (`-o`).
- `passwd_file` must contain one line structured like so : access_key:Secret_access_key.  That is both credential items are separated by a colon (`:`)
- `use_path_request_style` apparently needed for MinIO
- `ssl_verify_hostname` and `no_check_certificate` : needed to bypass SSL issues.

> Note : SSL errors will most often go unnoticed.  You simply won't be able to mount directories with no indication of failure.