## Using object storage

Until now, in any experiment we have run on Chameleon, we had to re-download large training sets each time we launched a new compute instance to work on that data. For example, in our “GourmetGram” use case, we had to re-download the Project37 movielens dataset each time we brought up a compute instance to train or evaluate a model on that data.

For a longer-term project, we will want to persist large data sets beyond the lifetime of the compute instance. That way, we can download a very large data set *once* and then re-use it many times with different compute instances, without having to keep a compute instance “alive” all the time, or re-download the data. We will use the object storage service in Chameleon to enable this.

Of the various types of storage available in a cloud computing environment (object, block, file), object storage is the most appropriate for large training data sets. Object storage is cheap, and optimized for storing and retrieving large volumes of data, where the data is not modified frequently. (In object storage, there is no in-place modification of objects - only replacement - so it is not the best solution for files that are frequently modified.)

After you run this experiment, you will know how to:

-   create an object store container at CHI@TACC
-   copy objects to it,
-   and mount it as a filesystem in a compute instance.

The object storage service is available at CHI@TACC or CHI@UC. In this tutorial, we will use CHI@TACC. The CHI@TACC object store can be accessed from a KVM@TACC VM instance.

### Object storage using the Horizon GUI

First, let’s try creating an object storage container from the OpenStack Horizon GUI.

Open the GUI for CHI@TACC:

-   from the [Chameleon website](https://chameleoncloud.org/hardware/)
-   click “Experiment” \> “CHI@TACC”
-   log in if prompted to do so
-   check the project drop-down menu near the top left (which shows e.g. “CHI-XXXXXX”), and make sure the correct project is selected.

In the menu sidebar on the left side, click on “Object Store” \> “Containers” and then, “Create Container”. You will be prompted to set up your container step by step using a graphical “wizard”.

-   Specify the name as <code>object-persist-<b>teamID</b></code> where in place of <code><b>teamID</b></code> you substitute your team ID (e.g. `project37` in our team case).
-   Leave other settings at their defaults, and click “Submit”.

### Use `rclone` and authenticate to object store from a compute instance

We will want to connect to this object store from the compute instance we configured earlier, and copy some data to it!

For *write* access to the object store from the compute instance, we will need to authenticate with valid OpenStack credentials. To support this, we will create an *application credential*, which consists of an ID and a secret that allows a script or application to authenticate to the service.

An application credential is a good way for something like a data pipeline to authenticate, since it can be used non-interactively, and can be revoked easily in case it is compromised without affecting the entire user account.

In the menu sidebar on the left side of the Horizon GUI, click “Identity” \> “Application Credentials”. Then, click “Create Application Credential”.

-   In the “Name”, field, use “data-persist”.
-   Set the “Expiration” date to the end date of the current semester. (Note that this will be in UTC time, not your local time zone.) This ensures that if your credential is leaked (e.g. you accidentially push it to a public Github repository), the damage is mitigated.
-   Click “Create Application Credential”.
-   Copy the “ID” and “Secret” displayed in the dialog, and save them in a safe place. You will not be able to view the secret again from the Horizon GUI. Then, click “Download openrc file” to have another copy of the secret.

Now that we have an application credential, we can use it to allow an application to authenticate to the Chameleon object store service. There are several applications and utilities for working with OpenStack’s Swift object store service; we will use one called [`rclone`](https://github.com/rclone/rclone).

On the compute instance, install `rclone`:

``` bash
# run on node-persist
curl https://rclone.org/install.sh | sudo bash
```

We also need to modify the configuration file for FUSE (**F**ilesystem in **USE**rspace: the interface that allows user space applications to mount virtual filesystems), so that object store containers mounted by our user will be availabe to others, including Docker containers:

``` bash
# run on node-persist
# this line makes sure user_allow_other is un-commented in /etc/fuse.conf
sudo sed -i '/^#user_allow_other/s/^#//' /etc/fuse.conf
```

Next, create a configuration file for `rclone` with the ID and secret from the application credential you just generated:

``` bash
# run on node-persist
mkdir -p ~/.config/rclone
nano  ~/.config/rclone/rclone.conf
```

Paste the following into the config file, but substitute your own application credential ID and secret.

You will also need to substitute your own user ID. You can find it using “Identity” \> “Users” in the Horizon GUI; it is an alphanumeric string (*not* the human-readable user name).

    [chi_tacc]
    type = swift
    user_id = YOUR_USER_ID
    application_credential_id = APP_CRED_ID
    application_credential_secret = APP_CRED_SECRET
    auth = https://chi.tacc.chameleoncloud.org:5000/v3
    region = CHI@TACC

Use Ctrl+O and Enter to save the file, and Ctrl+X to exit `nano`.

To test it, run

``` bash
# run on node-persist
rclone lsd chi_tacc:
```

and verify that you see your container listed. This confirms that `rclone` can authenticate to the object store.

### Retrieve code on the node

``` bash
# run on node-persist
git clone https://github.com/zuchuandatou/MLSysEngOps-NYU25Spring.git
cd MLSysEngOps-NYU25Spring
git checkout ziyi-huang/data-pipeline
```

### Create a pipeline to load training data into the object store

Next, we will prepare a simple ETL pipeline to get the Project37 movielens dataset into the object store. It will:

-   extract the data into a staging area (local filesystem on the instance)
-   transform the data, organizing it into directories by class as required by PyTorch
-   and then load the data into the object store

We are going to define the pipeline stages inside a Docker compose file. All of the services in the container will share a common `project37` volume. Then, we have:

1.  A service to extract the Project37 movielens data from the Internet. This service runs a Python container image, downloads the dataset, and unzips it.

<!-- -->

    extract-data:
    container_name: etl_extract_data
    image: python:3.11
    user: root
    volumes:
      - project37:/data
    working_dir: /data
    command:
      - bash
      - -c
      - |
        set -e

        echo "Resetting dataset directory..."
        rm -rf /data/Project-37
        mkdir -p /data/Project-37
        cd /data/Project-37

        echo "Downloading MovieLens 192M zip..."
        curl -L "https://nyu.box.com/shared/static/5r8m3rvjfejcip7nqurf9jbpi5rw5ri1?dl=1" -o ml-192m.zip

        echo "Unzipping dataset..."
        unzip -q ml-192m.zip
        rm -f ml-192m.zip

        echo "Listing contents of /data after extract stage:"
        find /data/Project-37 -mindepth 1 -maxdepth 1 -type d

1.  A service that runs a Python container image, and uses a Python script to organize the data into directories according to class label.

<!-- -->

    transform-data:
    container_name: etl_transform_data
    image: python:3.11
    volumes:
      - project37:/data
    working_dir: /data/Project-37
    command:
      - bash
      - -c
      - |
        set -e

        echo "Installing required Python packages..."
        pip install --no-cache-dir pandas scikit-learn

        echo "Transforming MovieLens 192M dataset..."
        python3 -c '
        import os
        import pandas as pd
        import random

        base_dir = "/data/Project-37"
        csv_path = os.path.join(base_dir, "ml-192m", "ratings.csv")
        data_name = "movielens_192m"
        output_dir = os.path.join(base_dir, "raw")
        os.makedirs(output_dir, exist_ok=True)

        train_path = os.path.join(base_dir, "training", f"{data_name}_train.txt")
        val_path = os.path.join(base_dir, "validation", f"{data_name}_val.txt")
        eval_path = os.path.join(base_dir, "evaluation", f"{data_name}_eval.txt")

        for p in [train_path, val_path, eval_path]:
          os.makedirs(os.path.dirname(p), exist_ok=True)

        # Pass 1: Build ID maps
        print("Pass 1: Building user/item ID maps...")
        user_set = set()
        item_set = set()
        for chunk in pd.read_csv(csv_path, chunksize=1_000_000):
          user_set.update(chunk["userId"].unique())
          item_set.update(chunk["movieId"].unique())

        user_map = {u: i + 1 for i, u in enumerate(sorted(user_set))}
        item_map = {m: i + 1 for i, m in enumerate(sorted(item_set))}

        # Pass 2: Remap + assign each row to split immediately
        print("Pass 2: Remapping + splitting rows to output...")
        split_probs = [0.7, 0.15, 0.15]

        with open(train_path, "w") as f_train, open(val_path, "w") as f_val, open(eval_path, "w") as f_eval:
          for chunk in pd.read_csv(csv_path, chunksize=1_000_000):
            chunk["user_id"] = chunk["userId"].map(user_map)
            chunk["item_id"] = chunk["movieId"].map(item_map)
            for _, row in chunk.iterrows():
              line = f"{int(row.user_id)}\t{int(row.item_id)}\n"
              r = random.random()
              if r < split_probs[0]:
                f_train.write(line)
              elif r < split_probs[0] + split_probs[1]:
                f_val.write(line)
              else:
                f_eval.write(line)

        print("Stats:")
        print(f"Total users: { len(user_map) }, items: { len(item_map) }")
        print("Transform complete.")
        '

        echo "Listing contents after transform:"
        find /data/Project-37 -mindepth 1 -maxdepth 1 -type d


1.  And finally, a service that uses `rclone copy` to load the organized data into the object store. Note that we pass some arguments to `rclone copy` to increase the parallelism, so that the data is loaded more quicly. Also note that since the name of the container includes your **team ID**, we have specified it using an environment variable that must be set before this stage can run.

<!-- -->

  load-data:
    container_name: etl_load_data
    image: rclone/rclone:latest
    volumes:
      - project37:/data
      - ~/.config/rclone/rclone.conf:/root/.config/rclone/rclone.conf:ro
    entrypoint: /bin/sh
    command:
      - -c
      - |
        if [ -z "$RCLONE_CONTAINER" ]; then
          echo "ERROR: RCLONE_CONTAINER is not set"
          exit 1
        fi
        
        echo "Cleaning up existing contents of container..."
        rclone delete chi_tacc:$RCLONE_CONTAINER --rmdirs || true
        
        echo "Uploading data from /data/Project-37..."
        rclone copy /data/Project-37 chi_tacc:$RCLONE_CONTAINER \
          --progress \
          --transfers=32 \
          --checkers=16 \
          --multi-thread-streams=4 \
          --fast-list
        
        echo "Listing directories in container after load stage:"
        rclone lsd chi_tacc:$RCLONE_CONTAINER

These services are defined in `~/MLSysEngOps-NYU25Spring/ziyi-huang-data-pipeline/docker/docker-compose-etl.yaml`.

Now, we can run the stages using Docker. (If we had a workflow orchestrator, we could use it to run the pipeline stages - but we don’t really need orchestration at this point.)

``` bash
# run on node-persist
cd ziyi-huang-data-pipeline/docker
```

``` bash
# run on node-persist
docker compose -f docker-compose-etl.yaml run extract-data
```

``` bash
# run on node-persist
docker compose -f docker-compose-etl.yaml run transform-data
```

For the last stage, the container name is not specified in the Docker compose YAML (since it has your **team ID** in it) - so we have to pass it as an environment variable first. Substitute your own **team ID** in the line below:

``` bash
# run on node-persist
export RCLONE_CONTAINER=object-persist-project37
docker compose -f docker-compose-etl.yaml run load-data
```

Now our training data is loaded into the object store and ready to use for training! We can clean up the Docker volume used as the temporary staging area:

``` bash
# run on node-persist
docker container prune -f
docker volume rm project37-etl_project37
```

In the Horizon GUI, note that we can browse the object store and download any file from it. This container is independent of any compute instance - it will persist, and its data is still saved, even if we have no active compute instance. (In fact, we *already* have no active compute instance on CHI@TACC.)

### Mount an object store to local file system

Now that our data is safely inside the object store, we can use it anywhere - on a VM, on a bare metal site, on multiple compute instances at once, even outside of Chameleon - to train or evaluate a model. We would not have to repeat the ETL pipeline each time we want to use the data.

If we were working on a brand-new compute instance, we would need to download `rclone` and create the `rclone` configuration file at `~/.config/rclone.conf`, as we have above. Since we already done these steps in order to load data into the object store, we don’t need to repeat them.

The next step is to create a mount point for the data in the local filesystem:

``` bash
# run on node-persist
sudo mkdir -p /mnt/object
sudo chown -R cc /mnt/object
sudo chgrp -R cc /mnt/object
```

Now finally, we can use `rclone mount` to mount the object store at the mount point (substituting your own **team ID** in the command below).

``` bash
# run on node-persist
rclone mount chi_tacc:object-persist-teamID /mnt/object --read-only --allow-other --daemon
```

Here,

-   `chi_tacc` tells `rclone` which section of its configuration file to use for authentication information
-   `object-persist-netID` tells it what object store container to mount
-   `/mnt/object` says where to mount it

Since we only intend to read the data, we can mount it in read-only mode and it will be slightly faster; and we are also protected from accidental writes. We also specified `--allow-other` so that we can use the mount from Docker, and `--daemon` means the `rclone` process will be started in the background.

Run

``` bash
# run on node-persist
ls /mnt/object
```

and confirm that we can now see the Project37 movielens data directories (`ml-32m`, `staging`, `evaluation`, `training`, `validation`) there.


# TODO:
Now, we can start a Docker container with access to that virtual “filesystem”, by passing that directory as a bind mount. Note that to mount a directory that is actually a FUSE filesystem inside a Docker container, we have to pass it using a slightly different `--mount` syntax, instead of the `-v` that we had used in previous examples.

``` bash
# run on node-persist
docker run -d --rm \
  -p 8888:8888 \
  --shm-size 8G \
  -e MOVIELENS_DATA_DIR=/mnt/Project-37 \
  -v ~/MLSysEngOps-NYU25Spring:/home/jovyan/work/ \
  --mount type=bind,source=/mnt/object,target=/mnt/Project-37,readonly \
  --name jupyter \
  quay.io/jupyter/pytorch-notebook:latest

```

Run

``` bash
# run on node-persist
docker logs jupyter
```

and look for a line like

    http://127.0.0.1:8888/lab?token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Paste this into a browser tab, but in place of 127.0.0.1, substitute the floating IP assigned to your instance, to open the Jupyter notebook interface that is running on your compute instance.

Then, find the `demo.ipynb` notebook. This notebook evaluates the `food11.pth` model on the evaluation set, which is **streamed from the object store**.

To validate this, on the host, run

``` bash
# run on node-persist
sudo apt update
sudo apt -y install nload
nload ens3
```

to monitor the load on the network. Run the `demo.ipynb` notebook inside the Jupyter instance running on “node-persist”, which also watching the `nload` output.

Note the incoming data volume, which should be on the order of Mbits/second when a batch is being loaded.

Close the Jupyter container tab in your browser, and then stop the container with

``` bash
# run on node-persist
docker stop jupyter
```

since we will bring up a different Jupyter instance in the next section.

### Un-mount an object store

We’ll keep working with this object store in the next part, so you do not have to un-mount it now. But generally speaking to stop `rclone` running and un-mount the object store, you would run

    fusermount -u /mnt/object

where you specify the path of the mount point.