This script reorganizes datasets stored in a Google Cloud Storage (GCS) bucket into training, development, and testing splits. It assumes the folder is already structures as follows:

Bucket: 230-project-tiles
  - sentinel-tiles
      - q1
      - q2
      - q3
      - q4
  - folder: mask-tiles
      - q1
      - q2
      - q3
      - q4

and reorganizes the GCP bucket to be as follows:

Bucket: 230-project-tiles
  - sentinel-tiles
      - dev
        - q1
        - q2
        - q3
        - q4
      - test
        - q1
        - q2
        - q3
        - q4
      - train
        - q1
        - q2
        - q3
        - q4
  - mask-tiles
      - dev
        - q1
        - q2
        - q3
        - q4
      - test
        - q1
        - q2
        - q3
        - q4
      - train
        - q1
        - q2
        - q3
        - q4

This organization allows for the correct execution of the script titled Final U-Net Model.

Key Features:
1. **Matching Images and Masks**: Matches images with their corresponding masks using filename conventions.
2. **Random Splitting**: Randomly divides the dataset into `train`, `dev`, and `test` splits based on user-defined proportions.
3. **Split Preservation**: Organizes the splits into subdirectories (e.g., `sentinel-tiles/train/q1/` and `mask-tiles/train/q1/`).
4. **Cloud Operations**: Uses GCP's Python client library to copy files to new paths and optionally deletes the originals.

Functionality:
- Initializes a GCS client and accesses the specified bucket.
- Processes data for each quarter, listing and matching image and mask files.
- Shuffles and splits the data into train, dev, and test subsets.
- Moves the files into appropriate split directories in the same bucket.

Parameters:
- `bucket_name`: Name of the GCS bucket containing the dataset.
- `source_image_prefix`: Prefix path to the image tiles in the bucket.
- `source_mask_prefix`: Prefix path to the mask tiles in the bucket.
- `splits`: Dictionary defining the proportions of the dataset to allocate to each split (e.g., `{"train": 0.8, "dev": 0.1, "test": 0.1}`).
- `seed`: Random seed for reproducibility in dataset splitting.

Usage:
- Update the `bucket_name`, `source_image_prefix`, and `source_mask_prefix` to match the structure of your GCS bucket.
- Define the desired split proportions in the `splits` dictionary.
- Execute the script to organize the dataset into training, development, and testing splits.

Dependencies:
- Requires `google-cloud-storage` library for interacting with GCS.
- Assumes a valid GCP authentication JSON file is available and set via `GOOGLE_APPLICATION_CREDENTIALS`.

In [None]:
from google.cloud import storage
import random
import os
from google.colab import drive
drive.mount('/content/drive')

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path-to-key.json"

# Input Values
bucket_name = "230-project-tiles" # name of GCP bucket
source_image_prefix = "sentinel-tiles/" # folder for image tiles
source_mask_prefix = "mask-tiles/" # folder for mask tiles
splits = {"train": 0.8, "dev": 0.1, "test": 0.1} # train, dev, and test splits

def reorganize_dataset(bucket_name, source_image_prefix, source_mask_prefix, splits, seed=42):
    """
    Reorganize the GCP bucket dataset into train, dev, and test datasets.

    Args:
        bucket_name (str): Name of the GCP bucket.
        source_image_prefix (str): Prefix for image files (e.g., 'sentinel-tiles/').
        source_mask_prefix (str): Prefix for mask files (e.g., 'mask-tiles/').
        splits (dict): Dictionary specifying train, dev, and test splits (e.g., {"train": 0.8, "dev": 0.1, "test": 0.1}).
        seed (int): Random seed for reproducibility.
    """
    # Initialize GCP storage client
    client = storage.Client()
    bucket = client.bucket(bucket_name)
    random.seed(seed)

    for quarter in ["q1/", "q2/", "q3/", "q4/"]:
        print(f"Processing {quarter}...")

        # List all images and masks for the quarter
        image_blobs = list(bucket.list_blobs(prefix=f"{source_image_prefix}{quarter}"))
        mask_blobs = list(bucket.list_blobs(prefix=f"{source_mask_prefix}{quarter}"))

        # Match images with their corresponding masks by filename
        mask_dict = {blob.name.split('/')[-1].replace("mask_", "image_"): blob for blob in mask_blobs}
        matched_files = [(img_blob, mask_dict[img_blob.name.split('/')[-1]]) for img_blob in image_blobs if img_blob.name.split('/')[-1] in mask_dict]

        # Shuffle and split files
        random.shuffle(matched_files)
        total_files = len(matched_files)
        train_cutoff = int(total_files * splits["train"])
        dev_cutoff = train_cutoff + int(total_files * splits["dev"])

        # Create splits
        train_files = matched_files[:train_cutoff]
        dev_files = matched_files[train_cutoff:dev_cutoff]
        test_files = matched_files[dev_cutoff:]

        # Move files to new locations
        for split_name, split_files in [("train", train_files), ("dev", dev_files), ("test", test_files)]:
            for img_blob, mask_blob in split_files:
                # New paths for image and mask files
                new_image_path = f"{source_image_prefix}{split_name}/{quarter}{img_blob.name.split('/')[-1]}"
                new_mask_path = f"{source_mask_prefix}{split_name}/{quarter}{mask_blob.name.split('/')[-1]}"

                # Copy image and mask files
                bucket.copy_blob(img_blob, bucket, new_image_path)
                bucket.copy_blob(mask_blob, bucket, new_mask_path)

                # Optionally delete the original files
                img_blob.delete()
                mask_blob.delete()

        print(f"Finished splitting {quarter}: {len(train_files)} train, {len(dev_files)} dev, {len(test_files)} test files.")

# Run the script
reorganize_dataset(bucket_name, source_image_prefix, source_mask_prefix, splits)

Mounted at /content/drive
Processing q1/...
Finished splitting q1/: 2156 train, 269 dev, 271 test files.
Processing q2/...
Finished splitting q2/: 882 train, 110 dev, 111 test files.
Processing q3/...
Finished splitting q3/: 749 train, 93 dev, 95 test files.
Processing q4/...
Finished splitting q4/: 1076 train, 134 dev, 135 test files.
