# Preparing Data for U-Net Model

**Author**: Sage McGinley-Smith  
**Class**: CS 230: Deep Learning  
**Date**: November 2024

This notebook provides a workflow for tiling GeoTIFF images and uploading the resulting tiles to a Google Cloud Storage (GCS) bucket. It assumes that there are four quarters, and that the geotiffs and masks for each quarter are titled as:
- image_q1_2019.tif
- image_q2_2019.tif
- image_q3_2019.tif
- image_q4_2019.tif
- mask_q1_2019.tif
- mask_q2_2019.tif
- mask_q3_2019.tif
- mask_q4_2019.tif

### Key Features:
1. **Install and Load Dependencies**:
   - Sets up the required Python libraries, including `rasterio` for geospatial processing and `google-cloud-storage` for interacting with GCS.
   - Mounts Google Drive to access necessary files, such as GeoTIFFs and GCP authentication credentials.

2. **Tiling Functionality**:
   - Splits large GeoTIFF images into smaller, square tiles of a defined size (e.g., 128x128 pixels).
   - Supports differentiation between image tiles (e.g., satellite images) and mask tiles (e.g., segmentation masks) based on the filename.

3. **Quarter-Based Organization**:
   - Automatically organizes tiles into folders corresponding to quarters (`q1`, `q2`, `q3`, `q4`) based on filenames.

4. **Google Cloud Storage Integration**:
   - Uploads the generated tiles to a specified GCS bucket.
   - Organizes tiles into the appropriate directories (`sentinel-tiles` or `mask-tiles`) within the bucket.

5. **Edge Case Handling**:
   - Excludes tiles that do not match the expected size (e.g., tiles from the edges of the image).

### How to Use:
1. **Setup**:
   - Install the necessary Python packages by running the installation cell.
   - Replace the placeholder in the `GOOGLE_APPLICATION_CREDENTIALS` variable with the path to your GCP credentials JSON file.

2. **Define Tiling and Upload Functions**:
   - The notebook includes reusable functions for tiling (`tile_and_upload`) and uploading files (`upload_to_gcp`).

3. **Configure Parameters**:
   - Specify the list of GeoTIFF file paths in `geotiff_paths`.
   - Set the target GCS bucket name in the `bucket_name` variable.

4. **Run the Tiling and Upload Workflow**:
   - The loop iterates over all specified GeoTIFF files, tiles them, and uploads the resulting tiles to the GCS bucket.

### Output:
- GeoTIFF tiles are uploaded to the specified GCS bucket in a structured format:

Bucket: 230-project-tiles
  - sentinel-tiles
      - q1
      - q2
      - q3
      - q4
  - folder: mask-tiles
      - q1
      - q2
      - q3
      - q4

  From this point, the tiles are prepared to be divided into train/dev/test folders in the Making Dev And Test Sets notebook.

# Install and Load Necessary Packages and Mount Drive

In [None]:
!pip install rasterio google-cloud-storage
import os
import rasterio
from google.cloud import storage
from google.colab import drive
drive.mount('/content/drive')
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "insert path to .json key file in drive here"

# Define Functions for Tiling and Bucketing
Image tiles are 128 x 128

In [None]:
def upload_to_gcp(bucket_name, source_file_name, destination_blob_name):
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(destination_blob_name)
    blob.upload_from_filename(source_file_name)
    print(f"File {source_file_name} uploaded to {destination_blob_name}.")

In [None]:
def tile_and_upload(geotiff_path, bucket_name):
    tile_size = 128  # Define the size of each tile
    filename = os.path.basename(geotiff_path).split(".")[0]

    # Determine the quarter based on the filename
    if "q1" in filename:
        quarter_folder = "q1"
    elif "q2" in filename:
        quarter_folder = "q2"
    elif "q3" in filename:
        quarter_folder = "q3"
    elif "q4" in filename:
        quarter_folder = "q4"
    else:
        raise ValueError("Filename does not contain a valid quarter identifier (q1, q2, q3, q4).")

    # Determine if the file is a mask or sentinel image
    if "mask" in filename:
        folder = "mask-tiles"
    else:
        folder = "sentinel-tiles"

    # Open the GeoTIFF and create tiles
    with rasterio.open(geotiff_path) as src:
        img_width, img_height = src.width, src.height

        # Iterate through the image to create tiles
        for i in range(0, img_width, tile_size):
            for j in range(0, img_height, tile_size):
                window = rasterio.windows.Window(i, j, tile_size, tile_size)

                # Read the windowed tile and save if it matches tile size
                transform = src.window_transform(window)
                tile_data = src.read(window=window)

                # Skip if the tile is smaller than expected (edge case)
                if tile_data.shape[1] != tile_size or tile_data.shape[2] != tile_size:
                    continue

                # Define tile filename and save path
                tile_filename = f"{filename}_tile_{i}_{j}.tif"
                tile_path = f"./{tile_filename}"

                # Save the tile
                with rasterio.open(
                    tile_path,
                    'w',
                    driver='GTiff',
                    height=tile_size,
                    width=tile_size,
                    count=src.count,
                    dtype=tile_data.dtype,
                    crs=src.crs,
                    transform=transform
                ) as dst:
                    dst.write(tile_data)

                # Upload to GCP in the appropriate folder structure
                destination_blob_name = f"{folder}/{quarter_folder}/{tile_filename}"
                upload_to_gcp(bucket_name, tile_path, destination_blob_name)

                # Remove local tile after uploading
                os.remove(tile_path)

# Loop Through Images and Tile + Upload Them

In [None]:
bucket_name = "230-project-tiles"
geotiff_paths = [
    "/content/drive/My Drive/Senior Project/Training_Data_Full_Quads/image_q1_2019.tif",
    "/content/drive/My Drive/Senior Project/Training_Data_Full_Quads/image_q2_2019.tif",
    "/content/drive/My Drive/Senior Project/Training_Data_Full_Quads/image_q3_2019.tif",
    "/content/drive/My Drive/Senior Project/Training_Data_Full_Quads/image_q4_2019.tif",
    "/content/drive/My Drive/Senior Project/Training_Data_Full_Quads/mask_q1_2019.tif",
    "/content/drive/My Drive/Senior Project/Training_Data_Full_Quads/mask_q2_2019.tif",
    "/content/drive/My Drive/Senior Project/Training_Data_Full_Quads/mask_q3_2019.tif",
    "/content/drive/My Drive/Senior Project/Training_Data_Full_Quads/mask_q4_2019.tif"
]

for geotiff_path in geotiff_paths:
    tile_and_upload(geotiff_path, bucket_name)