# VRTs and Cloud storage with GDAL


## Load Libraries

Check that gdal is installed

In [1]:
!gdalinfo --version

GDAL 3.6.3, released 2023/03/07


In [1]:
from osgeo import gdal
import subprocess
import json 
import pandas as pd
from google.cloud import storage
import os
import glob

In [2]:
os.environ['GS_NO_SIGN_REQUEST'] = 'YES'
os.environ['GDAL_NUM_THREADS'] = 'ALL_CPUS'

Log in to google cloud if needed 

In [5]:
!{gcloud auth login --update-adc}

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=32555940559.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8085%2F&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&state=mYPQ3hyT7RCvpIPmkfq41j8FTvxvDq&access_type=offline&code_challenge=7_6J6-N7bY3Z4sZ1Tqfz5iC0ll_tLOPWl2zYkUu7bfY&code_challenge_method=S256


Application default credentials (ADC) were updated.

You are now logged in as [cnilsen@gmail.com].
Your current project is [swhm-dev].  You can change this setting by running:
  $ gcloud config set project PROJECT_ID


In [3]:
!{gcloud config set project swhm-dev}

Updated property [core/project].


## Functions

A function to list blobs on the storage bucket 

In [4]:
def list_blobs_with_prefix(bucket_name, prefix, delimiter=None):
    """Lists all the blobs in the bucket that begin with the prefix.

    This can be used to list all blobs in a "folder", e.g. "public/".

    The delimiter argument can be used to restrict the results to only the
    "files" in the given "folder". Without the delimiter, the entire tree under
    the prefix is returned. For example, given these blobs:

        a/1.txt
        a/b/2.txt

    If you specify prefix ='a/', without a delimiter, you'll get back:

        a/1.txt
        a/b/2.txt

    However, if you specify prefix='a/' and delimiter='/', you'll get back
    only the file directly under 'a/':

        a/1.txt

    As part of the response, you'll also get back a blobs.prefixes entity
    that lists the "subfolders" under `a/`:

        a/b/
    """

    storage_client = storage.Client()

    # Note: Client.list_blobs requires at least package version 1.17.0.
    blobs = storage_client.list_blobs(bucket_name, prefix=prefix, delimiter=delimiter)

    # Note: The call returns a response only when the iterator is consumed.
    blob_list = []
    for blob in blobs:
        blob_list.append(blob.name)

    if delimiter:
        print("Prefixes:")
        for prefix in blobs.prefixes:
            blob_list.append([prefix])
    
    return blob_list


### 1. Download images

In [5]:
def dl(lay_name):
    cmd = f'gsutil -m cp -R gs://swhm-image-exports/{lay_name} .'
    !{cmd}
    


### 2. Reproject images

saves reprojected images to /tmp 

In [6]:
def reproject(lay_name, target_crs): 

    directory_path = lay_name
    
    for filename in os.listdir(directory_path):
        if filename.endswith(".tif"):
            input_path = os.path.join(directory_path, filename)
            output_path = os.path.join("tmp", filename)
            cmd = f'gdalwarp -t_srs {target_crs} -overwrite {input_path} {output_path} -co NUM_THREADS=ALL_CPUS -co COMPRESS=LZW -co BIGTIFF=YES'
            !{cmd}
            
            #make a list of the files in the directory 
            files = os.listdir("tmp")

            # create a new file for writing
            with open("files.txt", "w") as file:
                # write each file name to the file
                for f in files:
                    if f.endswith(".tif"):
                        file.write("tmp/"+f + "\n")

    
  



### 3. Build vrt



In [7]:
def makevrt(lay_name):
    print('Making VRT...')
    cmd = f'cd tmp | gdalbuildvrt -input_file_list files.txt output.vrt'
    !{cmd}
    print('VRT Complete!')
    

def makecog(): 
    print('Making COG..')
    cmd = f'''
    gdal_translate output.vrt cog.tif -of COG -co NUM_THREADS=ALL_CPUS -co COMPRESS=LZW -co BIGTIFF=YES
    '''
    !{cmd}
    print('COG Complete!')
    
def ul(lay_name):
    print('Uploading Layer...')
    cmd = f'gsutil cp cog.tif gs://live_data_layers/rasters/{lay_name}.tif'
    !{cmd}
    print('Layer upload complete!') 

## 4. Wrapper Function

In [8]:
def run_pipeline(lay_name): 
    dl(lay_name)
    reproject(lay_name,'EPSG:3857')
    makevrt(lay_name)
    makecog()
    ul(lay_name)
    files = glob.glob(f'{lay_name}/*')
    for f in files:
        os.remove(f)
    os.rmdir(lay_name)

In [9]:
lay_name = "Runoff_mm"

In [22]:
makevrt(lay_name)

Making VRT...
0...10...20...30...40...50...60...70...80...90...100 - done.
VRT Complete!


In [10]:
makecog()

Making COG..
Input file size is 73740, 70537
0...10...20...30...40...50...60...70...80...90...100 - done.
COG Complete!


In [11]:
ul(lay_name)

Uploading Layer...
Copying file://cog.tif [Content-Type=image/tiff]...
==> NOTE: You are uploading one or more large file(s), which would run          
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.

/ [1 files][  5.8 GiB/  5.8 GiB]  657.1 KiB/s                                   
Operation completed over 1 objects/5.8 GiB.                                      
Layer upload complete!


## Get list of objects in data bucket

In [19]:
BUCKET_NAME = 'swhm-image-exports'
blobsout = list_blobs_with_prefix(BUCKET_NAME,'')

In [20]:
df = pd.DataFrame(blobsout, columns=['file_path'])#.iloc[1:]
#df['folder_name'] = df['file_path'].str.split(BUCKET_NAME, 1,expand = True)
df['gdal_path'] = df['file_path'].str.replace('gs://', '/vsigs/') 
df

Unnamed: 0,file_path,gdal_path
0,Runoff_mm/Runoff_mm0000000000-0000000000.tif,Runoff_mm/Runoff_mm0000000000-0000000000.tif
1,Runoff_mm/Runoff_mm0000000000-0000023296.tif,Runoff_mm/Runoff_mm0000000000-0000023296.tif
2,Runoff_mm/Runoff_mm0000000000-0000046592.tif,Runoff_mm/Runoff_mm0000000000-0000046592.tif
3,Runoff_mm/Runoff_mm0000000000-0000069888.tif,Runoff_mm/Runoff_mm0000000000-0000069888.tif
4,Runoff_mm/Runoff_mm0000000000-0000093184.tif,Runoff_mm/Runoff_mm0000000000-0000093184.tif
5,Runoff_mm/Runoff_mm0000023296-0000000000.tif,Runoff_mm/Runoff_mm0000023296-0000000000.tif
6,Runoff_mm/Runoff_mm0000023296-0000023296.tif,Runoff_mm/Runoff_mm0000023296-0000023296.tif
7,Runoff_mm/Runoff_mm0000023296-0000046592.tif,Runoff_mm/Runoff_mm0000023296-0000046592.tif
8,Runoff_mm/Runoff_mm0000023296-0000069888.tif,Runoff_mm/Runoff_mm0000023296-0000069888.tif
9,Runoff_mm/Runoff_mm0000023296-0000093184.tif,Runoff_mm/Runoff_mm0000023296-0000093184.tif


In [21]:
lay_names= df['file_path'].str.split('/', 0).str[0]#.str.replace('/','',regex=False)
df['layer_name'] = lay_names.str.split('/',1).str[0]
df
#lay_names

  lay_names= df['file_path'].str.split('/', 0).str[0]#.str.replace('/','',regex=False)
  df['layer_name'] = lay_names.str.split('/',1).str[0]


Unnamed: 0,file_path,gdal_path,layer_name
0,Runoff_mm/Runoff_mm0000000000-0000000000.tif,Runoff_mm/Runoff_mm0000000000-0000000000.tif,Runoff_mm
1,Runoff_mm/Runoff_mm0000000000-0000023296.tif,Runoff_mm/Runoff_mm0000000000-0000023296.tif,Runoff_mm
2,Runoff_mm/Runoff_mm0000000000-0000046592.tif,Runoff_mm/Runoff_mm0000000000-0000046592.tif,Runoff_mm
3,Runoff_mm/Runoff_mm0000000000-0000069888.tif,Runoff_mm/Runoff_mm0000000000-0000069888.tif,Runoff_mm
4,Runoff_mm/Runoff_mm0000000000-0000093184.tif,Runoff_mm/Runoff_mm0000000000-0000093184.tif,Runoff_mm
5,Runoff_mm/Runoff_mm0000023296-0000000000.tif,Runoff_mm/Runoff_mm0000023296-0000000000.tif,Runoff_mm
6,Runoff_mm/Runoff_mm0000023296-0000023296.tif,Runoff_mm/Runoff_mm0000023296-0000023296.tif,Runoff_mm
7,Runoff_mm/Runoff_mm0000023296-0000046592.tif,Runoff_mm/Runoff_mm0000023296-0000046592.tif,Runoff_mm
8,Runoff_mm/Runoff_mm0000023296-0000069888.tif,Runoff_mm/Runoff_mm0000023296-0000069888.tif,Runoff_mm
9,Runoff_mm/Runoff_mm0000023296-0000093184.tif,Runoff_mm/Runoff_mm0000023296-0000093184.tif,Runoff_mm


In [22]:
lay_names = df["layer_name"].unique()
lay_names

array(['Runoff_mm'], dtype=object)

In [23]:
run_pipeline('Runoff_mm')

If you experience problems with multiprocessing on MacOS, they might be related to https://bugs.python.org/issue33725. You can disable multiprocessing by editing your .boto config or by adding the following flag to your command: `-o "GSUtil:parallel_process_count=1"`. Note that multithreading is still available even if you disable multiprocessing.

Copying gs://swhm-image-exports/Runoff_mm/Runoff_mm0000000000-0000069888.tif...
Copying gs://swhm-image-exports/Runoff_mm/Runoff_mm0000000000-0000023296.tif...
Copying gs://swhm-image-exports/Runoff_mm/Runoff_mm0000000000-0000000000.tif...
Copying gs://swhm-image-exports/Runoff_mm/Runoff_mm0000000000-0000046592.tif... 
Copying gs://swhm-image-exports/Runoff_mm/Runoff_mm0000000000-0000093184.tif... 
==> NOTE: You are downloading one or more large file(s), which would            
run significantly faster if you enabled sliced object downloads. This
feature is enabled by default but requires that compiled crcmod be
installed (see "gsutil help c

## Loop through layer names

Use this to run the pipeline for all layers in a list

In [14]:
for lay_name in lay_names:
    run_pipeline(lay_name) 
    

If you experience problems with multiprocessing on MacOS, they might be related to https://bugs.python.org/issue33725. You can disable multiprocessing by editing your .boto config or by adding the following flag to your command: `-o "GSUtil:parallel_process_count=1"`. Note that multithreading is still available even if you disable multiprocessing.

CommandException: No URLs matched: gs://swhm-image-exports/T
CommandException: 1 file/object could not be transferred.


FileNotFoundError: [Errno 2] No such file or directory: 'T'