# Extracting Time-Series from Cloud-Optimized GeoTIFFs (COGs)

This notebook shows how to use rasterio to efficiently extract pixels values from cloud-optimized GeoTIFF files hosted on cloud-buckets.

We leverage GDAL Virtual Rasters (VRT) to create a virtual stacked image from separate images and query them using `rasterio.sample`. This method is super fast and only fetches the data required for the pixels instead of the entire file.

In [59]:
import os
import rasterio
from osgeo import gdal
import tempfile
import pandas as pd

In [3]:
os.environ['GS_NO_SIGN_REQUEST'] = 'YES'

In this example, we have a folder on Google Cloud Storage (GCS) has 12 files representing soil moisture for each month.

```
soil_moisture_202301.tif
soil_moisture_202302.tif
soil_moisture_202303.tif
...
```

We want to sample pixel values from each of these at N different locations. 

### Creating a VRT

This is a one-time process to create a VRT for efficient query of the datasets.

### GDAL Command Line Tool

One can use the GDAL command-line tool to create a VRT and place it in the same GCS bucket as the files. 

`gdalbuildvrt -input_file_list filelist.txt soil_moisture.vrt`

The filelist would contain URLS of files with the `/vsigs` prefix.

```
/vsigs/spatialthoughts-public-data/terraclimate/soil_moisture_202301.tif
/vsigs/spatialthoughts-public-data/terraclimate/soil_moisture_202301.tif
/vsigs/spatialthoughts-public-data/terraclimate/soil_moisture_202301.tif
...
```


### Using GDAL Python API

In [None]:
# Create a VRT file in the temp directory
temp_dir = tempfile.gettempdir()
vrt_file = 'soil_moisture.vrt'
vrt_options = gdal.BuildVRTOptions(separate=True)
vrt_file_path = os.path.join(temp_dir, vrt_file)

# Add URLs to the files
urls = []
prefix = '/vsigs/spatialthoughts-public-data/terraclimate/'
for month in range(1, 13):
    image_id = f'soil_moisture_2023{month:02d}.tif'
    path = prefix + image_id
    urls.append(path)

# Create the VRT
gdal.BuildVRT(vrt_file_path, urls, options=vrt_options).FlushCache()

Once done, copy the VRT to the same GCS bucket

## Sampling Pixel Values

In [45]:
vrt_file_path = '/vsigs/spatialthoughts-public-data/terraclimate/soil_moisture.vrt'

In [65]:
locations = [
    ('Location 1', 80.449, 18.728),
    ('Location 2', 79.1488, 15.2797),
    ('Location 3', 74.656, 25.144)
]

Sorting coordinates by X and then Y provides better performance

In [66]:
sorted_locations = sorted(locations, key=lambda loc: (loc[1], loc[2]))
sorted_ids = [loc[0] for loc in sorted_locations]
sorted_coords = [(loc[1], loc[2]) for loc in sorted_locations]

Sample the VRT using `rasterio.sample()`

In [67]:
%%time
with rasterio.open(vrt_file_path) as src:
    samples = rasterio.sample.sample_gen(src, sorted_coords)
    data = list(samples)

CPU times: user 60.2 ms, sys: 15.4 ms, total: 75.6 ms
Wall time: 76.6 ms


Convert to a DataFrame

In [68]:
df = pd.DataFrame(data, columns=[f'{month:02}' for month in range(1, 13)])
df.index = sorted_ids
df

Unnamed: 0,01,02,03,04,05,06,07,08,09,10,11,12
Location 3,25.1,21.2,18.4,16.3,14.6,13.7,34.6,27.5,37.7,29.3,24.1,20.6
Location 2,33.5,26.4,21.8,18.7,16.3,14.5,125.6,51.1,67.2,41.5,80.1,122.3
Location 1,322.5,264.4,223.0,204.8,138.1,104.7,498.9,462.5,498.9,432.2,385.9,357.1
