# Co-locating Temperature Anomaly Values between CBP Monitoring Stations and Satellite SST

## Notebook Overview
### Tasks
- Select satellite SST and SST anomaly values for each location and date in which there is an in situ observation from the Chesapeake Bay Program
- Save two csvs of the results: one csv for in situ and satellite SSTs and a second csv for the in situ and satellite SST anomalies

This script creates a csv which holds the corresponding validation data points from both Chesapeake Bay Program's (CBP) Water Quality dataset and the two satellite datasets. The csv contains one row for each day and pixel in which there was a CBP validation point, and either a MUR or a Geopolar satellite pixel.
Output: csv of temperature values that can be observed in the satellite SST datasets and the in situ CBP data

## Analysis

In [1]:
import os
import warnings

import xarray as xr
import pandas as pd
import numpy as np

In [2]:
REPO_ROOT = '/Users/rwegener/repos/chesapeake_mhw'
SAVE_FIGS = False

## Read CBay Program In Situ Anomaly Data

In [18]:
# OLD NAMING CONVENTION:
# path = os.path.join(REPO_ROOT, 'data/interim', 'cbp_stations_climatology_anomaly_filtered.csv')
# NEW NAMING CONVENTION: 'cbp_temps_selected_stations_with_climatology.csv'
path_anom = os.path.join(REPO_ROOT, 'data/interim', 'cbp_temps_selected_stations_with_climatology.csv')
anom_raw = pd.read_csv(path_anom, parse_dates=[1])

In [19]:
path_sst = os.path.join(REPO_ROOT, 'data/interim', 'cbp_stations_climatology_raw_filtered.csv')
sst_raw = pd.read_csv(path_sst, parse_dates=[1])

## Helper Functions

Creating functions for repeated tasks

In [25]:
def get_satellite_sst(full_sst, lat, lon, time):
    '''
    For a given latitude, longitude, and time extract the SST value at that location
    in the give satellite dataset.
    :full_sst: 3D array of satellite SST from which to extract temperature values
    :lat: latitude value from the in situ dataset used to find the nearest SST pixel
    :lon: longitude value from the in situ dataset used to find the nearest SST pixel
    :time: time value from the in situ dataset to match in the SST dataset
    '''
    try:
        # time does NOT have nearest interpolation because we do not want adjacent days to
        # be selected
        matching_array = full_sst.sel(lat=lat, lon=lon, 
                                      method='nearest').sel(time=time.strftime('%Y-%m-%d')).values
        # Check only one value is returned, allowing for multiple array size return shapes
        if matching_array.ndim == 0:
            matching_sst = matching_array
        elif matching_array.ndim == 1 and matching_array.size == 1:
            matching_sst = matching_array[0]
        else:
            matching_sst = np.nan
            print('In situ date not found in satellite SST', time.strftime('%Y-%m-%d'), 'at',
                 lat, lon)
    except KeyError:
        # If a key error was raised the corresponding date was not found. 
        # Return nan for that sample location
        print('In situ date not found in satellite SST', time.strftime('%Y-%m-%d'), 'at',
            lat, lon)
        matching_sst = np.nan
    return matching_sst

## Initialize the output dataframe

In [26]:
anom_raw = anom_raw[['Station', 'Latitude', 'Longitude', 'SampleDate', 'anom_cbp']]
wq_anom = anom_raw.copy()
wq_anom['anom_geopolar'] = -999
wq_anom['anom_mur'] = -999
wq_anom['anom_ostia'] = -999

In [27]:
sst_raw = sst_raw[['Station', 'Latitude', 'Longitude', 'SampleDate', 'MeasureValue']]
wq_sst = sst_raw.copy()
wq_sst['geopolar'] = -999
wq_sst['mur'] = -999
wq_sst['ostia'] = -999

## Extract SST value corresponding to CBP In situ observations

### Geo-Polar SST

In [28]:
# Open raw SST
path = os.path.join(
    REPO_ROOT, 'data/raw', 
    'L4_GHRSST-SSTfnd-Geo_Polar_Blended_Night-GLOB-v02.0-fv01.0_CB_20020901_20230831.nc'
)
geopolar = xr.open_dataset(path).analysed_sst
# convert kelvin to celsius & update metadata
geopolar.values = geopolar.values - 273.15
geopolar.attrs.update({'units': 'celsius',})

# Open calculated climatological SST
path = os.path.join(REPO_ROOT, 'data/interim', 'geopolar_climatology_chesapeake.nc')
geopolar_clim = xr.open_dataset(path).climatology

# Compute SST anomaly
geopolar_anom = geopolar - geopolar_clim

Timing Notes

Feb 1: ~22,000 rows: ~50 seconds


In [29]:
%%time

# Create a new column of the wq dataframe containing the corresponding geopolar sst value
wq_anom['anom_geopolar'] = wq_anom.apply(lambda x: get_satellite_sst(geopolar_anom, x.Latitude, x.Longitude, x.SampleDate), 
                                                axis=1)

In situ date not found in satellite SST 2018-03-12 at 39.36415 -75.88203
In situ date not found in satellite SST 2018-03-12 at 39.2437 -75.9249
In situ date not found in satellite SST 2018-03-12 at 38.1576 -76.598
In situ date not found in satellite SST 2018-03-12 at 38.3525 -77.2051
In situ date not found in satellite SST 2018-03-12 at 38.3626 -76.99063
In situ date not found in satellite SST 2018-03-12 at 38.6082 -77.1739
CPU times: user 23 s, sys: 527 ms, total: 23.6 s
Wall time: 23.9 s


In [30]:
# Create a new column of the wq dataframe containing the corresponding geopolar sst value
wq_sst['geopolar'] = wq_sst.apply(lambda x: get_satellite_sst(geopolar, x.Latitude, x.Longitude, x.SampleDate), 
                                                axis=1)

In situ date not found in satellite SST 2017-02-01 at 37.68346 -75.98966
In situ date not found in satellite SST 2017-02-01 at 37.77513 -75.97466
In situ date not found in satellite SST 2017-02-01 at 37.58124 -76.05799
In situ date not found in satellite SST 2017-02-01 at 37.41153 -76.07966
In situ date not found in satellite SST 2017-02-01 at 37.41153 -76.02466
In situ date not found in satellite SST 2018-03-06 at 37.11203 -75.97255
In situ date not found in satellite SST 2018-03-06 at 37.17631 -75.99165
In situ date not found in satellite SST 2017-02-01 at 37.21722 -76.39222
In situ date not found in satellite SST 2018-03-06 at 37.32776 -76.02228
In situ date not found in satellite SST 2018-03-06 at 37.44776 -75.98625
In situ date not found in satellite SST 2018-03-06 at 37.52291 -75.95171
In situ date not found in satellite SST 2018-03-06 at 37.30396 -76.01815
In situ date not found in satellite SST 2021-05-25 at 36.83611 -76.24444
In situ date not found in satellite SST 2021-05-25 

In [31]:
# delete large data structures to save memory
del geopolar
del geopolar_anom
del geopolar_clim

### MUR SST

In [32]:
# Open raw SST
path = os.path.join(
    REPO_ROOT, 'data/raw', 
    'MUR-JPL-L4_GHRSST-SSTfnd-GLOB-v02.0-fv04.1-20020901_20230831.nc'
)
mur = xr.open_dataset(path).analysed_sst
# convert kelvin to celsius & update metadata
mur.values = mur.values - 273.15
mur.attrs.update({'units': 'celsius',})

# Open calculated climatological SST
path = os.path.join(REPO_ROOT, 'data/interim/data_cache_Dec24', 'mur_climatology_chesapeake.nc')
mur_clim = xr.open_dataset(path).climatology

# Compute SST anomaly
mur_anom = mur - mur_clim

In [33]:
%%time

# Create a new column of the wq dataframe containing the corresponding mur anom value
wq_anom['anom_mur'] = wq_anom.apply(lambda x: get_satellite_sst(mur_anom, x.Latitude, x.Longitude, x.SampleDate), 
                                                axis=1)

In situ date not found in satellite SST 2022-11-09 at 39.36415 -75.88203
In situ date not found in satellite SST 2022-11-09 at 39.2437 -75.9249
In situ date not found in satellite SST 2022-11-09 at 38.5807 -76.0587
In situ date not found in satellite SST 2022-11-09 at 38.3525 -77.2051
In situ date not found in satellite SST 2022-11-09 at 38.3626 -76.99063
In situ date not found in satellite SST 2022-11-09 at 38.6082 -77.1739
CPU times: user 23.9 s, sys: 769 ms, total: 24.6 s
Wall time: 24.9 s


In [35]:
# Create a new column of the wq dataframe containing the corresponding mur sst value
wq_sst['mur'] = wq_sst.apply(lambda x: get_satellite_sst(mur, x.Latitude, x.Longitude, x.SampleDate), 
                                                axis=1)

In situ date not found in satellite SST 2022-11-09 at 39.36415 -75.88203
In situ date not found in satellite SST 2022-11-09 at 39.2437 -75.9249
In situ date not found in satellite SST 2022-11-09 at 38.80645 -75.90971
In situ date not found in satellite SST 2022-11-09 at 38.5807 -76.0587
In situ date not found in satellite SST 2022-11-09 at 38.20855 -75.80458
In situ date not found in satellite SST 2022-11-09 at 38.56508 -77.19345
In situ date not found in satellite SST 2022-11-09 at 38.3525 -77.2051
In situ date not found in satellite SST 2022-11-09 at 38.3626 -76.99063
In situ date not found in satellite SST 2022-11-09 at 38.6082 -77.1739
In situ date not found in satellite SST 2022-11-09 at 37.97417 -75.86388
In situ date not found in satellite SST 2022-11-09 at 38.16864 -75.94713
In situ date not found in satellite SST 2022-11-09 at 38.69787 -77.02317


In [36]:
# delete large data structures to save memory
del mur
del mur_anom
del mur_clim

### OSTIA SST

Note: OSTIA has lots of missing dates because OSTIA begins in 2006, while the in situ records begin in 2003.

In [48]:
# Open raw SST
path = os.path.join(
    REPO_ROOT, 'data/raw', 
    'METOFFICE-GLO-SST-L4-NRT-OBS-SST-V2_analysed_sst_77.47W-75.53W_36.78N-39.97N_2007-01-01-2023-09-01.nc'
)
ostia = xr.open_dataset(path).analysed_sst
ostia = ostia.rename({'latitude': 'lat', 'longitude': 'lon'})
# convert kelvin to celsius & update metadata
ostia.values = ostia.values - 273.15
ostia.attrs.update({'units': 'celsius',})

# Open calculated climatological SST
path = os.path.join(REPO_ROOT, 'data/interim/data_cache_Dec24', 'ostia_climatology_chesapeake.nc')
ostia_clim = xr.open_dataset(path).climatology

# Compute SST anomaly
ostia_anom = ostia - ostia_clim

In [49]:
%%time

# Create a new column of the wq dataframe containing the corresponding ostia anom value
wq_anom['anom_ostia'] = wq_anom.apply(lambda x: get_satellite_sst(ostia_anom, x.Latitude, x.Longitude, x.SampleDate), 
                                                axis=1)

In situ date not found in satellite SST 2003-01-15 at 39.44149 -76.02599
In situ date not found in satellite SST 2003-03-12 at 39.44149 -76.02599
In situ date not found in satellite SST 2003-04-09 at 39.44149 -76.02599
In situ date not found in satellite SST 2003-04-23 at 39.44149 -76.02599
In situ date not found in satellite SST 2003-05-07 at 39.44149 -76.02599
In situ date not found in satellite SST 2003-05-21 at 39.44149 -76.02599
In situ date not found in satellite SST 2003-06-18 at 39.44149 -76.02599
In situ date not found in satellite SST 2003-07-09 at 39.44149 -76.02599
In situ date not found in satellite SST 2003-07-23 at 39.44149 -76.02599
In situ date not found in satellite SST 2003-08-06 at 39.44149 -76.02599
In situ date not found in satellite SST 2003-08-20 at 39.44149 -76.02599
In situ date not found in satellite SST 2003-09-17 at 39.44149 -76.02599
In situ date not found in satellite SST 2003-09-23 at 39.44149 -76.02599
In situ date not found in satellite SST 2003-10-08 

In [50]:
# Ensure all values are floats and none are np.ndarray
wq_anom['anom_ostia'] = wq_anom['anom_ostia'].map(
    lambda x: x.tolist() if isinstance(x, np.ndarray) else None
)

In [51]:
# Create a new column of the wq dataframe containing the corresponding geopolar sst value
wq_sst['ostia'] = wq_sst.apply(lambda x: get_satellite_sst(ostia, x.Latitude, x.Longitude, x.SampleDate), 
                                                axis=1)

In situ date not found in satellite SST 2006-09-12 at 38.6 -77.25694
In situ date not found in satellite SST 2006-08-08 at 38.6 -77.25692
In situ date not found in satellite SST 2006-08-23 at 38.6 -77.25692
In situ date not found in satellite SST 2006-09-06 at 38.6 -77.25692
In situ date not found in satellite SST 2006-09-20 at 38.6 -77.25692
In situ date not found in satellite SST 2006-10-04 at 38.6 -77.25692
In situ date not found in satellite SST 2006-10-10 at 38.6 -77.25692
In situ date not found in satellite SST 2006-10-18 at 38.6 -77.25692
In situ date not found in satellite SST 2006-10-31 at 38.6 -77.25692
In situ date not found in satellite SST 2006-08-07 at 38.64028 -77.22222
In situ date not found in satellite SST 2006-09-12 at 38.64028 -77.22222
In situ date not found in satellite SST 2006-10-10 at 38.64028 -77.22222
In situ date not found in satellite SST 2006-08-28 at 38.3475 -77.3275
In situ date not found in satellite SST 2006-04-11 at 38.4205 -77.3532
In situ date not f

In [52]:
# delete large data structures to save memory
del ostia
del ostia_anom
del ostia_clim

### Cleaning Output

Remove rows that don't have corresponding observations from any of the satellites.

In [53]:
wq_anom = wq_anom[(~wq_anom['anom_mur'].isnull()) | (~wq_anom['anom_geopolar'].isnull()) | \
                (~wq_anom['anom_ostia'].isnull())]

In [54]:
wq_sst = wq_sst[(~wq_sst['mur'].isnull()) | (~wq_sst['geopolar'].isnull())]

Sort values by date, latitude, and longitude

In [55]:
wq_anom = wq_anom.sort_values(['Station', 'SampleDate']).reset_index(drop=True)

In [56]:
wq_sst = wq_sst.sort_values(['Station', 'SampleDate']).reset_index(drop=True)

In [57]:
wq_anom

Unnamed: 0,Station,Latitude,Longitude,SampleDate,anom_cbp,anom_geopolar,anom_mur,anom_ostia
0,CB2.1,39.44149,-76.02599,2003-01-15,-2.038462,-0.741252,-0.803528,
1,CB2.1,39.44149,-76.02599,2003-03-12,-4.000000,-3.576567,-5.062172,
2,CB2.1,39.44149,-76.02599,2003-04-09,-5.912000,-2.275242,-5.617992,
3,CB2.1,39.44149,-76.02599,2003-04-23,0.288000,-2.860340,-6.881838,
4,CB2.1,39.44149,-76.02599,2003-05-07,-2.892593,-3.921206,-6.184295,
...,...,...,...,...,...,...,...,...
11128,WT8.3,38.84250,-76.53410,2022-08-09,2.217391,2.247086,,2.103586
11129,WT8.3,38.84250,-76.53410,2022-09-13,2.120455,1.351775,,1.260589
11130,WT8.3,38.84250,-76.53410,2022-10-12,-0.526087,-2.154529,,-2.478829
11131,WT8.3,38.84250,-76.53410,2022-11-14,1.795000,2.750237,,2.723844


### Save File

In [62]:
path = os.path.join(
    REPO_ROOT, 'data/processed', 
    'SSTanom_satellites_cbp_stations.csv'
    # 'anomaly_values_satellites_CBPstations_filtered.csv' OLD NAMING PATTERN
)
wq_anom.to_csv(path, index=False)

In [63]:
wq_anom

Unnamed: 0,Station,Latitude,Longitude,SampleDate,anom_cbp,anom_geopolar,anom_mur,anom_ostia
0,CB2.1,39.44149,-76.02599,2003-01-15,-2.038462,-0.741252,-0.803528,
1,CB2.1,39.44149,-76.02599,2003-03-12,-4.000000,-3.576567,-5.062172,
2,CB2.1,39.44149,-76.02599,2003-04-09,-5.912000,-2.275242,-5.617992,
3,CB2.1,39.44149,-76.02599,2003-04-23,0.288000,-2.860340,-6.881838,
4,CB2.1,39.44149,-76.02599,2003-05-07,-2.892593,-3.921206,-6.184295,
...,...,...,...,...,...,...,...,...
11128,WT8.3,38.84250,-76.53410,2022-08-09,2.217391,2.247086,,2.103586
11129,WT8.3,38.84250,-76.53410,2022-09-13,2.120455,1.351775,,1.260589
11130,WT8.3,38.84250,-76.53410,2022-10-12,-0.526087,-2.154529,,-2.478829
11131,WT8.3,38.84250,-76.53410,2022-11-14,1.795000,2.750237,,2.723844


In [64]:
path = os.path.join(
    REPO_ROOT, 'data/processed', 
    'SST_satellites_cbp_stations.csv'
    # New name: sst_values_satellites_cbp_stations.csv
)
wq_sst.to_csv(path, index=False)

In [65]:
wq_sst

Unnamed: 0,Station,Latitude,Longitude,SampleDate,MeasureValue,geopolar,mur,ostia
0,1AAUA001.39,38.40000,-77.32000,2007-03-22,8.300,4.000000,,2.9700012
1,1AAUA001.39,38.40000,-77.32000,2007-06-18,26.200,21.970001,,20.51001
2,1AAUA001.39,38.40000,-77.32000,2007-10-29,14.500,18.019989,,17.01001
3,1AAUA001.39,38.40000,-77.32000,2007-12-10,5.200,10.750000,,11.369995
4,1AAUA001.39,38.40000,-77.32000,2008-02-04,4.900,3.670013,,4.0
...,...,...,...,...,...,...,...,...
35723,YRK031.24,37.50465,-76.79252,2008-07-22,29.460,,27.756989,
35724,YRK031.24,37.50465,-76.79252,2008-08-22,26.455,,26.091003,
35725,YRK031.24,37.50465,-76.79252,2008-09-17,25.105,,23.483002,
35726,YRK031.24,37.50465,-76.79252,2008-10-16,21.134,,21.225006,


## For Depth Sensitivity (RMSE) (delete this? or move to another file?) NOTE wq_sst should now be wq_anom if revisiting this (name was changed above)

Computing RMSE here so I don't have to read in file in another notebook

In [None]:
wq_sst['geopolar_diff'] = wq_sst['geopolar_anom'] - wq_sst['MeasureAnomaly']
wq_sst['mur_diff'] = wq_sst['mur_anom'] - wq_sst['MeasureAnomaly']

In [None]:
N = len(wq_sst[~wq_sst['geopolar_diff'].isnull()])

rmse_geopolar = np.sqrt((wq_sst['geopolar_diff']**2).sum() / N)
print('rmse geopolar: ', rmse_geopolar)

N = len(wq_sst[~wq_sst['mur_diff'].isnull()])

rmse_mur = np.sqrt((wq_sst['mur_diff']**2).sum() / N)
print('rmse mur: ', rmse_mur)

| Depth Range    | Geopolar RMSE | MUR RMSE | 
| -------- | ------- | ------- |
| 0.5-3m  | 1.422   | 1.768 |
| 0.5-7m | 1.477   | 1.763 |
| **1-3m**    |  1.38   | 1.75|
| 1-7m |1.4749  | 1.757  |

| Depth Range    | Geopolar RMSE | MUR RMSE | 
| -------- | ------- | ------- |
| 0.5-3m  | 1.42   | 1.77 |
| 0.5-7m | 1.48   | 1.76 |
| **1-3m**    |  1.38   | 1.75|
| 1-7m |1.47  | 1.76  |