Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PR]: Update Regrid2 missing and fill value behaviors to align with CDAT and add unmapped_to_nan arg for output data #613

Merged
merged 16 commits into from Apr 10, 2024

Conversation

jasonb5
Copy link
Collaborator

@jasonb5 jasonb5 commented Mar 13, 2024

Description

Todo

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

If applicable:

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass with my changes (locally and CI/CD build)
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

Copy link

codecov bot commented Mar 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (472c04b) to head (fc17e7a).

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #613   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           15        15           
  Lines         1521      1543   +22     
=========================================
+ Hits          1521      1543   +22     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jasonb5 jasonb5 changed the title Regridding nan update [PR]: Regridding nan update Mar 13, 2024
@tomvothecoder
Copy link
Collaborator

Hey @lee1043, at our past meeting on 3/13, Jason mentioned having an unmapped_to_nan=True option for the Regrid2 API. This enables us to have consistency with the xESMF API and keeps plots looking correct when data has missing values. As a reminder, this option applies to output data. For input data, this PR updates the Regrid2 API to use fill_value to align with CDAT's Regrid2 behavior.

Do you agree with this idea? Thanks.

@lee1043
Copy link
Collaborator

lee1043 commented Mar 15, 2024

@tomvothecoder I agree, thanks!

@tomvothecoder
Copy link
Collaborator

Hi @jasonb5, just checking to see if you can get this done by next Tues (4/2)? I would like to release v0.7.0 mid-next week.

I'm currently debugging #615 to see if I can get anywhere.

- Load dask arrays into memory using `.values`
Copy link
Collaborator

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#615 should be fixed with this commit, 85d1c7c (#613).

The issue was that Dask Arrays are being maintained for bounds and the data variable array. I added calls to convert Dask Arrays to Numpy Arrays using .compute() and replacing replaced calls to .data with .values.

I ran the example code in #615 and it works now. @lee1043 can you confirm?

# %%
import os

import xcdat as xc

target_grid = xc.create_uniform_grid(-90, 89, 1.0, 0, 359, 1.0)

input_data = os.path.join(
    "/p/css03/esgf_publish/",
    "CMIP6/CMIP/AWI/AWI-CM-1-1-MR/historical/r1i1p1f1/Amon/psl/gn/v20181218/",
    "psl_Amon_AWI-CM-1-1-MR_historical_r1i1p1f1_gn_201301-201312.nc",
)

ds = xc.open_dataset(input_data)
ds_regrid = ds.regridder.horizontal("psl", target_grid, tool="regrid2")  # WORKS FINE

# %%
ds = xc.open_mfdataset(input_data)
ds_regrid_mf = ds.regridder.horizontal("psl", target_grid, tool="regrid2")  # WORKS FINE

# %%
ds_regrid.identical(ds_regrid_mf) # True

Comment on lines +273 to +286
def _get_latitude_weights(
bounds: List[Tuple[np.ndarray, np.ndarray]]
) -> List[np.ndarray]:
weights = []

for x, y in bounds:
cell_weight = np.sin(np.deg2rad(x)) - np.sin(np.deg2rad(y))
cell_weight = cell_weight.reshape((-1, 1))

weights.append(cell_weight)

return weights


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extracted this section of code into its own private function and used an explicit for loop to make it more readable compared to list comprehension (IMO).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I don't think Dask Arrays support np.deg2rad and/or np.sin with np.nan values, resulting in ValueError: cannot convert float NaN to integer in #615.

xcdat/regridder/regrid2.py Outdated Show resolved Hide resolved
@@ -505,4 +528,4 @@ def _get_bounds_ensure_dtype(ds, axis):
if bounds.dtype != np.float32:
bounds = bounds.astype(np.float32)

return bounds.data
return bounds.values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to use .values over .data to convert Dask Arrays to np.ndarray

@@ -113,7 +114,7 @@ def _regrid(
lon_mapping, lon_weights = _map_longitude(src_lon_bnds, dst_lon_bnds)

# convert to pure numpy
input_data = input_data_var.astype(np.float32).data
input_data = input_data_var.astype(np.float32).values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to use .values to convert Dask Arrays to .ndarray. Otherwise we get a TypeError: take() got an unexpected keyword argument 'mode' downstream with np.take().

@lee1043
Copy link
Collaborator

lee1043 commented Apr 1, 2024

I confirm this resolves #615.

@lee1043
Copy link
Collaborator

lee1043 commented Apr 2, 2024

It looks like the current version of the code in this PR could introduce unexpected behavior (guess it's a bug?) when there are NaN values in the original field and regrid using regrid2.

import os
import xcdat as xc
from pcmdi_metrics.utils import apply_landmask

input_data = os.path.join(
    "/p/css03/esgf_publish/CMIP6/CMIP/AWI/AWI-CM-1-1-MR/historical/r1i1p1f1/Amon/ts/gn/v20200511",
    "ts_Amon_AWI-CM-1-1-MR_historical_r1i1p1f1_gn_201301-201312.nc")

target_grid = xc.create_uniform_grid(-90, 89, 1.0, 0, 359, 1.0)

ds = xc.open_dataset(input_data)

# Apply land mask
ds_masked = ds.copy()
da_masked = apply_landmask(ds, "ts", keep_over="ocean")
ds_masked["ts"] = da_masked

ds_masked["ts"].isel(time=0).plot()

# regrid
ds_masked_regrid = ds_masked.regridder.horizontal("ts", target_grid, tool="regrid2")

ds_masked_regrid["ts"].isel(time=0).plot()

ds (raw data):
output1

ds_masked:
output2

ds_masked_regrid (from code in main branch) -- what I expect
output3

ds_masked_regrid (from this branch) -- what I get
output4

@jasonb5 @tomvothecoder do you have any idea?

@lee1043
Copy link
Collaborator

lee1043 commented Apr 2, 2024

For the above, it looks like 1e20 is filled for grid with NaN then do the regridding, which results 1e20 over the land and ~1e19 along the coastlines. If I plot using the below command, I get the 3rd plot in the above comment instead of the last one.

ds_masked_regrid["ts"].where(ds_masked_regrid["ts"] < 1e16).isel(time=0).plot()

@tomvothecoder
Copy link
Collaborator

Hi @jasonb5, just checking to see if you can get this done by next Tues (4/2)? I would like to release v0.7.0 mid-next week.

I'm currently debugging #615 to see if I can get anywhere.

@jasonb5 any updates? Also #615 is fixed. Jiwoo opened PR #634 to get that one in separately if this PR is going to take longer.

We can push back v0.7.0 another week if needed.

@jasonb5
Copy link
Collaborator Author

jasonb5 commented Apr 2, 2024

@tomvothecoder @lee1043 I'll have time today to finish this up.

@tomvothecoder
Copy link
Collaborator

Thanks for the update @jasonb5!

@jasonb5
Copy link
Collaborator Author

jasonb5 commented Apr 3, 2024

@lee1043 Fixed the issue, after regridding the missing values weren't replaced with Nan's again which resulted in the bad plot. There is still potentially an issue with coastline values ~1e19 so you still need ds_masked_regrid["ts"].where(ds_masked_regrid["ts"] < 1e16).isel(time=0).plot() to clean up the plot. I'm looking into this, might be a separate issue with regrid2.

@lee1043
Copy link
Collaborator

lee1043 commented Apr 3, 2024

@jasonb5 thank you very much for following up and for the update. I test with the latest code in this PR, which has changed 1e20 to nan. However, as you mentioned, there still are issues with coastlines that still concerns me. I think that indicates some of the original information is missed by averaging with 1e20, occurring along the coastlines in this case. Having ~1e19 numbers cause substantial impact on the statistical numbers when conducting model evaluation. I can exclude them by taking < 1e16, but still, that make the coastline grids as blank, which influences to the statistics. Is there any chance to make it those 1e20 or nan to be completely excluded before the regrinding process starts, so they won't affect to the regridded output at all?

ds_masked_regrid (from this branch) -- what I get

output5

By applying ds_masked_regrid["ts"].where(ds_masked_regrid["ts"] < 1e16).isel(time=0).plot()

output6

@jasonb5
Copy link
Collaborator Author

jasonb5 commented Apr 3, 2024

@lee1043 I'm investigating that issue how, I'll probably update this PR to just include a few fixes.

@lee1043
Copy link
Collaborator

lee1043 commented Apr 4, 2024

@jasonb5 thank you for following up. I opened PR #635 for my proposal on this matter. Could you please take a look? Please feel free to discard or merge it to this PR if that looks okay to you.

@jasonb5 jasonb5 changed the title [PR]: Regridding nan update [PR]: Regrid2 update Apr 4, 2024
@jasonb5
Copy link
Collaborator Author

jasonb5 commented Apr 4, 2024

@lee1043 I added another fix, your example still displays the same issue.

The following example works:

import os
import xcdat as xc
import numpy as np
from pcmdi_metrics.utils import apply_landmask, create_land_sea_mask

input_data = os.path.join(
    "/p/css03/esgf_publish/CMIP6/CMIP/AWI/AWI-CM-1-1-MR/historical/r1i1p1f1/Amon/ts/gn/v20200511",
    "ts_Amon_AWI-CM-1-1-MR_historical_r1i1p1f1_gn_201301-201312.nc")

target_grid = xc.create_uniform_grid(-90, 89, 1.0, 0, 359, 1.0)

ds = xc.open_dataset(input_data)

# Apply land mask
ds_masked = ds.copy()
mask = create_land_sea_mask(ds_masked)
# invert mask to keep ocean
ds_masked["mask"] = np.absolute(mask - 1)

# ds_masked["psl"].isel(time=0).plot()

# regrid
ds_masked_regrid = ds_masked.regridder.horizontal("ts", target_grid, tool="regrid2")

ds_masked_regrid["ts"].isel(time=0).plot()

I'm still looking into why your example still fails.

@lee1043
Copy link
Collaborator

lee1043 commented Apr 4, 2024

@lee1043 I added another fix, your example still displays the same issue.

The following example works:

import os
import xcdat as xc
import numpy as np
from pcmdi_metrics.utils import apply_landmask, create_land_sea_mask

input_data = os.path.join(
    "/p/css03/esgf_publish/CMIP6/CMIP/AWI/AWI-CM-1-1-MR/historical/r1i1p1f1/Amon/ts/gn/v20200511",
    "ts_Amon_AWI-CM-1-1-MR_historical_r1i1p1f1_gn_201301-201312.nc")

target_grid = xc.create_uniform_grid(-90, 89, 1.0, 0, 359, 1.0)

ds = xc.open_dataset(input_data)

# Apply land mask
ds_masked = ds.copy()
mask = create_land_sea_mask(ds_masked)
# invert mask to keep ocean
ds_masked["mask"] = np.absolute(mask - 1)

# ds_masked["psl"].isel(time=0).plot()

# regrid
ds_masked_regrid = ds_masked.regridder.horizontal("ts", target_grid, tool="regrid2")

ds_masked_regrid["ts"].isel(time=0).plot()

I'm still looking into why your example still fails.

Hi @jasonb5, thanks for following up. But I don't think the code in this comment is working. This is what I get which still has coastline issue. Seeing the result plot below, on top of colorbar color range indicates 1e19, which appears along coastline.
output7

It becomes more clear when plotting where >1e10
ds_masked_regrid["ts"].where(ds_masked_regrid["ts"]>1e10).isel(time=0).plot()

output8

I think the key issue here is that 1e20 is being used when interpolating.

In this PR, logic of the code is checking if there is attributes for FillValue or missing value, and if so, 1e20 is injected instead of NaN in the input variable (here). This input variable is passed to regrid2 (here), which include 1e20 when actually interpolating values.

Instead, I think inclusion of 1e20 when interpolating can be prevented by replacing 1e20 or any missing value if exists (nan_replace in the code) to actual NaN (here), before calling the regridding. I proposed this solution in #635. Could you please take a look and let me know if there is any concern?

@jasonb5
Copy link
Collaborator Author

jasonb5 commented Apr 4, 2024

@lee1043 Merged your change, and added a new one. Both pre-applied mask and passing mask as a variable should both work now.

Also note that how the mask is applied by cdms2 and passing the mask as a variable to xcdat differs from pcmdi_metrics apply_landmask function. The prior allows more data through, while the latter is tighter around the coastlines (not sure if this is a problem).

import os
import xcdat as xc
import numpy as np
import xarray as xr
import cdms2
import matplotlib.pyplot as plt
from pcmdi_metrics.utils import apply_landmask, create_land_sea_mask

fig, axs = plt.subplots(nrows=3, ncols=2, figsize=(10,10))

input_data = "ts_Amon_AWI-CM-1-1-MR_historical_r1i1p1f1_gn_201301-201312.nc"

target_grid = xc.create_uniform_grid(-90, 89, 1.0, 0, 359, 1.0)
cdms2_grid = cdms2.createUniformGrid(-90, 180, 1.0, 0, 360, 1.0)

ds = xc.open_dataset(input_data)
cdms2_ds = cdms2.open(input_data)
cdms2_var = cdms2_ds("ts")

mask = create_land_sea_mask(ds)
# cdms2 automatically inverts the mask
cdms2_mask = mask.values

cdms2_regrid = cdms2_var.regrid(cdms2_grid, mask=cdms2_mask, regridTool="regrid2")
cdms2_regrid = xr.DataArray(cdms2_regrid.data, dims=("time", "lat", "lon"))
cdms2_regrid = cdms2_regrid.where(cdms2_regrid!=1e20)
cdms2_regrid_t0 = cdms2_regrid.isel(time=0)

cdms2_regrid_t0.plot(ax=axs[0,0])
(cdms2_regrid_t0 - cdms2_regrid_t0).plot(ax=axs[0,1])

ds1 = ds.copy()
ds1_masked = apply_landmask(ds1, "ts", keep_over="ocean")
ds1["ts"] = ds1_masked

ds1_regrid = ds1.regridder.horizontal("ts", target_grid, tool="regrid2")
ds1_regrid_t0 = ds1_regrid["ts"].isel(time=0)
ds1_regrid_t0.plot(ax=axs[1,0])
# replace nan with 1e20 to show where data is missing
(ds1_regrid_t0.fillna(1e20) - cdms2_regrid_t0.fillna(1e20)).plot(ax=axs[1,1])

ds2 = ds.copy()
# need to manually invert the mask for ocean
ds2["mask"] = np.absolute(mask - 1)

ds2_regrid = ds2.regridder.horizontal("ts", target_grid, tool="regrid2")
ds2_regrid_t0 = ds2_regrid["ts"].isel(time=0)
ds2_regrid_t0.plot(ax=axs[2,0])
# replace nan with 1e20 to show where data is missing
(ds2_regrid_t0.fillna(1e20) - cdms2_regrid_t0.fillna(1e20)).plot(ax=axs[2,1])

plt.tight_layout()
plt.draw()

image

@lee1043
Copy link
Collaborator

lee1043 commented Apr 4, 2024

Hi @jasonb5, thank you for the update. I checked the updated code in this PR works well for my example code above.

Also note that how the mask is applied by cdms2 and passing the mask as a variable to xcdat differs from pcmdi_metrics apply_landmask function. The prior allows more data through, while the latter is tighter around the coastlines (not sure if this is a problem).

Thank you for pointing out. It is kind of intended behavior that in PMP I made the apply landmask function to be more strict when selecting either land or ocean grid. Default criteria for land is land fraction >= 0.8 while for the ocean < 0.2. I might have to rethink on this more to make it more consistent to what CDAT does, to keep minimal impact of xcdat conversion in PMP results.

@jasonb5
Copy link
Collaborator Author

jasonb5 commented Apr 5, 2024

@lee1043 Sounds good, just wanted to make a note.

@tomvothecoder I think this is good to merge, and I'll tackle #625 in the next release.

@tomvothecoder
Copy link
Collaborator

@jasonb5 Great! I'll do a quick final review and then merge.

@tomvothecoder tomvothecoder changed the title [PR]: Regrid2 update [PR]: Update Regrid2 to align with CDAT and add unmapped_to_nan arg for output data Apr 10, 2024
@tomvothecoder tomvothecoder changed the title [PR]: Update Regrid2 to align with CDAT and add unmapped_to_nan arg for output data [PR]: Update Regrid2 missing and fill value behaviors to align with CDAT and add unmapped_to_nan arg for output data Apr 10, 2024
@tomvothecoder
Copy link
Collaborator

I'm going to add the codecov token to the repo and GitHub Actions build to hopefully fix the codecov upload problem and prevent/limit future occurrences.

https://community.codecov.com/t/upload-issues-unable-to-locate-build-via-github-actions-api/3954

@tomvothecoder tomvothecoder merged commit 1f4d22a into main Apr 10, 2024
7 of 9 checks passed
@tomvothecoder tomvothecoder deleted the regrid_nan_update branch April 10, 2024 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[Bug]: regrid2 with open_mfdataset [Bug]: Horizontal regridding from limited domain to global
3 participants