Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hrrr3.HRRR fails when fetching 2 day forecast #46

Open
SorooshMani-NOAA opened this issue Sep 27, 2022 · 15 comments
Open

hrrr3.HRRR fails when fetching 2 day forecast #46

SorooshMani-NOAA opened this issue Sep 27, 2022 · 15 comments

Comments

@SorooshMani-NOAA
Copy link
Contributor

SorooshMani-NOAA commented Sep 27, 2022

6-hour cycles of HRRR (either on NOMADS or AWS) have 48 hour forecast data. However, when using HRRR in the following script, it fails to download the data:

import pandas as pd
import numpy as np
from matplotlib.transforms import Bbox
from datetime import datetime, timedelta
from pyschism.forcing.nws.nws2 import hrrr3

box = Bbox([[-70, 44], [-69, 45]])

now = datetime.now()
last_cycle = np.datetime64(pd.DatetimeIndex([now-timedelta(hours=2)]).floor('6H').values[0], 'h').tolist()

start = (last_cycle - timedelta(days=1)).replace(hour=0)
end = last_cycle + timedelta(days=2)

rndelta = end -start
rnday = rndelta.total_seconds() / 86400.

hrrr3.HRRR(start_date=start, rnday=rnday, record=2, bbox=box)

with the following error:

RemoteTraceback                           Traceback (most recent call last)                                                                                                                               [72/3696]RemoteTraceback:
"""
Traceback (most recent call last):
  File "LIBPATH/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "LIBPATH/python3.9/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "REPOPATH/pyschism/forcing/nws/nws2/hrrr3.py", line 129, in gen_sflux
    inventory = AWSGrib2Inventory(date, record, pscr)
  File "REPOPATH/pyschism/forcing/nws/nws2/hrrr3.py", line 46, in __init__
    for obj in page['Contents']:
KeyError: 'Contents'
"""

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In [59], line 1
----> 1 hrrr3.HRRR(start_date=start, rnday=rnday, record=2, bbox=box))

File REPOPATH/pyschism/forcing/nws/nws2/hrrr3.py:122, in HRRR.__init__(self, start_date, rnday, pscr, record, bbox)
    119         logger.info(f'npool is {npool}')
    120         pool = mp.Pool(npool)
--> 122         pool.starmap(self.gen_sflux, [(date, record, pscr) for date in datevector])
    123 #        [self.gen_sflux(date, record, pscr) for date in datevector]
    125         pool.close()

File LIBPATH/python3.9/multiprocessing/pool.py:372, in Pool.starmap(self, func, iterable, chunksize)
    366 def starmap(self, func, iterable, chunksize=None):
    367     '''
    368     Like `map()` method but the elements of the `iterable` are expected to
    369     be iterables as well and will be unpacked as arguments. Hence
    370     `func` and (a, b) becomes func(a, b).
    371     '''
--> 372     return self._map_async(func, iterable, starmapstar, chunksize).get()

File LIBPATH/python3.9/multiprocessing/pool.py:771, in ApplyResult.get(self, timeout)
    769     return self._value
    770 else:
--> 771     raise self._value

KeyError: 'Contents'

Note the line numbers might have shifted since I added some print calls to debug.

This does not fail if I try it with the old implementation of HRRR hrrr.HRRR. However the limitation from NOMADS is that it only has archive data from the day before:

...
from pyschism.forcing.nws import HRRR

hrrr = HRRR()
hrrr.write('RUNDIR_PATH', 2, start, rnday, True, True, True, box,True)
@cuill
Copy link
Member

cuill commented Sep 30, 2022

@SorooshMani-NOAA This error happens when the requested data is not available yet on aws.

@cuill
Copy link
Member

cuill commented Sep 30, 2022

@SorooshMani-NOAA
I would suggest you use gfs2 and hrrr3 because there are issues with original gfs/hrrr data.

Below are plots for hrrr and hrrr3. As you can see, hrrr_orig puts data from the curvilinear grid on the rectilinear grid, which results in the white area with missing values (very large values). SCHISM cannot recognize missing values and take them as input.
hrrr_orig

hrrr3

I cannot remember the issues for original gfs. After comparing it with Dan's sflux, there are some big errors between the two datasets. That's the reason I switched to the original grib2 format.

@SorooshMani-NOAA
Copy link
Contributor Author

@cuill, that's true, but the forecast data is actually available. For the 6-hour cycles, (about 2 hour after the actual cycle) the data should be available on AWS, just like NOMADS. However unlike old hrrr.HRRR implementation, the hrrr3 doesn't look for the forecasts stamps. It just tries to find the files with specific date & time stamp (as far as I understand).

So if I want to forecast for 2 days, I need to look at a 6-hour cycle which has forecast into the future. This is what I'm doing in the script above (snippet below):

now = datetime.now()
last_cycle = np.datetime64(pd.DatetimeIndex([now-timedelta(hours=2)]).floor('6H').values[0], 'h').tolist()

start = (last_cycle - timedelta(days=1)).replace(hour=0)
end = last_cycle + timedelta(days=2)

To quote myself from the ticket you don't have access to:

...it seems that on cycles 00, 06, 12, 18 there's a 48 hour extended forecast (I didn't know this!). The same thing appears to be true in case of BDP S3 buckets for 6-hour cycles...

But hrrr3 implementation just looks at the date stamp (e.g. 20220930 and t13z in noaa-hrrr-bdp-pds/hrrr.20220930/conus/hrrr.t13z.wrfsfcf01.grib2) not the forecast stamp (e.g. f01 in the same file). For HRRR hourly cycles for each time stamp (e.g. 20220930 and t13z) there are 19 forecast files (f00, f01, ..., f18) and for cycles 00, 06, 12, 18 (e.g. 20220930 and t00z or t06z or ...) there are 49 (f00, f01, ..., f48). hrrr.HRRR correctly fetches this data, but hrrr3 doesn't

@SorooshMani-NOAA
Copy link
Contributor Author

SorooshMani-NOAA commented Sep 30, 2022

I would suggest you use gfs2 and hrrr3 because there are issues with original gfs/hrrr data.

@cuill thanks for this information. I would really like to use hrrr3. That's the first thing I tried, but due to this issue I described in the previous comment, I cannot. I manually went to the S3 bucket and checked, the forecast file was there, but hrrr3 wouldn't pick it up, and I assumed it's because it's not looking at the forecast stamps

@SorooshMani-NOAA
Copy link
Contributor Author

Here it seems to be just using timedate stamp:

self.forecast_cycle = self.start_date #nearest_cycle()
paginator=self.s3.get_paginator('list_objects_v2')
pages=paginator.paginate(Bucket=self.bucket,
Prefix=f'hrrr.{self.forecast_cycle.strftime("%Y%m%d")}'
f'/{self.product}/')
data=[]
for page in pages:
for obj in page['Contents']:
data.append(obj)
self.cycle=self.forecast_cycle.hour
tz='t{:02d}z'.format(self.cycle)
self.file_metadata = list(sorted([
_['Key'] for _ in data if 'wrfsfcf' in _['Key'] and tz in _['Key'] and not 'idx' in _['Key']
]))
for key in self.file_metadata[1:record*24+1]:
filename = pathlib.Path(self.tmpdir) / key
filename.parent.mkdir(parents=True, exist_ok=True)
with open(filename, 'wb') as f:
logger.info(f'Downloading file {key}, ')
try:
self.s3.download_fileobj(self.bucket, key, f)

@cuill
Copy link
Member

cuill commented Sep 30, 2022

@SorooshMani-NOAA
With your script, rnday is 3.5 days. HRRR has 49 forecast records.

hrrr3 considered both hindcast and forecast scenarios:
For hindcast, rnday is the real run period and set record = 1 (days), which will generate rnday files with each one having 24 timestamps.
For forecast, if you start from today, set rnday=1, record=2 (days), which will generate one file with 48 timestamps.

As suggested by Dan, hrrr3 can take forecast at t00z, t06z, t12z, t18z, which has 49 points, covering longer than t01z, t02z, t03z, etc. which only have 19 points.

For example, if your nearest cycle is 2022093000, it will generate hrrr_2022093000.nc, likewise, if your nearest cycle is 2022093006, it will generate hrrr_2022093006.nc.

@cuill
Copy link
Member

cuill commented Sep 30, 2022

Here is the script I just used to generate hrrr_2022093006.nc

now = datetime.now()
last_cycle = np.datetime64(pd.DatetimeIndex([now-timedelta(hours=2)]).floor('6H').values[0], 'h').tolist()

print(last_cycle)

rnday = 1
record = 2
hgrid = Hgrid.open('../../static/hgrid.gr3', crs='epsg:4326')

pscr='/sciclone/pscr/lcui01/HRRR/'
hrrr = HRRR(start_date=last_cycle, rnday=rnday, pscr=pscr, record=record, bbox=hgrid.bbox)
print(f'It took {(time()-t0)/60} mins to process {record*24} files/day')

@SorooshMani-NOAA
Copy link
Contributor Author

@cuill, thank you, for the information. So are you saying to resolve my issue for a rnday=3.5 run, I need to use record=1 for hindcast with rnday=1.5. After that I need to call hrrr3 with record=2 for the forecast part with rnday=1 and it will take 2 days of forecast automatically because my cycle matches the 49 point forecast?

I'm OK with doing this in my scripts, but this is very implicit and opaque to a new user. I think, from a usability perspective of pyschism it makes more sense to take rnday into account internally so that if I need 2 day of forecast I pass rnday=2 instead of 1. In my opinion, a user doesn't need to know the details of the HRRR or GFS, etc.; they just need to know for what period they need data and how long their simulation is. Passing 1 instead of 2 for forecast section implies that user knows enough about how HRRR data is organized. This defeats the purpose of having pyschism abstraction.

Also it might be beneficial if pyschism can automatically distinguish forecast vs hindcast so that I don't need to make multiple calls to hrrr3.HRRR to get both segments of the simulation.

Maybe this is not a high priority since "this just works" if I know what to do with it, but I think it's an important improvement for new users down the line.

@cuill
Copy link
Member

cuill commented Sep 30, 2022

With your script, it generates start and today as follows:

last_cycle is 2022-09-30 12:00:00
start is 2022-09-29 00:00:00
rnday is 3.5

So you can set rnday = 2, because only two days (2022-09-29, 2022-09-30) are available during your cycle, and set record = 2, then it will generate hrrr_2022092900.nc and hrrr_2022093000.nc, both has 48 timestamps. The remaining 1.5 days will be covered by sflux1.

@cuill
Copy link
Member

cuill commented Sep 30, 2022

No, I would say the remaining 0.5 days will be covered by sflux1, 1 day covered by hrrr_2022092900.nc, and 2 days covered by hrrr_2022093000.nc

@SorooshMani-NOAA
Copy link
Contributor Author

I think there's a miscommunication. Let's say I want to do a simulation which is hindcast + forecast. Let's assume today is 9/30, the time is past 14 Zulu and I have:

  • start is 2022-09-29 00:00:00
  • rnday is 3.5

I'd like to setup the simulation using HRRR (from hrrr3) + GFS (either gfs or gfs2 implementation). How should I approach it? I thought the script that I have at the top (ticket description) would be sufficient for setting up the HRRR component. Based on what you're saying I need to do something else. Can you please write a short snippet to show me how to achieve this?

Thanks!

@cuill
Copy link
Member

cuill commented Sep 30, 2022

Here is the code to generate sflux1 from gfs2:

**now = datetime.now()
last_cycle = np.datetime64(pd.DatetimeIndex([now-timedelta(hours=2)]).floor('6H').values[0], 'h').tolist()
start = (last_cycle - timedelta(days=1)).replace(hour=0)
rnday = 2 
record = 3  
hgrid = Hgrid.open('../../static/hgrid.gr3', crs='epsg:4326')
pscr = '/sciclone/pscr/lcui01/GFS/'
GFS(start_date=start, rnday=rnday, pscr=pscr, record=record, bbox=hgrid.bbox)**

This will generate
20220929/gfs_2022092900.nc
20220930/gfs_2022093000.nc

You'll link these two files to your sflux, 20220929/gfs_2022092900.nc -> sflux_air_1.0001.nc
20220929/gfs_2022092900.nc -> sflux_prc_1.0001.nc
20220929/gfs_2022092900.nc -> sflux_rad_1.0001.nc
20220930/gfs_2022093000.nc -> sflux_air_1.0002.nc
20220930/gfs_2022093000.nc -> sflux_prc_1.0002.nc
20220930/gfs_2022093000.nc -> sflux_rad_1.0002.nc

For sflux2 from hrrr3, the only change is record:
now = datetime.now()
last_cycle = np.datetime64(pd.DatetimeIndex([now-timedelta(hours=2)]).floor('6H').values[0], 'h').tolist()
start = (last_cycle - timedelta(days=1)).replace(hour=0)
rnday = 2
record = 2
hgrid = Hgrid.open('../../static/hgrid.gr3', crs='epsg:4326')
pscr = '/sciclone/pscr/lcui01/HRRR/'
HRRR(start_date=start, rnday=2, pscr=pscr, record=record, bbox=hgrid.bbox)

This will generate:
20220929/hrrr_2022092900.nc
20220930/hrrr_2022093000.nc

You'll link these two files to your sflux, 20220929/hrrr_2022092900.nc -> sflux_air_2.0001.nc
20220929/hrrr_2022092900.nc -> sflux_prc_2.0001.nc
20220929/hrrr_2022092900.nc -> sflux_rad_2.0001.nc
20220930/hrrr_2022093000.nc -> sflux_air_2.0002.nc
20220930/hrrr_2022093000.nc -> sflux_prc_2.0002.nc
20220930/hrrr_2022093000.nc -> sflux_rad_2.0002.nc

Per the image below, SCHISM will read data as marked by the red color:
IMG_0016

@SorooshMani-NOAA
Copy link
Contributor Author

@cuill thank you for your detailed explanation.

I have some follow up questions:

  • What is the role of record vs rnday arguments?

If I remember correctly the rnday in SCHISM and other locations in pyschism mean the number of simulation days. However here in your script it is specified as 2 as opposed to 3.5 (assuming I'm on hour-12 cycle, so I'm doing 1.5 day hindcast and 2 days forecast).

  • Is 2 the number of GFS or HRRR files needed for this simulation?
  • Does rnday mean a different thing in hrrr3 and gfs2 implementations vs other places in pyschism?
  • Also does record in this case mean the number of simulation days in forecast mode?

Either way, as I said earlier, I think it's better to abstract this so that without the requirement for exact knowledge, the user can just do

import pandas as pd
import numpy as np
from matplotlib.transforms import Bbox
from datetime import datetime, timedelta
from pyschism.forcing.nws.nws2 import hrrr3

box = Bbox([[-70, 44], [-69, 45]])

now = datetime.now()
last_cycle = np.datetime64(pd.DatetimeIndex([now-timedelta(hours=2)]).floor('6H').values[0], 'h').tolist()

start = (last_cycle - timedelta(days=1)).replace(hour=0)
end = last_cycle + timedelta(days=2)

rndelta = end -start
rnday = rndelta.total_seconds() / 86400.

hrrr3.HRRR(start_date=start, rnday=rnday, bbox=box)
gfs2.GFS(start_date=start, rnday=rnday, bbox=box)

and get the correct files. In my opinion having rnday and start_date should be sufficient to implement the logic inside pyschism to pick the right files for GFS and HRRR, without any additional information required from the user.

Ideally I would also think about abstracting the part where we need to link files to the actual sflux location as well.

@cuill
Copy link
Member

cuill commented Oct 3, 2022

@SorooshMani-NOAA
Thanks for your suggestion. I'll add it to the to-do list but I don't have ETA at the moment. I have other priorities to do. You are welcome to open a PR.

@SorooshMani-NOAA
Copy link
Contributor Author

@cuill Sure, I understand there are other priorities. I just wanted to document our discussions and ideas. For now I don't have any immediate need for it, your scripts will help me with my use case. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants