## Data Prerequisites

All of our tutorials in which you train and evaluate a model use the CORA data set, either in its original form or with some extensions. In this notebook, we will guide you through the process of downloading all essential data set pieces and explain how NeuralCora expects the data structure of the CORA dataset so that you will be able to run all of the tutorials.

### Locate the CORA samples

This project ships the 2013 hourly coastal-ocean height cube and a matching land mask under `data/`. Run the cell below to confirm the files are present and adjust the paths if you placed the downloads elsewhere.


In [1]:
from pathlib import Path

def locate_repo_root(start: Path) -> Path:
    for candidate in [start] + list(start.parents):
        if (candidate / 'data').exists():
            return candidate
    raise FileNotFoundError("Could not find a 'data' directory relative to this notebook.")

REPO_ROOT = locate_repo_root(Path.cwd())
DATA_PATH = REPO_ROOT / 'data' / 'NY_2013_180_360.nc'
MASK_PATH = REPO_ROOT / 'data' / 'real_land_mask_180_360.nc'

print(f'Repository root: {REPO_ROOT}')
for path in (DATA_PATH, MASK_PATH):
    status = 'found' if path.exists() else 'missing'
    print(f'{path.name}: {status} @ {path}')


Repository root: /Users/Work/Library/CloudStorage/GoogleDrive-yunlong.pan@stonybrook.edu/My Drive/neuralcora
NY_2013_180_360.nc: found @ /Users/Work/Library/CloudStorage/GoogleDrive-yunlong.pan@stonybrook.edu/My Drive/neuralcora/data/NY_2013_180_360.nc
real_land_mask_180_360.nc: found @ /Users/Work/Library/CloudStorage/GoogleDrive-yunlong.pan@stonybrook.edu/My Drive/neuralcora/data/real_land_mask_180_360.nc


### Prepare hourly timestamps

The `time` coordinate in the NetCDF cube stores the hour index (0-8759) for 2013. The cell below converts those integers into actual timestamps, then optionally writes a new NetCDF file with the datetime-aware coordinate. Install the optional dependencies first if they are missing (for example, `pip install xarray netcdf4 pandas`).


In [2]:
import numpy as np

try:
    import pandas as pd
    import xarray as xr
except ImportError as exc:
    raise ImportError("Install xarray, netcdf4, and pandas before running this cell (pip install xarray netcdf4 pandas).") from exc

OUTPUT_PATH = DATA_PATH.with_name('NY_2013_180_360_datetime.nc')
OVERWRITE = False

ds = xr.load_dataset(DATA_PATH)
updated = ds
try:
    if 'time' not in ds:
        raise KeyError("Expected 'time' coordinate in the CORA dataset.")

    time_coord = ds['time'].values
    if np.issubdtype(time_coord.dtype, np.datetime64):
        print('Time coordinate already stores datetime64 values.')
    else:
        hours = np.asarray(time_coord, dtype=np.int64)
        base_timestamp = pd.Timestamp('2013-01-01 00:00:00')
        datetimes = base_timestamp + pd.to_timedelta(hours, unit='h')
        print(f'Converted {datetimes.size} hourly steps from integers to timestamps.')
        print('First five timestamps:', datetimes[:5])
        print('Last five timestamps:', datetimes[-5:])
        updated = ds.assign_coords(time=('time', datetimes))

    if OUTPUT_PATH.exists() and not OVERWRITE:
        print(f'Output already exists at {OUTPUT_PATH}. Set OVERWRITE = True to replace it.')
    else:
        updated.to_netcdf(OUTPUT_PATH)
        print(f'Saved dataset with datetime coordinates to {OUTPUT_PATH}')
finally:
    if updated is not ds:
        updated.close()
    ds.close()


Time coordinate already stores datetime64 values.
Saved dataset with datetime coordinates to /Users/Work/Library/CloudStorage/GoogleDrive-yunlong.pan@stonybrook.edu/My Drive/neuralcora/data/NY_2013_180_360_datetime.nc


### Validate the saved cube

After writing the new file, reload it to confirm the `time` coordinate now stores `datetime64[ns]` values.


In [None]:
try:
    import xarray as xr
except ImportError as exc:
    raise ImportError("Install xarray before running this validation cell (pip install xarray).") from exc

reloaded = xr.load_dataset(OUTPUT_PATH)
try:
    time_values = reloaded['time'].values
    print('Coordinate dtype:', time_values.dtype)
    print('First five timestamps:', time_values[:5])
    print('Last five timestamps:', time_values[-5:])
finally:
    reloaded.close()


### NumPy-only helper

If you only need raw `numpy.datetime64` values (for example, to create labels for plots), the helper below works without xarray. Replace the demo `hour_indices` array with your data.


In [None]:
import numpy as np

hour_indices = np.arange(0, 24)  # replace with your hour offsets
base_timestamp = np.datetime64('2013-01-01T00:00')
timestamps = base_timestamp + hour_indices.astype('timedelta64[h]')
print('Sample hourly timestamps:', timestamps[:5])


In [3]:
from pathlib import Path

def locate_repo_root(start: Path) -> Path:
    for candidate in [start] + list(start.parents):
        if (candidate / 'data').exists():
            return candidate
    raise FileNotFoundError("Could not find a 'data' directory relative to this notebook.")

REPO_ROOT = locate_repo_root(Path.cwd())
DATA_PATH = REPO_ROOT / 'data' / 'NY_2013_180_360_january.nc'
MASK_PATH = REPO_ROOT / 'data' / 'real_land_mask_180_360.nc'

print(f'Repository root: {REPO_ROOT}')
for path in (DATA_PATH, MASK_PATH):
    status = 'found' if path.exists() else 'missing'
    print(f'{path.name}: {status} @ {path}')


Repository root: /Users/yunlong/Library/Mobile Documents/com~apple~CloudDocs/Documents/GitHub/neuralcora
NY_2013_180_360_january.nc: found @ /Users/yunlong/Library/Mobile Documents/com~apple~CloudDocs/Documents/GitHub/neuralcora/data/NY_2013_180_360_january.nc
real_land_mask_180_360.nc: found @ /Users/yunlong/Library/Mobile Documents/com~apple~CloudDocs/Documents/GitHub/neuralcora/data/real_land_mask_180_360.nc


In [4]:
# !pip install xarray netcdf4 pandas

import xarray as xr

ds = xr.load_dataset(DATA_PATH)

print(ds)

<xarray.Dataset> Size: 386MB
Dimensions:    (time: 744, latitude: 180, longitude: 360)
Coordinates:
  * time       (time) datetime64[ns] 6kB 2013-01-01 ... 2013-01-31T23:00:00
  * latitude   (latitude) float64 1kB 40.25 40.26 40.26 ... 41.49 41.49 41.5
  * longitude  (longitude) float64 3kB -74.5 -74.49 -74.48 ... -71.26 -71.25
Data variables:
    zeta       (time, latitude, longitude) float64 386MB nan nan nan ... nan nan


In [5]:
# Select the first month (January) from the dataset
# January has 21 days, each with 24 hours: 21 * 24 = 504 time steps

january_ds = ds.isel(time=slice(0, 21*24))
print(january_ds)

<xarray.Dataset> Size: 261MB
Dimensions:    (time: 504, latitude: 180, longitude: 360)
Coordinates:
  * time       (time) datetime64[ns] 4kB 2013-01-01 ... 2013-01-21T23:00:00
  * latitude   (latitude) float64 1kB 40.25 40.26 40.26 ... 41.49 41.49 41.5
  * longitude  (longitude) float64 3kB -74.5 -74.49 -74.48 ... -71.26 -71.25
Data variables:
    zeta       (time, latitude, longitude) float64 261MB nan nan nan ... nan nan


In [6]:
january_ds.to_netcdf(REPO_ROOT / 'data' / 'NY_2013_180_360_demo.nc')