# Data preparation

NeuralGCM models take and produce data on defined on [37 pressure levels](https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-pressure-levels), including the following variables, provided in SI units and on the NeuralGCM model's native grid:

- Inputs/outputs (on pressure levels): `u_component_of_wind`, `v_component_of_wind`, `geopotential`, `temperature`, `specific_humidity`, `specific_cloud_ice_water_content`, `specific_cloud_liquid_water_content`.
- Forcings (surface level only): `sea_surface_temperature`, `sea_ice_cover`

## Regridding data

Preparing a dataset stored on a different horizontal grid for NeuralGCM requires two steps:

1. Horizontal regridding to a Gaussian grid. For processing fine-resolution data conservative regridding is most appropriate (and is what we used to train NeuralGCM).
2. Filling in all missing values (NaN), to ensure all inputs are valid. Forcing fields like `sea_surface_temperature` are only defined over ocean in ERA5, and NeuralGCM's surface model also includes a mask that ignores values over land, but we still need to fill all NaN values to them leaking into our model outputs.

Utilities for both of these operations are packaged as part of Dinosaur. We'll show how to use them on the Zarr copy of ERA5 from the [ARCO-ERA5](https://github.com/google-research/arco-era5) project:

In [1]:
import numpy as np
import gcsfs
import xarray

# create a xarray.Dataset with required variables for NeuralGCM
gcs = gcsfs.GCSFileSystem(token='anon')
# Pythonic file-system for Google Cloud Storage

path = 'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3'

full_era5 = xarray.open_zarr(gcs.get_mapper(path), chunks=None)

In [2]:
full_era5

In [3]:
level_1000 = full_era5.sel(level=1000, latitude=40.75, longitude=286, time=slice('2022-01-01', '2022-01-31'))

In [4]:
# selected_vars = level_1000[['temperature', 'specific_humidity', 'specific_cloud_ice_water_content', 'specific_cloud_liquid_water_content']]

selected_vars = level_1000[['temperature', 'specific_humidity']]
selected_vars['sea_surface_temperature'] = level_1000['sea_surface_temperature']
# selected_vars['sea_ice_cover'] = level_1000['sea_ice_cover']
selected_vars

In [5]:
df = selected_vars.to_dataframe()

In [23]:
df_new = df

In [25]:
df = df.drop(df.columns[[0, -1]], axis=1)
df_new = df_new.drop(df_new.columns[[0, -1]], axis=1)

In [31]:
df_new.columns = ['temperature', 'specific_humidity', 'longitude','level', 'latitude']
df_new = df_new.drop(columns=['level'])
df_new

Unnamed: 0,temperature,specific_humidity,longitude,latitude
0,282.569519,0.007179,40.75,286.0
1,282.419830,0.007194,40.75,286.0
2,282.261169,0.007183,40.75,286.0
3,282.115967,0.007143,40.75,286.0
4,282.051605,0.007127,40.75,286.0
...,...,...,...,...
737,268.130768,0.001132,40.75,286.0
738,268.398315,0.001179,40.75,286.0
739,268.476868,0.001247,40.75,286.0
740,269.385986,0.001144,40.75,286.0


In [32]:
df_new.to_csv('era5.csv', index=False)