# A new preprocessing script

What does the preprocessing script have to do? Prepare the raw SPCAM data for use in the NN data generator.

This entails:
1. Computing derivative variables
2. Flatten to sample dimension
3. Crop levels and potentially lat/lon
4. Shuffle training dataset


The preprocessing script should be in the main directory or in a scripts directory. With all the functions part of the cbrain module.

Then there is the question how to do the normalization. Like where do I compute the normalization A and b? 

And how do I do the config file and train valid?

## Read the multi-file dataset: xarray vs plain netcdf

I don't know how to do benchmarking properly because of the RAM issue.

In [1]:
DATADIR = '/local/S.Rasp/sp32fbp_andkua/'

In [2]:
!ls $DATADIR/*h1* | head -10

/local/S.Rasp/sp32fbp_andkua//AndKua_aqua_SPCAM3.0_sp_fbp32.cam2.h1.0000-01-01-00000.nc
/local/S.Rasp/sp32fbp_andkua//AndKua_aqua_SPCAM3.0_sp_fbp32.cam2.h1.0000-01-02-00000.nc
/local/S.Rasp/sp32fbp_andkua//AndKua_aqua_SPCAM3.0_sp_fbp32.cam2.h1.0000-01-03-00000.nc
/local/S.Rasp/sp32fbp_andkua//AndKua_aqua_SPCAM3.0_sp_fbp32.cam2.h1.0000-01-04-00000.nc
/local/S.Rasp/sp32fbp_andkua//AndKua_aqua_SPCAM3.0_sp_fbp32.cam2.h1.0000-01-05-00000.nc
/local/S.Rasp/sp32fbp_andkua//AndKua_aqua_SPCAM3.0_sp_fbp32.cam2.h1.0000-01-06-00000.nc
/local/S.Rasp/sp32fbp_andkua//AndKua_aqua_SPCAM3.0_sp_fbp32.cam2.h1.0000-01-07-00000.nc
/local/S.Rasp/sp32fbp_andkua//AndKua_aqua_SPCAM3.0_sp_fbp32.cam2.h1.0000-01-08-00000.nc
/local/S.Rasp/sp32fbp_andkua//AndKua_aqua_SPCAM3.0_sp_fbp32.cam2.h1.0000-01-09-00000.nc
/local/S.Rasp/sp32fbp_andkua//AndKua_aqua_SPCAM3.0_sp_fbp32.cam2.h1.0000-01-10-00000.nc
ls: write error: Broken pipe


### xarray

In [3]:
import xarray as xr

In [4]:
%%time
ds = xr.open_mfdataset(
    DATADIR + 'AndKua_aqua_SPCAM3.0_sp_fbp32.cam2.h1.0000-01-*-00000.nc',
    decode_times=False,
    concat_dim='time'
)

CPU times: user 1.35 s, sys: 140 ms, total: 1.49 s
Wall time: 2.37 s


In [5]:
%%time
ds = xr.open_mfdataset(
    DATADIR + 'AndKua_aqua_SPCAM3.0_sp_fbp32.cam2.h1.0000-*-*-00000.nc',
    decode_times=False,
    concat_dim='time',
    #parallel=True  # This makes the reading significantly slower...
)

CPU times: user 18.7 s, sys: 1.17 s, total: 19.8 s
Wall time: 26.4 s


I mean this is super fast, is this a fluke?

In [6]:
ds.time

<xarray.DataArray 'time' (time: 17520)>
array([0.000000e+00, 2.083333e-02, 4.166667e-02, ..., 3.649375e+02,
       3.649583e+02, 3.649792e+02])
Coordinates:
  * time     (time) float64 0.0 0.02083 0.04167 0.0625 ... 364.9 365.0 365.0
Attributes:
    long_name:  time
    units:      days since 0000-01-01 00:00:00
    calendar:   noleap
    bounds:     time_bnds

In [7]:
ds.TAP

<xarray.DataArray 'TAP' (time: 17520, lev: 30, lat: 64, lon: 128)>
dask.array<shape=(17520, 30, 64, 128), dtype=float32, chunksize=(48, 30, 64, 128)>
Coordinates:
  * lat      (lat) float64 -87.86 -85.1 -82.31 -79.53 ... 79.53 82.31 85.1 87.86
  * lon      (lon) float64 0.0 2.812 5.625 8.438 ... 348.8 351.6 354.4 357.2
  * lev      (lev) float64 3.643 7.595 14.36 24.61 ... 936.2 957.5 976.3 992.6
  * time     (time) float64 0.0 0.02083 0.04167 0.0625 ... 364.9 365.0 365.0
Attributes:
    units:      K
    long_name:  T after physics

### Plain netcdf

Don't even need to do that if it's that fast... Need to time that later!

## Prepare feature and target datasets, actually make them one!

In [8]:
input_vars = ['TBP', 'QBP', 'VBP', 'PS', 'SOLIN', 'SHFLX', 'LHFLX']
output_vars = ['TPHYSTND', 'PHQ', 'FSNT', 'FSNS', 'FLNT', 'FLNS', 'PRECT']

In [9]:
all_vars = input_vars + output_vars

### Compute derivative variables and cut to correct time steps

Only include what is currently used!

### Compute time step

In [10]:
dt = ds.time[:2].diff('time').values * 24 * 60 * 60

In [11]:
dt

array([1800.])

In [12]:
type(dt)

numpy.ndarray

In [13]:
import numpy as np

In [14]:
dt = dt.astype(np.float16)

#### BP variables

In [15]:
# Dictionary defining the physical tendency associated with each variable
# Must be hardcoded!
phy_dict = {
    'TAP': 'TPHYSTND',
    'QAP': 'PHQ',
    'QCAP': 'PHCLDLIQ',
    'QIAP': 'PHCLDICE',
    'VAP': 'VPHYSTND',
    'UAP': 'UPHYSTND'
}

In [16]:
def compute_bp(ds, var):
    """GCM state at beginning of time step before physics.
    ?BP = ?AP - physical tendency * dt
    
    Args:
        ds: entire xarray dataset
        var: BP variable name

    Returns:
        bp: xarray dataarray containing just BP variable, with the first time step cut.
    """
    base_var = var[:-2] + 'AP'
    return (ds[base_var] - ds[phy_dict[base_var]] * dt)[1:]   # Not the first time step

In [17]:
def compute_flx(ds, var):
    """Cuts last time step from flux variables"""
    return ds[var][:-1]

In [18]:
%time TBP = compute_bp(ds, 'TBP')

CPU times: user 56 ms, sys: 8 ms, total: 64 ms
Wall time: 103 ms


In [19]:
%time SHFLX = compute_flx(ds, 'SHFLX')

CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 6.34 ms


In [20]:
TBP

<xarray.DataArray (time: 17519, lev: 30, lat: 64, lon: 128)>
dask.array<shape=(17519, 30, 64, 128), dtype=float32, chunksize=(47, 30, 64, 128)>
Coordinates:
  * lat      (lat) float64 -87.86 -85.1 -82.31 -79.53 ... 79.53 82.31 85.1 87.86
  * lon      (lon) float64 0.0 2.812 5.625 8.438 ... 348.8 351.6 354.4 357.2
  * lev      (lev) float64 3.643 7.595 14.36 24.61 ... 936.2 957.5 976.3 992.6
  * time     (time) float64 0.02083 0.04167 0.0625 0.08333 ... 364.9 365.0 365.0

In [21]:
SHFLX

<xarray.DataArray 'SHFLX' (time: 17519, lat: 64, lon: 128)>
dask.array<shape=(17519, 64, 128), dtype=float32, chunksize=(48, 64, 128)>
Coordinates:
  * lat      (lat) float64 -87.86 -85.1 -82.31 -79.53 ... 79.53 82.31 85.1 87.86
  * lon      (lon) float64 0.0 2.812 5.625 8.438 ... 348.8 351.6 354.4 357.2
  * time     (time) float64 0.0 0.02083 0.04167 0.0625 ... 364.9 364.9 365.0
Attributes:
    units:        W/m2
    long_name:    Surface sensible heat flux
    cell_method:  time: mean

In [39]:
%time TBP_stack = TBP.stack(sample=('time', 'lat', 'lon'))   # Ok, this fills up my memory... WTH?

CPU times: user 54.9 s, sys: 10.1 s, total: 1min 4s
Wall time: 1min 4s


In [40]:
TBP_stack

<xarray.DataArray (lev: 30, sample: 143515648)>
dask.array<shape=(30, 143515648), dtype=float64, chunksize=(30, 385024)>
Coordinates:
  * lev      (lev) float64 3.643 7.595 14.36 24.61 ... 936.2 957.5 976.3 992.6
  * sample   (sample) MultiIndex
  - time     (sample) float64 0.02083 0.02083 0.02083 ... 0.02083 0.02083
  - lat      (sample) float64 -87.86 -87.86 -87.86 ... -87.86 -87.86 -87.86
  - lon      (sample) float64 0.0 2.812 5.625 8.438 ... 73.12 75.94 78.75 81.56

In [22]:
%time SHFLX_stack = SHFLX.stack(sample=('time', 'lat', 'lon'))   # BAM 20G. WTF!?

CPU times: user 1min 5s, sys: 9.92 s, total: 1min 15s
Wall time: 1min 15s


In [23]:
SHFLX_stack

<xarray.DataArray 'SHFLX' (sample: 143515648)>
dask.array<shape=(143515648,), dtype=float32, chunksize=(393216,)>
Coordinates:
  * sample   (sample) MultiIndex
  - time     (sample) float64 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
  - lat      (sample) float64 -87.86 -87.86 -87.86 ... -87.86 -87.86 -87.86
  - lon      (sample) float64 0.0 2.812 5.625 8.438 ... 73.12 75.94 78.75 81.56
Attributes:
    units:        W/m2
    long_name:    Surface sensible heat flux
    cell_method:  time: mean

In [29]:
TBP.lev.size

30