# Train ML model for predictions of week 3-4 & 5-6

This notebook create a Machine Learning `ML_model` to predict weeks 3-4 & 5-6 based on `S2S` weeks 3-4 & 5-6 forecasts and is compared to `CPC` observations for the [`s2s-ai-challenge`](https://s2s-ai-challenge.github.io/).

# Synopsis

## Method: `local convolutional neural network`
We developed a `local convolutional neural network` to tackle this challenge task. Our approach is based on a simplified version of a convolutional neural network architecture that was proposed in Scheuerer et al. 2020. We trained one model for each variable and lead time, i.e., 4 models in total. 

## Data used

Training-input for Machine Learning model:
- renku datasets: all hindcasts of the target variables and the tercile edges for the features and all terciled hindcast-like-observations as labels

Forecast-input for Machine Learning model:
- renku datasets: all forecasts of the target variables and the tercile edges

Compare Machine Learning model forecast against ground truth:
- renku datasets

## Resources used
see reproducibility

## Safeguards

All points have to be [x] checked. If not, your submission is invalid.

Changes to the code after submissions are not possible, as the `commit` before the `tag` will be reviewed.
(Only in exceptions and if previous effort in reproducibility can be found, it may be allowed to improve readability and reproducibility after November 1st 2021.)

### Safeguards to prevent [overfitting](https://en.wikipedia.org/wiki/Overfitting?wprov=sfti1) 

If the organizers suspect overfitting, your contribution can be disqualified.

  - [X] We did not use 2020 observations in training (explicit overfitting and cheating)
  - [X] We did not repeatedly verify my model on 2020 observations and incrementally improved my RPSS (implicit overfitting)
  - [X] We provide RPSS scores for the training period with script `skill_by_year`, see section 5.1
  - [X] We tried our best to prevent [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)?wprov=sfti1).
  - [X] We honor the `train-validate-test` [split principle](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets). This means that the hindcast data is split into `train` and `validate`, whereas `test` is withheld.
  - [X] We did not use `test` explicitly in training or implicitly in incrementally adjusting parameters.
  - [X] We considered [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)).

### Safeguards for Reproducibility
Notebook/code must be independently reproducible from scratch by the organizers (after the competition), if not possible: no prize
  - [X] All training data is publicly available (no pre-trained private neural networks, as they are not reproducible for us)
  - [X] Code is well documented, readable and reproducible.
  - [X] Code to reproduce training and predictions is preferred to run within a day on the described architecture. If the training takes longer than a day, please justify why this is needed. Please do not submit training piplelines, which take weeks to train.

# Create Predictions

In [1]:
#!conda activate s2s-ai

In [2]:
# if you want to run CNN_train_predict.py using the data in the repo, set data_path to horat_n. The line below is used to run the script on the server, where the data was stored in a different folder which is not backed up.
# takes about 12 h using 25 cores on the specified architecture
# run file CNN_train_predict.py to create an ensemble of 5 predictions
# ! taskset --cpu-list 21-45 python CNN_train_predict.py

# Submission

In [3]:
import xarray as xr
xr.set_options(display_style='text')

from scripts import skill_by_year, assert_predictions_2020

import warnings
warnings.simplefilter("ignore") 

path_data = 'server'

In [4]:
def load_pred(pred_folder, years):
    
    das = []
    for v in ['t2m', 'tp']:
        if years == '2020':
            das_lead0 = xr.open_dataset(f'../submissions/{pred_folder}/global_prediction_{v}_lead0_{years}_smooth.nc')[v]
            das_lead1 = xr.open_dataset(f'../submissions/{pred_folder}/global_prediction_{v}_lead1_{years}_smooth.nc')[v]
        else:
            das_lead0 = xr.open_dataset(f'../submissions/{pred_folder}/global_prediction_{v}_lead0_smooth_{years}.nc')[v]
            das_lead1 = xr.open_dataset(f'../submissions/{pred_folder}/global_prediction_{v}_lead1_smooth_{years}.nc')[v]
        das.append(xr.concat([das_lead0, das_lead1], dim = 'lead_time'))
    return xr.merge(das).expand_dims(dim={'pred': [pred_folder]})

In [5]:
#read predictions for test year
years = '2020'
ds_10 = load_pred('10', years)
ds_20 = load_pred('20', years)
ds_30 = load_pred('30', years)
ds_40 = load_pred('40', years)
ds_50 = load_pred('50', years)

In [6]:
#average over different predictions (obtained using different seeds)
average_pred_2020 = xr.concat([ds_10, ds_20, ds_30, ds_40, ds_50], 'pred').mean('pred')

In [7]:
average_pred_2020

In [8]:
#save average prediction as final prediction
average_pred_2020.to_netcdf(f'../submissions/ML_prediction_2020.nc')

In [None]:
#!git add ../submissions/ML_prediction_2020.nc
#!git add ML_forecast_template.ipynb

#!git commit -m "commit submission for my_method_name" # whatever message you want
#!git tag "submission-my_method_name-0.0.1" # if this is to be checked by scorer, only the last submitted==tagged version will be considered

#!git push --tags

# RPSS

In [9]:
skill_average_2020 = skill_by_year(average_pred_2020, cache_path = '../../../../Data/s2s_ai/data')#the data was stored in a different folder which is not backed up
print(skill_average_2020)

          RPSS
year          
2020  0.002364


## RPSS for training period

In [14]:
#read predictions for train years
years = 'allyears'
ds_train_10 = load_pred('10', years)
ds_train_20 = load_pred('20', years)
ds_train_30 = load_pred('30', years)
ds_train_40 = load_pred('40', years)
ds_train_50 = load_pred('50', years)

In [15]:
#average over different predictions (obtained using different seeds)
average_pred_train = xr.concat([ds_train_10, ds_train_20, ds_train_30, ds_train_40, ds_train_50], 'pred').mean('pred')

In [16]:
skill_average_train = skill_by_year(average_pred_train, cache_path = '../../../../Data/s2s_ai/data')#the data was stored in a different folder which is not backed up
print(skill_average_train)

          RPSS
year          
2000  0.008699
2001  0.004106
2002  0.003021
2003  0.003090
2004  0.003638
2005  0.001265
2006  0.002902
2007  0.002793
2008  0.004395
2009  0.000433
2010  0.000719
2011  0.003563
2012  0.002182
2013  0.001284
2014  0.001288
2015 -0.000021
2016 -0.000114
2017 -0.000091
2018  0.001625
2019  0.000519


In [17]:
print(skill_average_train.mean())

RPSS    0.002265
dtype: float64


# Reproducibility

## memory

In [15]:
# https://phoenixnap.com/kb/linux-commands-check-memory-usage
!free -g

              total        used        free      shared  buff/cache   available
Mem:            236          72          35           1         127         160
Swap:             0           0           0


## CPU

In [16]:
!lscpu

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 48 bits virtual
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  2
Core(s) per socket:  12
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Stepping:            2
CPU MHz:             2500.704
CPU max MHz:         3300,0000
CPU min MHz:         1200,0000
BogoMIPS:            5000.30
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            30720K
NUMA node0 CPU(s):   0-11,24-35
NUMA node1 CPU(s):   12-23,36-47
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl x

## software

In [1]:
!conda list
#this seems not to work, so see the copied output below

/bin/sh: 1: conda: not found


# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             4.5                       1_gnu  
_pytorch_select           0.1                       cpu_0  
_tflow_select             2.3.0                       mkl  
abseil-cpp                20200923.3           h2531618_0  
absl-py                   0.13.0           py39h06a4308_0  
aiobotocore               1.3.3              pyhd3eb1b0_0  
aiohttp                   3.7.4            py39h27cfd23_1  
aioitertools              0.7.1              pyhd3eb1b0_0  
appdirs                   1.4.4              pyhd3eb1b0_0  
asciitree                 0.3.3                      py_2  
astor                     0.8.1            py39h06a4308_0  
astunparse                1.6.3                      py_0  
async-timeout             3.0.1            py39h06a4308_0  
attrs                     21.2.0             pyhd3eb1b0_0  
backcall                  0.2.0              pyhd3eb1b0_0  
beautifulsoup4            4.10.0             pyha770c72_0    conda-forge
blas                      1.0                         mkl  
blinker                   1.4              py39h06a4308_0  
bokeh                     2.3.3            py39h06a4308_0  
botocore                  1.20.106           pyhd3eb1b0_0  
bottleneck                1.3.2            py39hdd57654_1  
branca                    0.3.1                    pypi_0    pypi
brotli                    1.0.9                he6710b0_2  
brotlipy                  0.7.0           py39h27cfd23_1003  
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.17.1               h27cfd23_0  
ca-certificates           2021.5.30            ha878542_0    conda-forge
cached-property           1.5.2                      py_0  
cachetools                4.2.2              pyhd3eb1b0_0  
cdsapi                    0.5.1                    pypi_0    pypi
certifi                   2021.5.30        py39hf3d152e_0    conda-forge
cffi                      1.14.6           py39h400218f_0  
cfgrib                    0.9.9.0            pyhd8ed1ab_1    conda-forge
cftime                    1.5.0            py39h6323ea4_0  
chardet                   3.0.4           py39h06a4308_1003  
charset-normalizer        2.0.4              pyhd3eb1b0_0  
click                     8.0.1              pyhd3eb1b0_0  
climetlab                 0.8.18                   pypi_0    pypi
climetlab-s2s-ai-challenge 0.8.0                    pypi_0    pypi
cloudpickle               1.6.0              pyhd3eb1b0_0  
configargparse            1.5.2                    pypi_0    pypi
coverage                  5.5              py39h27cfd23_2  
cryptography              3.4.7            py39hd23ed53_0  
curl                      7.78.0               h1ccaba5_0  
cycler                    0.10.0           py39h06a4308_0  
cython                    0.29.24          py39h295c915_0  
cytoolz                   0.11.0           py39h27cfd23_0  
dask                      2021.8.1           pyhd3eb1b0_0  
dask-core                 2021.8.1           pyhd3eb1b0_0  
debugpy                   1.4.1            py39h295c915_0  
decorator                 5.0.9              pyhd3eb1b0_0  
distributed               2021.8.1         py39h06a4308_0  
docopt                    0.6.2                      py_1    conda-forge
eccodes                   2.19.1               hea64003_0    conda-forge
ecmwf-api-client          1.6.1                    pypi_0    pypi
ecmwflibs                 0.3.14                   pypi_0    pypi
entrypoints               0.3              py39h06a4308_0  
fasteners                 0.16.3             pyhd3eb1b0_0  
findlibs                  0.0.2                    pypi_0    pypi
flatbuffers               2.0.0                h2531618_0  
folium                    0.12.1                   pypi_0    pypi
fonttools                 4.25.0             pyhd3eb1b0_0  
freetype                  2.10.4               h5ab3b9f_0  
fsspec                    2021.7.0           pyhd3eb1b0_0  
gast                      0.4.0              pyhd3eb1b0_0  
giflib                    5.2.1                h7b6447c_0  
google-auth               1.33.0             pyhd3eb1b0_0  
google-auth-oauthlib      0.4.1                      py_2  
google-pasta              0.2.0              pyhd3eb1b0_0  
grpcio                    1.36.1           py39h2157cd5_1  
h5netcdf                  0.11.0             pyhd8ed1ab_0    conda-forge
h5py                      2.10.0           py39hec9cf62_0  
hdf4                      4.2.13               h3ca952b_2  
hdf5                      1.10.6          nompi_h7c3c948_1111    conda-forge
heapdict                  1.0.1              pyhd3eb1b0_0  
icu                       68.1                 h2531618_0  
idna                      3.2                pyhd3eb1b0_0  
importlib-metadata        3.10.0           py39h06a4308_0  
intake                    0.6.3              pyhd3eb1b0_0  
intake-xarray             0.5.0              pyhd3eb1b0_0  
intel-openmp              2019.4                      243  
ipykernel                 6.2.0            py39h06a4308_1  
ipython                   7.26.0           py39hb070fc8_0  
ipython_genutils          0.2.0              pyhd3eb1b0_1  
jasper                    1.900.1           h07fcdf6_1006    conda-forge
jedi                      0.18.0           py39h06a4308_1  
jellyfish                 0.8.8                    pypi_0    pypi
jinja2                    3.0.1              pyhd3eb1b0_0  
jmespath                  0.10.0             pyhd3eb1b0_0  
joblib                    1.0.1              pyhd8ed1ab_0    conda-forge
jpeg                      9d                   h36c2ea0_0    conda-forge
jupyter_client            7.0.1              pyhd3eb1b0_0  
jupyter_core              4.7.1            py39h06a4308_0  
keras-preprocessing       1.1.2              pyhd3eb1b0_0  
kiwisolver                1.3.1            py39h2531618_0  
krb5                      1.19.2               hac12032_0  
lcms2                     2.12                 h3be6417_0  
ld_impl_linux-64          2.35.1               h7274673_9  
libaec                    1.0.5                h9c3ff4c_0    conda-forge
libcurl                   7.78.0               h0b77cf5_0  
libedit                   3.1.20210714         h7f8727e_0  
libev                     4.33                 h7b6447c_0  
libffi                    3.3                  he6710b0_2  
libgcc-ng                 9.3.0               h5101ec6_17  
libgfortran-ng            7.5.0               ha8ba4b0_17  
libgfortran4              7.5.0               ha8ba4b0_17  
libgomp                   9.3.0               h5101ec6_17  
libllvm10                 10.0.1               he513fc3_3    conda-forge
libmklml                  2019.0.5                      0  
libnetcdf                 4.7.4           nompi_h56d31a8_107    conda-forge
libnghttp2                1.41.0               hf8bcb03_2  
libpng                    1.6.37               hbc83047_0  
libprotobuf               3.14.0               h8c45485_0  
libsodium                 1.0.18               h7b6447c_0  
libssh2                   1.9.0                h1ba5d50_1  
libstdcxx-ng              9.3.0               hd4cf53a_17  
libtiff                   4.2.0                h85742a9_0  
libwebp                   1.2.0                h89dd481_0  
libwebp-base              1.2.0                h27cfd23_0  
llvmlite                  0.36.0           py39h612dafd_4  
locket                    0.2.1            py39h06a4308_1  
lz4-c                     1.9.3                h295c915_1  
magics                    1.5.6                    pypi_0    pypi
markdown                  3.3.4            py39h06a4308_0  
markupsafe                2.0.1            py39h27cfd23_0  
matplotlib-base           3.4.2            py39hab158f2_0  
matplotlib-inline         0.1.2              pyhd3eb1b0_2  
mkl                       2020.2                      256  
mkl-service               2.3.0            py39he8ac12f_0  
mkl_fft                   1.3.0            py39h54f3939_0  
mkl_random                1.0.2            py39h63df603_0  
msgpack-python            1.0.2            py39hff7bd54_1  
multidict                 5.1.0            py39h27cfd23_2  
munkres                   1.1.4                      py_0  
nc-time-axis              1.3.1              pyhd8ed1ab_2    conda-forge
ncurses                   6.2                  he6710b0_1  
nest-asyncio              1.5.1              pyhd3eb1b0_0  
netcdf4                   1.5.4           nompi_py39hb3be4b9_103    conda-forge
ninja                     1.10.2               hff7bd54_1  
numba                     0.53.1           py39ha9443f7_0  
numcodecs                 0.8.0            py39h2531618_0  
numexpr                   2.7.3            py39hb2eb853_0  
numpy                     1.19.2           py39h89c1606_0  
numpy-base                1.19.2           py39h2ae0177_0  
oauthlib                  3.1.1              pyhd3eb1b0_0  
olefile                   0.46               pyhd3eb1b0_0  
openssl                   1.1.1k               h7f98852_0    conda-forge
opt_einsum                3.3.0              pyhd3eb1b0_1  
packaging                 21.0               pyhd3eb1b0_0  
pandas                    1.3.2            py39h8c16a72_0  
parso                     0.8.2              pyhd3eb1b0_0  
partd                     1.2.0              pyhd3eb1b0_0  
pdbufr                    0.9.0                    pypi_0    pypi
pexpect                   4.8.0              pyhd3eb1b0_3  
pickleshare               0.7.5           pyhd3eb1b0_1003  
pillow                    8.3.1            py39h5aabda8_0  
pip                       21.2.4           py37h06a4308_0  
prompt-toolkit            3.0.17             pyhca03da5_0  
properscoring             0.1                        py_0    conda-forge
protobuf                  3.14.0           py39h2531618_1  
psutil                    5.8.0            py39h27cfd23_1  
ptyprocess                0.7.0              pyhd3eb1b0_2  
pyasn1                    0.4.8              pyhd3eb1b0_0  
pyasn1-modules            0.2.8                      py_0  
pycparser                 2.20                       py_2  
pydap                     3.2.2           pyh9f0ad1d_1001    conda-forge
pygments                  2.10.0             pyhd3eb1b0_0  
pyjwt                     2.1.0            py39h06a4308_0  
pyodc                     1.0.3                    pypi_0    pypi
pyopenssl                 20.0.1             pyhd3eb1b0_1  
pyparsing                 2.4.7              pyhd3eb1b0_0  
pysocks                   1.7.1            py39h06a4308_0  
python                    3.9.6                h12debd9_1  
python-dateutil           2.8.2              pyhd3eb1b0_0  
python-eccodes            2020.10.0        py39h1dff97c_0    conda-forge
python-flatbuffers        1.12               pyhd3eb1b0_0  
python-snappy             0.6.0            py39h2531618_3  
python_abi                3.9                      2_cp39    conda-forge
pytorch                   1.8.1           cpu_py39h60491be_0  
pytz                      2021.1             pyhd3eb1b0_0  
pyyaml                    5.4.1            py39h27cfd23_1  
pyzmq                     22.2.1           py39h295c915_1  
readline                  8.1                  h27cfd23_0  
requests                  2.26.0             pyhd3eb1b0_0  
requests-oauthlib         1.3.0                      py_0  
rsa                       4.7.2              pyhd3eb1b0_1  
s3fs                      2021.7.0           pyhd3eb1b0_0  
scikit-learn              0.24.2           py39ha9443f7_0  
scipy                     1.6.2            py39h91f5cce_0  
setuptools                52.0.0           py39h06a4308_0  
six                       1.16.0             pyhd3eb1b0_0  
snappy                    1.1.8                he6710b0_0  
sortedcontainers          2.4.0              pyhd3eb1b0_0  
soupsieve                 2.0.1                      py_1    conda-forge
sqlite                    3.36.0               hc218d9a_0  
tbb                       2020.2               h4bd325d_4    conda-forge
tblib                     1.7.0              pyhd3eb1b0_0  
tensorboard               2.5.0                      py_0  
tensorboard-plugin-wit    1.6.0                      py_0  
tensorflow                2.4.1           mkl_py39h4683426_0  
tensorflow-base           2.4.1           mkl_py39h43e0292_0  
tensorflow-estimator      2.5.0              pyh7b7c402_0  
termcolor                 1.1.0            py39h06a4308_1  
threadpoolctl             2.2.0              pyh8a188c0_0    conda-forge
tk                        8.6.10               hbc83047_0  
toolz                     0.11.1             pyhd3eb1b0_0  
tornado                   6.1              py39h27cfd23_0  
tqdm                      4.62.2                   pypi_0    pypi
traitlets                 5.0.5              pyhd3eb1b0_0  
typing-extensions         3.10.0.0             hd3eb1b0_0  
typing_extensions         3.10.0.0           pyh06a4308_0  
tzdata                    2021a                h5d7bf9c_0  
urllib3                   1.26.6             pyhd3eb1b0_1  
wcwidth                   0.2.5              pyhd3eb1b0_0  
webob                     1.8.7              pyhd8ed1ab_0    conda-forge
werkzeug                  1.0.1              pyhd3eb1b0_0  
wheel                     0.35.1             pyhd3eb1b0_0  
wrapt                     1.12.1           py39he8ac12f_1  
xarray                    0.19.0             pyhd3eb1b0_1  
xhistogram                0.3.0              pyhd8ed1ab_0    conda-forge
xskillscore               0.0.23             pyhd8ed1ab_0    conda-forge
xz                        5.2.5                h7b6447c_0  
yaml                      0.2.5                h7b6447c_0  
yarl                      1.6.3            py39h27cfd23_0  
zarr                      2.8.1              pyhd3eb1b0_0  
zeromq                    4.3.4                h2531618_0  
zict                      2.0.0              pyhd3eb1b0_0  
zipp                      3.5.0              pyhd3eb1b0_0  
zlib                      1.2.11               h7b6447c_3  
zstd                      1.4.9                haebb681_0  