# Machine Learning for Land Models (ML4Land)

This notebook is a progression of the methods described in `1. Toy_Example.ipynb`

We use the same data sources, but now rather than at a single timesnap, we use a longer time range to avoid temporal/spatial correlations.

We will take the first 3 months of 2017 in as a traning set and the subsequent two months as a validate/test split respecitvely.

For now this is just to ensure the machinery works. We will then look at doing this for longer times.

Note that when running remotely over SSH, a simple function of the form

```python
fig = plt.figure(figsize=(24,12))
ax = fig.add_subplot(111,projection=ccrs.PlateCarree(central_longitude=0))
ax.coastlines()
plt.show()
```

fails, for reasons unlnown (seems to kill kernel). Example geo plots can be found in  `1. Toy_Example.ipynb`. Here insteead we will just deal with numbers and dfs.

---


## 1. Getting the X data <a name="features"></a>

As before, we will get our features  via the [Copernicus Climate Data Store](https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-land-monthly-means?tab=overview).

In [4]:
import climetlab as cml
import xarray as xr



data_root = '/network/group/aopp/predict/TIP016_PAXTON_RPSPEEDY/ML4L/' #for IO

variables = ['2m_temperature','10m_u_component_of_wind', '10m_v_component_of_wind'] #Variables we are interested in

#Time selection
years = ['2017']
months = ['01','02','03','04','05']
times= ["00:00"]


#Load the data

load_from_remote = True
if load_from_remote:
    xdata = cml.load_source("cds",
                            "reanalysis-era5-land-monthly-means",
                             variable=variables,
                             product_type= "monthly_averaged_reanalysis",
                             year = years,
                             month = months,
                             time = times
                             )
    cds_xarray = xdata.to_xarray(backend_kwargs={'errors': 'ignore','filter_by_keys':{'edition': 1, 'typeOfLevel':'surface'}})
    cds_xarray.to_netcdf(data_root+"xdata.nc")
else:
    cds_xarray = xr.open_dataset(data_root+"xdata.nc")

In [11]:
display(cds_xarray)

Unnamed: 0,Array,Chunk
Bytes,40 B,40 B
Shape,"(5,)","(5,)"
Count,2 Tasks,1 Chunks
Type,datetime64[ns],numpy.ndarray
"Array Chunk Bytes 40 B 40 B Shape (5,) (5,) Count 2 Tasks 1 Chunks Type datetime64[ns] numpy.ndarray",5  1,

Unnamed: 0,Array,Chunk
Bytes,40 B,40 B
Shape,"(5,)","(5,)"
Count,2 Tasks,1 Chunks
Type,datetime64[ns],numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,123.66 MiB,123.66 MiB
Shape,"(5, 1801, 3600)","(5, 1801, 3600)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 123.66 MiB 123.66 MiB Shape (5, 1801, 3600) (5, 1801, 3600) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray",3600  1801  5,

Unnamed: 0,Array,Chunk
Bytes,123.66 MiB,123.66 MiB
Shape,"(5, 1801, 3600)","(5, 1801, 3600)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,123.66 MiB,123.66 MiB
Shape,"(5, 1801, 3600)","(5, 1801, 3600)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 123.66 MiB 123.66 MiB Shape (5, 1801, 3600) (5, 1801, 3600) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray",3600  1801  5,

Unnamed: 0,Array,Chunk
Bytes,123.66 MiB,123.66 MiB
Shape,"(5, 1801, 3600)","(5, 1801, 3600)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,123.66 MiB,123.66 MiB
Shape,"(5, 1801, 3600)","(5, 1801, 3600)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 123.66 MiB 123.66 MiB Shape (5, 1801, 3600) (5, 1801, 3600) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray",3600  1801  5,

Unnamed: 0,Array,Chunk
Bytes,123.66 MiB,123.66 MiB
Shape,"(5, 1801, 3600)","(5, 1801, 3600)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray


Unlike before, we now have extra contributions to the `time` dimension that need to be selected for explicitly.

Recall that can select from the xarray dataset as:
```python
cds_xarray.loc[dict(time=slice("2018-01-01"))]
```

or from a df in the usual way as:
```python
df_cds = cds_xarray.to_dataframe()
df_cds.loc["2018-01-01"]
```

Before moving forward, lets make the  data into a nice pandas object:

In [5]:
df_cds = cds_xarray.to_dataframe()

In [20]:
new = df_cds.query("time=='2017-01-01' | time=='2017-02-01'")
new = df_cds.query("'2017-01-01' <= time <= '2017-03-01'")

In [21]:
import numpy as np
np.unique(new.index.get_level_values('time'))

array(['2017-01-01T00:00:00.000000000', '2017-02-01T00:00:00.000000000',
       '2017-03-01T00:00:00.000000000'], dtype='datetime64[ns]')

## 2. Getting the Y data <a name="ydata"></a>

Again we will get the CMG monthly product from MODIS

* [MODIS](https://modis-land.gsfc.nasa.gov/temp.html).

* [MOD11C3](https://lpdaac.usgs.gov/products/mod11c3v006/)

We need to do multiple HTTPS queries, once for each month.

To do this we will first create a text file that can then be read by `wget`:

In [17]:
base = 'https://e4ftl01.cr.usgs.gov/MOLT/MOD11C3.006/' 
    
    
years = ['2017']
months = ['01','02','03','04','05']

f = open('get_MODIS_data.txt','w')
for y in years:
    for m in months:
        date = y+'.'+m+'.01/ \n'
        string = base+date
        f.write(string)
f.close()

And then pass this to `wget`: 

In [19]:
!wget -r -l1 --no-parent -A "*.hdf" -i get_MODIS_data.txt -P "/network/group/aopp/predict/TIP016_PAXTON_RPSPEEDY/ML4L/"


--2021-10-29 13:29:47--  https://e4ftl01.cr.usgs.gov/MOLT/MOD11C3.006/2017.01.01/
Resolving e4ftl01.cr.usgs.gov (e4ftl01.cr.usgs.gov)... 152.61.133.130, 2001:49c8:4000:127d::133:130
Connecting to e4ftl01.cr.usgs.gov (e4ftl01.cr.usgs.gov)|152.61.133.130|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘/network/group/aopp/predict/TIP016_PAXTON_RPSPEEDY/ML4L/e4ftl01.cr.usgs.gov/MOLT/MOD11C3.006/2017.01.01/index.html.tmp’

e4ftl01.cr.usgs.gov     [ <=>                ]   2.99K  --.-KB/s    in 0.001s  

2021-10-29 13:29:57 (5.42 MB/s) - ‘/network/group/aopp/predict/TIP016_PAXTON_RPSPEEDY/ML4L/e4ftl01.cr.usgs.gov/MOLT/MOD11C3.006/2017.01.01/index.html.tmp’ saved [3060]

Loading robots.txt; please ignore errors.
--2021-10-29 13:29:57--  https://e4ftl01.cr.usgs.gov/robots.txt
Reusing existing connection to e4ftl01.cr.usgs.gov:443.
HTTP request sent, awaiting response... 302 Found
Location: https://urs.earthdata.nasa.gov/oauth/authori

Each file is saved within a nest of directories - maybe there is a way to modiy wget to avoid this?

In any case it is no real problem; we can easily access the relevant files through `glob`

Lets bring all this data together into a single df

In [21]:
import xarray as xr
import rioxarray as rxr
import glob
import pandas as pd
import sys
all_files = glob.glob('/network/group/aopp/predict/TIP016_PAXTON_RPSPEEDY/ML4L/e4ftl01.cr.usgs.gov/MOLT/MOD11C3.006/**/*.hdf')

dfs = []
for f in (all_files):
    modis_xarray= rxr.open_rasterio(f,masked=True)
    datastamp = modis_xarray.attrs['RANGEBEGINNINGDATE']
    modis_df = modis_xarray.to_dataframe() #everything as a df
    modis_df['time'] = datastamp

    modis_df = modis_df[['LST_Day_CMG', 'time']] 

    dfs.append(modis_df)

df_modis = pd.concat(dfs)



RasterioIOError: '/network/group/aopp/predict/TIP016_PAXTON_RPSPEEDY/ML4L/e4ftl01.cr.usgs.gov/MOLT/MOD11C3.006/2017.01.01/MOD11C3.A2017001.006.2017032204847.hdf' not recognized as a supported file format.

In [2]:
import rioxarray as rxr
rxr.open_rasterio('true_y_data.hdf')

RasterioIOError: 'true_y_data.hdf' not recognized as a supported file format.

In [26]:
! wget https://e4ftl01.cr.usgs.gov/MOLT/MOD11C3.006/2016.01.01/MOD11C3.A2016001.006.2016234032549.hdf -O ../data/ydata2.hdf

--2021-10-29 14:17:32--  https://e4ftl01.cr.usgs.gov/MOLT/MOD11C3.006/2016.01.01/MOD11C3.A2016001.006.2016234032549.hdf
Resolving e4ftl01.cr.usgs.gov (e4ftl01.cr.usgs.gov)... 152.61.133.130, 2001:49c8:4000:127d::133:130
Connecting to e4ftl01.cr.usgs.gov (e4ftl01.cr.usgs.gov)|152.61.133.130|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://urs.earthdata.nasa.gov/oauth/authorize?scope=uid&app_type=401&client_id=ijpRZvb9qeKCK5ctsn75Tg&response_type=code&redirect_uri=https%3A%2F%2Fe4ftl01.cr.usgs.gov%2Foauth&state=aHR0cHM6Ly9lNGZ0bDAxLmNyLnVzZ3MuZ292L01PTFQvTU9EMTFDMy4wMDYvMjAxNi4wMS4wMS9NT0QxMUMzLkEyMDE2MDAxLjAwNi4yMDE2MjM0MDMyNTQ5LmhkZg [following]
--2021-10-29 14:17:39--  https://urs.earthdata.nasa.gov/oauth/authorize?scope=uid&app_type=401&client_id=ijpRZvb9qeKCK5ctsn75Tg&response_type=code&redirect_uri=https%3A%2F%2Fe4ftl01.cr.usgs.gov%2Foauth&state=aHR0cHM6Ly9lNGZ0bDAxLmNyLnVzZ3MuZ292L01PTFQvTU9EMTFDMy4wMDYvMjAxNi4wMS4wMS9NT0QxMUMzLkEyMDE2MDAxLjAw

In [34]:
f = 'true_y_data.hdf'
rxr.open_rasterio(f,masked=True)

RasterioIOError: 'true_y_data.hdf' not recognized as a supported file format.

In [31]:
!conda list

# packages in environment at /home/kimpson/anaconda3:
#
# Name                    Version                   Build  Channel
_ipyw_jlab_nb_ext_conf    0.1.0                    py38_0  
_libgcc_mutex             0.1                        main  
affine                    2.3.0                    pypi_0    pypi
alabaster                 0.7.12             pyhd3eb1b0_0  
anaconda                  2021.05                  py38_0  
anaconda-client           1.7.2                    py38_0  
anaconda-navigator        2.0.3                    py38_0  
anaconda-project          0.9.1              pyhd3eb1b0_1  
anyio                     2.2.0            py38h06a4308_1  
appdirs                   1.4.4                      py_0  
argh                      0.26.2                   py38_0  
argon2-cffi               20.1.0           py38h27cfd23_1  
asciitree                 0.3.3                    pypi_0    pypi
asn1crypto                1.4.0                      py_0  
astroid                  

In [32]:
!conda update scikit-learn

Collecting package metadata (current_repodata.json): done
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: | ^C
failed

CondaError: KeyboardInterrupt



## 3. Clean and combine the data <a name="clean"></a>

Now lets clean up both datasets and join them together.

In [None]:
#Make everything a pandas df
df_x_orig = cds_xarray.to_dataframe()
df_y_orig = df_modis

#Create some copies that we will modify, leaving df_x_orig, df_y_orig unaltered in case we need to refer back
df_x = df_x_orig.copy()
df_y = df_y_orig.copy()

In [None]:
import numpy as np
#Reindex dfx to long1
df_x['latitude'] = np.round(df_x.index.get_level_values('latitude').values,3)
df_x['longitude'] = np.round((df_x.index.get_level_values('longitude').values +180) %360 - 180,3)
df_x['time'] = df_x.index.get_level_values('time').values
df_x = df_x.set_index(['latitude', 'longitude','time'], drop=True)


selected_x_columns = ['t2m','u10','v10'] #only use these columns, drop the others
df_x = df_x[selected_x_columns]




#Reindex dfy via a linear shift
#---ATTENTION---!> We add a linear shift of 0.0250 such that the coordinates match between the X and Y data
# We need to clarify the proper way to deal with this. Perhaps some interpolation method?
df_y['latitude'] = np.round(df_y.index.get_level_values('y').values,3) - 0.0250
df_y['longitude'] = np.round(df_y.index.get_level_values('x').values,3) - 0.0250

df_y = df_y.set_index(['latitude', 'longitude','time'], drop=True)

selected_y_columns = ['LST_Day_CMG'] #only use these columns, drop the others
df_y = df_y[selected_y_columns]





In [None]:
display(df_y)
display(df_x)

In [None]:
df_ML = df_y.merge(df_x, how = 'inner', left_index=True, right_index=True) #Merge
df_ML_clean = df_ML.dropna() #Get rid of nulls

In [None]:
df_x_orig

There is a curious effect whereby apparent artifacts are introduced in the plotting when dropping Nulls. Consider the `'LST_Day_CMG'` map for `df_ML` and then once we drop the nulls:

In [None]:
plotit(df_ML,'LST_Day_CMG')
plotit(df_ML_clean,'LST_Day_CMG')

There seems to be some artifact, evident in a small vertical line off the coast of South America. **Need to establish exactly what is causing this**. As far as I can tell it is a _plotting_ issue rather than an issue with the data processing itself, but need to confirm this.

Going forward we will adopt `df_ML_clean` for our analysis.

In [None]:
display(df_ML_clean)

## 4. Do some simple ML <a name="ML"></a>

We now have all the data in a single df `df_ML_clean`

Lets do some ML with this data.

First it is actually useful to split up the df we just created (!) into two reduced x/y dfs:

In [None]:
x = df_ML_clean.drop(columns=['LST_Day_CMG']) #all the other columns. Don't use long3 - just for join
y = df_ML_clean[['LST_Day_CMG']] #just the 'y' column

display(x)
display(y)

Now lets create a training and test set:

In [None]:
from sklearn.model_selection import train_test_split

#Create train/test data
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.20)

#Create two dfs holding the test/train data
df_train = x_train.merge(y_train, how = 'inner', on=['latitude', 'longitude']) 
df_test = x_test.merge(y_test, how = 'inner', on=['latitude', 'longitude']) 

#Plot the train and test data
plotit(df_train,'LST_Day_CMG')
plotit(df_test,'LST_Day_CMG')

**Note again the vertical artifacts near the tip of S. America...**


...and now train using an RF 

In [None]:
from sklearn.ensemble import RandomForestRegressor


#Make everything numpy arrays
x_train = x_train.to_numpy()
y_train = y_train.to_numpy().ravel() #flatten

x_test = x_test.to_numpy()
y_test = y_test.to_numpy().ravel()


# Initiate model 
rf = RandomForestRegressor(n_estimators = 10, verbose=1)

# Train the model on training data
rf.fit(x_train, y_train)

With a trained model we can then make some predictions, and then evaluate these predictions againsts the true (test) data

In [None]:
from sklearn.metrics import r2_score

y_pred = rf.predict(x_test)
training_score = rf.score(x_train, y_train)
testing_score = rf.score(x_test, y_test) # =r2_score(y_test, y_pred)
relative_error = (y_pred - y_test)/y_test

print ('Train/Test score:', training_score, testing_score)
print ('Max/min relative error:', max(abs(relative_error)), min(abs(relative_error)))

Not too bad! But then we did use the 2m surface temperature...

We can also visualise how this error is spread geographically: 

In [None]:
#Add the error and predictions to the test df i
df_test['relative_error'] = relative_error
df_test['y_pred'] = y_pred

#--Plot it up

#Actual
plotit(df_test,'LST_Day_CMG')

#Predicted
plotit(df_test,'y_pred')


#Error
plotit(df_test,'relative_error')

