This notebook shows how to deploy a pre-trained tree canopy model on new data and how to train a new model. 

In [1]:

from pathlib import Path
root = Path.cwd()

from src.model import SiteData, CanopyModeler, CanopyPredictions, NoGtCanopyPredictions

### Predict tree canopy using a pre-trained model

Create a `SiteData` object for the site you want to predict tree canopy for. Pass in the site_name (which is the name of the folder where all relevant data is stored). Set `tc=False` because there is no ground truth tree canopy associated with this site. 

In [None]:
nyc23 = SiteData(root, 'nyc23',epsg=26918, tc=False)

Create a `CanopyModeler` object with the following arguments:  
* `train_site`: set to None as we'll be using a pre-trained model.  
* `test_site_list`: It's possible to predict tree canopy for multiple sites at once. In this case, we are only predicting one site. However, it should still be passed as a single-element list.  

In [None]:
modeler = CanopyModeler(root=root,train_site=None,test_site_list=[nyc23])

Now we can call `.run_tree_canopy_model()` to deploy a pre-trained model on our nyc 2023 data, using the following arguments:
* `local_results`: set to `False`, because we are not predicting tree canopy for the same data the model was trained on.
* `transfer_results`: set to `True`, because we do want to predict tree canopy for our test site(s)
* `model_filename`: the name of the model you want to use, saved as a .pkl to the 'models' folder. 
* `load_saved`: must be set to `True` in order to use the pre-trained model.

In [None]:
nyc23_preds = modeler.run_tree_canopy_model(local_results=False,transfer_results=True,model_filename='2017_model',load_saved=True)

The output of `run_tree_canopy_model(`)` is a dictionary containing the keys 'local' and 'transfer'. In this case, the 'local' entry is empty because we did not predict local results. 

The results will be either a `CanopyPrediction` object (when ground truth tree canopy is available, allowing us to calculate RMSE etc.) or a `NoGtCanopyPrediction` object (when no ground truth tree canopy data is available).

To access the results for the site we just predicted:

In [None]:
results = nyc23_preds['transfer']['nyc23']

Save the result object to an npz file.  
It will be saved to output/site_name/results_for_{site_name}\_from_{train_site_name}_model.npz'

In [None]:
results.to_npz()

### Train a new model and predict tree canopy for the same year and different years

This time, we will use data from 2017 to train a new model and then test it by predicting tree canopy for 2017 and 2021.  

Create SiteData objects for both years of data. In this case, both have ground truth tree canopy available for testing, so tc is set to True.

In [2]:
nyc17 = SiteData(root, 'nyc17',26918, tc=True)
nyc21 = SiteData(root, 'nyc21',26918, tc=True)

Create a `CanopyModeler` object with `train_site` and `test_site_list` set to the respective sites. Here, we are training the model on data from 2017 and testing on 2021.

In [3]:
modeler = CanopyModeler(root=root,train_site=nyc17,test_site_list=[nyc21])

Now call `run_tree_canopy_model()` with the following arguments:
* `local_results`: set to `True` because we want to predict tree canopy for the same year the model was trained for.
* `transfer_results`: also set to `True` because we want to see how the model does on new data.
* `model_filename`: this is the name that the newly trained model will be saved under in the models folder. We'll just add the suffix 'v2'. 
* `load_saved`: set to `False`, because we are not using a pre-trained model. 

In [None]:
nyc17model = modeler.run_tree_canopy_model(local_results=True,transfer_results=True,model_filename='2017_model_v3',load_saved=False)

['april_lst', 'april_tdvi', 'april_ndwi', 'april_brightness', 'april_wetness', 'april_carotenoid_index_1', 'april_lswi', 'april_s2_water_index', 'april_blue', 'april_nir', 'april_rededge2', 'april_sw1', 'april_sw2', 'october_tdvi', 'october_brightness', 'october_wetness', 'october_lswi', 'october_s2_water_index', 'october_blue', 'october_nir', 'october_rededge1', 'springdiff_lst', 'springdiff_tdvi', 'springdiff_ndwi', 'springdiff_brightness', 'springdiff_wetness', 'springdiff_carotenoid_index_1', 'springdiff_chlorophyll_veg_index', 'springdiff_carotenoid_index_2', 'springdiff_anthocyanin_index_2', 'springdiff_s2_water_index', 'springdiff_blue', 'springdiff_nir', 'springdiff_rededge1', 'springdiff_sw1', 'falldiff_tdvi', 'falldiff_ndwi', 'falldiff_brightness', 'falldiff_wetness', 'falldiff_carotenoid_index_1', 'falldiff_chlorophyll_veg_index', 'falldiff_lswi', 'falldiff_chlorophyll_index_red_edge', 'falldiff_carotenoid_index_2', 'falldiff_anthocyanin_index_2', 'falldiff_s2_water_index', 

These additional arguments to `run_tree_canopy_model()` can also be adjusted when training a new model:
* `autoselect_variables` (default=True): when True, hierarchical clustering is used to automatically select a subset of the available features for model training based on their correlation to each other and to the response variable.
* `manual_variables` (default=False): when True, function expects user to specify which features to use for training.
* `vars_to_use` (default=False): if manual_variables is True, a list of feature names as strings e.g. ['april_ndvi', 'october_red'..etc] to be used for training the model.
* `var_selection_threshold` (default=0.1): threshold used for automatic variable selection. Range 0 to 2, with lower values allowing more variables to be included. 
* `frac` (default=0.10): the fraction of all available data to be used to train the model.
* `test_size` (default=0.3): the fraction of training data to be held out for testing (i.e. a further subset of frac)
* `random_state` (default=42): random state for all processes (splitting data, random forest etc.)

Like before, the output of `run_tree_canopy_model` can be accessed and saved as follows. In this case, we have results for both local (same year) and transfer (different year) predictions.



In [2]:
results_17 = nyc17model['local']
results_21 = nyc17model['transfer']['nyc21']

# save scores to a file in output called scores.json
results_17.save_scores()
results_21.save_scores()

# save the result objects for future analysis
results_17.to_npz()
results_21.to_npz()

We can also quickly look at model performance by calling `print_model_info()`. (For the 2021 results, 'test set RMSE' is `None` because this was not the training site, ergo there was no test set)

In [7]:
results_21.print_model_info()

Site Name: nyc21
                        Model: 2017_model_v3
                        Overall:
                        	MAE: 9.64
                        	RMSE: 13.48
                        	Test set RMSE: None
                        Classifier Metrics:
                        	% correct zero canopy labels: 70.71
                        	% zero canopy pixels found: 66.57
                        RMSE by classification: 
                        	TP RMSE (correctly predicted canopy): 14.27
                        	TN RMSE (correctly predicted no canopy):   0.00
                        	FP RMSE (predicted canopy, actually none): 16.69
                        	FN RMSE (missed actual canopy):            11.06
                    
None
