# Problem Set 6 - Random Forest Regression    

In this problem set, you will train a random forest regression model to emulate a physics-based firn model in order to predict changes in firn air content on Antarctica ice shelves out through the end of the century. Firn air content (FAC) is a good proxy measure for how much meltwater can be retained by an ice shelf, so higher FAC means that the ice shelf is more stable and less prone to hydrofracture-induced disintegration. By predicting how FAC will change with time under different emissions scenarios, we can estimate the vulnerability of these ice shelves to damage and failure in a changing climate.    

The data and workflow in this problem set are adapted from D. Dunmire, N. Wever, A. F. Banwell, J. T. M. Lenaerts (2024) "Antarctic-wide ice-shelf firn emulation reveals robust future firn air depletion signal for the Antarctic Peninsula," *Nature Communications Earth and Environment, 5 (1)*, 1-13, doi: 10.1038/s43247-024-01255-4.

We are motivated to use a machine learning model to make these predictions because of the huge computational cost for physics-based models. Dunmire et al. (2024) notes that SNOWPACK model runs used to generate the training data for FAC took multiple days of CPU time, while their random forest emulator was able to estimate FAC time series for around 6 million datapoints in just under 40 minutes. 

**[1]** Import the packages that you will need to train a random forest regression model and optimize its hyperparameters using a randomized search. 

**[2] (2 pts)** Import the training and test data from https://raw.githubusercontent.com/rtculberg/ml_in_eas/main/data/FACTrainingData.csv. Separately, import data from a witheld set of ice shelves from https://raw.githubusercontent.com/rtculberg/ml_in_eas/main/data/WithheldIceShelves.csv. Print the first few rows from the training and test data set.           

You should see that you have access to the following variables:   
**year** - the year of the data point       
**site** - an alphanumeric code for test locations on different ice shelves          
**snow** - total annual snow accumulation in millimeters of water equivalent         
**wind** - average annual 10 m windspeed in meters per second         
**ta_summer** - average 2 m summer air temperature in degrees C          
**ta_annual** - average 2 m annual air temperature in degrees C        
**fac** - firn air content in meters at simulated by the physics-based SNOWPACK model        
**SSP** - share socioeconomic pathway scenario (e.g. scenarios for different degrees of carbon emissions in the future)

**[3] (5 pts)** From the test and training dataset, create one new dataframe that contains only the input features and another new dataframe with the target feature. Split the data into test and training sets using an 80-20 split. Remember to set your random state to some integer for reproducibility.

**[4] (5 pts)** Apply a z-score transform to both your training and test data sets. 

**[5] (5 pts)** A random forest regression model has a large number of tunable hyperparameters. To choose the best paramters, we will use `RandomizedSearchCV` to try many different combinations and see how they score. To do this, we need to set up a random grid with different parameter ranges to explore. Create a Python dictionary that contains lists of possible parameters values for the following hyperparameters:      
`n_estimators: [10,50,100,200]`         
`max_features: [None, 'sqrt']`             
`max_depth: [5,10,25,50,100,None]`             
`min_samples_split: [2,5,10]`           
`min_samples_leaf: [1,2,4]`             

**[6] (5 pts)** Instantiate a new `RandomForestRegressor()` object. Pass this object to `RandomizedSearchCV` as the `estimator`. Pass your random grid dictionary that you created in the last code block as the `param_distribution` parameters. Run 100 interations using three-fold cross-validation. Set the `random_state` to the same integer that you used for the test/train split and `n_jobs` to -1 to use all available cores. Then call the `fit()` function on your new `RandomizedSearchCV` object, passing it your training data, to find the best combination of parameters. Note that this code may take a few minutes to run!

**[7] (2 pts)** Find and print out the best parameters from your randomized search. Hint: look at the documentation for the `best_params_` property of a `RandomizedSearchCV` object. 

**[8] (5 pts)** Now that we have selected the best hyperparameters, it's time to actually train our best version of the model! Instantiate a new `RandomForestRegressor` object using the best hyperparameter values that you found in your randomized search. Set the random state to the same integer that you used in the random search. Make sure your random forest regressor will use bootstrapping to build the trees and be sure to use out-of-bag samples to estimate the generalization score.     

Train your model using your training data. Then print out the R^2 training score, the R^2 score on the test set, the score of the training dataset using the out-of-bag estimate, and the feature importance weighting. 

**[9] (5 pts)** Use your trained model to predict FAC on the test dataset. Calculate and print out the R^2 score for the test dataset. Print the RMSE score, rescaled to FAC units in meters.

**[10] (3 pts)** Make a scatter plot of the true vs. predicted FAC on the test set. Be sure to "un-standardize" the data before plotting so that your axes are in terms of true FAC in meters. Plot a 1-1- line on top of your scatter plot.  

**[11]** Now we can use our trained model emulator to make predictions of how FAC will change on other ice shelves through the end of the century. The code below selects some test points from the Larsen C ice shelf under different SSP scenarios for this exercise. 

In [69]:
#Larsen C
LC = unseen[((unseen.site == 'VIR619')|(unseen.site == 'VIR621')|(unseen.site == 'VIR575'))].copy()
LC.drop(['site', 'fac'], axis=1, inplace=True)
LC.head()

Unnamed: 0,year,snow,wind,ta_summer,ta_annual,SSP
1,2020,699.207,5.267815,-2.533384,-12.772272,5
2,2020,797.802,5.646286,-1.729275,-11.357697,5
3,2020,929.5386,4.920175,-2.362071,-12.070659,5
12,2021,721.8876,5.306538,-2.488674,-12.412049,5
13,2021,824.7402,5.689971,-1.712857,-10.976377,5


**[12] (2 pts)** Standardize thew ne unseen data for input to the model.

**[13] (5 pts)** Use your trained model to predict FAC for each data point. Rescale your predictions back to meters. Add your predictions to your `LC` dataframe as a new column and print the first few rows of that dataframe. 

**[14] (1 pt)** The code below shows you how to select data points from the SSP5 emission scenario and find the mean FAC for each year in the dataset. Use this examples to create two more variables that hold data for the SSP3 and SSP1 emission scenarios. 

In [75]:
LC5 = LC[LC.SSP == 5].groupby(LC.year).mean()

**[15] (5 pts)** For each of the three emission scenarios, plot annual mean FAC as a function of time. Plot all three lines on the same plot. Don't forget to label your axes correctly and provide a legend.