<a href="https://colab.research.google.com/github/williamlidberg/Geographical-Intelligence-Lab/blob/main/notebooks/machine_learning_raster_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning on raster data
Start by installing geopandas and cloning the course github repositroy where some example data is stored. The raster data will be stored under /content/Analyses-of-Environmental-Data-2/data/rasters and the field plots will be stored as csv files under under /content/Analyses-of-Environmental-Data-2/data/


In [None]:
!pip install geopandas
!git clone https://github.com/williamlidberg/Analyses-of-Environmental-Data-2.git # This include some test data we will use

### Import and inspect the field data csv

In [None]:
import pandas as pd
soildata = pd.read_csv('/content/Analyses-of-Environmental-Data-2/data/Krycklan_Soilsurvey_data.csv', sep=';')
soildata

### Plot the data

In [None]:
import geopandas as gpd
import matplotlib.pyplot as plt

soildata_gdf = gpd.GeoDataFrame(soildata, geometry=gpd.points_from_xy(soildata.East, soildata.North), crs=3006)
plt.rcParams["figure.figsize"] = (10,20)
soildata_gdf.plot(column='SMC', cmap='viridis_r', legend=True)

## Rasterdata

In order to train a machine learning model you need some geospatial data.
The following raster layers were calculated from the digital elevation model using whitebox tools.

1.   DownslopeIndex_2m
2.   DownslopeIndex_4m
3.   DepthToWater_1ha
4.   DepthToWater_2ha
5.   DepthToWater_4ha
6.   DepthToWater_8ha
7.   DepthToWater_16ha
8.   DepthToWater_32ha
9.   ElevationAboveStream_1ha
10.  ElevationAboveStream_2ha
11.  ElevationAboveStream_4ha
12.  ElevationAboveStream_8ha
13.  ElevationAboveStream_16ha
14.  ElevationAboveStream_32ha
15.  PennocLandformClassification
16.  PlanCurvature
17.  RelativeTopographicPosition
18.  TopographicWetnessIndex
19.  WILT
20.  DEM
21.  Slope
22.  DInfFlowaccumulation

You need to extract the pixel values to the field plots. This can be done using a combination of [rasterio](https://rasterio.readthedocs.io/en/latest/) and [geopandas](https://geopandas.org/en/stable/). rasterio is a python package that focuses on reading and writing raster data. Start by installing it in your environment.

In [None]:
!pip install rasterio

Before we export anything its a good habit to inspect some of the data to make sure it looks like expected.

In [None]:
import rasterio
from rasterio.plot import show
dem = rasterio.open('/content/Analyses-of-Environmental-Data-2/data/rasters/dem/16m.tif') # dem is short for digital elevation model
show(dem, cmap='viridis_r')
plt.show()

## Task 1
Plot some of the raster files under /content/Analyses-of-Environmental-Data-2/data/rasters/ so you get a sense of what the data is representing.

Note that some muppet (me) has mixed lower case and upper case letters in the names and python is case sensetive. slope and Slope are not the same. This is a very common mistake.

# Extract raster values to field plots
If the data looks to be in order we can use raster io to extract the raster values to our field plots. This code first finds the x and y coordinates of each field plot in the geodataframe. "coords = [(x,y) for x, y in zip(soildata_gdf.geometry.x, soildata_gdf.geometry.y)]" it then loops over each field plot and extracts the raster values. Finally it adds the extracted values to a new column in the geodataframe. "soildata_gdf['dem'] = [x[0] for x in src.sample(coords)]"

In [None]:
import rasterio
import geopandas as gpd


coords = [(x,y) for x, y in zip(soildata_gdf.geometry.x, soildata_gdf.geometry.y)]

# Open the raster using rasterio and extract the pixel values to the geodataframe
# dem
src = rasterio.open('/content/Analyses-of-Environmental-Data-2/data/rasters/dem/16m.tif')
soildata_gdf['dem'] = [x[0] for x in src.sample(coords)] # Naming is important to keep things in order
# Slope
src = rasterio.open('/content/Analyses-of-Environmental-Data-2/data/rasters/Slope/16m.tif')
soildata_gdf['Slope'] = [x[0] for x in src.sample(coords)]
# RelativeTopographicPosition
src = rasterio.open('/content/Analyses-of-Environmental-Data-2/data/rasters/RelativeTopographicPosition/16m.tif')
soildata_gdf['RelativeTopographicPosition'] = [x[0] for x in src.sample(coords)]
# TopographicWetnessIndex
src = rasterio.open('/content/Analyses-of-Environmental-Data-2/data/rasters/TopographicWetnessIndex/16m.tif')
soildata_gdf['TopographicWetnessIndex'] = [x[0] for x in src.sample(coords)]
# DownslopeIndex_2m
src = rasterio.open('/content/Analyses-of-Environmental-Data-2/data/rasters/DownslopeIndex_2m/16m.tif')
soildata_gdf['DownslopeIndex_2m'] = [x[0] for x in src.sample(coords)]
# DepthToWater_1ha
src = rasterio.open('/content/Analyses-of-Environmental-Data-2/data/rasters/DepthToWater_1ha/16m.tif')
soildata_gdf['DepthToWater_1ha'] = [x[0] for x in src.sample(coords)]


### Now you have a new geodataframe with both the field data and the raster data

In [None]:
soildata_gdf

It can be useful to plot one of the variables to see if it makes any sense. compare this plot to the raster plots you did above.

In [None]:
plt.rcParams["figure.figsize"] = (10,20)
soildata_gdf.plot(column='dem', cmap='viridis_r')

Since we are only interested in soil moisture now we will drop the other Y-variables. We also need to split the data into training data and testing data. The model will be trained on the training data and evaluated on the test data just like in module 7.

In [None]:
soildata_clean = soildata_gdf.drop(soildata_gdf.columns[2:10], axis=1) # drops column 3 to 10
soildata_clean = soildata_clean.drop(soildata_gdf.columns[0], axis=1) # drops column 0 which is the text for the soil moisture
soildata_clean # SMC_code 	= Soil moisture code

# Split data into training and testing
Here we will use stratified sampling which means that sklearn will include examples of all classes in both the training data and the testing data.

In [None]:
from sklearn.model_selection import train_test_split
y = soildata_clean.iloc[:,0] # This is soil moisture
x = soildata_clean.iloc[:,1:] # These are all the topographical variables

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=0, stratify = y)

## Tuning the hyperparameters
In the field of machine learning things are named differntly than in traditional statistics. In statistics the settings for a model is sometimes refered to as features. However, in machine learning the features are the data you extracted to the points and the setting of the model is instead called hyperparameters. Much cooler. It is quite common to fiddle with these hyper paramters to see what works and this process can be autmated. This is known as tuning the hyperparameters.

Here is an example using a tune grid where multiple models will be trained using all possible combinations of the settings listed bellow. This is a brute force approach and very demanding of your hardware. But computer time is cheaper than human time so lets do it.

In [None]:
import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
model = RandomForestClassifier() # note that we are using classification for the soil moisture classes


tune_grid = {'n_estimators': [50, 100, 500],
               'max_features': ['sqrt'],
               'max_depth': [4,5,6],
               'min_samples_split': [2, 5, 10],
               'min_samples_leaf': [1, 2, 4],
               'bootstrap': [True]}

rf_random = RandomizedSearchCV(estimator = model, param_distributions = tune_grid, random_state=0, n_jobs = -1)

# Train the model using the optimal hyperparameters
rf_random.fit(x_train, y_train)
print('The best combination of hyperparameters was', rf_random.best_params_)

Evaluate the model just like in module 7. Note that the accuracy is between 0 and 1 so 0.5 is 50%.

In [None]:
from sklearn.metrics import accuracy_score
y_pred = rf_random.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Inspect the F1-score for each soil moisture class and pay attention to the support column which shows how many plots are within each category. It's harder to learn with few examples from a class. Remember that the soil moisture classes were


*   1 = Dry
*   2 = Mesic
*   3 = Mesic-moist
*   4 = Moist
*   2 = Wet

In [None]:
from sklearn.metrics import classification_report

y_pred_test = rf_random.predict(x_test)
print(classification_report(y_test, y_pred_test, zero_division=0))

# Implement a machine learning model on raster data
Now we have a working model that we want to apply to the Krycklan catchment. We need to read all the rasterlayers, apply the model and then save the result as a new raster.

All rasterdata will be read into numpy arrays using gdal.

In [None]:
from osgeo import gdal_array
import numpy as np
# Read raster data as numeric array from file
dem = gdal_array.LoadFile('/content/Analyses-of-Environmental-Data-2/data/rasters/dem/16m.tif')
Slope = gdal_array.LoadFile('/content/Analyses-of-Environmental-Data-2/data/rasters/Slope/16m.tif')
RelativeTopographicPosition = gdal_array.LoadFile('/content/Analyses-of-Environmental-Data-2/data/rasters/RelativeTopographicPosition/16m.tif')
TopographicWetnessIndex = gdal_array.LoadFile('/content/Analyses-of-Environmental-Data-2/data/rasters/TopographicWetnessIndex/16m.tif')
DownslopeIndex_2m = gdal_array.LoadFile('/content/Analyses-of-Environmental-Data-2/data/rasters/DownslopeIndex_2m/16m.tif')
DepthToWater_1ha = gdal_array.LoadFile('/content/Analyses-of-Environmental-Data-2/data/rasters/DepthToWater_1ha/16m.tif')


Make a list of all arrays you wish to include. Note that you need to add or remove the variables in both the list and the converted dataframe.

In [None]:
# Make a list of all arrays. you can
list_or_all_rasters = [dem, Slope, RelativeTopographicPosition, TopographicWetnessIndex, DownslopeIndex_2m, DepthToWater_1ha]

all_data = np.array(list_or_all_rasters)
all_data=all_data.reshape(6,738*662).T # 6 is the number of indices in the list and 738*662 is the shape of the original DEM

df_data=pd.DataFrame(all_data,columns=['dem', 'Slope', 'RelativeTopographicPosition','TopographicWetnessIndex', 'DownslopeIndex_2m', 'DepthToWater_1ha'])


### Get the classification

In [None]:
result = rf_random.predict(df_data)

# Save the data as a raster file with coordinates and extent from one of the input layers
result = result.reshape(738,662)
extent = rasterio.open('/content/Analyses-of-Environmental-Data-2/data/rasters/dem/16m.tif')

with rasterio.Env():
  profile = extent.profile
  with rasterio.open('/content/prediction.tif', 'w', **profile) as dst:
        dst.write(result, 1)

src = rasterio.open('/content/prediction.tif')
plt.imshow(src.read(1), cmap='viridis_r')
plt.show()

# Get the probability of a class
Nature is realy fixed into classes so the probability of a class might actually be more useful. Instead of saving the classification from the model we can save the probability of a class.

In [None]:
result = rf_random.predict_proba(df_data) # notice that we use predict_proba instead of predict
result = result[:, 4]
# Save the data as a raster file with coordinates and extent from one of the input layers
result = result.reshape(738,662)
extent = rasterio.open('/content/Analyses-of-Environmental-Data-2/data/rasters/dem/16m.tif')

with rasterio.Env():
  profile = extent.profile
  with rasterio.open('/content/prediction_probability.tif', 'w', **profile) as dst:
        dst.write(result, 1)

src = rasterio.open('/content/prediction_probability.tif')
plt.imshow(src.read(1), cmap='viridis_r')
plt.show()

### Task 2
Plot the probability of Dry areas, Wet areas and Mesic-moist areas. The original values for the classes are:

*   1 = Dry
*   2 = Mesic
*   3 = Mesic-moist
*   4 = Moist
*   2 = Wet

But remember that python lists starts from 0 to class 1 is at position 0 in the probability output.

## Task 2
Train a decision tree model like the one you used in module 7 on the same data and compare it to the random forest model.

## Task 3
Predict carbon to nitrogren ratio based on topographical data
---

Now you will do the same as above but instead of using classified soil moisture you will try to predict the C/N ratio from a new set of field plots from Krycklan. The data can be found here: /content/Analyses-of-Environmental-Data-2/data/Krycklan_Chemdata.csv The share of carbon between mineral-associated and particulate organic matter and the ratio between carbon and nitrogen affect soil carbon stocks and mediate the effects of other variables on soil carbon stocks.



This dataset contains samples from multiple depths from each plot. You can select sample depth by chemdata_gdf= chemdata_gdf[chemdata_gdf['SampleDepth'] == 0] to get the surface sample.

\
Remember to change from  [classification](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=randomforest#sklearn.ensemble.RandomForestClassifier) to [regression](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html?highlight=randomforest#sklearn.ensemble.RandomForestRegressor) and do not use stratified sampling when splitting the data.

Change

    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=0, stratify = y)

to

    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=0)
    
Drop the apropriate columns before training the model. You will also need to use other metrics when evaluating.

These are some examples that you can use:

    from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
    # Calculate mean absolute error (MAE)
    mae = mean_absolute_error(y_test, y_pred)
    print("Mean Absolute Error (MAE):", mae)

    # Calculate mean squared error (MSE)
    mse = mean_squared_error(y_test, y_pred)
    print("Mean Squared Error (MSE):", mse)

    # Calculate root mean squared error (RMSE)
    rmse = mse ** 0.5
    print("Root Mean Squared Error (RMSE):", rmse)

    # Calculate R-squared (R2)
    r2 = r2_score(y_test, y_pred)
    print("R-squared (RÂ²):", r2)

## Task 4
 Extract more variables with Whitebox Tools
---
[Whitebox Tools](https://www.whiteboxgeo.com/manual/wbt_book/available_tools/geomorphometric_analysis.html) Is a great software topographical modeling. This section describes how to extract additional topographical features. This is an example on how to set up Whitebox Tools and extract aspect from the original DEM. Your task is to extract a new topographical index using whitebox tools and include it with the other data and train a machine learning model where the new index is included.

To do this you need to complete the following steps


1.   Calculated the index using whitebox.
2.   Extract raster values to the field plot points.
3.   Add it to the list of rasters for inference and dont forget that it needs to have the same name in the table as in the list of raters.



In [None]:
!pip install whitebox

import whitebox

wbt = whitebox.WhiteboxTools()

wbt.ruggedness_index(
    dem = '/content/Analyses-of-Environmental-Data-2/data/rasters/dem/16m.tif',
    output = '/content/Analyses-of-Environmental-Data-2/data/ruggedness.tif'
)


Complete the tasks and upload the notebook with your name to canvas.