<a href="https://colab.research.google.com/github/subornaa/Data-Analytics-Tutorials/blob/main/Lasso_and_Ridge_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regularization Techniques: Lasso and Ridge Regression

# Introduction and Dataset

## Background

This tutorial will explore Lasso and Ridge regression methods to model different response variables that are commonly modeled in forestry. These include quadratic mean diameter (QMD), aboveground biomass (AGB), and basal area (BA). The tutorial will employ a suite of input features (i.e., predictor variables) used to estimate the response variables.

## Tutorial goals

**Goal 1: Develope ridge and lasso regression models for QMD, AGB, and BA using LiDAR and multispectral predictor variables**

**Goal 2: Compare ridge and lasso models for each response variable and choose the best model for each**

**Goal 3: Apply the best performing model for each response variable across the entire PRF**

-----

## Data

Please refer to the README on the main GitHub page for a detailed description of each file.


## Packages

- GeoPandas
- rioxarray
- spyndex

# Install and load packages

**Uncomment the cell below to install required packages**

In [None]:
!pip install -q pandas==2.2.2
!pip install -q geopandas==1.0.1
!pip install -q matplotlib==3.10.1
!pip install -q rioxarray==0.19.0
!pip install -q spyndex==0.5.0
!pip install -q pyarrow==19.0.0
!pip install -q laspy[lazrs]==2.5.4

In [None]:
import os
import numpy as np
import pandas as pd
import geopandas as gpd
import rioxarray as rio
import spyndex
import laspy
from spyndex import indices
from math import sqrt, pi
import matplotlib.pyplot as plt
from rasterio.plot import show
from matplotlib import pyplot as plt
import seaborn as sns
import statsmodels.api as sm

# Download data

In [None]:
# Download the data if it does not yet exist
if not os.path.exists("data"):
  !gdown 1UDKAdXW0h6JSf7k31PZ-srrQ3487l9e2
  !unzip prf_data.zip -d data/
  os.remove("prf_data.zip")
else:
  print("Data has already been downloaded.")

os.listdir("data")

# Preprocessing

Before we can begin with ridge and lasso regression, we must first preprocess the data so it is analysis ready. The following code blocks will prepare both the response variables (QMD, AGB, BA) in addition to predictor variables (99th height percentile and spectral indices).

In [None]:
trees_df = gpd.read_file(r'data/trees.csv')
plots_gdf = gpd.read_file(r'data/plots.gpkg').rename(columns={"Plot": "PlotName"})

This block of code ensures that the `biomass`, `height`, `baha` and `DBH` columns in trees_df are numeric

In [None]:
cols_to_convert = ['biomass', 'height', 'baha', 'DBH']
for col in cols_to_convert:
    trees_df[col] = pd.to_numeric(trees_df[col])

trees_df

Lets check the range of various tree attributes, they seem good!

In [None]:
trees_df.describe()

## Response Variables

Just as a refresher, this is the variable that is being measured, observed, or is the focus of the study. Its expcted to change in relation with other variables. Below is the exploration of the response variables we will be looking at

### Quadratic Mean Diameter

Quadratic Mean Diameter (QMD) is a common stand level attribute that is modeled in forestry. QMD is often prefered over the arithmetic mean in forestry because it gives greater weight to larger trees. This is relevant for several reasons, primarily though because the wood from larger trees is more valuable.

QMD also is relevant for understanding forest ecology among other applications.

In [None]:
# Calculate the Quadratic Mean Diameter (QMD)
qmd_df = (
    trees_df
    .groupby('PlotName')
    .agg(
        n_trees=('DBH', 'count'),
        sum_squares=('DBH', lambda x: (x**2).sum())
    )
    .assign(qmd=lambda df: (df['sum_squares'] / df['n_trees']).apply(sqrt))
    .reset_index()[['PlotName', 'qmd']]
)

print(qmd_df.describe())

# Join with plots GeoDataFrame
plots_gdf = plots_gdf.merge(qmd_df, on='PlotName', how='left')

ax = plots_gdf['qmd'].hist(edgecolor='black', color='green')
ax.set_xlabel('QMD (cm)')
ax.set_ylabel('Number of Plots')
ax.set_title('Distribution of Quadratic Mean Diameter (QMD)')
plt.show()

### Aboveground Biomass (AGB)

Forest aboveground biomass (AGB) is another very common stand attribute to model. Biomass is defined as the living organic materials comprising trees including wood, bark, branches, foliage, etc. AGB is modeled for many different reasons. One relevant application of AGB modelling is for forest carbon projects, since forest aboveground carbon is typically estimated to be ~50% of AGB.

We can calculate plot-level AGB by summing the AGB of all trees in a plot, and then dividing that by the plot area. This is performed in the code below.

In [None]:
# Note that each plot has a radius of 14.1m (625m^2)
# We need to convert to hectares, since this is the most common areal unit in forestry.
# There are 10000 m^2 in a hectare, so we divide by 10000.

plot_area_m2 = 625

plot_area_ha = plot_area_m2 / 10000

print(f"Area of each plot in hectares: {plot_area_ha} ha")

# Convert tree-level biomass from Kg/ha to Kg, and then to Mg (tonnes).
trees_df['biomass_kg'] = trees_df['biomass'] * plot_area_ha
trees_df['biomass_Mg'] = trees_df['biomass_kg'] / 1000

biomass_df = (trees_df.groupby('PlotName').
                    agg(biomass_Mg_total=('biomass_Mg', 'sum')).
                    assign(biomass_Mg_ha=lambda x: x['biomass_Mg_total'] / plot_area_ha))

# Summarize biomass
print(biomass_df.describe())

# Join with plots GeoDataFrame
plots_gdf = plots_gdf.merge(biomass_df, on='PlotName', how='left')

ax = biomass_df['biomass_Mg_total'].hist(edgecolor='black', color='green')
ax.set_xlabel('AGB (Mg/ha)')
ax.set_ylabel('Number of Plots')
ax.set_title('Distribution of Aboveground Biomass')

### Basal Area

Basal Area represents the cross-sectional area of all trees per unit land area, It reflects how crowded or sparse a forest is, which is important for understanding growth conditions, competition, and habitat quality.


In [None]:
def get_ba(dbh):
    return ((dbh / 2) ** 2) * pi

ba_df = (trees_df
            .assign(ba_cm2=lambda x: get_ba(x['DBH']))
            .assign(ba_m2=lambda x: x['ba_cm2'] / 10000)
            .groupby('PlotName')
            .agg(total_ba_m2_ha=('ba_m2', 'sum'))
            .assign(ba_m2_ha=lambda x: x['total_ba_m2_ha'] / plot_area_ha)
            .reset_index())

ba_df.describe()

# Join with plots GeoDataFrame
plots_gdf = plots_gdf.merge(ba_df, on='PlotName', how='left')

ax = plots_gdf['ba_m2_ha'].hist(edgecolor='black', color='green')
ax.set_xlabel('Basal Area (m2/ha)')
ax.set_ylabel('Number of Plots')
ax.set_title('Distribution of Basal Area (BA)')
plt.show()

## Predictor Variables

This is the variable that is manipulated, controlled, or measured to see if it has an effect on the response variable. In our experiment, we would be trying to predict special indecies and ALS metrics.

### Airborne Laser Scanning (ALS) derived metrics.

We load the ALS metrics (ALS is a type of LiDAR) as an xarray dataset. xarray is similar to numpy arrays, but with added attributes and functionality. For example, xarrays can contain spatial coordinate reference systems (CRS).

Lets first read the `als_metrics`

In [None]:
als_metrics = rio.open_rasterio(r'data/als_metrics.tif')
als_metrics

Ensure that raster and plot coordinates are in the same CRS.

In [None]:
assert plots_gdf.crs == als_metrics.rio.crs, "CRS mismatch between plots and raster data."

Convert ALS metric names from a tuple to a list for later use.

In [None]:
als_metrics_nms = list(als_metrics.long_name)

**Question 1 - Next, we create a list of plot coordinate tuples by iterating through each ALS metric (by index) and extracting the value at each plot location. Fill in the code below.**

In [None]:
plot_coords = [(geom.x, geom.y) for geom in plots_gdf.geometry]

for i, metric in enumerate(als_metrics_nms):

    #Uncomment the line below to see all the difference metrics
    #print(f"Extracting metric: {metric}")

    metric_ras_i = ...[i]
    plots_gdf[metric] = [float(metric_ras_i.sel(x=c[0], y=c[1], method="nearest").values) for c in plot_coords]

In [None]:
# @title Solution
plot_coords = [(geom.x, geom.y) for geom in plots_gdf.geometry]

for i, metric in enumerate(als_metrics_nms):

    #Uncomment the line below to see all the difference metrics
    #print(f"Extracting metric: {metric}")

    metric_ras_i = als_metrics[i]
    plots_gdf[metric] = [float(metric_ras_i.sel(x=c[0], y=c[1], method="nearest").values) for c in plot_coords]

Lets view the distribution of the 99th height percentile, it looks normal.

In [None]:
ax = plots_gdf['p99'].hist(edgecolor='black', color='blue')
ax.set_xlabel('99th Height Percentile (m)')
ax.set_ylabel('Number of Plots')
plt.show()

Now lets extract the spectral indices to use a predictor variables

### Sentinel-2 Spectral Indices

While we can write code to calculate spectral indices, this can become time consuming once we start dealing with many different indices. Moreover, we can make mistakes in our code. As a suitable alternative, the `spyndex` Python package offers a standardized, simpler method for calculating many spectral indices at once.

Read the spyndex documentation here: [https://spyndex.readthedocs.io/en/stable/](https://spyndex.readthedocs.io/en/stable/)

 First we load the Sentinel-2 imagery for 2018 (year the plots were sampled).

In [None]:
s2 = rio.open_rasterio(r'data/petawawa_s2_2018.tif')

assert plots_gdf.crs == s2.rio.crs, "CRS mismatch between plots and raster data."

# Consult the documentation for the spectral bands:
# https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S2_SR_HARMONIZED
s2


Check range of reflectance for each band.

In [None]:
print("Min reflectance in S2 data:", np.nanmin(s2.values))
print("Max reflectance in S2 data:", np.nanmax(s2.values))

**Consult this table for more details about band abbreviations used for spectral index calculation:**

[https://github.com/awesome-spectral-indices/awesome-spectral-indices?tab=readme-ov-file#expressions](https://github.com/awesome-spectral-indices/awesome-spectral-indices?tab=readme-ov-file#expressions)

In [None]:
print("Sentinel-2 band names order:", s2.long_name)
print("Spyndex band abbreviations:", spyndex.bands)

Make a list of spectral indices to calculate.

In [None]:
spec_index_ls = ["NDVI", "NBR", "SAVI", "MSAVI", "DSI", "NDWI", "GLI",  "ND705", "NDREI", "IRECI", "TGI"]

One nice thing about spyndex is that it links each spectral index with a publication describing it. This code lists all the publications for each index.

In [None]:
for si in spec_index_ls:
    print(f"{si}: {indices[si].reference}")

This code gets all the spectral indices.

In [None]:
spec_indeces = spyndex.computeIndex(
    index = spec_index_ls,
    params = {
        "A": s2[0],
        "B": s2[1],
        "G": s2[2],
        "R": s2[3],
        "RE1": s2[4],
        "RE2": s2[5],
        "RE3": s2[6],
        "N": s2[7],
        "N2": s2[8],
        "WV": s2[9],
        "S1": s2[10],
        "S2": s2[11],
        "L": 1
    }

)

**Question 2 - Use the `spec_indeces` found before to extract the spectral indices into a dataframe. Fill in the code below.**

In [None]:
for si_name in spec_index_ls:

    #Uncomment this line below to see the code extracting
    #print(f"Extracting {si_name} values at plot coordinates...")

    si_raster = ...[spec_indeces.index == si_name]

    plots_gdf[si_name] = [si_raster.sel(x=c[0], y=c[1], method="nearest").values[0] for c in plot_coords]

plots_gdf.head(5)

In [None]:
# @title Solution
for si_name in spec_index_ls:

    #Uncomment this line below to see the code extracting
    #print(f"Extracting {si_name} values at plot coordinates...")

    si_raster = spec_indeces[spec_indeces.index == si_name]

    plots_gdf[si_name] = [si_raster.sel(x=c[0], y=c[1], method="nearest").values[0] for c in plot_coords]

plots_gdf.head(5)

Lets view one of the spectral indices.

In [None]:
view_si_nm = "NDVI"
view_si_raster = spec_indeces[spec_indeces.index == si_name]
show(view_si_raster.values[0], cmap='viridis')

Converting the geodataframe to regular dataframe allows for easier manipulation.

In [None]:
plots_df = pd.DataFrame(plots_gdf)
plots_df.head()

Now that we have gathered all the necessary information, we can finalize our dataset by creating lists of all predictor and response variables.  
**Note:** In Python, two lists can be concatenated using the `+` operator.  
Please keep in mind that the predictor variables are distributed across two DataFrames: `spec_index_ls` and `als_metrics_nms`.


**Question 3 - fill in the code below.**

In [None]:
predictor_vars = ... + ...
print("Predictor variables:", predictor_vars)

response_vars = ["biomass_Mg_ha", "ba_m2_ha", "qmd"]
print("Response variables:", response_vars)

In [None]:
# @title Solution
predictor_vars = spec_index_ls + als_metrics_nms
print("Predictor variables:", predictor_vars)

response_vars = ["biomass_Mg_ha", "ba_m2_ha", "qmd"]
print("Response variables:", response_vars)

It is always good practice to ensure that there are no unexpected NaN values in the dataset.  
To do this, remove any rows containing NaN values in the predictor or response variables using the `dropna()` function.

In [None]:
plots_df = plots_df.dropna(subset=response_vars + predictor_vars)
plots_df.shape

Let's examine the correlation matrix of the predictor and response variables.  
We primarily do this to identify any unusual or unexpected relationships in the data that may indicate issues or outliers.

In [None]:
corr_matrix = plots_df[response_vars + predictor_vars].corr()
corr_matrix

Let's take a quick look at our data.  
While we cannot identify any clear trends from this variable alone, we can explore what insights might be gained using Lasso and Ridge regression techniques.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))  # 1 row, 2 columns

# First plot: NDVI vs biomass
plots_df.plot.scatter(x='NDVI', y='biomass_Mg_ha', ax=axes[0])
axes[0].set_title('NDVI vs Biomass')

# Second plot: p99 vs biomass
plots_df.plot.scatter(x='p99', y='biomass_Mg_ha', ax=axes[1])
axes[1].set_title('p99 vs Biomass')

plt.tight_layout()
plt.show()

Export plots with predictor variables for later use

In [None]:
plots_df[['PlotName'] + predictor_vars].to_csv('data/predictors.csv', index=False)

# Goal 1: Develope ridge and lasso regression models for QMD, AGB, and BA using LiDAR and multispectral predictor variables

To understand the purpose of the required packages:

- `train_test_split` is used to divide the dataset into training and testing sets. This helps us evaluate our model on "unseen" data, simulating how it would perform in real-world scenarios and reducing the risk of overfitting.

- `Lasso` is a linear regression model with L1 regularization. It adds a penalty that shrinks less important feature coefficients to zero, effectively performing feature selection. This improves model interpretability and helps prevent overfitting.

- `mean_squared_error` and `r2_score` are evaluation metrics.
    - mean_squared_error measures the average squared difference between actual and predicted values.

    - r2_score (coefficient of determination) indicates how well the model explains the variance in the target variable.




In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score

Now that the setup is complete, we will define our target variable as `biomass_Mg_ha` and extract the predictor (`X`) and target (`y`) datasets accordingly.

Next, we will split the data into training and testing sets, reserving the test set for final model evaluation on unseen data.

**Question 1 - fill in the code below.**

In [None]:
# Set target variable
target_var = "biomass_Mg_ha"

# Divide features and targets into separate DataFrames
X = plots_gdf[...]
y = plots_gdf[...]

# Split data into training and testing sets
X_train, X_test, ..., y_test = ...(X, y, test_size=0.3, random_state=42)

In [None]:
# @title Solution
# Set target variable
target_var = "biomass_Mg_ha"

# Divide features and targets into separate DataFrames
X = plots_gdf[predictor_vars]
y = plots_gdf[target_var]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

This block of code isn't strictly necessary for every experiment, but it serves as a final check to ensure data quality. It verifies that the dataset dimensions are consistent and that there are no unwanted NaN values before training the model.

In [None]:
train_df = pd.concat([X_train, y_train], axis=1)
train_df_clean = train_df.dropna()

# Separate features and target again
X_train = train_df_clean[predictor_vars]
y_train = train_df_clean[target_var]

Now we can finally train the model!

**Question 2 - fill in the code below.**

In [None]:
# Train a lasso regression model with initial alpha=0.1
lasso = ...(alpha=0.99, max_iter=100000)
lasso....(X_train, y_train)

In [None]:
# @title Solution
# Train a lasso regression model with initial alpha=0.1
lasso = Lasso(alpha=0.99, max_iter=100000)
lasso.fit(X_train, y_train)

Once we train the model, we can use the test set as a true test of how accruate the model is.

In [None]:
test_df = pd.concat([X_test, y_test], axis=1)
test_df_clean = test_df.dropna()

# Separate features and target again
X_test = test_df_clean[predictor_vars]
y_test = test_df_clean[target_var]

y_test_pred_lasso = lasso.predict(X_test)

**Question 3 - Fill in the code below to get the R² and RMSE values.**

In [None]:
# Calculate R2 and RMSE
r2 = ...(y_test, ...)
rmse = sqrt(....(y_test, ...))
print(f"R2: {r2:.3f}, RMSE: {rmse:.3f}")

In [None]:
# @title Solution
# Calculate R2 and RMSE
r2 = r2_score(y_test, y_test_pred_lasso)
rmse = sqrt(mean_squared_error(y_test, y_test_pred_lasso))
print(f"R2: {r2:.3f}, RMSE: {rmse:.3f}")

We can print out all the model coefficients to observe which penalties the Lasso model applied to each predictor.  
This provides insight into which variables the model considered most important.  
However, this information becomes more meaningful when compared to another model, so let's proceed by creating a Ridge regression model next.

In [None]:
# View the parameters of the model
print("Lasso coefficients:")
for feature, coef in zip(X.columns, lasso.coef_):
    print(f"{feature}: {coef:.4f}")

`Ridge` regression is a linear regression model that includes L2 regularization.

In [None]:
from sklearn.linear_model import Ridge

Since our datasets are already prepared, we can train a Ridge regression model directly without any additional setup.

In [None]:
# Train a ridge regression model with initial alpha=1
ridge = ...(alpha=1, max_iter=10000)
ridge....(..., ...)

In [None]:
# @title Solution
# Train a ridge regression model with initial alpha=1
ridge = Ridge(alpha=1, max_iter=10000)
ridge.fit(X_train, y_train)

Lets look at the metrics again, we will comapre them to the lasso matrics in goal 2!

In [None]:
y_test_pred_ridge = ridge....(...)
# Calculate R2 and RMSE
r2 = r2_score(..., ...)
rmse = sqrt(mean_squared_error(..., ...))
print(f"R2: {r2:.3f}, RMSE: {rmse:.3f}")

In [None]:
# @title Solution
y_test_pred_ridge = ridge.predict(X_test)
# Calculate R2 and RMSE
r2 = r2_score(y_test, y_test_pred_ridge)
rmse = sqrt(mean_squared_error(y_test, y_test_pred_ridge))
print(f"R2: {r2:.3f}, RMSE: {rmse:.3f}")

Notice that none of the coefficients are exactly zero.  
This is a characteristic of Ridge regression, which will be discussed in more detail in the next objective.

In [None]:
print("ridge coefficients:")
for feature, coef in zip(X.columns, ridge.coef_):
    print(f"{feature}: {coef:.4f}")

# Goal 2: Comapre the models.

Lets first try to look at the preformance of both models.

**Question 1 - Fill in the code below.**

In [None]:
plt.figure(figsize=(12, 5))

# Ridge
plt.subplot(1, 2, 1)
plt.scatter(..., ..., alpha=0.6, color='royalblue')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--')
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.title("Ridge Regression")

# Lasso
plt.subplot(1, 2, 2)
plt.scatter(y_test, ..., alpha=0.6, color='darkorange')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--')
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.title("Lasso Regression")

plt.tight_layout()
plt.show()

In [None]:
# @title Solution
plt.figure(figsize=(12, 5))

# Ridge
plt.subplot(1, 2, 1)
plt.scatter(y_test, y_test_pred_ridge, alpha=0.6, color='royalblue')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--')
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.title("Ridge Regression")

# Lasso
plt.subplot(1, 2, 2)
plt.scatter(y_test, y_test_pred_lasso, alpha=0.6, color='darkorange')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--')
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.title("Lasso Regression")

plt.tight_layout()
plt.show()

These type of graphs displays how closely the model's predictions align with the actual values.  
A perfect model would have all points lying on the dashed line.  
In this case, both models appear to perform similarly, so a closer examination is required to determine which model is superior.  
However, it appears that the Lasso model may have a slight advantage, though additional evidence is needed to support this conclusion.

One approach is to examine the coefficients of each model to glean insights about the relative importance of the predictors.

**Question 2 - fill in the code below.**

In [None]:
feature_names =  ['NDVI', 'NBR', 'SAVI', 'MSAVI', 'DSI', 'NDWI', 'GLI', 'ND705', 'NDREI', 'IRECI', 'TGI', 'avg_95', 'avg', 'b10', 'b20', 'b30', 'b40', 'b50', 'b60', 'b70', 'b80', 'b90', 'dns_10m', 'dns_12m', 'dns_14m', 'dns_15m', 'dns_16m', 'dns_18m', 'dns_20m', 'dns_25m', 'dns_2m', 'dns_4m', 'dns_5m', 'dns_6m', 'dns_8m', 'kur_95', 'p01', 'p05', 'p10', 'p20', 'p30', 'p40', 'p50', 'p60', 'p70', 'p80', 'p90', 'p95', 'p99', 'qav', 'skew_95', 'd0_2', 'd10_12', 'd12_14', 'd14_16', 'd16_18', 'd18_20', 'd20_22', 'd22_24', 'd24_26', 'd26_28', 'd28_30', 'd2_4', 'd30_32', 'd32_34', 'd34_36', 'd36_38', 'd38_40', 'd40_42', 'd42_44', 'd44_46', 'd46_48', 'd4_6', 'd6_8', 'd8_10', 'std_95', 'vci_1mbin', 'vci_0.5bin']

ridge_coef = ridge....
lasso_coef = lasso....

x = np.arange(len(feature_names))
width = 0.35

fig, axes = plt.subplots(1, 2, figsize=(24, 10))

# Plot 1
# Use axes[0] for the first subplot
axes[0].bar(x - width/2, ridge_coef, width, label='Ridge')
axes[0].bar(x + width/2, lasso_coef, width, label='Lasso')
axes[0].set_ylabel("Coefficient Value")
axes[0].set_title("Model Coefficient Comparison (All Features - Rotated Labels)")
axes[0].legend()


# Prep data for top 20 coefs
coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Ridge_Coef': ridge_coef,
    'Lasso_Coef': lasso_coef
})

coef_df['Abs_Sum_Coef'] = np.abs(coef_df['Ridge_Coef']) + np.abs(coef_df['Lasso_Coef'])
coef_df = coef_df.sort_values(by='Abs_Sum_Coef', ascending=False).head(25)

top_feature_names = coef_df['Feature'].tolist()
top_ridge_coef = coef_df['Ridge_Coef'].tolist()
top_lasso_coef = coef_df['Lasso_Coef'].tolist()

x_top = np.arange(len(top_feature_names))

# Plot 2
# Use axes[1] for the second subplot
axes[1].barh(x_top - width/2, top_ridge_coef, width, label='Ridge')
axes[1].barh(x_top + width/2, top_lasso_coef, width, label='Lasso')

axes[1].set_xlabel("Coefficient Value")
axes[1].set_ylabel("Feature Name")
axes[1].set_title("Model Coefficient Comparison (Top 25 Features)")
axes[1].legend()
axes[1].set_yticks(x_top)
axes[1].set_yticklabels(top_feature_names, fontsize=10)

plt.tight_layout()
plt.show()

In [None]:
# @title Solution
feature_names =  ['NDVI', 'NBR', 'SAVI', 'MSAVI', 'DSI', 'NDWI', 'GLI', 'ND705', 'NDREI', 'IRECI', 'TGI', 'avg_95', 'avg', 'b10', 'b20', 'b30', 'b40', 'b50', 'b60', 'b70', 'b80', 'b90', 'dns_10m', 'dns_12m', 'dns_14m', 'dns_15m', 'dns_16m', 'dns_18m', 'dns_20m', 'dns_25m', 'dns_2m', 'dns_4m', 'dns_5m', 'dns_6m', 'dns_8m', 'kur_95', 'p01', 'p05', 'p10', 'p20', 'p30', 'p40', 'p50', 'p60', 'p70', 'p80', 'p90', 'p95', 'p99', 'qav', 'skew_95', 'd0_2', 'd10_12', 'd12_14', 'd14_16', 'd16_18', 'd18_20', 'd20_22', 'd22_24', 'd24_26', 'd26_28', 'd28_30', 'd2_4', 'd30_32', 'd32_34', 'd34_36', 'd36_38', 'd38_40', 'd40_42', 'd42_44', 'd44_46', 'd46_48', 'd4_6', 'd6_8', 'd8_10', 'std_95', 'vci_1mbin', 'vci_0.5bin']

ridge_coef = ridge.coef_
lasso_coef = lasso.coef_

x = np.arange(len(feature_names))
width = 0.35

fig, axes = plt.subplots(1, 2, figsize=(24, 10))

# Plot 1
# Use axes[0] for the first subplot
axes[0].bar(x - width/2, ridge_coef, width, label='Ridge')
axes[0].bar(x + width/2, lasso_coef, width, label='Lasso')
axes[0].set_ylabel("Coefficient Value")
axes[0].set_title("Model Coefficient Comparison (All Features - Rotated Labels)")
axes[0].legend()


# Prep data for top 20 coefs
coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Ridge_Coef': ridge_coef,
    'Lasso_Coef': lasso_coef
})

coef_df['Abs_Sum_Coef'] = np.abs(coef_df['Ridge_Coef']) + np.abs(coef_df['Lasso_Coef'])
coef_df = coef_df.sort_values(by='Abs_Sum_Coef', ascending=False).head(25)

top_feature_names = coef_df['Feature'].tolist()
top_ridge_coef = coef_df['Ridge_Coef'].tolist()
top_lasso_coef = coef_df['Lasso_Coef'].tolist()

x_top = np.arange(len(top_feature_names))

# Plot 2
# Use axes[1] for the second subplot
axes[1].barh(x_top - width/2, top_ridge_coef, width, label='Ridge')
axes[1].barh(x_top + width/2, top_lasso_coef, width, label='Lasso')

axes[1].set_xlabel("Coefficient Value")
axes[1].set_ylabel("Feature Name")
axes[1].set_title("Model Coefficient Comparison (Top 20 Features)")
axes[1].legend()
axes[1].set_yticks(x_top)
axes[1].set_yticklabels(top_feature_names, fontsize=10)

plt.tight_layout()
plt.show()

These types of graphs show the coefficients for each predictor and how each model penalized them.  
An important concept to remember is that simpler models are generally preferred in machine learning, as they tend to avoid overfitting and perform better on real-world data.

If Lasso eliminates many irrelevant variables, it results in a simpler model.  
If this simplification leads to better performance compared to Ridge, it suggests that Lasso is the superior model overall.  
However, we cannot draw this conclusion from the chart alone.  
By examining the evaluation metrics from the final section of this goal, we can incorporate this information into our final decision.

Finally, lets look at the RMSE (Root Mean Squared Error) and R² (Coefficient of Determination) values.

In [None]:
def print_metrics(name, y_true, y_pred):
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    r2 = r2_score(y_true, y_pred)
    print(f"{name} - RMSE: {rmse:.3f}, R²: {r2:.3f}")

print_metrics("Ridge", y_test, y_test_pred_ridge)
print_metrics("Lasso", y_test, y_test_pred_lasso)

The RMSE (Root Mean Squared Error) is lower for the Lasso model, indicating that its predictions are, on average, closer to the actual values.

The R² (Coefficient of Determination) is higher for Lasso, meaning it explains a greater proportion of the variance in the response variable (`biomass_Mg_ha`). Specifically, Lasso explains 46.5% of the variance, whereas Ridge explains only 26%.

These results suggest that Lasso is the better model in this case, as it fits the data more accurately and generalizes more effectively.

**Question 3 - After our obervations, which model do you think is better to use for the final prediction?**

*Answer here*

<details open>
<summary>Solution</summary>

Based on our observations, the Lasso regression model appears to be the better choice for the final prediction. It achieves a lower RMSE, indicating more accurate predictions, and a higher R² value, meaning it explains a larger proportion of the variance in the target variable (`biomass_Mg_ha`). Additionally, Lasso's ability to perform variable selection and produce a simpler model reduces the risk of overfitting and improves generalizability. Therefore, Lasso is preferred over Ridge for this dataset.

</details>

# Goal 3: Apply the best performing model for each response variable across the entire PRF

We will use cross-validation to search for the optimal value of the regularization parameter alpha for our Lasso model. This process will help improve model performance by selecting the alpha that best balances bias and variance. We achieve this by evaluating multiple candidate alpha values and choosing the one that yields the best cross-validated score.


> **_NOTE:_**   This code might take some time to run, and disregard any warnings.

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
full_df = pd.concat([..., ...], axis=1)
full_df_clean = test_df.dropna()

# Separate features and target again
X = full_df_clean[predictor_vars]
y = full_df_clean[target_var]


alphas = np.logspace(-3, 3, 300)
lasso_cv = ...(Lasso(max_iter=10000), param_grid={'alpha': alphas}, scoring='neg_mean_squared_error', cv=5)
lasso_cv.fit(X_train, y_train)

In [None]:
# @title Solution
alphas = np.logspace(-3, 3, 300)
lasso_cv = GridSearchCV(Lasso(max_iter=10000), param_grid={'alpha': alphas}, scoring='neg_mean_squared_error', cv=5)
lasso_cv.fit(X_train, y_train)

In [None]:
lasso_opt = Lasso(alpha=lasso_cv.best_params_['alpha'], max_iter=10000)
lasso_opt.fit(X_train, y_train)

Lets make our final prediction!

In [None]:
y_pred_full = lasso_opt.predict(X_test)

There are a few ways we can visulize our results. Lets go through them!

In [None]:
plt.figure(figsize=(8, 8))
sns.scatterplot(x=y_test, y=y_pred_full, alpha=0.6)
plt.plot([y.min(), y.max()], [y.min(), y.max()], color='red', linestyle='--', label='Perfect Prediction Line') # y=x line
plt.title('Actual vs. Predicted Biomass')
plt.xlabel('Actual Biomass (Mg/ha)')
plt.ylabel('Predicted Biomass (Mg/ha)')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()


Like how we seen before, an Actual vs Predicted graph is always good to see at a quick glance how our model preformed. Below lets print out the metrics to compare this final model quantitatively.

In [None]:
print_metrics("Lasso CV", y_test, y_pred_full)

**Question 1 - Given that our model now achieves a higher R² and a lower RMSE compared to previous trial sets, does this indicate that the model’s performance has improved or deteriorated?**


*Answer here*

<details open>
<summary>Solution</summary>

While there is improvment in having a higher R² and lower RMSE, it is marginal. However any improvment is welcomed and thus our model has imporved!

</details>

Lets create a redisdual plot. To do this, you must calculate the difference between the predicted `y_pred_full` from the actual `y_test` values.

**Question 2 - fill in the code below.**

In [None]:
residuals = ... - ...

plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_pred_full, y=residuals, alpha=0.6)
plt.axhline(y=0, color='red', linestyle='--', label='Zero Residual Line') # Zero error line
plt.title('Residuals Plot (Predicted vs. Residuals)')
plt.xlabel('Predicted Biomass (Mg/ha)')
plt.ylabel('Residuals (Actual - Predicted)')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

In [None]:
# @title Solution
residuals = y_test - y_pred_full

plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_pred_full, y=residuals, alpha=0.6)
plt.axhline(y=0, color='red', linestyle='--', label='Zero Residual Line') # Zero error line
plt.title('Residuals Plot (Predicted vs. Residuals)')
plt.xlabel('Predicted Biomass (Mg/ha)')
plt.ylabel('Residuals (Actual - Predicted)')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

A residual plot is useful for detecting any patterns or trends in the residuals.  
We do **not** want to observe any trends, as their presence can indicate multicollinearity among variables, which is undesirable.  
Ideally, the residuals should be randomly and evenly dispersed, as demonstrated once the plot above is generated.

Now run the code below to see the final sets of graphs.

In [None]:
# Set up side-by-side subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Histogram with KDE
sns.histplot(residuals, kde=True, ax=axes[0])
axes[0].set_title('Distribution of Residuals')
axes[0].set_xlabel('Residuals')
axes[0].set_ylabel('Frequency')
axes[0].grid(True, linestyle='--', alpha=0.7)

# Q-Q plot
sm.qqplot(residuals, line='s', ax=axes[1])
axes[1].set_title('Q-Q Plot of Residuals')

plt.tight_layout()
plt.show()

Looking at the histogram below, the residuals should ideally form a bell-shaped distribution (approximately normal) centered around zero.  

Regarding the Q-Q plot, if the points largely follow the straight reference line, it suggests that the residuals are approximately normally distributed.  
Deviations from this line indicate departures from normality. Both of these graphs can help determine that out experiment falls into line what should be expected of a good result.

To summarize, we now have our final model, and we have confirmed that the results align with what we expect from a well-performing model based on the residual trends. Lasso regression was chosen for this dataset due to its superior performance in this case. However, this does not imply that Ridge regression is an inferior model; depending on the dataset, Ridge may perform better. Therefore, it is always advisable to test multiple regression methods for any analysis.

Furthermore, if new data similar to this set becomes available, we can use our final model to predict the overall `biomass_Mg_ha`. This encapsulates the main objective of our work here.


## References

Gemini. (2025). Assistance is editing writeups and code. Retrieved from https://gemini.google.com