# Exercise: Deforestation Detection with Foundation Models

In this exercise, you'll apply foundation model embeddings to detect deforestation in the Amazon rainforest.

## Learning Objectives
- Load and visualize satellite imagery for a deforestation hotspot
- Extract and explore foundation model embeddings
- Apply unsupervised clustering to identify land cover patterns
- Train a supervised classifier to detect deforestation
- Evaluate model performance and interpret results

## Background: Why Deforestation Detection?

Deforestation in the Amazon has accelerated in recent years, with significant impacts on:
- **Climate Change**: The Amazon stores ~150-200 billion tons of carbon
- **Biodiversity**: Home to 10% of Earth's species
- **Indigenous Communities**: Millions depend on the forest
- **Regional Climate**: Affects rainfall patterns across South America

Traditional monitoring requires extensive field surveys. Foundation models enable:
- **Rapid Detection**: Identify changes quickly from satellite imagery
- **Scalable Monitoring**: Cover vast areas efficiently
- **Early Warning**: Detect clearing before it becomes widespread

## About the Study Area

We'll focus on the Brazilian state of Rondônia, which has experienced significant deforestation due to:
- Cattle ranching expansion
- Soy cultivation
- Logging operations
- Road construction opening new areas

You'll see a mix of intact forest, recent clearings, and agricultural land.

## Setup

First, install required packages and import helper functions.

In [None]:
# Install dependencies
%pip install numpy pandas matplotlib rasterio seaborn xarray pyproj dask rioxarray pystac-client planetary-computer scikit-learn pyarrow tqdm scipy leafmap -q

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
import pandas as pd
import seaborn as sns

def in_colab():
    try:
        import google.colab
        return True
    except ImportError:
        return False

if in_colab():
    !git clone https://github.com/rohansaw/FoundationModels4EO.git
    %cd FoundationModels4EO
else:
    print("Running locally - skipping git clone")

# Import helper functions
from geo_helpers import (
    load_sentinel2_rgb_timeseries,
    load_foundation_model_embeddings,
    create_embedding_rgb,
    compute_clusters,
    predict_on_embeddings
)

from viz_helpers import (
    plot_rgb_timeseries,
    plot_embeddings_rgb,
    plot_clustering_results,
    plot_classification_results,
    show_study_area_map,
    plot_classification_vs_clustering
)

## Part 1: Define Study Area and Explore Imagery

**TODO #1**: Define the bounding box for your study area in Rondônia, Brazil.

We've selected a region known for active deforestation. The coordinates are provided below.

**Your Task**: 
- Run the cell to visualize the study area on an interactive map
- Examine the region and note:
  - Where do you see intact forest?
  - Where do you see cleared areas?
  - Can you identify roads or agricultural fields?

In [None]:
# Study area in Rondônia, Brazil - active deforestation zone
LON_MIN, LON_MAX = -63.65, -63.45
LAT_MIN, LAT_MAX = -10.85, -10.70
YEAR = 2024

# Visualize the study area
print("Study Area: Rondônia, Brazil")
show_study_area_map(LON_MIN, LON_MAX, LAT_MIN, LAT_MAX)

### Load Satellite Imagery

**TODO #2**: Load Sentinel-2 RGB imagery for the dry season.

The dry season (May-September) in the Amazon is best for detecting deforestation because:
- Less cloud cover
- Cleared areas are more visible
- Burning activity peaks during this period

**Your Task**:
- Run the cell to load and visualize the RGB time series
- Examine the images across different months
- Can you spot any changes over time?

In [None]:
# Load Sentinel-2 RGB imagery for dry season
DRY_SEASON_MONTHS = [5, 6, 7, 8, 9]  # May-September

# TODO: Call load_sentinel2_rgb_timeseries with the study area bounds and dry season months
rgb_timeseries = load_sentinel2_rgb_timeseries(
    LON_MIN, LON_MAX, LAT_MIN, LAT_MAX, YEAR, DRY_SEASON_MONTHS
)

# Visualize the time series
plot_rgb_timeseries(rgb_timeseries, "Rondônia, Brazil", YEAR)

## Part 2: Load and Visualize Foundation Model Embeddings

**TODO #3**: Load the Alpha Earth embeddings for your study area.

Instead of working directly with raw satellite imagery (which has many spectral bands and temporal dimensions), we'll use pre-computed embeddings from Google's Alpha Earth foundation model.

**Your Task**:
- Run the cell to load embeddings
- Note the shape of the embedding array
- Think about: What do these 64 dimensions represent?

In [None]:
# TODO: Load foundation model embeddings for the study area
embeddings = load_foundation_model_embeddings(LON_MIN, LON_MAX, LAT_MIN, LAT_MAX, YEAR)

print(f"Embedding shape: {embeddings.shape}")
print(f"  - {embeddings.shape[0]} feature dimensions")
print(f"  - {embeddings.shape[1]} x {embeddings.shape[2]} pixels")
print(f"\nEach pixel has a 64-dimensional feature vector that captures:")
print("  - Vegetation density and health")
print("  - Spectral signatures (colors/reflectance)")
print("  - Texture patterns")
print("  - Temporal dynamics")

### Visualize Embeddings as False Color RGB

**TODO #4**: Create an RGB visualization from 3 embedding dimensions.

Since embeddings have 64 dimensions, we can't visualize them directly. We'll select 3 dimensions and map them to red, green, and blue channels.

**Your Task**:
- Run the cell with the default dimensions [0, 10, 20]
- **Experiment**: Try different dimension combinations to see what features they capture
  - Example: [5, 15, 25] or [2, 30, 50]
- What patterns do different dimensions highlight?

In [None]:
# TODO: Experiment with different dimension combinations
dimensions_to_visualize = [0, 10, 20]  # Try changing these!

# Create and plot RGB visualization
embedding_rgb = create_embedding_rgb(embeddings, bands=dimensions_to_visualize)
plot_embeddings_rgb(embedding_rgb, bands=dimensions_to_visualize)

**Reflection Questions**:
1. How does the embedding visualization compare to the RGB imagery?
2. Can you identify different land cover types (forest, cleared areas, water)?
3. Which dimension combination best highlights deforestation?

## Part 3: Unsupervised Clustering

**TODO #5**: Apply k-means clustering to identify natural groupings in the embeddings.

Clustering can reveal patterns without labeled data. Different clusters often correspond to:
- Intact forest
- Recently cleared areas
- Agricultural land
- Water bodies
- Bare soil

**Your Task**:
- Run clustering with the provided k values
- **Experiment**: Try different numbers of clusters (e.g., [3, 7, 15])
- Identify which clusters might represent:
  - Dense forest (dark green in RGB)
  - Deforested areas (lighter colors)
  - Other land cover types

In [None]:
# TODO: Experiment with different numbers of clusters
number_of_clusters_to_explore = [3, 5, 10]  # Try changing this!

# Compute clusters
print("Computing clusters... (this may take a moment)")
cluster_results = compute_clusters(embeddings, k_values=number_of_clusters_to_explore)

# Plot results
plot_clustering_results(cluster_results=cluster_results)

**Analysis Questions**:
1. What happens when you increase the number of clusters?
2. Can you identify a cluster that clearly separates forest from cleared areas?
3. With k=10, can you identify:
   - Primary forest clusters?
   - Deforestation clusters?
   - Agricultural clusters?
4. What are the limitations of unsupervised clustering for this task?

## Part 4: Supervised Classification - Load Labeled Data

**TODO #6**: Load pre-labeled training and test data.

For supervised learning, we need labeled examples. We've prepared a dataset with:
- **Forest**: Intact forest areas
- **Deforested**: Recently cleared areas
- **Non-forest**: Agricultural fields, water, urban areas, etc.

The data was labeled using:
- Visual interpretation of high-resolution imagery
- Forest change detection datasets
- Field validation where available

**Your Task**:
- Run the cell to load the CSV with embeddings and labels
- Examine the class distribution
- Understand the train/test split

In [None]:
# Load pre-extracted embeddings with labels
# The CSV contains AlphaEarth embeddings for labeled points in Rondônia

samples_rondonia = pd.read_csv("demo_data/rondonia_deforestation_2024.csv")
print(f"Loaded {len(samples_rondonia)} samples with embeddings")
print(f"\nClasses: {samples_rondonia['class_name'].value_counts().to_dict()}")
print(f"\nFirst few rows:")
print(samples_rondonia[['longitude', 'latitude', 'class_name', 'label']].head())

### Prepare Training and Test Splits

**TODO #7**: Create balanced train and test sets.

We'll use a small training set to demonstrate the power of foundation model embeddings.

**Your Task**:
- Adjust the number of training and test samples per class
- Run the cell to create the splits
- Note how few samples we need for training!

In [None]:
# Prepare training and test splits
# TODO: Experiment with different numbers of training samples

NUM_SAMPLES = {
    'forest': {'label': 0, 'n_train': 20, 'n_test': 100},
    'deforested': {'label': 1, 'n_train': 20, 'n_test': 100},
    'non-forest': {'label': 2, 'n_train': 20, 'n_test': 100}
}

from geo_helpers import prepare_csv_samples, show_split_statistics

X_train, y_train, X_test, y_test, train_idx, test_idx = \
    prepare_csv_samples(samples_rondonia, NUM_SAMPLES, random_seed=42)

show_split_statistics(samples_rondonia, train_idx, test_idx)

### Visualize Sample Locations

**TODO #8**: Examine the spatial distribution of training and test samples.

Understanding where your samples come from is critical for:
- Detecting spatial bias
- Understanding spatial autocorrelation
- Ensuring representative coverage

**Your Task**:
- Run the cells to create interactive maps
- Examine the spatial distribution
- Are train and test samples well-separated geographically?

In [None]:
# Prepare sample coordinates for visualization
from geo_helpers import prepare_sample_coordinates
from viz_helpers import plot_sample_map_by_class, plot_sample_map_by_split

samples_rondonia = prepare_sample_coordinates(samples_rondonia, train_idx, test_idx)

In [None]:
# Visualize all samples by land cover class
m_all = plot_sample_map_by_class(samples_rondonia, center_lat=-10.77, center_lon=-63.55, zoom=8)
m_all

In [None]:
# Visualize train vs test split
m_split = plot_sample_map_by_split(samples_rondonia, center_lat=-10.77, center_lon=-63.55, zoom=8)
m_split

### Train the Classifier

**TODO #9**: Train a Random Forest classifier on your labeled samples.

**Your Task**:
- Run the training cell
- The model will learn to distinguish forest from deforested areas
- Note the training set size - foundation models work with few samples!

In [None]:
# TODO: Train a Random Forest classifier
print("Training Random Forest classifier...")
clf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
clf.fit(X_train, y_train)

# Evaluate on test set
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Training complete!")
print(f"\nModel trained on {len(X_train)} samples")
print(f"Test Accuracy: {accuracy:.2%}")

# Show detailed results
class_names = ['forest', 'deforested', 'non-forest']
plot_classification_results(y_test, y_pred, accuracy, class_names)

**Analysis Questions**:
1. How does the model perform on the test set?
2. Which classes are most confused with each other?
3. What might cause classification errors?
4. How would you improve the model?

## Part 5: Apply Classifier to Full Study Area

**TODO #10**: Use the trained classifier to predict land cover for every pixel.

Now we'll apply the model to classify the entire study area into:
- Forest (green)
- Deforested (red)
- Non-forest (yellow)

**Your Task**:
- Run the prediction cell
- Examine the output map
- Compare with the original RGB imagery and clustering results

In [None]:
# TODO: Apply classifier to all pixels in the study area
print("Predicting land cover for entire study area...")
predictions = predict_on_embeddings(clf, embeddings)

print(f"Prediction complete!")
print(f"\nLand cover distribution:")
class_names = ['Forest', 'Deforested', 'Non-forest']
for label, name in enumerate(class_names):
    count = np.sum(predictions == label)
    pct = 100 * count / predictions.size
    print(f"  {name}: {pct:.1f}%")

# Visualize predictions
plt.figure(figsize=(12, 10))
plt.imshow(predictions, cmap='RdYlGn', vmin=0, vmax=2)
plt.colorbar(label='Land Cover Class', ticks=[0, 1, 2], 
             format=plt.FuncFormatter(lambda x, p: class_names[int(x)] if int(x) < len(class_names) else ''))
plt.title('Predicted Land Cover\n(Green=Forest, Red=Deforested, Yellow=Non-forest)', 
          fontsize=14, fontweight='bold')
plt.axis('off')
plt.tight_layout()
plt.show()

### Compare Supervised vs Unsupervised Results

**TODO #11**: Compare your supervised classification with the clustering results.

**Your Task**:
- Run the comparison visualization
- Analyze the differences:
  - Where do they agree?
  - Where do they disagree?
  - Which approach better captures deforestation patterns?
  - Does the supervised model show clearer boundaries?

In [None]:
# Compare classification with clustering
print("Comparing supervised classification with unsupervised clustering...")

plot_classification_vs_clustering(
    embeddings,
    None,
    predictions,
    {5: cluster_results[5]},  # Use k=5 clustering
    class_names=class_names
)

## Part 6: Analysis and Interpretation

**TODO #12**: Analyze your results and reflect on the exercise.

**Discussion Questions**:

1. **Model Performance**:
   - How well does your classifier identify deforested areas?
   - What types of errors do you observe (false positives/negatives)?
   - How could you improve the model?
   - What role does the foundation model play in achieving good results with few samples?

2. **Training Data**:
   - How many samples did you use for each class?
   - Was it enough? How would more samples help?
   - Why is it important to examine the spatial distribution of samples?
   - What is spatial autocorrelation and why does it matter?

3. **Supervised vs Unsupervised**:
   - What are the advantages of supervised classification?
   - When might clustering be more useful?
   - Could you combine both approaches?
   - Which method gives more interpretable results for deforestation monitoring?

4. **Real-World Application**:
   - How could this approach be used for forest monitoring?
   - What additional data would improve accuracy?
   - What are the limitations and ethical considerations?
   - How would you validate predictions in the field?

5. **Foundation Models**:
   - How do pre-trained embeddings help with limited training data?
   - What would change if you worked with raw satellite bands instead?
   - What biases might exist in the foundation model?
   - Could this model generalize to other regions (e.g., Congo Basin, Southeast Asia)?

### Important Lessons:

1. **Foundation models enable rapid prototyping** with limited training data (20 samples per class!)
2. **Pre-labeled datasets** accelerate development but require quality control
3. **Unsupervised clustering** can reveal patterns without labels
4. **Spatial autocorrelation** matters - train/test samples should be geographically separated
5. **Model validation** requires careful consideration of test data distribution
6. **Confusion matrices** reveal which classes are hardest to distinguish

### Next Steps:

- Explore the companion notebook on crop classification to see more advanced workflows
- Try this exercise on different deforestation regions (Congo Basin, Southeast Asia)
- Investigate time series analysis for change detection
- Learn about active learning to optimize sample collection
- Experiment with different foundation models and compare results

### Resources:

- [Global Forest Watch](https://www.globalforestwatch.org/): Real-time deforestation monitoring
- [Google Earth Engine](https://earthengine.google.com/): Petabyte-scale geospatial analysis
- [Alpha Earth Model](https://arxiv.org/pdf/2507.22291): Foundation model documentation
- [Sentinel-2 Documentation](https://sentinels.copernicus.eu/web/sentinel/missions/sentinel-2): Satellite mission details
- [MapBiomas](https://mapbiomas.org/): Annual land cover maps for Brazil