# Predicting Municipal Sustainable Development Index (IMDS) Using Satellite Embeddings

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/quarcs-lab/ds4bolivia/blob/master/notebooks/predict_imds_rf.ipynb)

## Overview

This notebook demonstrates how to predict the **IMDS (Índice Municipal de Desarrollo Sostenible)** - a composite index that aggregates all Sustainable Development Goal (SDG) indicators into a single municipal development score - using 64-dimensional satellite imagery embeddings from Google Earth Engine.

### Learning Objectives

By the end of this notebook, you will be able to:

1. Load and merge multiple datasets for machine learning
2. Prepare features and target variables for regression
3. Train a Random Forest model with cross-validation
4. Evaluate model performance using appropriate metrics
5. Analyze feature importance from satellite embeddings
6. Interpret prediction errors and identify patterns

### About the IMDS

The IMDS provides a comprehensive measure of sustainable development at the municipal level in Bolivia. It combines indicators across all 17 SDGs into a single normalized score (0-100), making it valuable for:

- Comparing overall development across municipalities
- Tracking progress toward sustainable development
- Identifying areas requiring targeted interventions

### Research Question

**Can satellite imagery features predict overall municipal sustainable development in Bolivia?**

---

## 1. Setup and Libraries

First, we import all necessary libraries for data manipulation, machine learning, and visualization.

In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("Libraries loaded successfully!")

## 2. Data Loading

We load three datasets from the DS4Bolivia repository:

1. **SDG Data**: Contains the IMDS (our target variable) and other SDG indices
2. **Satellite Embeddings**: 64-dimensional feature vectors derived from Google Earth Engine
3. **Region Names**: Municipality and department names for interpretation

In [None]:
# Define data URLs from the DS4Bolivia repository
REPO_URL = "https://raw.githubusercontent.com/quarcs-lab/ds4bolivia/master"

url_sdg = f"{REPO_URL}/sdg/sdg.csv"
url_emb = f"{REPO_URL}/satelliteEmbeddings/satelliteEmbeddings2017.csv"
url_names = f"{REPO_URL}/regionNames/regionNames.csv"

# Load datasets
print("Loading datasets...")
df_sdg = pd.read_csv(url_sdg)
df_embeddings = pd.read_csv(url_emb)
df_names = pd.read_csv(url_names)

print(f"SDG data loaded: {len(df_sdg)} municipalities")
print(f"Satellite embeddings loaded: {len(df_embeddings)} municipalities")
print(f"Region names loaded: {len(df_names)} municipalities")

### 2.1 Explore the SDG Data

Let's examine the structure of our SDG dataset and understand the IMDS variable.

In [None]:
# Display SDG dataset structure
print("SDG Dataset Columns:")
print(df_sdg.columns.tolist())
print(f"\nShape: {df_sdg.shape}")

In [None]:
# Examine the IMDS variable
print("IMDS (Municipal Sustainable Development Index) Statistics:")
print(df_sdg['imds'].describe())

### 2.2 Explore the Satellite Embeddings

The satellite embeddings are 64-dimensional vectors (A00 to A63) derived from satellite imagery using deep learning models in Google Earth Engine.

In [None]:
# Display embedding structure
print("Satellite Embeddings Columns:")
print(df_embeddings.columns.tolist())
print(f"\nShape: {df_embeddings.shape}")
print(f"\nFirst few rows:")
df_embeddings.head()

## 3. Data Preparation

We merge our datasets using `asdf_id` as the common identifier and prepare our features (X) and target variable (y).

In [None]:
# Merge datasets
# Step 1: Merge SDG data (with IMDS) with satellite embeddings
df_merged = df_sdg[['asdf_id', 'imds']].merge(
    df_embeddings,
    on='asdf_id',
    how='inner'
)

# Step 2: Add region names for interpretation
df_merged = df_merged.merge(
    df_names[['asdf_id', 'mun', 'dep']],
    on='asdf_id',
    how='left'
)

print(f"Merged dataset shape: {df_merged.shape}")
print(f"Columns: {df_merged.columns.tolist()}")

In [None]:
# Check for missing values in IMDS
missing_imds = df_merged['imds'].isna().sum()
print(f"Missing IMDS values: {missing_imds}")

# Remove rows with missing IMDS (if any)
df_clean = df_merged.dropna(subset=['imds']).copy()
print(f"Valid municipalities for analysis: {len(df_clean)}")

In [None]:
# Prepare features (X) and target (y)
# Features are the 64 satellite embedding dimensions (A00 to A63)
embedding_cols = [f'A{str(i).zfill(2)}' for i in range(64)]

X = df_clean[embedding_cols].values
y = df_clean['imds'].values

print(f"Feature matrix shape: {X.shape}")
print(f"Target variable shape: {y.shape}")
print(f"\nIMDS range: [{y.min():.2f}, {y.max():.2f}]")
print(f"IMDS mean: {y.mean():.2f} ± {y.std():.2f}")

### 3.1 Visualize IMDS Distribution

Let's understand the distribution of our target variable.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Histogram
axes[0].hist(y, bins=25, edgecolor='black', alpha=0.7)
axes[0].axvline(x=y.mean(), color='red', linestyle='--', label=f'Mean: {y.mean():.2f}')
axes[0].set_xlabel('IMDS Score')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of IMDS')
axes[0].legend()

# Box plot by department
df_clean.boxplot(column='imds', by='dep', ax=axes[1], rot=45)
axes[1].set_xlabel('Department')
axes[1].set_ylabel('IMDS Score')
axes[1].set_title('IMDS by Department')
plt.suptitle('')  # Remove automatic title

plt.tight_layout()
plt.show()

## 4. Train-Test Split

We split our data into training (80%) and test (20%) sets. The test set will be used for final model evaluation.

In [None]:
# Split data into training and test sets
X_train, X_test, y_train, y_test, idx_train, idx_test = train_test_split(
    X, y, df_clean.index,
    test_size=0.2,
    random_state=RANDOM_STATE
)

print(f"Training set: {len(X_train)} municipalities ({len(X_train)/len(X)*100:.1f}%)")
print(f"Test set: {len(X_test)} municipalities ({len(X_test)/len(X)*100:.1f}%)")

## 5. Model Configuration

We use a **Random Forest Regressor** for this prediction task. Random Forests are well-suited for:

- High-dimensional data (64 features)
- Capturing non-linear relationships
- Providing feature importance rankings
- Handling potential interactions between features

### Hyperparameters

| Parameter | Value | Description |
|-----------|-------|-------------|
| n_estimators | 100 | Number of trees in the forest |
| max_depth | 20 | Maximum depth of each tree |
| min_samples_split | 5 | Minimum samples to split a node |
| min_samples_leaf | 2 | Minimum samples at leaf nodes |
| max_features | sqrt | Features considered per split (~8) |

In [None]:
# Configure Random Forest model
model_params = {
    'n_estimators': 100,
    'max_depth': 20,
    'min_samples_split': 5,
    'min_samples_leaf': 2,
    'max_features': 'sqrt',
    'random_state': RANDOM_STATE,
    'n_jobs': -1  # Use all available cores
}

# Initialize the model
rf_model = RandomForestRegressor(**model_params)

print("Random Forest model configured with:")
for param, value in model_params.items():
    if param not in ['random_state', 'n_jobs']:
        print(f"  - {param}: {value}")

## 6. Cross-Validation

Before training our final model, we perform **5-fold cross-validation** to get a reliable estimate of model performance. This technique:

1. Splits the training data into 5 equal parts (folds)
2. Trains on 4 folds and validates on the remaining fold
3. Repeats 5 times, using each fold as validation once
4. Provides mean and standard deviation of performance

In [None]:
# Setup cross-validation
cv = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

# Perform cross-validation
print("Performing 5-fold cross-validation...")
cv_scores = cross_val_score(rf_model, X_train, y_train, cv=cv, scoring='r2')

print("\nCross-validation results:")
for i, score in enumerate(cv_scores, 1):
    print(f"  Fold {i}: R² = {score:.4f}")

print(f"\nMean CV R²: {cv_scores.mean():.4f} (±{cv_scores.std():.4f})")

In [None]:
# Visualize cross-validation scores
plt.figure(figsize=(8, 4))
plt.bar(range(1, len(cv_scores)+1), cv_scores, edgecolor='black', alpha=0.7)
plt.axhline(y=cv_scores.mean(), color='red', linestyle='--', lw=2, 
            label=f'Mean: {cv_scores.mean():.4f}')
plt.xlabel('Fold', fontsize=11)
plt.ylabel('R² Score', fontsize=11)
plt.title('Cross-Validation R² Scores', fontsize=12, fontweight='bold')
plt.xticks(range(1, len(cv_scores)+1))
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

## 7. Model Training and Evaluation

Now we train the model on the full training set and evaluate it on the held-out test set.

In [None]:
# Train the model on the full training set
print("Training Random Forest model...")
rf_model.fit(X_train, y_train)
print("Training complete!")

In [None]:
# Make predictions
y_train_pred = rf_model.predict(X_train)
y_test_pred = rf_model.predict(X_test)

# Calculate metrics
# Training set
train_r2 = r2_score(y_train, y_train_pred)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
train_mae = mean_absolute_error(y_train, y_train_pred)

# Test set
test_r2 = r2_score(y_test, y_test_pred)
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
test_mae = mean_absolute_error(y_test, y_test_pred)

print("=" * 50)
print("MODEL PERFORMANCE METRICS")
print("=" * 50)
print(f"\nTraining Set ({len(X_train)} municipalities):")
print(f"  R² Score:  {train_r2:.4f}")
print(f"  RMSE:      {train_rmse:.4f} IMDS points")
print(f"  MAE:       {train_mae:.4f} IMDS points")

print(f"\nTest Set ({len(X_test)} municipalities):")
print(f"  R² Score:  {test_r2:.4f}")
print(f"  RMSE:      {test_rmse:.4f} IMDS points")
print(f"  MAE:       {test_mae:.4f} IMDS points")
print("=" * 50)

### 7.1 Understanding the Metrics

| Metric | Description | Interpretation |
|--------|-------------|----------------|
| **R²** | Proportion of variance explained | 0.23 means 23% of IMDS variance is explained by satellite features |
| **RMSE** | Root Mean Squared Error | Average prediction error (penalizes large errors more) |
| **MAE** | Mean Absolute Error | Average absolute prediction error in IMDS points |

**Note**: The gap between training R² (~0.82) and test R² (~0.23) indicates some overfitting, which is common with Random Forests on small datasets.

### 7.2 Visualize Predictions

In [None]:
# Calculate prediction errors
test_errors = y_test - y_test_pred

# Create visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# 1. Actual vs Predicted
axes[0].scatter(y_test, y_test_pred, alpha=0.6, edgecolors='k', linewidth=0.5)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2, label='Perfect prediction')
axes[0].set_xlabel('Actual IMDS', fontsize=11)
axes[0].set_ylabel('Predicted IMDS', fontsize=11)
axes[0].set_title(f'Actual vs Predicted IMDS\nTest R² = {test_r2:.4f}', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# 2. Residual plot
axes[1].scatter(y_test_pred, test_errors, alpha=0.6, edgecolors='k', linewidth=0.5)
axes[1].axhline(y=0, color='r', linestyle='--', lw=2)
axes[1].set_xlabel('Predicted IMDS', fontsize=11)
axes[1].set_ylabel('Residuals (Actual - Predicted)', fontsize=11)
axes[1].set_title('Residual Plot', fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3)

# 3. Error distribution
axes[2].hist(test_errors, bins=15, edgecolor='black', alpha=0.7)
axes[2].axvline(x=0, color='r', linestyle='--', lw=2)
axes[2].set_xlabel('Prediction Error', fontsize=11)
axes[2].set_ylabel('Frequency', fontsize=11)
axes[2].set_title(f'Distribution of Errors\nMAE = {test_mae:.2f}', fontsize=12, fontweight='bold')
axes[2].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

## 8. Feature Importance Analysis

Random Forests provide built-in feature importance scores based on how much each feature contributes to reducing prediction error across all trees.

In [None]:
# Get feature importances
importances = rf_model.feature_importances_
indices = np.argsort(importances)[::-1]

# Create feature importance dataframe
feature_importance_df = pd.DataFrame({
    'feature': embedding_cols,
    'importance': importances,
    'rank': np.argsort(np.argsort(importances)[::-1]) + 1
}).sort_values('importance', ascending=False)

# Display top 20 features
print("TOP 20 MOST IMPORTANT FEATURES:")
print("=" * 50)
cumulative = 0
for i in range(20):
    idx = indices[i]
    cumulative += importances[idx]
    print(f"{i+1:2d}. {embedding_cols[idx]}: {importances[idx]:.4f} (Cumulative: {cumulative*100:.1f}%)")

In [None]:
# Calculate cumulative importance
cumsum = np.cumsum(importances[indices])
n_features_80 = np.argmax(cumsum >= 0.80) + 1

print(f"\nFeatures needed for 80% importance: {n_features_80}/64 ({n_features_80/64*100:.1f}%)")

In [None]:
# Visualize feature importance
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# 1. Top 20 feature importances
top_n = 20
top_indices = indices[:top_n]
top_features = [embedding_cols[i] for i in top_indices]
top_importances = importances[top_indices]

axes[0].barh(range(top_n), top_importances, edgecolor='black')
axes[0].set_yticks(range(top_n))
axes[0].set_yticklabels(top_features)
axes[0].set_xlabel('Importance', fontsize=11)
axes[0].set_title('Top 20 Feature Importances', fontsize=12, fontweight='bold')
axes[0].invert_yaxis()
axes[0].grid(True, alpha=0.3, axis='x')

# 2. Cumulative importance
axes[1].plot(range(1, 65), cumsum, linewidth=2, marker='o', markersize=3)
axes[1].axhline(y=0.8, color='r', linestyle='--', lw=2, label='80% threshold')
axes[1].axvline(x=n_features_80, color='g', linestyle='--', lw=2, label=f'{n_features_80} features')
axes[1].set_xlabel('Number of Features', fontsize=11)
axes[1].set_ylabel('Cumulative Importance', fontsize=11)
axes[1].set_title('Cumulative Feature Importance', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Key Insight: Distributed Feature Importance

Unlike more specific indicators (e.g., extreme energy poverty), the IMDS requires many features to capture its variance. This makes sense because IMDS is a composite index combining diverse dimensions of development that manifest in different ways in satellite imagery.

## 9. Prediction Error Analysis

Understanding where the model makes large errors helps us identify patterns and limitations of satellite-based predictions.

In [None]:
# Create results dataframe for test set
test_results = df_clean.iloc[idx_test].copy()
test_results['imds_actual'] = y_test
test_results['imds_predicted'] = y_test_pred
test_results['error'] = test_errors
test_results['abs_error'] = np.abs(test_errors)

print(f"Test results dataframe created with {len(test_results)} municipalities")

### 9.1 Overpredicted Municipalities

These are municipalities where the model predicts **higher** development than the actual IMDS. The satellite features suggest more development than actually exists.

In [None]:
# Top 10 overpredicted (model predicts higher than actual)
overpredicted = test_results.nsmallest(10, 'error')

print("TOP 10 OVERPREDICTED MUNICIPALITIES")
print("(Model predicts higher development than actual)")
print("=" * 70)

for _, row in overpredicted.iterrows():
    print(f"\n{row['mun']}, {row['dep']}")
    print(f"  Actual: {row['imds_actual']:.2f} | Predicted: {row['imds_predicted']:.2f} | Error: {row['error']:.2f}")

### 9.2 Underpredicted Municipalities

These are municipalities where the model predicts **lower** development than the actual IMDS. These areas achieve better development outcomes than their satellite features suggest.

In [None]:
# Top 10 underpredicted (model predicts lower than actual)
underpredicted = test_results.nlargest(10, 'error')

print("TOP 10 UNDERPREDICTED MUNICIPALITIES")
print("(Model predicts lower development than actual)")
print("=" * 70)

for _, row in underpredicted.iterrows():
    print(f"\n{row['mun']}, {row['dep']}")
    print(f"  Actual: {row['imds_actual']:.2f} | Predicted: {row['imds_predicted']:.2f} | Error: {row['error']:.2f}")

### 9.3 Error Patterns by Department

In [None]:
# Analyze errors by department
dept_errors = test_results.groupby('dep').agg({
    'error': ['mean', 'std', 'count'],
    'abs_error': 'mean'
}).round(2)

dept_errors.columns = ['Mean Error', 'Std Error', 'Count', 'Mean Abs Error']
dept_errors = dept_errors.sort_values('Mean Error')

print("PREDICTION ERRORS BY DEPARTMENT:")
print("=" * 60)
print(dept_errors.to_string())

In [None]:
# Visualize errors by department
plt.figure(figsize=(10, 5))

dept_order = dept_errors.sort_values('Mean Error').index
colors = ['green' if x > 0 else 'red' for x in dept_errors.loc[dept_order, 'Mean Error']]

plt.barh(dept_order, dept_errors.loc[dept_order, 'Mean Error'], color=colors, edgecolor='black', alpha=0.7)
plt.axvline(x=0, color='black', linestyle='-', lw=1)
plt.xlabel('Mean Prediction Error (Actual - Predicted)', fontsize=11)
plt.ylabel('Department', fontsize=11)
plt.title('Prediction Errors by Department\n(Positive = Underpredicted, Negative = Overpredicted)', 
          fontsize=12, fontweight='bold')
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

## 10. Summary and Conclusions

### Model Performance Summary

In [None]:
# Create summary table
summary = pd.DataFrame({
    'Metric': ['Model', 'Target Variable', 'Number of Features', 'Training Samples', 'Test Samples',
               'CV Mean R²', 'CV Std R²', 'Test R²', 'Test RMSE', 'Test MAE', 'Features for 80% Importance'],
    'Value': ['Random Forest Regressor', 'IMDS (Municipal Sustainable Development Index)', 64,
              len(X_train), len(X_test), f"{cv_scores.mean():.4f}", f"{cv_scores.std():.4f}",
              f"{test_r2:.4f}", f"{test_rmse:.4f}", f"{test_mae:.4f}", n_features_80]
})

print("MODEL SUMMARY")
print("=" * 60)
for _, row in summary.iterrows():
    print(f"{row['Metric']:40s} {row['Value']}")

### Key Findings

1. **Moderate Predictive Power (R² ≈ 23%)**: Satellite embeddings explain about 23% of the variance in IMDS. This is expected because IMDS is a composite index including many dimensions not directly observable from satellite imagery.

2. **Urban Centers Systematically Underpredicted**: Cities like La Paz achieve higher development than satellite features suggest. This indicates that institutional services, economic opportunities, and other non-physical development dimensions are concentrated in urban areas.

3. **Rural Areas Often Overpredicted**: Some rural municipalities show visible infrastructure in satellite imagery that doesn't translate to actual development outcomes. Factors like isolation, climate, and service access are not captured.

4. **Distributed Feature Importance**: Unlike specific indicators, overall development requires many satellite features (44/64 for 80% importance). This reflects the multi-dimensional nature of the IMDS.

### Limitations

- **Invisible Dimensions**: Many SDG components (governance, education quality, health services, gender equality) are not directly observable from satellite imagery
- **Composite Index Complexity**: Aggregating diverse indicators into a single score creates challenges for prediction
- **Sample Size**: 339 municipalities may limit model complexity

### Practical Applications

1. **Screening Tool**: Use satellite predictions to identify areas for detailed surveys
2. **Monitoring Physical Development**: Track infrastructure changes over time
3. **Complementary Data**: Combine with traditional survey data for comprehensive assessment

---

## References

- **SDG Data Source**: Andersen, L. E., Canelas, S., Gonzales, A., Peñaranda, L. (2020). Atlas municipal de los Objetivos de Desarrollo Sostenible en Bolivia 2020. La Paz: Universidad Privada Boliviana, SDSN Bolivia. https://atlas.sdsnbolivia.org

- **DS4Bolivia Repository**: https://github.com/quarcs-lab/ds4bolivia

- **Satellite Embeddings**: Google Earth Engine aggregated embeddings at municipal level (2017)

---

## Exercises for Students

1. **Compare with Other SDG Indices**: Modify this notebook to predict individual SDG indices (e.g., `index_sdg1`, `index_sdg7`). How does predictive power vary across different SDGs?

2. **Feature Selection**: Try training the model with only the top 20 features. How does performance change?

3. **Alternative Models**: Replace Random Forest with Gradient Boosting (`GradientBoostingRegressor`) or XGBoost. Compare performance.

4. **Spatial Analysis**: Create a map showing prediction errors across municipalities. Are there spatial clusters of over/underprediction?

5. **Hyperparameter Tuning**: Use `GridSearchCV` or `RandomizedSearchCV` to find optimal hyperparameters. Does performance improve significantly?