# Assignment #1 Classification

Submitted by:

Jazib Imran

Anjali Bhagat

Adopted from: https://geemap.org/notebooks/32_supervised_classification/

In [230]:
import ee
import geemap

In [85]:
# If you are running this notebook for the first time, you need to activate the command below for the authentication flow:
ee.Authenticate()

True

In [86]:
try:
    # Initialize the library.
    ee.Initialize()
    print('Google Earth Engine has initialized successfully!')
except ee.EEException as e:
    print('Google Earth Engine has failed to initialize!')
except:
    print("Unexpected error:", sys.exc_info()[0])
    raise

Google Earth Engine has initialized successfully!


For this classfication task, we would use Amsterdam as our Region of interest.
We take the central point of Amsterdam and then get its 10000m bounds to run our classfication problem

In [87]:
# Amsterdam
point = ee.Geometry.Point([4.82, 52.4])
boundingBox = point.buffer(10000).bounds()

Collecting the Sentinel-2 data

In [88]:
imageCollection = ee.ImageCollection("COPERNICUS/S2_SR_HARMONIZED") \
    .filterBounds(boundingBox) \
    .filterDate('2024-05-01', '2024-07-31') \
    .filterMetadata('CLOUDY_PIXEL_PERCENTAGE', 'less_than', 10)

We can also print how many images are there in our collection

In [89]:
listOfImages = imageCollection.aggregate_array('system:index').getInfo()
print('Number of images in the collection: ', len(listOfImages))

Number of images in the collection:  5


Now, we would compute the Median of all the five images present. This would return into an ee.image

In [90]:
image= imageCollection.median()
image_roi = image.clip(boundingBox)
image_roi.getInfo()

{'type': 'Image',
 'bands': [{'id': 'B1',
   'data_type': {'type': 'PixelType',
    'precision': 'double',
    'min': 0,
    'max': 65535},
   'dimensions': [1, 1],
   'origin': [4, 52],
   'crs': 'EPSG:4326',
   'crs_transform': [1, 0, 0, 0, 1, 0]},
  {'id': 'B2',
   'data_type': {'type': 'PixelType',
    'precision': 'double',
    'min': 0,
    'max': 65535},
   'dimensions': [1, 1],
   'origin': [4, 52],
   'crs': 'EPSG:4326',
   'crs_transform': [1, 0, 0, 0, 1, 0]},
  {'id': 'B3',
   'data_type': {'type': 'PixelType',
    'precision': 'double',
    'min': 0,
    'max': 65535},
   'dimensions': [1, 1],
   'origin': [4, 52],
   'crs': 'EPSG:4326',
   'crs_transform': [1, 0, 0, 0, 1, 0]},
  {'id': 'B4',
   'data_type': {'type': 'PixelType',
    'precision': 'double',
    'min': 0,
    'max': 65535},
   'dimensions': [1, 1],
   'origin': [4, 52],
   'crs': 'EPSG:4326',
   'crs_transform': [1, 0, 0, 0, 1, 0]},
  {'id': 'B5',
   'data_type': {'type': 'PixelType',
    'precision': 'double

Let's add the image and amsterdam bounds to the map to visualize them

In [135]:
Map = geemap.Map()

vis_params = {'bands': ['B4', 'B3', 'B2'],
              'gamma': 1.9600000000000004, 
              'min': 0.0, 'max': 6235.8}

Map.centerObject(boundingBox, 10)
Map.addLayer(boundingBox, {}, "Region of Interest")
Map.addLayer(image_roi, vis_params, "Image")
Map

Map(center=[52.39998628995313, 4.8202661079618], controls=(WidgetControl(options=['position', 'transparent_bg'…

## Combining Separate Landcover Classes into a Single Feature Collection

In this process, we collected data for different landcover classes, such as `greenspace`, `builtup`, `agriculture`, `water`, and `barren`. Initially, each class was collected into a separate `FeatureCollection` using the `Map.user_info` method in Google Earth Engine, which allowed us to filter and isolate specific classes into their individual `FeatureCollection` objects.The samples for each class was collected using the GUI of Map and then those samples were saved into the respective class using the user_rois method. This step was repeated for all the classes.

Once each class had its own `FeatureCollection`, we combined them into a single `FeatureCollection` for ease of use, analysis, or export. Combining these collections into one allowed us to unify all landcover data under a single `FeatureCollection`, which simplifies further operations.


In [None]:
Map.user_rois.getInfo()

In [None]:
# greenspace=Map.user_rois
# builtup=Map.user_rois
# agriculure=Map.user_rois
# water=Map.user_rois
barren=Map.user_rois

### Code Explanation
In the code, we stored each separate class’s `FeatureCollection` in a list called `list_fc`:

```python
# List of feature collections
list_fc = [greenspace, builtup, agriculture, water, barren]

In [147]:
list_fc= [greenspace, builtup, agriculure, water, barren]

To combine them, we used the first FeatureCollection in list_fc as the starting point (combined_fc = list_fc[0]). Then, we iteratively merged each of the remaining FeatureCollection objects in list_fc with this initial collection. This approach was chosen because Google Earth Engine does not directly support indexing within the FeatureCollection API, so there is no straightforward way to access and merge subsequent collections without first designating a base collection to start with.

This method provides a simple, efficient way to combine all classes without needing to initialize an empty FeatureCollection, which can sometimes introduce unnecessary complexity.After running this code, combined_fc will contain all features from the individual class collections in list_fc, unified under a single FeatureCollection.

In [None]:
landcover_fc =list_fc[0]
for fc in list_fc[1:]:
    landcover_fc = landcover_fc.merge(fc)

In [None]:
Map.centerObject(boundingBox, 10)
Map.addLayer(landcover_fc, {}, "landcover_fc")

Each feature collection (`fc`) contains two key properties:

- **`landcover`**: An integer value that uniquely represents each landcover class. This integer acts as an identifier for different types of landcover, making it easy to distinguish between classes during analysis.

- **`class`**: A string property that holds the name of each landcover class, providing a human-readable label. This makes the data more interpretable, as it links each `landcover` integer ID to a specific type of landcover, such as "greenspace" or "builtup."

To better understand how this data is structured, let’s convert the combined feature collection into a `GeoDataFrame` (using the `ee_to_gdf` function) and examine its contents:


In [151]:
landcover_gdf=geemap.ee_to_gdf(landcover_fc)
landcover_gdf

Unnamed: 0,geometry,class,color,landcover
0,"POLYGON ((4.80040 52.34784, 4.80040 52.34808, ...",green spaces,#33ff4b,2
1,"POLYGON ((4.79962 52.34809, 4.79962 52.34823, ...",green spaces,#33ff4b,2
2,"POLYGON ((4.79874 52.34852, 4.79874 52.34865, ...",green spaces,#33ff4b,2
3,"POLYGON ((4.80292 52.34922, 4.80292 52.34925, ...",green spaces,#33ff4b,2
4,"POLYGON ((4.80627 52.34998, 4.80627 52.35022, ...",green spaces,#33ff4b,2
...,...,...,...,...
137,"POLYGON ((4.80619 52.45755, 4.80619 52.45773, ...",barren,#7e502a,5
138,"POLYGON ((4.80569 52.46558, 4.80569 52.46572, ...",barren,#7e502a,5
139,"POLYGON ((4.83521 52.45918, 4.83521 52.45921, ...",barren,#7e502a,5
140,"POLYGON ((4.83552 52.45905, 4.83566 52.45914, ...",barren,#7e502a,5


After converting our combined feature collection to a `GeoDataFrame`, we can save it as a GeoJSON file in our directory. This creates a local copy of the data, which is helpful for reproducing the analysis.

Below, we’ll export the `GeoDataFrame` as a GeoJSON file and then demonstrate how to reload it later as a `GeoDataFrame` named `landcover_fc` if we want to work with the data again.


In [152]:
import geopandas as gpd
output_path = "landcover_data.geojson"
landcover_gdf.to_file(output_path, driver="GeoJSON")
print(f"GeoJSON file saved as {output_path}")

# # Load the GeoJSON file back into a GeoDataFrame if needed
# landcover_fc = gpd.read_file(output_path)

# # Display the first few rows to verify the data
# landcover_fc.head()


GeoJSON file saved as landcover_data.geojson


In our analysis, we are using the following Sentinel-2 bands:

- **B2**: Blue
- **B3**: Green
- **B4**: Red
- **B5**: Red Edge 1
- **B6**: Red Edge 2
- **B7**: Red Edge 3
- **B8**: Near Infrared (NIR)
- **B8A**: Narrow Near Infrared
- **B11**: Short Wave Infrared (SWIR) 1
- **B12**: Short Wave Infrared (SWIR) 2

These bands are selected for their usefulness in various remote sensing applications, including vegetation analysis, land cover classification, and water body monitoring.


In [176]:
selected_bands = [
    "B2",  # Blue
    "B3",  # Green
    "B4",  # Red
    "B5",  # Red Edge 1
    "B6",  # Red Edge 2
    "B7",  # Red Edge 3
    "B8",  # Near Infrared (NIR)
    "B8A", # Narrow Near Infrared
    "B11", # Short Wave Infrared (SWIR) 1
    "B12"  # Short Wave Infrared (SWIR) 2
]

In this step, we extract pixel values from the imagery for all pixels located within each polygon in the training `FeatureCollection`. This process allows us to overlay the polygons on the satellite imagery to obtain training and test data for our analysis.


In [None]:
samples = image_roi.select(selected_bands).sampleRegions(**{
    'collection': landcover_fc,
    'properties': ['landcover'],
    'scale': 10
})
print(samples.size().getInfo())

39244


## Train-Test Split

To evaluate the performance of our model, we need to divide our dataset into training and test samples. We will use a simple random split method where approximately 70% of the samples will be used for training and the remaining 30% for testing.

### Code Explanation
In the code below, we perform the following steps:

1. **Set the Split Ratio**: We define a variable `split` to represent the proportion of data to use for training (70%).
2. **Generate Random Values**: We add a random column to the `samples` `FeatureCollection` to facilitate random sampling.
3. **Filter for Training Samples**: We filter the `samples` to create a `training` set, containing features where the random value is less than the split threshold.
4. **Filter for Test Samples**: We create a `test` set by filtering features where the random value is greater than or equal to the split threshold.
5. **Output the Size of Each Set**: Finally, we print the number of training and test samples.



In [None]:
# Train-Test split
split = 0.7

samples = samples.randomColumn()
training = samples.filter(ee.Filter.lt("random", split))
test = samples.filter(ee.Filter.gte("random", split))
print("Number of training samples:",training.size().getInfo())
print("Number of test samples:",test.size().getInfo())

Number of training samples: 27417
Number of test samples: 11827


## Training the Classifier

To classify our data, we will use the **Random Forest** algorithm. We instantiate the classifier with a specified number of trees and train it using our training samples. In this case, we are using just one parameter:

- **`trees`**: The number of trees in the forest, set to 10.



In [192]:
trained_classifier = ee.Classifier.smileRandomForest(10).train(
    features=training,
    classProperty='landcover',  # Use the label
    inputProperties=selected_bands,
)

Now, we classify our whole image using the above trained classifier

In [193]:
result = image_roi.select(selected_bands).classify(trained_classifier)

Adding this layer into our map to visualize the results

In [194]:
Map.addLayer(result.randomVisualizer(), {}, "classified")
Map

Map(bottom=688963.0, center=[52.42681482569163, 4.792785644531251], controls=(WidgetControl(options=['position…

## Classifying Test Samples and Generating a Confusion Matrix

After training our classifier, we can use it to classify the test samples. The classification results are then evaluated using a **confusion matrix**, which helps us assess the model's performance.

Here’s a brief overview of the steps involved:

1. **Classify Test Samples**: The `test` dataset is classified using the trained classifier, resulting in predicted classes for each sample.

2. **Generate Confusion Matrix**: We create a confusion matrix by comparing the actual landcover labels (`landcover`) with the predicted classifications (`classification`). This matrix provides insights into the accuracy of the model and highlights any misclassifications.

3. **Display the Confusion Matrix**: Finally, we print the confusion matrix to the console for analysis.

This process allows us to understand how well our model performs on unseen data, guiding further improvements if necessary.


In [None]:
test_pred = test.classify(trained_classifier)
cm_test = test_pred.errorMatrix("landcover", "classification")
print(cm_test.getInfo())

We can also check the overall accuracy

In [203]:
# Overall Accuracy
accuracy=cm_test.accuracy().getInfo()
print(f'Model Accuracy: {accuracy * 100:.2f}%')

Model Accuracy: 99.44%


## Splitting Training Data and Fine-Tuning the Model

To enhance our model's performance, we further split the original training dataset into **training** and **validation** sets. This allows us to fine-tune the model by evaluating its performance on unseen validation data.

- **Training Set**: Contains samples used to train the model. We filter the `training` dataset using a random value less than the specified split threshold.
  
- **Validation Set**: Contains samples used to validate the model’s performance after training. We filter the `samples` to include features with a random value greater than or equal to the split threshold.

Next, we define a range of hyperparameters for tuning the Random Forest model:

- **Number of Trees (`n_trees`)**: The total number of decision trees in the forest, which influences the model's complexity and performance. We consider values of 5, 10, and 15.

- **Variables per Split (`vars_per_split`)**: The number of features to consider when making a split in each decision tree. This helps to prevent overfitting by limiting the options at each split. We use values of 2 and 3.

- **Minimum Leaf Population (`min_leaf`)**: The minimum number of samples required to be at a leaf node. This parameter controls tree growth and helps prevent overfitting. We test values of 1 and 5.

- **Bag Fraction (`bag_frac`)**: The proportion of training data to use for building each tree. This parameter introduces randomness and diversity into the model. We evaluate fractions of 0.5 and 0.7.

The following process iterates through combinations of these hyperparameters, trains the classifier, classifies the validation set, and computes accuracy. The results are collected for analysis.

Finally, we print the tuning results, displaying the number of trees, variables per split, minimum leaf population, bag fraction, and the corresponding accuracy of the model for each combination of parameters.


In [204]:
training = training.filter(ee.Filter.lt("random", split))
validation = samples.filter(ee.Filter.gte("random", split))

In [None]:
number_of_trees_list = [5, 10, 15]
variables_per_split_list = [2, 3]
min_leaf_population_list = [1, 5]
bag_fraction_list = [0.5, 0.7]
results = []

for n_trees in number_of_trees_list:
    for vars_per_split in variables_per_split_list:
        for min_leaf in min_leaf_population_list:
            for bag_frac in bag_fraction_list:
                trained_classifier = ee.Classifier.smileRandomForest(
                    n_trees,
                    vars_per_split,
                    min_leaf,
                    bag_frac,
                    None,  # max_nodes (None means unlimited)
                    42     # seed for reproducibility
                ).train(
                    features=training,
                    classProperty='landcover',
                    inputProperties=selected_bands,
                )
                testing_pred = validation.classify(trained_classifier)
                error_matrix = testing_pred.errorMatrix('landcover', 'classification')
                accuracy = error_matrix.accuracy().getInfo()
                results.append({
                    'num_trees': n_trees,
                    'vars_per_split': vars_per_split,
                    'min_leaf': min_leaf,
                    'bag_fraction': bag_frac,
                    'accuracy': accuracy
                })

for result in results:
    print(f"Num Trees: {result['num_trees']}, Vars/Split: {result['vars_per_split']}, Min Leaf: {result['min_leaf']}, Bag Fraction: {result['bag_fraction']}, Accuracy: {result['accuracy'] * 100:.2f}%")


Num Trees: 5, Vars/Split: 2, Min Leaf: 1, Bag Fraction: 0.5, Accuracy: 99.06%
Num Trees: 5, Vars/Split: 2, Min Leaf: 1, Bag Fraction: 0.7, Accuracy: 99.32%
Num Trees: 5, Vars/Split: 2, Min Leaf: 5, Bag Fraction: 0.5, Accuracy: 98.36%
Num Trees: 5, Vars/Split: 2, Min Leaf: 5, Bag Fraction: 0.7, Accuracy: 98.54%
Num Trees: 5, Vars/Split: 3, Min Leaf: 1, Bag Fraction: 0.5, Accuracy: 99.13%
Num Trees: 5, Vars/Split: 3, Min Leaf: 1, Bag Fraction: 0.7, Accuracy: 99.20%
Num Trees: 5, Vars/Split: 3, Min Leaf: 5, Bag Fraction: 0.5, Accuracy: 98.39%
Num Trees: 5, Vars/Split: 3, Min Leaf: 5, Bag Fraction: 0.7, Accuracy: 98.75%
Num Trees: 10, Vars/Split: 2, Min Leaf: 1, Bag Fraction: 0.5, Accuracy: 99.25%
Num Trees: 10, Vars/Split: 2, Min Leaf: 1, Bag Fraction: 0.7, Accuracy: 99.39%
Num Trees: 10, Vars/Split: 2, Min Leaf: 5, Bag Fraction: 0.5, Accuracy: 98.51%
Num Trees: 10, Vars/Split: 2, Min Leaf: 5, Bag Fraction: 0.7, Accuracy: 98.70%
Num Trees: 10, Vars/Split: 3, Min Leaf: 1, Bag Fraction: 0.5

We can find the best parameters and then can use those values to train our model and check its generalization by checking its its accuracy with test data that we reserved before.

In [None]:
import pandas as pd

results_df = pd.DataFrame(results)

best_params = results_df.loc[results_df['accuracy'].idxmax()]

# Print the best hyperparameters and corresponding accuracy
print("Best Hyperparameters:")
print(f"Num Trees: {best_params['num_trees']}")
print(f"Vars/Split: {best_params['vars_per_split']}")
print(f"Min Leaf: {best_params['min_leaf']}")
print(f"Bag Fraction: {best_params['bag_fraction']}")
print(f"Accuracy: {best_params['accuracy'] * 100:.2f}%")


Best Hyperparameters:
Num Trees: 15.0
Vars/Split: 3.0
Min Leaf: 1.0
Bag Fraction: 0.5
Accuracy: 99.46%


In [216]:
n_trees=int (best_params['num_trees'])
n_trees

15

Apply the best Model and check its accuracy

In [None]:
n_trees=int(best_params['num_trees'])
vars_per_split=best_params['vars_per_split']
min_leaf=best_params['min_leaf']
bag_fraction=best_params['bag_fraction']
trained_classifier = ee.Classifier.smileRandomForest(
                    n_trees,
                    vars_per_split,
                    min_leaf,
                    bag_fraction,
                    None,  
                    42   
                ).train(
                    features=training,
                    classProperty='landcover',
                    inputProperties=selected_bands,
                )
test_pred = test.classify(trained_classifier)
cm_test = test_pred.errorMatrix("landcover", "classification")
accuracy=cm_test.accuracy().getInfo()
print(f'Model Accuracy: {accuracy * 100:.2f}%')

Model Accuracy: 99.52%


In [220]:
best_result = image_roi.select(selected_bands).classify(trained_classifier)

Visualize the results

In [None]:
landcover = best_result.set("classification_class_values", [1, 2, 3, 4, 5])
landcover = landcover.set("classification_class_palette", ['FF0000', '0000FF', '00FF00', 'FFFF00', 'FFFF00'])
Map.addLayer(landcover, {}, "Land cover")
Map

Map(bottom=528.0, center=[-80.17871349622823, -119.53125000000001], controls=(WidgetControl(options=['position…