# Group Exercise #2

In this graded group exercise, you will use the **GEE Python API** and the **geemap Python package** to:

- Retrieve a collection of satellite images over a study area
- Create a composite image and visualize it
- Apply a simple unsupervised machine learning algorithm (e.g., k-means) for clustering the study area
- Analyze the clustering results given a set of reference points

## Tasks

### 1. Define a case study area 
*The size of bounding box covering the study area should be large enough (e.g., few hundred kilometers).* 

Before we start the process, the initialization of the API needs to be done.
- Two packages are needed: `ee` and `geemap`
- The authentication line is necessary to access the resources in Google Earth Engine. And it should be run at first and follow the instructions. Then the Python environment knows the authentication and it needn't be executed anymore.

In [2]:
import ee
import geemap
import sys

#   Please run this just once to get access to the resources in GEE
#   ee.Authenticate()

try:
    # Initialize the library
    ee.Initialize()
    print('Google Earth Engine has initialized successfully!')
except ee.EEException as e:
    print('Google Earth Engine has failed to initialize!')
except:
    print("Unexpected error:", sys.exc_info()[0])
    raise

Google Earth Engine has initialized successfully!


We select Turkey as the area of interest and create its bounding box.

In [3]:
# Bounding box of Turkey (Define the region of interest as a polygon,just need to define once here)
left = 25.663
right = 44.822
up= 42.366
down = 35.817 

roi = ee.Geometry.Polygon(
        [[[left, down],
          [right, down],
          [right, up],
          [left, up],]], None, False)

bbox = roi.bounds().getInfo()['coordinates'][0]
print(bbox)

[[25.663, 35.817], [44.822, 35.817], [44.822, 42.366], [25.663, 42.366], [25.663, 35.817]]


Then we center Turkey at the center of the map and add the bounding box as a layer.

In [4]:
# Create a map centered on Turkey
map = geemap.Map(center=[(up+down)/2, (left+right)/2], zoom=6)

# Add the bounding box to the map
bbox_layer = geemap.geojson_to_ee({
    'type': 'Feature',
    'geometry': {
        'type': 'Polygon',
        'coordinates': [bbox]
    },
    'properties': {
        'name': 'Bounding box'
    }
})
map.addLayer(bbox_layer, {'color': 'grey'}, 'Bounding box')

# Display the map
map

Map(center=[39.091499999999996, 35.2425], controls=(WidgetControl(options=['position', 'transparent_bg'], widgâ€¦

### 2. Add the reference points
*Filter the reference points to the selected region of interest*

- The reference points are stored in the multiple .csv file under /Reference_Point
- We read them as DataFrame first and then convert it to ee FeatureCollection.

NB! Task was solved with three solutions (A, B, C), each of which runs parallelly with the others. So you just need to run one of them for the whole assignment.

#### 2.1 Solution A
- As Google Earth API cannot handle external data larger than 10Mb, so we need to avoid making a big complete ee FeatureColleciton of all points and handle it by ee package. Instead, we need to decrease the size of the ee FeatureCollection.
- Therefore, here we read all the points into dataframe first and finishe the filtering when it was still a dataframe.
- Then, we convert this smaller dataframe into ee FeatureCollection.
- This method took 9.1 seconds to finish on my laptop locally. (We need Internet connection after the conversion from dataframe, like half the process)

In [5]:
import dask.dataframe as dd
import pandas as pd

# Use DASK to read all csv together
da = dd.read_csv('Reference_Point/SamplesSet*.csv')
ref_point = da.compute()

# Get rid of the rows with no landcover
ref_point = ref_point.dropna(subset=['landcover'])

# Filter the data withi the area of interest
ref_point_sub = ref_point[(ref_point['lon'] >= left) & (ref_point['lon'] <= right) & (ref_point['lat'] >= down) & (ref_point['lat'] <= up)]

print(ref_point_sub)

# Convert the data to a ee feature
def row_to_feature(row):
    point = ee.Geometry.Point(row['lon'], row['lat'])
    feature = ee.Feature(point, {'landcover': row['landcover']})
    return feature

# Make feature collection
feature_list = ref_point_sub.apply(row_to_feature,axis=1).tolist()
feature_collection = ee.FeatureCollection(feature_list)
ref_point_ee = feature_collection.filter(ee.Filter.geometry(roi))
print(ref_point_ee.size().getInfo())

map.addLayer(ref_point_ee, {}, 'Area_of_Interest')
map

OSError: An error occurred while calling the read_csv method registered to the pandas backend.
Original Message: Reference_Point/SamplesSet*.csv resolved to no files

#### 2.2 Solution B
NB. the solution also exceeds the 10Mb limit, so it is not the best solution.

- Same reason for Solution A, Google Earth API cannot handle external data larger than 10Mb. Solution B tries to avoid online calculation with big dataset by making loops to handle the process separately. 
- Here the way could be loop around each csv file, convert it into ee FeatureCollection, filter it and then merge the selected point with the result of last loop.
- This solution keeps calling Google Earth, therefore robust net connection for the whole process is necessary.

In [None]:
# import dask.dataframe as dd
# import pandas as pd

# # Convert the data to a ee feature
# def row_to_feature(row):
#     point = ee.Geometry.Point(row['lon'], row['lat'])
#     feature = ee.Feature(point, {'landcover': row['landcover']})
#     return feature

# ref_point= pd.DataFrame({})
# ref_point_ee= ee.FeatureCollection([])
# # loop each csv and run the process separately
# for i in range(1,11):
#     path= 'Reference_Point/SamplesSet'+ str(i) +'.csv'
#     ref_point_sub = pd.read_csv(path)
#     ref_point_sub = ref_point_sub.dropna(subset=['landcover'])
#     # merge the dataframe
#     ref_point = pd.concat([ref_point,ref_point_sub])
#     # convert the dataframe to ee feature
#     feature_list = ref_point_sub.apply(row_to_feature,axis=1).tolist()
#     feature_collection = ee.FeatureCollection(feature_list)
#     filtered_collection = feature_collection.filter(ee.Filter.geometry(roi))
#     # merge the feature collection
#     ref_point_ee = ref_point_ee.merge(filtered_collection)
#     # print(ref_point_ee.size().getInfo())

# ref_point
# map.addLayer(ref_point_ee, {}, 'Area_of_Interest')
# map

#### 2.3 Solution C
- Solution C avoids online calculating with big dataset by using geopandas to handle geo objects.
- We could add the points as geodataframe and filter it with the area of interest locally with geopandas
- And then we transfer the geodataframe into ee. objest. So the input for the geemap occupies a much smaller storage.
- This soltion takes 6.9 seconds. It just needs Internet for the last step.

In [None]:
import geopandas as gpd
from shapely.geometry import Point, Polygon
import dask.dataframe as dd
import pandas as pd
import geopandas as gpd

# use DASK to read all csv together
da = dd.read_csv('Reference_Point/SamplesSet*.csv')
ref_point = da.compute()

# Get rid of the rows with no landcover
ref_point = ref_point.dropna(subset=['landcover'])

# Convert the dataframe to geodataframe
geometry = [Point(xy) for xy in zip(ref_point['lon'], ref_point['lat'])]
gdf = gpd.GeoDataFrame(ref_point, geometry=geometry, crs='EPSG:4326')
gdf = gdf.drop(['lon', 'lat'], axis=1)

# Filter the points in the area of interest
area_of_interest = Polygon([(left, down), (right, down), (right, up), (left, up)])
gdf_ref_point_sub = gdf.cx[left:right, down:up]

# Convert the geodataframe to ee
ref_point_ee = geemap.geopandas_to_ee(gdf_ref_point_sub)

# Add the points to the map
map.addLayer(ref_point_ee, {}, 'Area_of_Interest')
map

### 3 & 4. Generate image collection & Calculate the median image 
- *Generate Sentinel 2 images for this region of interest covering the summer of 2023*
- *Reduce this image collection by calculating the median of all values*

In [None]:
# To mask clouds in Sentinel-2 image using QA band
def mask_s2_clouds(image):
  qa = image.select('QA60')
  # Bits 10 and 11 are clouds and cirrus, respectively.
  cloud_bit_mask = 1 << 10
  cirrus_bit_mask = 1 << 11
  # Both flags should be set to zero, indicating clear conditions.
  mask = (
      qa.bitwiseAnd(cloud_bit_mask)
      .eq(0)
      .And(qa.bitwiseAnd(cirrus_bit_mask).eq(0))
  )
  # Return the image after the mask is applied and divide it by 10000 to scale the values around 0-1 
  return image.updateMask(mask).divide(10000)

# Get the Sentinel-2 image collection for the region of interest
image = (
     ee.ImageCollection('COPERNICUS/S2_SR_HARMONIZED') 
    .filterBounds(roi) 
    # Summer 2023
    .filterDate('2023-06-01', '2023-08-31') 
    # Pre-filter to get less cloudy granules.
    .filterMetadata('CLOUDY_PIXEL_PERCENTAGE', 'less_than', 10) 
    .map(mask_s2_clouds)
    # Get the median of the image collection
    .median()
)

# Set visualization parameters for visulization
vis_params = {
    'min': 0,'max': 0.4,
    'bands': ["B4", "B3", "B2"]
}

map.addLayer(image, vis_params, 'S2-2019-Median')
map

### 5. Visualize the reference points and image
*Make the image and visualisation together on one map*

In [None]:
# Find the 'Bounding box' layer and hide it
bbox_layer = map.find_layer('Bounding box')
bbox_layer.visible = False

# Add the reference points to the map
map.addLayer(ref_point_ee, {}, 'Area_of_Interest')
map

# Set visualization parameters for visulization of image
vis_params = {
    'min': 0,'max': 0.4,
    'bands': ["B4", "B3", "B2"]
}
# Add the image to the map
map.addLayer(image, vis_params, 'S2-2019-Median')
map


### 6. Make cluster
- *Cluster the selected image into a predefined number of clusters*
- *You can use the reference points labels to get a first estimation of number of clusters and k-mean as the clustering algorithm*

In [None]:
# Select the sample points to train the K-means model
training = image.sample(**{
    'scale': 30,
    # 8000 points in total
    'numPixels': 8000,
    'seed': 0,
    'geometries': True, 
    'region': roi,
    # set the tilescan to reduce the occupancy of the storage
    'tileScale': 4
})

In [None]:
# Set the cluster number as 5 (the reference points have 5 types of landcover)
cluster_number = 5
clusterer = ee.Clusterer.wekaKMeans(cluster_number).train(training)

# Cluster the input using the trained clusterer.
image_clustered = image.cluster(clusterer)

# Display the clusters with random colors.
map.addLayer(image_clustered.randomVisualizer(), {}, 'clusters')
map

### 7. Analyze the clustering results. 
For example, you can use histogram of the land cover classes per cluster id generated by the k-means algorithm.

- Per Cluster ID, we visualised the distribution of different land cover groups (x-axis is different cluster, y-axis is number of reference points of this land cover on the cluster)
- We also calculted the V-measure index (a comprehensive index evaluating the homogeneity in one cluster and the completeness from different clusters) to quantify the quality of the clustering

In [None]:
# Join the value of clustered image (cluster_id) into reference point
joined_ref_point_ee = image_clustered.reduceRegions(
    collection=ref_point_ee,
    reducer=ee.Reducer.mean(),
    crs='EPSG:4326',
    scale = 30,
    tileScale = 4
)

print(joined_ref_point_ee.getInfo())

In [None]:
# Convert the reference point with joined cluster_id into dataframe
joined_ref_point= geemap.ee_to_pandas(joined_ref_point_ee)
# Rename the columns of cluster_id to make it more readable
joined_ref_point = joined_ref_point.rename(columns={"mean": "cluster_id"})
joined_ref_point

In [None]:
import matplotlib.pyplot as plt
# reshape the dataframe to pivot-table (to calculate the number of different land covers per cluster id)
pivot_table = joined_ref_point.pivot_table(index="landcover", columns="cluster_id", values=None, aggfunc="size")

# build up the sub plots
fig, axs = plt.subplots(ncols=1, nrows=len(pivot_table.columns), figsize=(9, 25))

# to visualise the charts
for i, column in enumerate(pivot_table.columns):
    axs[i].bar(pivot_table.index, pivot_table[column])
    axs[i].set_xlabel("cluster_id: "+str(column))
    axs[i].set_ylabel("landcover count")
    #axs[i].set_title(column)

plt.show()

In [None]:
from sklearn.metrics import v_measure_score

#to calculate the v-measure for the clustering (range of v-measure is 0-1, 1 stands for better interpretation of real situation )
v_measure = v_measure_score(joined_ref_point['landcover'], joined_ref_point['cluster_id'])
print("V-measure: ", v_measure)

### 8. Change the number of clusters and repeat the analysis
- For Point of Dicussion 2, we run the code with different number of clusters, get the V-measure index and compare (also compare time)
- For Point of Dicussion 3, we run the training in a smaller/similar area similar with less/same land cover types first. Then apply it to our Turkey and compare the V-measure index. (also compare time)
- For Point of Dicussion 4, we first record time in Solution A,B,C at chapter 2. Then we also record time for different number of clusters and applying model to Turkey with different training models.

## Points of Discussions

- Why do we need to train an unsupervised classifier in GEE?
- Impact of the number clusters on the results
- Impact of performing the clustering in region X and applying it on region Y
  - Region Y has the same land cover classes present in Region X
  - Region Y has more land cover classes/types.
- What can you say about the computational time?