## This notebook will clean small clusters of "noise" contained within masks that may exist in masks created via k-means clustering. 

Import packages

In [37]:
import cv2
import numpy as np
from PIL import Image

Function for removing small clusters from an image. Image should be binary (only contain 0 and 1 values) 

In [None]:
def remove_small_clusters(binary_image, min_cluster_size=10):
    print("Running connected components analysis...")
    
    # Run connected components on the binary image (0 and 1 values)
    num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(binary_image, connectivity=8)
    print(f"Number of labels: {num_labels}")
    
    # Initialize the cleaned image
    cleaned_image = np.zeros(binary_image.shape, dtype=np.uint8)
    
    for i in range(1, num_labels):
        component_size = stats[i, cv2.CC_STAT_AREA]
        
        if component_size >= min_cluster_size:
            cleaned_image[labels == i] = 1
    
    return cleaned_image

Upload image, check image shape and pixel values. Path should be changed to wherever your image of interest is.  

In [59]:
# Load the binary image (assumed to be in grayscale)

#open image with PIL
pil_image = Image.open('/explore/nobackup/people/sking11/clustering/clustering/1970_binarymask.tif')
#ensure image is in grayscale
pil_image = pil_image.convert('L')  # Convert to grayscale
#convert image to numpy array
binary_image = np.array(pil_image)

#check image shape 
print(f"Image shape: {binary_image.shape}")
#check that image is binary (contains 0 and 1 values only) 
print(f"Unique pixel values before thresholding: {np.unique(binary_image)}")

Image shape: (9821, 12513)
Unique pixel values before thresholding: [0 1]


Run the remove_small_clusters function. Change min_cluster_size to something that makes sense for your data (a value that will remove noise without removing trees for the G-LiHT images tended to range between 400-600 pixels)

In [61]:
# Remove small clusters
cleaned_image = remove_small_clusters(binary_image, min_cluster_size=500)

Running connected components analysis...
Number of labels: 210103


In [62]:
# Save or display the cleaned image w/ cv2
writepath = '/explore/nobackup/people/sking11/clustering/clustering/cleaned_1970_binarymask_500cluster.tif'
cv2.imwrite(writepath, cleaned_image)

True