# Project 4: Clustering

## Introduction
For this project, I chose to explore clustering because I had a problem with my current research that I believed could be solved with clustering. In order to understand the nature of my problem, I do believe it is neccesary to understand the nature of my research.

# Predicting Microplastic transport using Eulerian Computational Modeling
My research involves using the MIT General Circulation Model (MITgcm) to simulate the movement and concentration of microplastics in the global ocean. Because the Eulerian framework represents tracer concentrations rather than individual particles, I needed a way to group sampling locations that were geographically close or dynamically similar. Clustering provided a practical solution—allowing me to create piecewise regions of similar distance and behavior for initializing and analyzing microplastic tracer experiments. Below is one of the early demos I did - but it gives you an idea of the most "flashly" part of the research.

![example_sim](res/tracer01_surface.gif "segment")

# Back to Introduction

The dataset spans thousands of microplastic samples collected between 2013 and 2018 by civilian scientists around the world. Because these samples come from vastly different regions and oceanic regimes, analyzing them as a single, continuous field would blur local variability and make it harder to design targeted simulations. By partitioning the dataset into piecewise clusters of geographically proximate samples, I can treat each cluster as a distinct initialization region in my Eulerian modeling framework. This approach makes it easier to run controlled tracer experiments that capture regional transport dynamics while still preserving the global context of the data.

For this reason, clustering offered a natural way to identify geographically coherent groups within the sampling data. By grouping nearby samples into distinct regions, I could define piecewise tracer sources that align with the spatial resolution and concentration-based nature of my Eulerian model.

## Research Specific Constraints  

Because my research integrates clustering results directly into an ocean circulation model, the sampling coordinates had to correspond to valid **ocean grid cells** rather than arbitrary latitude–longitude points. Some sampling sites in the dataset were located slightly on land or along coastlines, which the model cannot interpret. To resolve this, each point was **snapped to the nearest valid ocean location** on the model grid. This step ensures that all clustered locations align with the physical domain of the simulation, even though it adds complexity to the preprocessing workflow.

## Introduce the Data  

The **WW Marine Datashare** dataset contains microplastic sampling records collected by civilian scientists between roughly **2013–2018**, covering oceans and regional seas worldwide. Each record describes a single sampling event, including location, date, environmental conditions, and detailed particle counts categorized by **color, size, and shape**.  

Below is a condensed overview of the dataset’s main feature groups:

- **Geospatial & Temporal Information:**  
  `Sample Latitude`, `Sample Longitude`, `Sample Location`, `Sample Date`, `Sample Year`, `Sample Time (local)`, `Ocean Basin`, `Regional Sea`, and whether the site was *coastal or open ocean*.  

- **Environmental & Contextual Fields:**  
  `Water Temperature (°C)`, `Wind Speed`, `Wind Direction`, `Depth of Sample (m)`, `Sampling Platform`, and optional site metadata such as `Sampling Site Information`, `Notes`, and `Lab Notes`.  

- **Laboratory / Counting Fields:**  
  Detailed particle counts separated by **color (Blue, Red, Transparent, Black, Green, Other)**, **shape (Round, Filament, Other Shape)**, and **size bins** (e.g., `<1.5 mm`, `1.6–3.1 mm`, `3.2–5 mm`, `5.1–9.6 mm`).  

- **Aggregated Metrics:**  
  `Total Pieces` — total number of microplastic fragments counted.  
  `Total Pieces/L` — concentration normalized by sample volume.  
  `Sample Volume (L)` and `Number of Filters` — context for sampling effort and lab processing.  

Together, these features provide a comprehensive view of **where, when, and how** microplastics were collected and analyzed. Because the dataset is community-driven, it offers broad spatial coverage but includes variability in sampling technique and reporting detail.




### Sampling Frequency Over Time  

The figure below shows the number of microplastic samples collected per week across the dataset.  

![Samples per Week](res/samplesPerWeek.png "Sampling frequency")

The plot highlights a strong temporal bias in the dataset: nearly all samples were collected in **late 2014 through early 2015**, with a pronounced spike during a few weeks of intensive sampling. Outside of this period, data collection was sparse and irregular.  

This pattern supports the decision to focus analysis on **2015 samples**, ensuring temporal consistency and reducing the effects of uneven sampling density. Concentrating on this well-sampled year also simplifies clustering by avoiding temporal gaps that could distort spatial grouping.


## Pre-processing the Data  

Before moving into clustering, I performed a few standard cleaning and filtering steps to make the dataset usable. Most of these steps involved selecting relevant columns, removing incomplete samples, and narrowing the data to a single year.

### Selecting and Cleaning Relevant Features  

```python
import pandas as pd

marine_df = pd.read_excel("WW_dataset/WW Marine Datashare.xlsx")

# Keep only columns needed for clustering and later modeling
microplastics_df = marine_df[
    [
        "Sample Volume (L)", "Total Pieces/L", "Total Pieces",
        "Sample Longitude", "Sample Latitude",
        "Sample Date", "Date Filtered", "Date Counted"
    ]
].copy()

# Drop rows missing key spatial or temporal fields
microplastics_df = microplastics_df.dropna(
    subset=[
        "Sample Longitude", "Sample Latitude",
        "Sample Date", "Date Filtered", "Date Counted"
    ]
)
```

Here, I dropped nonessential metadata such as site descriptions, color- and size-specific counts, and lab notes—these fields weren’t relevant to the spatial clustering problem. Only the **location**, **date**, and **microplastic concentration** fields were kept since they directly relate to where and how much plastic was found.

---

### Converting Dates and Filtering by Year  

```python
microplastics_df["Sample Date"] = pd.to_datetime(
    microplastics_df["Sample Date"], errors="coerce"
)

# Keep only the dense sampling year
microplastics_2015 = microplastics_df[
    microplastics_df["Sample Date"].dt.year == 2015
].reset_index(drop=True)
```

When I plotted the sampling frequency, I found that nearly all samples came from **2015**, so I limited my analysis to that year. This makes the clustering more temporally consistent and reduces noise from sparse years.

---

These simple preprocessing steps left me with a clean and consistent subset of spatially valid, time-aligned samples ready for clustering experiments.


# Visualizing the processed data
The full dataset is 
![sample_map](res/sample_map.png)

And the 2015 is
![2015_samples](res/sample_map_2015.png)

As you can see, we do lose a bit of the data, but still, the model can only have one pickup date for particles, so I have to choose a year. 

## Modeling (Clustering)

I began my modeling process with a straightforward implementation of **K-Means clustering** on the 2015 dataset. At this stage, I simply wanted to see if any broad spatial groupings of samples would emerge when using longitude and latitude as inputs. K-Means was an intuitive starting point because it produces clear, interpretable clusters and allowed me to visualize how the data naturally grouped across the globe.

After this first attempt, I experimented with **DBSCAN**, a density-based clustering method. I was interested in whether DBSCAN would be better suited for identifying irregularly shaped regions or filtering out sparse outliers along coastlines. While DBSCAN did succeed in isolating small coastal clusters, it also tended to fragment large, coherent regions into multiple smaller clusters, which made it less ideal for defining broader oceanic regions.

After discussing these results with my professor, I revisited **K-Means**—but this time, I incorporated a more appropriate **distance metric** for global data. Instead of using Euclidean distance directly on latitude and longitude, I projected the coordinates onto a **unit sphere**, allowing the algorithm to operate using the **Haversine (great-circle) distance**. This change resolved several artifacts seen in the earlier runs, particularly where clusters had previously “wrapped around” the 180° longitude line or appeared distorted at higher latitudes.

This final spherical K-Means approach produced clean, geographically meaningful clusters that corresponded well with known oceanic regions. These clusters were ultimately used to define **piecewise initialization regions** for my later experiments.


## Modeling (Clustering)

I approached the clustering in three stages. First, I ran a straightforward K-Means on raw latitude and longitude to see whether simple Euclidean distance would already reveal useful regional patterns. Then I experimented with DBSCAN to understand whether a density-based view of the data would highlight coastal “hotspots” or outliers. Finally, after discussing these results with my professor, I revisited K-Means using a distance-aware formulation that better respects the spherical geometry of the Earth.

### 1. Initial K-Means with Euclidean Distance

My first pass treated longitude and latitude as a 2D plane and ran K-Means directly on those coordinates. This was intentionally naïve, but it gave me a quick sanity check on whether broad clusters existed at all.

```python
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Assume microplastics_2015 has already been created in preprocessing
X_euclid = microplastics_2015[["Sample Longitude", "Sample Latitude"]].to_numpy()

kmeans_euclid = KMeans(n_clusters=6, random_state=0).fit(X_euclid)
microplastics_2015["cluster_kmeans_euclid"] = kmeans_euclid.labels_

plt.figure(figsize=(10, 5))
plt.scatter(
    microplastics_2015["Sample Longitude"],
    microplastics_2015["Sample Latitude"],
    c=microplastics_2015["cluster_kmeans_euclid"],
    s=10, alpha=0.7, edgecolor="none"
)
plt.title("Naive K-Means Clustering on Lon/Lat (Euclidean Distance)")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.xlim(-180, 180)
plt.ylim(-90, 90)
plt.tight_layout()
plt.show()
```

This run confirmed that there *are* spatial groupings, but the cluster boundaries were somewhat distorted—for example near the 180° meridian and at higher latitudes—because Euclidean distance is not a great approximation for global geospatial data.

---

### 2. DBSCAN for Density-Based Clustering

Next, I tried **DBSCAN**, which forms clusters based on local point density rather than forcing every point into a cluster. I used latitude and longitude (converted to radians) so I could measure distance on the Earth’s surface and distinguish dense coastal regions from more isolated samples.

```python
from sklearn.cluster import DBSCAN
import numpy as np

# Coordinates in radians for distance-based clustering
lat_rad = np.radians(microplastics_2015["Sample Latitude"].to_numpy())
lon_rad = np.radians(microplastics_2015["Sample Longitude"].to_numpy())
coords_rad = np.c_[lat_rad, lon_rad]

# eps in kilometers, then converted to radians
eps_km = 50.0
eps_rad = eps_km / 6371.0088  # Earth radius in km

dbscan = DBSCAN(
    eps=eps_rad,
    min_samples=5,
    metric="haversine"
).fit(coords_rad)

microplastics_2015["cluster_dbscan"] = dbscan.labels_

plt.figure(figsize=(10, 5))
plt.scatter(
    microplastics_2015["Sample Longitude"],
    microplastics_2015["Sample Latitude"],
    c=microplastics_2015["cluster_dbscan"],
    s=10, alpha=0.7, edgecolor="none"
)
plt.title("DBSCAN Clustering (Density-Based, Haversine Distance)")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.xlim(-180, 180)
plt.ylim(-90, 90)
plt.tight_layout()
plt.show()
```

DBSCAN was good at identifying small, dense clusters—often along coastlines—while marking very isolated points as noise (`-1`). However, for my purposes I ultimately wanted a small number of broad, contiguous regions that could be used as piecewise tracer sources. DBSCAN tended to break large oceanic areas into many fragments, which made those regions harder to interpret as clean “initialization zones.”

---

### 3. Refined K-Means with Spherical (Haversine-Aware) Distance

After talking with my professor, I returned to K-Means but changed how I represented location. Instead of clustering directly on (lon, lat) with Euclidean distance, I projected each point onto the **unit sphere**:



where \(\phi\) is latitude and \(\lambda\) is longitude in radians. Running K-Means in this 3D space better approximates great-circle distance on the globe and removes artifacts from treating the Earth as flat.

```python
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt

# Use snapped coordinates if available; otherwise fall back to raw lon/lat
lat_used = microplastics_2015.get("lat_snapped", microplastics_2015["Sample Latitude"])
lon_used = microplastics_2015.get("lon_snapped", microplastics_2015["Sample Longitude"])

lat_rad = np.radians(lat_used.to_numpy())
lon_rad = np.radians(lon_used.to_numpy())

# Project onto unit sphere
X_sphere = np.c_[
    np.cos(lat_rad) * np.cos(lon_rad),
    np.cos(lat_rad) * np.sin(lon_rad),
    np.sin(lat_rad)
]

K = 6  # number of global regions I wanted
kmeans_sphere = KMeans(n_clusters=K, n_init=20, random_state=0).fit(X_sphere)
microplastics_2015["cluster_kmeans_sphere"] = kmeans_sphere.labels_

plt.figure(figsize=(10, 5))
plt.scatter(
    lon_used,
    lat_used,
    c=microplastics_2015["cluster_kmeans_sphere"],
    s=10, alpha=0.7, edgecolor="none"
)
plt.title("K-Means Clustering on Unit Sphere (Haversine-Aware)")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.xlim(-180, 180)
plt.ylim(-90, 90)
plt.tight_layout()
plt.show()
```

This final version of K-Means produced six geographically coherent regions that aligned much better with how I expect ocean basins and circulation patterns to behave. These spherical K-Means clusters are the ones I ultimately used as **piecewise microplastic source regions** in my later Eulerian modeling experiments.
