Created by Vin Bhat (https://orcid.org/0009-0007-2302-0564) in collaboration with Marlin Wong, Neel Kansara, and Saanvi Shah as part of the identifying outliers and unsupervised learning subteams of the Hack the GLOBE! project in the NASA SEES internship program.

In [None]:
# Importing KaggleHub and setting up the environment
# Authenticating Kaggle API

import kagglehub
import os

os.environ['KAGGLE_USERNAME'] = 'your_username'
os.environ['KAGGLE_KEY'] = 'your_api_key'

from kaggle import api

In [None]:
# Importing Hack the GLOBE! data source

hack_the_globe_path = kagglehub.competition_download('hack-the-globe')

print('Data source import complete.')

# Initialize your environment

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import pyarrow
import matplotlib.pyplot as plt

# Load the Data

In [None]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

parquet_file_path = os.path.join(hack_the_globe_path, 'mv_surface_temperatures_wide.parquet')
df = pd.read_parquet(parquet_file_path)
display(df)

# Data Profiling
Utilized `df.info()` to gain a general sense for the DataFrame. Came to the following conclusions:

*   64 columns, 716031 rows
*   2 columns (`submission_access_code_id`, `site_nickname`) only contain null values
*   Some data has only been collected more recently (`site_true_latitude`, `site_true_longitude`, `site_true_elevation`, `site_true_point` with only 84 non-null values)
*   Dtypes primarily used include string, Int64, float64 in order of prevalence



In [None]:
# Displaying the DataFrame information
print(df.info())
# Displaying the shape of the DataFrame
print(df.shape)

In [None]:
# Displaying number of null values in each column from greatest to least
print(df.isnull().sum().sort_values(ascending=False))
# Displaying proportion of null values in each column from greatest to least
print(df.isna().mean().sort_values(ascending=False))

The two columns with pure null values won't help in data analysis, and thus can be removed.

In [None]:
# Remove columns with only null values
df = df.drop(columns=['submission_access_code_id', 'site_nickname'])

In [None]:
import folium
from folium.plugins import MarkerCluster

# Filter rows with valid submission coordinates
df_valid = df[['submission_latitude', 'submission_longitude']].dropna()

# Center map at the mean of valid coordinates
map_center = [
    df_valid['submission_latitude'].mean(),
    df_valid['submission_longitude'].mean()
]

m = folium.Map(location=map_center, zoom_start=2)
marker_cluster = MarkerCluster().add_to(m)

for _, row in df_valid.iterrows():
    folium.Marker(
        location=[row['submission_latitude'], row['submission_longitude']]
    ).add_to(marker_cluster)

m

# Location Outliers
Key observation: multiple sets of clearly invalid coordinates, including cluster of 692 north of Alaska, cluster of 27 north of Russia, and a clusters of 9 and 3 near the coordinates of (0,0). Next steps include investigating the outlier points based on coordinates, as well as differentiating between the site coordinates and submission coordinates.

In [None]:
# Define latitude and longitude thresholds for likely and impossible coordinates
likely_lat_min, likely_lat_max = -60, 85    # Most land-based submissions
likely_lon_min, likely_lon_max = -180, 180  # Valid longitude range

# Impossible coordinates: outside valid lat/lon range
impossible_coords = (
    (df['submission_latitude'] < -90) | (df['submission_latitude'] > 90) |
    (df['submission_longitude'] < -180) | (df['submission_longitude'] > 180)
)

# Unlikely but possible: outside likely land-based range but still valid
unlikely_coords = (
    ((df['submission_latitude'] < likely_lat_min) | (df['submission_latitude'] > likely_lat_max)) &
    (df['submission_latitude'].notna())
)

# Points close to (0,0)
zero_point_coords = (
    df['submission_latitude'].between(-1, 1) &
    df['submission_longitude'].between(-1, 1)
)

# Combine all criteria
flagged_coords = df[impossible_coords | unlikely_coords | zero_point_coords]

# Show summary
print(f"Impossible coords: {impossible_coords.sum()}")
print(f"Unlikely coords: {unlikely_coords.sum()}")
print(f"Near (0,0): {zero_point_coords.sum()}")
print(f"Total flagged: {len(flagged_coords)}")
display(flagged_coords[['submission_latitude', 'submission_longitude']])

The results match our earlier observations of the map. Many seem to be in clusters and likely are from the same userid or organizationid.

In [None]:
# Get unique organizationid values from flagged_coords
org_ids_flagged = flagged_coords['organizationid'].unique()
print(org_ids_flagged)
print(flagged_coords.value_counts('organizationid'))

Used the GLOBE API (https://api.globe.gov/search/swagger-ui.html#/v-1-controller/findByProtocolAndMeasuredDateAndOAndOrganizationIdUsingGET) to determine the identities of the organizations:
* 78523475: Iksal 'c' Primary School (Israel)
  * Responsible for cluster of 692 north of Alaska
* 6512608: NASA Langley Research Center GLOBE v-School (United States)
  * Responsible for cluster of 9 at (0,0)
* 106363801: Queen Elizabeth College (Mauritius)
  * Responsible for 2 points near (0,0)
* 58728215: Federal Government Girls College, Calabar (Nigeria)
  * Responsible for 1 point near (0,0)

In [None]:
org_id = 78523475 # Iksal 'c' Primary School organizationid
df_filtered = df[
    (df['organizationid'] == org_id) &
    (df['submission_latitude'].notna()) &
    (df['submission_longitude'].notna())
]

# Get average location to center the map
center_lat = df_filtered['submission_latitude'].mean()
center_lon = df_filtered['submission_longitude'].mean()

# Create the Folium map
m = folium.Map(location=[center_lat, center_lon], zoom_start=4)

marker_cluster = MarkerCluster().add_to(m)

# Add markers
for _, row in df_filtered.iterrows():
    folium.Marker(
        location=[row['submission_latitude'], row['submission_longitude']],
        popup=f"Surface Temp: {row['sample_surface_temperature_c']}",
        tooltip=row.get('site_id', 'Site')
    ).add_to(marker_cluster)

# Display map
m

Hypothesis of 692 points coming from Iksal 'c' Primary School confirmed.

# Coordinate Uncertainty

In [None]:
import geopandas as gpd
import contextily as ctx

# Filter rows with only site coordinates
df_filtered = df[
    df['site_latitude'].notna() &
    df['site_longitude'].notna() &
    df['submission_latitude'].isna() &
    df['submission_longitude'].isna() &
    df['site_latitude'].between(-90, 90) &
    df['site_longitude'].between(-180, 180)
]

# Convert to GeoDataFrame
gdf = gpd.GeoDataFrame(
    df_filtered,
    geometry=gpd.points_from_xy(df_filtered['site_longitude'], df_filtered['site_latitude']),
    crs="EPSG:4326"
)

# Reproject to Web Mercator
gdf = gdf.to_crs(epsg=3857)

# Plot
fig, ax = plt.subplots(figsize=(12, 8))
gdf.plot(ax=ax, markersize=10, alpha=0.6, color='crimson', label="Site Only")
# Changed the basemap provider from Stamen.TerrainBackground to OpenStreetMap.Mapnik
ctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.Mapnik)
ax.set_axis_off()
plt.title("Site Coordinates Without Submission Coordinates")
plt.legend()
plt.tight_layout()
plt.show()

Many rows seem to have site coordinates but not submission coordinates. For the rows that do have both, we want to check if any have differences or errors.

In [None]:
df_geo = df.dropna(subset=[
    'site_latitude', 'site_longitude',
    'submission_latitude', 'submission_longitude'
])

# Define a small tolerance to handle floating point precision
tolerance = 100

# Find rows where lat/lon values differ more than the tolerance
mismatches = df[
    ((df['site_latitude'] - df['submission_latitude']).abs() > tolerance) |
    ((df['site_longitude'] - df['submission_longitude']).abs() > tolerance)
]

# Output the count and optionally preview
print(f"Number of mismatched coordinates: {len(mismatches)}")
print(mismatches[['organizationid','site_latitude', 'site_longitude', 'submission_latitude', 'submission_longitude']].value_counts('organizationid'))

organizationid 394556 has 27 extremely mismatched coordinates (viewed on the Folium map earlier). After plugging into the aforementioned GLOBE API, it can be determined that these points originated from the University of Toledo (United States).

In [None]:
import seaborn as sns
from folium.plugins import HeatMap

df_geo = df[['submission_latitude', 'submission_longitude', 'site_elevation', 'sample_surface_temperature_c']].dropna()

heat_data = [
    [row['submission_latitude'], row['submission_longitude'], row['sample_surface_temperature_c']]
    for _, row in df_geo.iterrows()
]

map_center = [df_geo['submission_latitude'].mean(), df_geo['submission_longitude'].mean()]
m = folium.Map(location=map_center, zoom_start=2)

HeatMap(heat_data, min_opacity=0.3, radius=8, blur=4).add_to(m)

m

The heatmap above shows where large amounts of data have been entered from. Notable locations include the University of Toledo (United States), Iksal 'c' Primary School (Israel), and National Taiwan University (Taiwan). To see if an error was made when entering coordinates for Iksal 'c' Primary School, we can list the coordinates that weren't part of the batch of 692 from earlier

In [None]:
from geopy.distance import geodesic

# Filter by organizationid
df_filtered = df[
    (df['organizationid'] == 78523475) &
    (df['submission_latitude'].notna()) &
    (df['submission_longitude'].notna())
]

# Define Israel center
israel_coords = (31.0461, 34.8516)

# Define distance threshold in miles
distance_limit_miles = 500

# Function to compute distance from Israel center
def is_within_radius(row):
    point = (row['submission_latitude'], row['submission_longitude'])
    return geodesic(point, israel_coords).miles <= distance_limit_miles

# Apply the distance filter
df_near_israel = df_filtered[df_filtered.apply(is_within_radius, axis=1)]

# Show result
print(df_near_israel.value_counts('organizationid'))
print(df_near_israel.value_counts('userid'))

This confirms our suspicions that the data in the batch of 692 might have been submitted with the same submission coordinates in error. In addition, it appears that there is only one userid entering the data for the non-faulty values from Iksal 'c' Primary School, implying that a teacher might be entering data for students. Next, we can search for coordinates with exorbitantly high temperature, where users might have incorrectly inputted the temperature in Fahreinheit instead of Celsius.

# Temperature Outliers

In [None]:
# Find rows with surface temperatures higher than 60°C
high_temp_coords = df[df['sample_surface_temperature_c'] > 60]

# Display summary and preview
print(f"Number of rows with surface temperature > 60°C: {len(high_temp_coords)}")
display(high_temp_coords[['site_latitude', 'site_longitude', 'sample_surface_temperature_c', 'organizationid', 'userid']])

# Interpretable Flagging
Now we can combine this new temperature based information with the coordinate information to create boolean flagging columns for unlikely temperature, impossible coordinates, unlikely coordinates, coordinates near (0,0), and mismatched site and submission coordinates.

In [None]:
# Add boolean flag columns to df for outlier temperatures and coordinates

# Outlier temperature: surface temperature > 60°C (possible Fahrenheit entry or error)
df['is_temp_outlier'] = df['sample_surface_temperature_c'] > 60

# Outlier coordinates: use previously defined impossible_coords, unlikely_coords, zero_point_coords
df['is_impossible_coord'] = impossible_coords
df['is_unlikely_coord'] = unlikely_coords
df['is_zero_point_coord'] = zero_point_coords

# Add a boolean flag for mismatched site and submission coordinates
df['is_mismatched_coord'] = (
    ((df['site_latitude'] - df['submission_latitude']).abs() > tolerance) |
    ((df['site_longitude'] - df['submission_longitude']).abs() > tolerance)
)

In addition, we'll add two columns: one summing the number of flags a row has, and the other a boolean value to track whether the row has more than 3 flags.

In [None]:
# Add a column that sums the number of flags for each row
flag_columns = [
    'is_temp_outlier',
    'is_impossible_coord',
    'is_unlikely_coord',
    'is_zero_point_coord',
    'is_mismatched_coord'
]
df['num_flags'] = df[flag_columns].sum(axis=1)

Now we can check how many rows have flags.

In [None]:
# Count rows with at least one flag
num_with_flags = (df['num_flags'] > 0).sum()
print(f"Rows with at least one flag: {num_with_flags}")

In [None]:
# Create a dataset without any flagged rows (i.e., rows where num_flags == 0)
df_clean = df[df['num_flags'] == 0].copy()
print(f"Shape of clean dataset: {df_clean.shape}")
display(df_clean.head())

# Unsupervised Learning
Now that we have sorted out easily identifiable outliers, we can use unsupervised learning methods like K-Means, DBSCAN, and GMM to potentially find new outliers.

In [None]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Select numeric columns for clustering (e.g., latitude, longitude, temperature)
features = ['site_latitude', 'site_longitude', 'sample_surface_temperature_c']
df_kmeans = df_clean.dropna(subset=features).copy()

X = df_kmeans[features].values

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit K-Means (choose k=5 as an example, adjust as needed)
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
df_kmeans['kmeans_label'] = kmeans.fit_predict(X_scaled)

# Show cluster counts
print(df_kmeans['kmeans_label'].value_counts())
df_kmeans.head()

plt.figure(figsize=(10, 6))
scatter = plt.scatter(
    df_kmeans['site_longitude'],
    df_kmeans['site_latitude'],
    c=df_kmeans['kmeans_label'],
    cmap='tab10',
    alpha=0.6
)
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('K-Means Clusters (k=5)')
plt.colorbar(scatter, label='Cluster Label')
plt.show()

K-Means seems to somewhat align with geographical groupings, which makes sense given the data we fed it.

In [None]:
from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=5, random_state=42)
gmm.fit(X_gmm)

# Select features for GMM clustering
features = ['site_latitude', 'site_longitude', 'sample_surface_temperature_c']
df_gmm = df_clean.dropna(subset=features).copy()
X_gmm = df_gmm[features].values

# Predict GMM cluster labels
df_gmm['gmm_label'] = gmm.predict(X_gmm)

# Show cluster counts and preview
print(df_gmm['gmm_label'].value_counts())
display(df_gmm.head())

# Optional: visualize clusters
plt.figure(figsize=(10, 6))
scatter = plt.scatter(
    df_gmm['site_longitude'],
    df_gmm['site_latitude'],
    c=df_gmm['gmm_label'],
    cmap='tab10',
    alpha=0.6
)
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('GMM Clusters (n_components=5)')
plt.colorbar(scatter, label='GMM Cluster Label')
plt.show()

In [None]:
from sklearn.decomposition import PCA

# Select numeric columns for PCA (same as clustering)
features = ['site_latitude', 'site_longitude', 'sample_surface_temperature_c']
df_pca = df_clean.dropna(subset=features).copy()
X_pca = df_pca[features].values

# Standardize features
X_pca_scaled = scaler.transform(X_pca)

# Fit PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(X_pca_scaled)

# Add principal components to DataFrame
df_pca['PC1'] = principal_components[:, 0]
df_pca['PC2'] = principal_components[:, 1]

# Plot the first two principal components
plt.figure(figsize=(10, 6))
plt.scatter(df_pca['PC1'], df_pca['PC2'], alpha=0.5, s=10)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Clean Data')
plt.show()

# Explained variance ratio
print("Explained variance ratio:", pca.explained_variance_ratio_)

In [None]:
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor

# Use the same features as before
features = ['site_latitude', 'site_longitude', 'sample_surface_temperature_c']
df_unsup = df_clean.dropna(subset=features).copy()
X_unsup = df_unsup[features].values

# Isolation Forest
iso_forest = IsolationForest(contamination=0.01, random_state=42)
df_unsup['iso_outlier'] = iso_forest.fit_predict(X_unsup) == -1

# Local Outlier Factor
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.01)
df_unsup['lof_outlier'] = lof.fit_predict(X_unsup) == -1

# Show counts of detected outliers
print("Isolation Forest outliers:", df_unsup['iso_outlier'].sum())
print("Local Outlier Factor outliers:", df_unsup['lof_outlier'].sum())

# Optionally, preview some outliers
display(df_unsup[df_unsup['iso_outlier'] | df_unsup['lof_outlier']].head())

# Final Dataset Without Outliers

In [None]:
# Exclude rows detected as outliers by either Isolation Forest or Local Outlier Factor
final_df = df_clean.drop(df_unsup[df_unsup['iso_outlier'] | df_unsup['lof_outlier']].index)
print(f"Final dataset shape (excluding unsupervised outliers): {final_df.shape}")
display(final_df)

While this final dataframe obviously isn't perfect, it lets users perform data analysis without egregious outliers or faulty data.