# NOAA Dataset: Handling Missing Values

This notebook details the process for cleaning and preparing the NOAA dataset, with a focus on addressing missing values. The raw dataset contains data from three oceanic buoys, each with different sensor capabilities, leading to two distinct types of missing data: systematic and incidental.

**Systematic missingness** occurs when a station lacks the hardware to measure a certain variable. In our case, there is a buoy that never reports wind speed which implies it doesn't have an anemometer. **Incidental missingness** refers to small, random gaps in data from a sensor that is otherwise operational, often due to transient sensor malfunctions or transmission errors.

The strategy used here is to first handle the systematic gaps by partitioning the data into analysis-specific subsets. Then, incidental gaps within these subsets are addressed using time-series interpolation. This preserves data integrity by not attempting to impute values that were never physically measurable.

## Setup and Data Loading

In [1]:
import io
import pandas as pd
from IPython.display import display, HTML

In [2]:
files = ['../data/source/NOAA_46041.csv',
         '../data/source/NOAA_46050.csv',
         '../data/source/NOAA_46243.csv'
]

dataframes = [pd.read_csv(file) for file in files]
df = pd.concat(dataframes, ignore_index=True)
df['date_time'] = pd.to_datetime(df['date_time'])

## 1. Standardize and Clean Columns

Column names are programmatically standardized for consistency and ease of access. This involves removing leading/trailing whitespace, replacing units and special characters with descriptive suffixes, replacing spaces with underscores, and converting all names to lowercase. A standardized naming convention is best practice to prevent errors, improve code readability, and make data manipulation more predictable.

In [3]:
# Clean up column names
df.columns = df.columns.str.strip()
df.columns = df.columns.str.replace(r'\s*\(degrees north\)', '_degrees_north', regex=True)
df.columns = df.columns.str.replace(r'\s*\(degrees east\)', '_degrees_east', regex=True)
df.columns = df.columns.str.replace(r'\s*\(C°\)', '_celsius', regex=True)
df.columns = df.columns.str.replace(r'\s*\(m/s\)', '_mps', regex=True)
df.columns = df.columns.str.replace(r'\s*\(hPa\)', '_hpa', regex=True)
df.columns = df.columns.str.replace(r'\s*\(s\)', '_s', regex=True)
df.columns = df.columns.str.replace(r'\s*\(m\)', '_m', regex=True)

df.columns = df.columns.str.replace(r'\s+', '_', regex=True)
df.columns = df.columns.str.replace(r'_{2,}', '_', regex=True)
df.columns = df.columns.str.lower()

columns_str = "\n".join(df.columns)

html = f"<pre style='font-family: monospace; border-style: solid; border-width: thin; padding:10px;'>{columns_str}</pre>"
display(HTML(html))

## 2. Merge Redundant Columns

During exploratory data analysis, it was discovered that two columns, `wind_speed_mps` and `wind_speed_cwind_mps`, measure the same physical quantity but are populated for different sets of records. To create a single, authoritative source for wind speed, these columns are merged. The `combine_first` method is used to coalesce the two columns, filling NaN values in the first column with the corresponding values from the second. The original, now-redundant columns are then dropped to simplify the DataFrame.

In [4]:
df['wind_speed'] = df['wind_speed_mps'].combine_first(
    df['wind_speed_cwind_mps'])

df.drop(columns=['wind_speed_mps', 'wind_speed_cwind_mps'], inplace=True)

buffer = io.StringIO()
df.info(buf=buffer)
info = buffer.getvalue()

html = f"<pre style='font-family: monospace; border-style: solid; border-width: thin; padding:10px;'>{info}</pre>"
display(HTML(html))

## 3. Quantify Missing Data

A quantitative analysis of missing data is performed to guide the imputation strategy. The percentage of missing values for each variable is calculated. The results clearly show that variables like `air_temperature_celsius` and `sea_level_pressure_hpa` are missing in over 65% of records. It's a direct result of specific stations lacking the necessary sensors. This confirms the presence of systematic missingness and invalidates the use of simple, global imputation methods, which would introduce significant bias.

In [5]:
missing_data = df.isnull().sum()
missing_percentage = (missing_data / len(df)) * 100
missing_info = pd.DataFrame({'Missing Values': missing_data, 'Percentage': missing_percentage})

html = f"<pre style='font-family: monospace; border-style: solid; border-width: thin; padding:10px;'>{missing_info}</pre>"
display(HTML(html))

## 4. Imputation Strategy

Since the `NaN` values are systematic, a single, broad imputation won't be performed. Instead, three separate dataframes will be made based on the station capabilities discovered in the EDA. This respects the "data silos" and only the data that is physically valid for a given task will be used.

- `df_complete_station`: For modeling tasks that require all variables (e.g., predicting waves from wind and pressure). This will only contain data from Station 2868187.

- `df_wind_analysis`: For analyzing regional wind patterns. This will contain data from the two stations that measure wind (2868187 and 2868934).

- `df_wave_analysis`: For analyzing regional wave patterns. This will contain data from the two stations that measure waves (2868187 and 2888997).

In [6]:
# DataFrame for the one station with complete data
df_complete_station = df[df['station_id'] == 2868187].copy()

# DataFrame for all stations that have wind data
wind_station_ids = [2868187, 2868934]
df_wind_analysis = df[df['station_id'].isin(wind_station_ids)].copy()

# DataFrame for all stations that have wave data
wave_station_ids = [2868187, 2888997]
df_wave_analysis = df[df['station_id'].isin(wave_station_ids)].copy()

buffer_complete = io.StringIO()
df_complete_station.info(buf=buffer_complete)
info_complete = buffer_complete.getvalue()

buffer_wind = io.StringIO()
df_wind_analysis.info(buf=buffer_wind)
info_wind = buffer_wind.getvalue()

buffer_wave = io.StringIO()
df_wave_analysis.info(buf=buffer_wave)
info_wave = buffer_wave.getvalue()

html = f"""
<div style="display: flex; width: 100%;">
  <div style="flex: 1; padding-right: 10px;">
    <h4 style="font-family: monospace;">Complete Station Info:</h4>
    <pre style='font-family: monospace; border-style: solid; border-width: thin; padding:10px;'>{info_complete}</pre>
  </div>
  <div style="flex: 1; padding-left: 5px; padding-right: 5px;">
    <h4 style="font-family: monospace;">Wind Analysis Info:</h4>
    <pre style='font-family: monospace; border-style: solid; border-width: thin; padding:10px;'>{info_wind}</pre>
  </div>
  <div style="flex: 1; padding-left: 10px;">
    <h4 style="font-family: monospace;">Wave Analysis Info:</h4>
    <pre style='font-family: monospace; border-style: solid; border-width: thin; padding:10px;'>{info_wave}</pre>
  </div>
</div>
"""

display(HTML(html))

## 5. Handle Incidental Missing Data with Time-Series Interpolation

After partitioning, the remaining missing values are incidental. For this type of temporal data, time-based linear interpolation is the most appropriate method for filling small gaps. It estimates the missing value based on its chronological position between two known data points, which is more physically plausible than simply filling with the mean or median. To perform this, the `date_time` column is set as the DataFrame index and sorted. Interpolation is then applied. Any remaining NaN values after interpolation (typically at the very beginning or end of a data series) are dropped.

In [7]:
df_complete_station.set_index('date_time', inplace=True)
df_complete_station.sort_index(inplace=True)

missing_before_str = df_complete_station.isnull().sum().to_string()

df_complete_station.interpolate(method='time', inplace=True)
missing_after_str = df_complete_station.isnull().sum().to_string()

html = f"""
<div style="display: flex; width: 100%;">

  <!-- Left Column: Displays the 'before' state -->
  <div style="flex: 1; padding-right: 10px;">
    <h4 style="font-family: monospace;">Missing values BEFORE interpolation:</h4>
    <pre style='font-family: monospace; border-style: solid; border-width: thin; padding:10px;'>{missing_before_str}</pre>
  </div>
  
  <!-- Right Column: Displays the 'after' state -->
  <div style="flex: 1; padding-left: 10px;">
    <h4 style="font-family: monospace;">Missing values AFTER interpolation:</h4>
    <pre style='font-family: monospace; border-style: solid; border-width: thin; padding:10px;'>{missing_after_str}</pre>
  </div>
  
</div>
"""

display(HTML(html))

In [8]:
df_wave_analysis.set_index('date_time', inplace=True)
df_wave_analysis.sort_index(inplace=True)

missing_before_str = df_wave_analysis.isnull().sum().to_string()

df_wave_analysis.interpolate(method='time', inplace=True)
missing_after_str = df_wave_analysis.isnull().sum().to_string()

html = f"""
<div style="display: flex; width: 100%;">

  <!-- Left Column: Displays the 'before' state -->
  <div style="flex: 1; padding-right: 10px;">
    <h4 style="font-family: monospace;">Missing values BEFORE interpolation:</h4>
    <pre style='font-family: monospace; border-style: solid; border-width: thin; padding:10px;'>{missing_before_str}</pre>
  </div>
  
  <!-- Right Column: Displays the 'after' state -->
  <div style="flex: 1; padding-left: 10px;">
    <h4 style="font-family: monospace;">Missing values AFTER interpolation:</h4>
    <pre style='font-family: monospace; border-style: solid; border-width: thin; padding:10px;'>{missing_after_str}</pre>
  </div>
  
</div>
"""

display(HTML(html))

In [9]:
print(f"Shape before dropping final NaNs: {df_wave_analysis.shape}")
df_wave_analysis.dropna(inplace=True)
print(f"Shape after dropping final NaNs:  {df_wave_analysis.shape}")

Shape before dropping final NaNs: (2168, 10)
Shape after dropping final NaNs:  (2166, 10)


In [10]:
df_wind_analysis.set_index('date_time', inplace=True)
df_wind_analysis.sort_index(inplace=True)

missing_before_str = df_wind_analysis.isnull().sum().to_string()

df_wind_analysis.interpolate(method='time', inplace=True)
missing_after_str = df_wind_analysis.isnull().sum().to_string()

html = f"""
<div style="display: flex; width: 100%;">

  <!-- Left Column: Displays the 'before' state -->
  <div style="flex: 1; padding-right: 10px;">
    <h4 style="font-family: monospace;">Missing values BEFORE interpolation:</h4>
    <pre style='font-family: monospace; border-style: solid; border-width: thin; padding:10px;'>{missing_before_str}</pre>
  </div>
  
  <!-- Right Column: Displays the 'after' state -->
  <div style="flex: 1; padding-left: 10px;">
    <h4 style="font-family: monospace;">Missing values AFTER interpolation:</h4>
    <pre style='font-family: monospace; border-style: solid; border-width: thin; padding:10px;'>{missing_after_str}</pre>
  </div>
  
</div>
"""

display(HTML(html))

In [11]:
print(f"Shape before dropping final NaNs: {df_wind_analysis.shape}")
df_wind_analysis.dropna(inplace=True)
print(f"Shape after dropping final NaNs:  {df_wind_analysis.shape}")

Shape before dropping final NaNs: (5774, 10)
Shape after dropping final NaNs:  (5767, 10)


## 6. Save Cleaned Data

In [12]:
buffer_complete = io.StringIO()
df_complete_station.info(buf=buffer_complete)
info_complete = buffer_complete.getvalue()

buffer_wind = io.StringIO()
df_wind_analysis.info(buf=buffer_wind)
info_wind = buffer_wind.getvalue()

buffer_wave = io.StringIO()
df_wave_analysis.info(buf=buffer_wave)
info_wave = buffer_wave.getvalue()

html = f"""
<div style="display: flex; width: 100%;">
  <div style="flex: 1; padding-right: 10px;">
    <h4 style="font-family: monospace;">Complete Station Info:</h4>
    <pre style='font-family: monospace; border-style: solid; border-width: thin; padding:10px;'>{info_complete}</pre>
  </div>
  <div style="flex: 1; padding-left: 5px; padding-right: 5px;">
    <h4 style="font-family: monospace;">Wind Analysis Info:</h4>
    <pre style='font-family: monospace; border-style: solid; border-width: thin; padding:10px;'>{info_wind}</pre>
  </div>
  <div style="flex: 1; padding-left: 10px;">
    <h4 style="font-family: monospace;">Wave Analysis Info:</h4>
    <pre style='font-family: monospace; border-style: solid; border-width: thin; padding:10px;'>{info_wave}</pre>
  </div>
</div>
"""

display(HTML(html))

> Over the course of a 30‑day month, the first buoy’s dataset contains about 720 observations—implying it records a measurement roughly every hour. In contrast, the second buoy captures around 5,760 readings in the same period, indicating a sampling interval of approximately 7.5 minutes. The third buoy falls in between, with about 2,160 data points per month, which works out to one measurement every 20 minutes.