<a href="https://colab.research.google.com/github/sophiewagner7/its-too-nice-out-to-take-a-cab/blob/main/notebooks/pre_process_weather_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pre-Process Weather Data

Moacir P. de Sá Pereira

This notebook wrangles our hourly weather data for the KNYC0 weather station in New York City. It iterates over several years’ worth of data downloaded from the [Global Historical Climate
Network hourly
(GHCNh)](https://www.ncei.noaa.gov/products/global-historical-climatology-network-hourly) database,
which provides hourly weather data going back over two centuries for New
York City. The data come in over 200 columns to
account for the variability that can occur in the terse
[METAR](https://en.wikipedia.org/wiki/METAR) report for airplanes, which is
also included under `remarks`. The government provides a
[codebook](https://www.ncei.noaa.gov/oa/global-historical-climatology-network/hourly/doc/ghcnh_DOCUMENTATION.pdf) to describe the remaining data.

Though initially we had planned on keeping as much data as possible, instead we are reducing the dataset to the temperature, dew point, wind speed, precipitation, cloud cover, and relative humidity. They are all numeric values except `cloud_cover`, which is an ordinal categorical variable we derived that negatively indicates the amount of cloud cover (-4 is overcast, 0 is clear).

Additionally, we calculate a difference between the value of each variable and the same time from a week earlier. It is based on these delta values that we will try to see if we can determine what makes a “nice” day.

The resulting dataframe is saved to Google Colab and needs to be downloaded to be put back into Git.


## Imports and Concatenate Yearly Files

In [None]:
#%pip install metar-taf-parser-mivek

In [1]:
import pandas as pd
#from metar import Metar
#from metar_taf_parser.parser.parser import MetarParser



In [2]:
#obs = Metar.Metar('METAR KEWR 111851Z VRB03G19KT 2SM R04R/3000VP6000FT TSRA BR FEW015 BKN040CB BKN065 OVC200 22/22 A2987 RMK AO2 PK WND 29028/1817 WSHFT 1812 TSB05RAB22 SLP114 FRQ LTGICCCCG TS OHD AND NW -N-E MOV NE P0013 T02270215')
#print(obs.string())

In [3]:
root_url = "https://github.com/sophiewagner7/nyc-weather/raw/refs/heads/main/data/GHCNh"

dfs = []
for year in range(2019, 2025):
  file_name = f"GHCNh_USW00094728_{year}.parquet"
  df_fragment = pd.read_parquet(f"{root_url}/{file_name}")
  dfs.append(df_fragment)

df = pd.concat(dfs)

  df = pd.concat(dfs)


## Wrangle Data

In [4]:
def get_cloud_cover(row):
  coverage = []
  if row.sky_cover_1:
    coverage.append(row.sky_cover_1)
  if row.sky_cover_2:
    coverage.append(row.sky_cover_2)
  if row.sky_cover_3:
    coverage.append(row.sky_cover_3)
  coverage = " ".join(coverage)
  if "OVC" in coverage:
    return -4
  elif "BKN" in coverage:
    return -3
  elif "SCT" in coverage:
    return -2
  elif "FEW" in coverage:
    return -1
  else:
    return 0


In [5]:
# Fill precipitation NAs with .1mm of rain during trace rain

df["precipitation"] = df.apply(lambda row: .1 if row.precipitation_Measurement_Code == "2-Trace" else row.precipitation, axis=1)

# Concatenate sky cover reports to determine sky cover.
# Key:
# -4 : Overcast
# -3 : Broken Clouds
# -2 : Scattered Clouds
# -1 : Few Clouds
# 0 : Clear skies
df["cloud_cover"] = df.apply(lambda row: get_cloud_cover(row), axis = 1)


# Add datetime columns
df["datetime"] = pd.to_datetime(df.DATE)
df = df.sort_values('datetime')
df["datetime"] = df["datetime"].dt.floor('h')
df["date"] = df.datetime.dt.date
df["hour"] = df.datetime.dt.hour
df = df.drop_duplicates(subset=['date', 'hour'], keep='last')
df.set_index('datetime', inplace=True)

# Zero out NAs
for col in ["wind_speed", "precipitation", ]:
  df[col] = df[col].fillna(0)

# Interpolate NAs
for col in ["temperature", "dew_point_temperature", "relative_humidity"]:
  df[col] = df[col].interpolate()

  df["cloud_cover"] = df.apply(lambda row: get_cloud_cover(row), axis = 1)
  df["datetime"] = pd.to_datetime(df.DATE)


## Remove Columns

In [6]:
columns = [
    "date",
    "hour",
    "temperature",
    "dew_point_temperature",
    "wind_speed",
    "precipitation",
    "cloud_cover",
    "relative_humidity",
    "remarks"
]

df = df[columns]

In [7]:
df.to_parquet("GHCNh_USW00094728_2019_to_2024.parquet")