## Downloading the data

We'll use the same code in the main script to download the CSV data.

In [3]:
from datetime import datetime
from meteostat import Hourly
import os
import pandas as pd

def download_data():
    # Let us be good people and not hammer their API if we already have the data
    if not os.path.exists("rdu_weather_data.csv") or not os.path.exists("rdu_weather_predict.csv"):
        X_end = datetime(2025, 9, 16, 23, 59)
        y_start = datetime(2025, 9, 17)
        y_end = datetime(2025, 9, 30, 23, 59)

        print("Downloading RDU weather data, please wait...")
        X = Hourly('72306', X_start, X_end)
        X = X.fetch()
        y = Hourly('72306', y_start, y_end)
        y = y.fetch()
        X.to_csv("rdu_weather_data.csv", index=True)
        y.to_csv("rdu_weather_predict.csv", index=True)
        print("Downloaded and saved data!")
        return X, y
    else:
        print("You already have the weather data downloaded, not redownloading.")
        X = pd.read_csv("rdu_weather_data.csv", index_col=0, parse_dates=True)
        y = pd.read_csv("rdu_weather_predict.csv", index_col=0, parse_dates=True)
        return X, y

X, y = download_data()

You already have the weather data downloaded, not redownloading.


## Missing values

We'll see what's missing. My IDE is fancy and I can visualize this in my own environment, but unfortunately that is not reproducible. This shows you which columns are missing the most as a percentage. We can see that `snow` and `tsun` are missing 100%. This is discussed in the README file.

The wind gust is missing 99% of the time, also discussed in the README. This is more meant to actually show as a percentage how much data is missing.

In [5]:
missing_summary = X.isna().mean().sort_values(ascending=False)
print(missing_summary)

snow    1.000000
tsun    1.000000
wpgt    0.994422
prcp    0.065400
coco    0.050441
wdir    0.002071
pres    0.000385
rhum    0.000000
dwpt    0.000000
temp    0.000000
wspd    0.000000
dtype: float64
