# PCA Computation per State (Dask Edition)

This notebook revisits the PCA weather workflow using `dask.dataframe` instead of Spark.


In [1]:
import numpy as np
import pandas as pd
import dask.dataframe as dd
import dask.array as da
from dask.distributed import Client

from pathlib import Path

from lib.numpy_pack import unpackArray
from lib.dask_pca import compute_statistics, covariance_from_summary
from lib.decomposer import Eigen_decomp
from lib.Reconstruction_plots import recon_plot

_client = Client(n_workers=4, threads_per_worker=1)
_client


2025-11-07 20:16:17,776 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/cl/mbdp3zfx4cg9mmnz60g052540000gn/T/dask-worker-space/worker-9qgycg6h', purging
2025-11-07 20:16:17,777 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/cl/mbdp3zfx4cg9mmnz60g052540000gn/T/dask-worker-space/worker-wv_y5mqo', purging
2025-11-07 20:16:17,777 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/cl/mbdp3zfx4cg9mmnz60g052540000gn/T/dask-worker-space/worker-agj19dnp', purging
2025-11-07 20:16:17,778 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/cl/mbdp3zfx4cg9mmnz60g052540000gn/T/dask-worker-space/worker-grsxxpyv', purging
2025-11-07 20:16:17,778 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/cl/mbdp3zfx4cg9mmnz60g052540000gn/T/dask-worker-space/worker-lsrsuwyw', purging
2025-11-07 20:16:17,778 - distributed.diskutils - INFO - Found st

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 4
Total threads: 4,Total memory: 16.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:63437,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads: 4
Started: Just now,Total memory: 16.00 GiB

0,1
Comm: tcp://127.0.0.1:63448,Total threads: 1
Dashboard: http://127.0.0.1:63450/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:63440,
Local directory: /var/folders/cl/mbdp3zfx4cg9mmnz60g052540000gn/T/dask-worker-space/worker-peyt_4_3,Local directory: /var/folders/cl/mbdp3zfx4cg9mmnz60g052540000gn/T/dask-worker-space/worker-peyt_4_3

0,1
Comm: tcp://127.0.0.1:63457,Total threads: 1
Dashboard: http://127.0.0.1:63458/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:63441,
Local directory: /var/folders/cl/mbdp3zfx4cg9mmnz60g052540000gn/T/dask-worker-space/worker-mtjebtba,Local directory: /var/folders/cl/mbdp3zfx4cg9mmnz60g052540000gn/T/dask-worker-space/worker-mtjebtba

0,1
Comm: tcp://127.0.0.1:63449,Total threads: 1
Dashboard: http://127.0.0.1:63452/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:63442,
Local directory: /var/folders/cl/mbdp3zfx4cg9mmnz60g052540000gn/T/dask-worker-space/worker-qnhdhop9,Local directory: /var/folders/cl/mbdp3zfx4cg9mmnz60g052540000gn/T/dask-worker-space/worker-qnhdhop9

0,1
Comm: tcp://127.0.0.1:63454,Total threads: 1
Dashboard: http://127.0.0.1:63455/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:63443,
Local directory: /var/folders/cl/mbdp3zfx4cg9mmnz60g052540000gn/T/dask-worker-space/worker-kfizrr2a,Local directory: /var/folders/cl/mbdp3zfx4cg9mmnz60g052540000gn/T/dask-worker-space/worker-kfizrr2a


## Load Parquet Data

The weather state parquet archives are hosted on S3. Run the next cell once to download and unpack the NY dataset locally.


ls: ../../weather_data/: No such file or directory


In [2]:
import tarfile
import urllib.request

state = "NY"
data_dir = Path("../../../Data/Weather")
data_dir.mkdir(parents=True, exist_ok=True)

tarname = f"{state}.tgz"
parquet_dir = data_dir / f"{state}.parquet"

if not parquet_dir.exists():
    tar_path = data_dir / tarname
    if not tar_path.exists():
        url = f"https://mas-dse-open.s3.amazonaws.com/Weather/by_state/{tarname}"
        urllib.request.urlretrieve(url, tar_path)
    with tarfile.open(tar_path, "r:gz") as tar:
        tar.extractall(path=data_dir)

parquet_path = parquet_dir
parquet_path


HTTPError: HTTP Error 404: Not Found

In [None]:
ddf = dd.read_parquet(parquet_path, engine="pyarrow")
print(ddf.shape)
ddf.head()


In [None]:
measurement_counts = ddf.groupby("Measurement").size().compute().sort_values()
measurement_counts


In [None]:
measurement = "TMAX"
filtered = ddf[ddf["Measurement"] == measurement].persist()
filtered.head()


In [None]:
summary = compute_statistics(filtered)
mean, (eigval, eigvec) = covariance_from_summary(summary)

sample_count = summary["nan_counts"].shape[0]
print(f"Sample count: {sample_count}")
print(f"Eigenvalues (top 5): {eigval[:5]}")


In [None]:
sample_row = filtered.head(1).iloc[0]
values = unpackArray(sample_row["Values"], np.float16).astype(np.float64)
x_axis = np.arange(1, len(values) + 1)

plotter = recon_plot(Eigen_decomp(x_axis, values, mean, eigvec[:, :5]), year_axis=True, interactive=True, figsize=(4, 3))
plotter.get_Interactive()


## Summary
- Dask loads the state parquet without Spark.
- Partition-wise aggregation reproduces the covariance matrix while ignoring `NaN` days.
- Eigenvectors/values can be used directly with the existing reconstruction widgets (now sized smaller at 4Ã—3).

You can iterate over other measurements by redefining `measurement` and re-running the analysis cell.
