# 01 — Pandas `groupby` + Datetime on GHCN Parquet
This exercise loads the Parquet produced by **Notebook 02** and demonstrates standard `groupby` + datetime analyses.
You can read locally or (after you push to GitHub) via a **cloud HTTPS raw** URL.

In [1]:
import pandas as pd, numpy as np
from pathlib import Path

# Primary local artifact produced by 02_fetch_ghcn_il_to_parquet.ipynb
LOCAL_PARQUET = '../data/ghcn_il_top4_daily.parquet'

# After pushing to GitHub, set this to your repo's raw URL to read from the cloud:
CLOUD_PARQUET = None  # e.g., 'https://raw.githubusercontent.com/USER/REPO/main/data/ghcn_il_top4_daily.parquet'

def read_cloud_first(cloud_url, local_fallback):
    try:
        if cloud_url:
            df = pd.read_parquet(cloud_url)  # needs pyarrow/fastparquet
            print('Loaded from cloud:', cloud_url)
            return df
    except Exception as e:
        print('Cloud read failed → using local:', type(e).__name__, str(e)[:120])
    print('Loaded local:', local_fallback)
    return pd.read_parquet(local_fallback)

df = read_cloud_first(CLOUD_PARQUET, LOCAL_PARQUET).sort_values(['ID','DATE']).reset_index(drop=True)
df.dtypes

Loaded local: ../data/ghcn_il_top4_daily.parquet


ID              object
DATE    datetime64[ns]
PRCP           float64
TMAX           float64
TMIN           float64
SNOW           float64
SNWD           float64
DAPR           float64
MDPR           float64
TOBS           float64
WT01           float64
WT04           float64
WT05           float64
WT06           float64
WT03           float64
WT07           float64
WT08           float64
WT09           float64
WT11           float64
WT14           float64
WT16           float64
WT18           float64
DASF           float64
MDSF           float64
WESD           float64
EVAP           float64
dtype: object

## 1) Add datetime helpers

In [2]:
df['year']  = pd.to_datetime(df['DATE']).dt.year
df['month'] = pd.to_datetime(df['DATE']).dt.month
df['ym']    = pd.to_datetime(df['DATE']).dt.to_period('M')
df.head()

Unnamed: 0,ID,DATE,PRCP,TMAX,TMIN,SNOW,SNWD,DAPR,MDPR,TOBS,...,WT14,WT16,WT18,DASF,MDSF,WESD,EVAP,year,month,ym
0,USC00110137,1892-12-02,0.0,,,,,,,,...,,1.0,,,,,,1892,12,1892-12
1,USC00110137,1892-12-03,,,,,,,,,...,,,,,,,,1892,12,1892-12
2,USC00110137,1892-12-06,20.8,,,,,,,,...,,1.0,,,,,,1892,12,1892-12
3,USC00110137,1892-12-07,22.1,,,,,,,,...,,,,,,,,1892,12,1892-12
4,USC00110137,1892-12-13,11.4,,,,,,,,...,,1.0,,,,,,1892,12,1892-12


## 2) Monthly means & totals by station

In [5]:
monthly = (
    df.groupby(['ID','ym'], as_index=False)
      .agg(TMIN=('TMIN','mean'), TMAX=('TMAX','mean'), PRCP=('PRCP','sum'))
)
monthly_piv = monthly.pivot(index='ym', columns='ID', values='TMIN')
monthly.head(), monthly_piv.head()

(            ID       ym  TMIN  TMAX   PRCP
 0  USC00110137  1892-12   NaN   NaN   55.1
 1  USC00110137  1893-01   NaN   NaN    6.6
 2  USC00110137  1893-02   NaN   NaN   84.9
 3  USC00110137  1893-03   NaN   NaN  148.7
 4  USC00110137  1893-04   NaN   NaN  215.1,
 ID       USC00110137  USC00110338  USC00116526  USC00117391
 ym                                                         
 1866-02          NaN          NaN          NaN          NaN
 1866-03          NaN          NaN          NaN          NaN
 1866-05          NaN          NaN          NaN          NaN
 1866-09          NaN          NaN          NaN          NaN
 1866-10          NaN          NaN          NaN          NaN)

**Try it:** Compute monthly *median* `TMAX` by station.

KeyError: 'Column not found: TAVG'

## 3) Annual precipitation totals and rankings

In [7]:
annual_prcp = (
    df.groupby(['ID','year'], as_index=False)
      .agg(annual_prcp_mm=('PRCP','sum'))
)
annual_prcp['rank_within_year'] = annual_prcp.groupby('year')['annual_prcp_mm'].rank(ascending=False, method='min')
annual_prcp.sort_values(['year','rank_within_year']).head(12)

Unnamed: 0,ID,year,annual_prcp_mm,rank_within_year
369,USC00117391,1866,358.1,1.0
370,USC00117391,1867,730.0,1.0
371,USC00117391,1868,859.7,1.0
372,USC00117391,1869,1051.1,1.0
373,USC00117391,1870,986.4,1.0
374,USC00117391,1871,831.3,1.0
375,USC00117391,1872,627.5,1.0
376,USC00117391,1873,448.6,1.0
377,USC00117391,1874,618.4,1.0
378,USC00117391,1875,681.7,1.0


**Try it:** Rank warmest station per year using mean `TAVG`.

## 4) Station-by-month climatology (using whatever is present)

In [9]:
climo = df.groupby(['ID','month'], as_index=False)['TMAX'].mean()
climo_piv = climo.pivot(index='month', columns='ID', values='TMAX')
climo_piv

ID,USC00110137,USC00110338,USC00116526,USC00117391
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,3.206984,-0.64633,0.290932,-0.295235
2,5.861129,1.304014,2.374458,2.132727
3,11.82408,7.960794,9.20914,9.04622
4,18.641973,15.491488,17.048922,16.344476
5,24.038368,21.870903,23.236806,22.627047
6,29.194917,27.075834,28.192177,27.815774
7,31.286829,29.453647,30.386635,29.359379
8,30.389658,28.292638,29.317349,28.476923
9,26.789799,24.376057,25.657596,25.001848
10,20.332227,17.70926,18.992826,17.721014


**Try it:** Compute monthly precipitation climatology (sum of `PRCP` across years).