## 1. Setup and Installation

Install required packages:
```bash
pip install s3fs boto3 pandas pyarrow dask matplotlib seaborn
```

In [2]:
# Import libraries
import s3fs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

print('✓ Libraries imported successfully!')

✓ Libraries imported successfully!


## 2. Connect to S3 Bucket

The NOAA GHCN data is publicly available (no credentials needed).

In [3]:
# Initialize S3 filesystem (anonymous access)
print('Connecting to S3...')
s3 = s3fs.S3FileSystem(anon=True)

# NOAA GHCN bucket
bucket_name = 'noaa-ghcn-pds'
print(f'✓ Connected to S3 bucket: {bucket_name}')

Connecting to S3...
✓ Connected to S3 bucket: noaa-ghcn-pds


## 3. Explore Bucket Structure

Let's see how the data is organized in the S3 bucket.

In [4]:
# List top-level directories
print('Top-level directories in the bucket:')
print('='*60)
top_level = s3.ls(bucket_name)
for item in top_level:
    print(f'  {item}')

print(f'\nTotal items: {len(top_level)}')

Top-level directories in the bucket:
  noaa-ghcn-pds/csv
  noaa-ghcn-pds/csv.gz
  noaa-ghcn-pds/ghcnd-countries.txt
  noaa-ghcn-pds/ghcnd-inventory.txt
  noaa-ghcn-pds/ghcnd-states.txt
  noaa-ghcn-pds/ghcnd-stations.txt
  noaa-ghcn-pds/ghcnd-version.txt
  noaa-ghcn-pds/index.html
  noaa-ghcn-pds/mingle-list.txt
  noaa-ghcn-pds/parquet
  noaa-ghcn-pds/readme-by_station.txt
  noaa-ghcn-pds/readme-by_year.txt
  noaa-ghcn-pds/readme.txt
  noaa-ghcn-pds/status-by_station.txt
  noaa-ghcn-pds/status-by_year.txt
  noaa-ghcn-pds/status.txt
  noaa-ghcn-pds/test.txt

Total items: 17


In [5]:
# Explore parquet directory
print('Exploring parquet data organization:')
print('='*60)
parquet_dirs = s3.ls(f'{bucket_name}/parquet/')
for item in parquet_dirs:
    print(f'  {item}')
    
    # Get size info if it's a directory
    if item.endswith('/'):
        subfiles = s3.ls(item)
        print(f'    └─ Contains {len(subfiles)} items')

Exploring parquet data organization:
  noaa-ghcn-pds/parquet/by_station
  noaa-ghcn-pds/parquet/by_year


## 4. Explore by_year Parquet Files

Let's investigate the file organization and schema of the by_year parquet data.


In [6]:
# List year directories in by_year
by_year_path = f'{bucket_name}/parquet/by_year'
year_dirs = s3.ls(by_year_path)

print('BY_YEAR FILE ORGANIZATION')
print('='*80)
print(f'\nTotal year directories: {len(year_dirs)}')
print('\nFirst 10 years:')
for i, f in enumerate(year_dirs[:10]):
    year = f.split('/')[-1]
    print(f'  {i+1:2d}. {year}')

print('\nLast 10 years:')
for i, f in enumerate(year_dirs[-10:]):
    year = f.split('/')[-1]
    print(f'  {i+1:2d}. {year}')

# Examine one year's element directories
sample_year = year_dirs[250]  # Recent year
print(f'\n\nExploring elements in {sample_year.split("/")[-1]}:')
print('-'*80)
element_dirs = s3.ls(sample_year)
print(f'Total elements: {len(element_dirs)}')
print('\nFirst 20 elements:')
for i, elem_dir in enumerate(element_dirs[:20]):
    element = elem_dir.split('/')[-1]
    print(f'  {i+1:2d}. {element}')

# Look at actual parquet files in one element
sample_element = element_dirs[20]  # PRCP is common
print(f'\n\nActual parquet files in {sample_element.split("/")[-2]}/{sample_element.split("/")[-1]}:')
print('-'*80)
parquet_files = s3.ls(sample_element)
print(f'Number of parquet files: {len(parquet_files)}')
print('\nFile sizes:')
for i, pf in enumerate(parquet_files[:5]):
    info = s3.info(pf)
    size_mb = info['size'] / (1024 * 1024)
    filename = pf.split('/')[-1]
    print(f'  {i+1}. {filename[:40]:40s} ({size_mb:6.2f} MB)')


BY_YEAR FILE ORGANIZATION

Total year directories: 264

First 10 years:
   1. YEAR=1750
   2. YEAR=1763
   3. YEAR=1764
   4. YEAR=1765
   5. YEAR=1766
   6. YEAR=1767
   7. YEAR=1768
   8. YEAR=1769
   9. YEAR=1770
  10. YEAR=1771

Last 10 years:
   1. YEAR=2016
   2. YEAR=2017
   3. YEAR=2018
   4. YEAR=2019
   5. YEAR=2020
   6. YEAR=2021
   7. YEAR=2022
   8. YEAR=2023
   9. YEAR=2024
  10. YEAR=2025


Exploring elements in YEAR=2012:
--------------------------------------------------------------------------------
Total elements: 98

First 20 elements:
   1. ELEMENT=ADPT
   2. ELEMENT=ASLP
   3. ELEMENT=ASTP
   4. ELEMENT=AWBT
   5. ELEMENT=AWDR
   6. ELEMENT=AWND
   7. ELEMENT=DAEV
   8. ELEMENT=DAPR
   9. ELEMENT=DASF
  10. ELEMENT=DATN
  11. ELEMENT=DATX
  12. ELEMENT=DWPR
  13. ELEMENT=EVAP
  14. ELEMENT=FMTM
  15. ELEMENT=MDPR
  16. ELEMENT=MDSF
  17. ELEMENT=MDTN
  18. ELEMENT=MDTX
  19. ELEMENT=MNPN
  20. ELEMENT=MXPN


Actual parquet files in YEAR=2012/ELEMENT=PGTM:
-------

In [7]:
# Read a sample parquet file to examine schema and fields
sample_file = parquet_files[0]  # Use first actual parquet file
print('PARQUET SCHEMA AND FIELDS')
print('='*80)
year = sample_file.split('/')[-3]
element = sample_file.split('/')[-2]
filename = sample_file.split('/')[-1]
print(f'\nReading sample file: {year}/{element}/{filename[:40]}\n')

with s3.open(sample_file, 'rb') as f:
    df_sample = pd.read_parquet(f)

print(f'DataFrame shape: {df_sample.shape[0]:,} rows × {df_sample.shape[1]} columns')
print('\n' + '-'*80)
print('COLUMN NAMES AND DATA TYPES:')
print('-'*80)
for col, dtype in df_sample.dtypes.items():
    non_null = df_sample[col].notna().sum()
    pct_null = (1 - non_null / len(df_sample)) * 100
    print(f'  {col:15s} : {str(dtype):15s}  (null: {pct_null:5.1f}%)')

print('\n' + '-'*80)
print('FIRST 10 ROWS:')
print('-'*80)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 120)
print(df_sample.head(10))

print('\n' + '-'*80)
print('BASIC STATISTICS (numeric columns):')
print('-'*80)
print(df_sample.describe())


PARQUET SCHEMA AND FIELDS

Reading sample file: YEAR=2012/ELEMENT=PGTM/1b8a346935904eea8e0b774358625c41_0.snapp

DataFrame shape: 105,055 rows × 7 columns

--------------------------------------------------------------------------------
COLUMN NAMES AND DATA TYPES:
--------------------------------------------------------------------------------
  ID              : object           (null:   0.0%)
  DATE            : object           (null:   0.0%)
  DATA_VALUE      : int64            (null:   0.0%)
  M_FLAG          : object           (null: 100.0%)
  Q_FLAG          : object           (null: 100.0%)
  S_FLAG          : object           (null:   0.0%)
  OBS_TIME        : object           (null: 100.0%)

--------------------------------------------------------------------------------
FIRST 10 ROWS:
--------------------------------------------------------------------------------
            ID      DATE  DATA_VALUE M_FLAG Q_FLAG S_FLAG OBS_TIME
0  FMW00040308  20120101         951   None 

In [7]:
# Examine unique values and data characteristics
print('DATA CHARACTERISTICS')
print('='*80)

# Check station data (column name is 'ID' not 'id')
if 'ID' in df_sample.columns:
    print(f'\nUnique stations: {df_sample["ID"].nunique():,}')
    print('\nTop 10 stations by observation count:')
    print('-'*80)
    top_stations = df_sample['ID'].value_counts().head(10)
    for station, count in top_stations.items():
        print(f'  {station}: {count:,} observations')

# Check unique dates
if 'DATE' in df_sample.columns:
    print(f'\n\nUnique dates: {df_sample["DATE"].nunique():,}')
    print('Date range:', df_sample['DATE'].min(), 'to', df_sample['DATE'].max())

# Check data value range
if 'DATA_VALUE' in df_sample.columns:
    print(f'\n\nData Value Statistics:')
    print(f'  Min: {df_sample["DATA_VALUE"].min()}')
    print(f'  Max: {df_sample["DATA_VALUE"].max()}')
    print(f'  Mean: {df_sample["DATA_VALUE"].mean():.2f}')
    print(f'  Median: {df_sample["DATA_VALUE"].median():.2f}')

# Check flags
flag_cols = [col for col in df_sample.columns if 'FLAG' in col]
if flag_cols:
    print(f'\n\nFlag Columns:')
    for flag_col in flag_cols:
        non_null = df_sample[flag_col].notna().sum()
        if non_null > 0:
            print(f'\n{flag_col}:')
            print(df_sample[flag_col].value_counts().head(5))

# Show memory usage
print('\n' + '-'*80)
print('MEMORY USAGE:')
print('-'*80)
mem_usage = df_sample.memory_usage(deep=True)
total_mb = mem_usage.sum() / (1024 * 1024)
print(f'Total memory usage: {total_mb:.2f} MB')
print('\nPer column:')
for col, mem in mem_usage.items():
    mem_mb = mem / (1024 * 1024)
    pct = (mem / mem_usage.sum()) * 100
    print(f'  {col:15s} : {mem_mb:6.2f} MB ({pct:5.1f}%)')


DATA CHARACTERISTICS

Unique stations: 889

Top 10 stations by observation count:
--------------------------------------------------------------------------------
  USW00094225: 366 observations
  PSW00040309: 366 observations
  USW00014858: 366 observations
  USW00025309: 366 observations
  VQW00011624: 366 observations
  USW00026523: 366 observations
  USW00025331: 366 observations
  USW00025335: 366 observations
  USW00053853: 365 observations
  USW00053850: 365 observations


Unique dates: 366
Date range: 20120101 to 20121231


Data Value Statistics:
  Min: 0
  Max: 2359
  Mean: 1354.43
  Median: 1412.00


Flag Columns:

S_FLAG:
W    105055
Name: S_FLAG, dtype: int64

--------------------------------------------------------------------------------
MEMORY USAGE:
--------------------------------------------------------------------------------
Total memory usage: 27.15 MB

Per column:
  Index           :   0.00 MB (  0.0%)
  ID              :   6.81 MB ( 25.1%)
  DATE            :   6

In [8]:
# Analyze partitioning scheme summary
print('\n' + '='*80)
print('PARTITIONING SCHEME SUMMARY')
print('='*80)

# Extract years from directory names
years = []
for year_dir in year_dirs:
    year_name = year_dir.split('/')[-1]
    if year_name.startswith('YEAR='):
        years.append(year_name.replace('YEAR=', ''))

print(f'\n✓ Two-level hierarchical partitioning: YEAR → ELEMENT')
print(f'\nLevel 1 - Years:')
print(f'  - Total years: {len(years)}')
print(f'  - Range: {years[0]} to {years[-1]}')

# Sample a few years to get average element count
sample_years = [year_dirs[i] for i in [50, 150, 250]]
element_counts = []
for sy in sample_years:
    element_counts.append(len(s3.ls(sy)))

print(f'\nLevel 2 - Elements (weather measurement types):')
print(f'  - Average elements per year: ~{int(np.mean(element_counts))}')
print(f'  - Sample year ({sample_year.split("/")[-1]}): {len(element_dirs)} elements')

# Show some common elements
common_elements = ['ELEMENT=PRCP', 'ELEMENT=TMAX', 'ELEMENT=TMIN', 'ELEMENT=SNOW', 'ELEMENT=TAVG']
available_elements = [e.split('/')[-1] for e in element_dirs]
print(f'\nCommon core elements available:')
for ce in common_elements:
    status = '✓' if ce in available_elements else '✗'
    print(f'  {status} {ce}')

print(f'\n\nTotal data organization:')
print(f'  - {len(year_dirs)} year directories')
print(f'  - ~{len(year_dirs) * int(np.mean(element_counts)):,} year/element partitions')
print(f'  - Multiple parquet files per partition (e.g., {len(parquet_files)} files in sampled partition)')
print(f'  - File format: Snappy-compressed Parquet')
print(f'  - Typical file size: ~{np.mean([s3.info(pf)["size"] for pf in parquet_files[:5]]) / (1024*1024):.1f} MB')



PARTITIONING SCHEME SUMMARY

✓ Two-level hierarchical partitioning: YEAR → ELEMENT

Level 1 - Years:
  - Total years: 264
  - Range: 1750 to 2025



Level 2 - Elements (weather measurement types):
  - Average elements per year: ~44
  - Sample year (YEAR=2012): 98 elements

Common core elements available:
  ✓ ELEMENT=PRCP
  ✓ ELEMENT=TMAX
  ✓ ELEMENT=TMIN
  ✓ ELEMENT=SNOW
  ✓ ELEMENT=TAVG


Total data organization:
  - 264 year directories
  - ~11,616 year/element partitions
  - Multiple parquet files per partition (e.g., 1 files in sampled partition)
  - File format: Snappy-compressed Parquet


  - Typical file size: ~0.2 MB
