# Exploring NOAA GHCN Data on S3

## Global Historical Climatology Network (GHCN)

This notebook explores the **NOAA Global Historical Climatology Network (GHCN)** dataset stored on AWS S3.

### Dataset Information:
- **S3 Bucket**: `s3://noaa-ghcn-pds/`
- **Registry**: [AWS Open Data Registry](https://registry.opendata.aws/noaa-ghcn/)
- **Time Period**: 1750 - Present (275+ years!)
- **Format**: Parquet and CSV.GZ
- **Size**: Billions of observations worldwide

### Goals:
1. List available years and data organization
2. Understand data distribution across time
3. Explore geographic coverage (stations worldwide)
4. Sample data to understand structure
5. Analyze metadata and completeness

## 1. Setup and Installation

Install required packages:
```bash
pip install s3fs boto3 pandas pyarrow dask matplotlib seaborn
```

In [None]:
# Import libraries
import s3fs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

print('✓ Libraries imported successfully!')

## 2. Connect to S3 Bucket

The NOAA GHCN data is publicly available (no credentials needed).

In [None]:
# Initialize S3 filesystem (anonymous access)
print('Connecting to S3...')
s3 = s3fs.S3FileSystem(anon=True)

# NOAA GHCN bucket
bucket_name = 'noaa-ghcn-pds'
print(f'✓ Connected to S3 bucket: {bucket_name}')

## 3. Explore Bucket Structure

Let's see how the data is organized in the S3 bucket.

In [None]:
# List top-level directories
print('Top-level directories in the bucket:')
print('='*60)
top_level = s3.ls(bucket_name)
for item in top_level:
    print(f'  {item}')

print(f'\nTotal items: {len(top_level)}')

In [None]:
# Explore parquet directory
print('Exploring parquet data organization:')
print('='*60)
parquet_dirs = s3.ls(f'{bucket_name}/parquet/')
for item in parquet_dirs:
    print(f'  {item}')
    
    # Get size info if it's a directory
    if item.endswith('/'):
        subfiles = s3.ls(item)
        print(f'    └─ Contains {len(subfiles)} items')

## 4. Discover Available Years

Let's find out what years are available in the dataset.

In [None]:
# List years in the by_year directory
print('Discovering available years...')
year_dirs = s3.ls(f'{bucket_name}/parquet/by_year/')

# Extract years from directory names
years = []
for year_dir in year_dirs:
    # Extract YEAR=YYYY from path
    if 'YEAR=' in year_dir:
        year = int(year_dir.split('YEAR=')[1].strip('/'))
        years.append(year)

years = sorted(years)

print('='*60)
print(f'Total years available: {len(years)}')
print(f'Year range: {min(years)} - {max(years)}')
print(f'\nFirst 10 years: {years[:10]}')
print(f'Last 10 years: {years[-10:]}')
print('='*60)

In [None]:
# Visualize temporal distribution
fig, axes = plt.subplots(2, 1, figsize=(16, 10))
fig.suptitle('NOAA GHCN: Temporal Coverage', fontsize=16, fontweight='bold')

# Full timeline
axes[0].bar(years, [1]*len(years), width=1.0, color='steelblue', edgecolor='none')
axes[0].set_xlabel('Year', fontsize=12)
axes[0].set_ylabel('Data Available', fontsize=12)
axes[0].set_title(f'Complete Timeline: {min(years)} - {max(years)} ({len(years)} years)', fontsize=13)
axes[0].set_xlim(min(years)-5, max(years)+5)
axes[0].grid(True, alpha=0.3, axis='x')

# Decade bins
decade_bins = np.arange(min(years)//10*10, max(years)+10, 10)
decade_counts, _ = np.histogram(years, bins=decade_bins)
decade_labels = [f'{int(d)}s' for d in decade_bins[:-1]]

axes[1].bar(decade_bins[:-1], decade_counts, width=9, color='orange', edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Decade', fontsize=12)
axes[1].set_ylabel('Years Available', fontsize=12)
axes[1].set_title('Data Availability by Decade', fontsize=13)
axes[1].set_xticks(decade_bins[:-1])
axes[1].set_xticklabels(decade_labels, rotation=45, ha='right')
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print(f'\nDecade summary:')
for decade, count in zip(decade_bins[:-1], decade_counts):
    print(f'  {int(decade)}s: {int(count)} years')

## 5. Sample Data Structure

Let's load a sample of data from a recent year to understand the structure.

In [None]:
# Load a sample from 2023
import pyarrow.parquet as pq

sample_year = 2023
print(f'Loading sample data from year {sample_year}...')

# Get files for this year
year_path = f'{bucket_name}/parquet/by_year/YEAR={sample_year}/'
year_files = s3.ls(year_path)

print(f'Files available for {sample_year}: {len(year_files)}')
print(f'First few files:')
for f in year_files[:5]:
    size = s3.size(f) / (1024**2)  # Convert to MB
    print(f'  {f.split("/")[-1]}: {size:.2f} MB')

In [None]:
# Read a small sample
print(f'\nReading sample from {sample_year}...')
sample_file = year_files[0]

# Open file with S3
with s3.open(sample_file, 'rb') as f:
    # Read parquet file
    table = pq.read_table(f)
    df_sample = table.to_pandas()

print(f'✓ Loaded {len(df_sample):,} rows')
print(f'\nDataFrame info:')
print(df_sample.info())
print(f'\nFirst few rows:')
df_sample.head(10)

In [None]:
# Explore data columns and types
print('='*60)
print('Data Structure Summary')
print('='*60)
print(f'Columns: {list(df_sample.columns)}')
print(f'\nSample statistics:')
print(df_sample.describe())

# Check for unique values in key columns
print(f'\nUnique stations in sample: {df_sample["ID"].nunique():,}')
if 'ELEMENT' in df_sample.columns:
    print(f'Unique measurement types (ELEMENT): {df_sample["ELEMENT"].nunique()}')
    print(f'Top measurement types:')
    print(df_sample['ELEMENT'].value_counts().head(10))

## 6. Geographic Distribution (Stations)

Let's explore where the weather stations are located worldwide.

In [None]:
# Station metadata (if available)
print('Looking for station metadata...')
csv_files = s3.ls(f'{bucket_name}/csv/')
print(f'CSV directory contents:')
for item in csv_files[:10]:
    print(f'  {item}')

In [None]:
# Count unique stations across recent years
print('Analyzing station distribution across years...')
recent_years = [2020, 2021, 2022, 2023, 2024]
station_counts = {}

for year in recent_years:
    try:
        year_path = f'{bucket_name}/parquet/by_year/YEAR={year}/'
        files = s3.ls(year_path)
        
        if files:
            # Read first file to get station count estimate
            with s3.open(files[0], 'rb') as f:
                table = pq.read_table(f, columns=['ID'])
                df_temp = table.to_pandas()
                station_counts[year] = df_temp['ID'].nunique()
                
        print(f'  {year}: {station_counts.get(year, "N/A")} unique stations (sample)')
    except Exception as e:
        print(f'  {year}: Error - {str(e)[:50]}')

# Visualize
if station_counts:
    fig, ax = plt.subplots(figsize=(10, 6))
    years_list = list(station_counts.keys())
    counts_list = list(station_counts.values())
    
    ax.bar(years_list, counts_list, color='teal', alpha=0.7, edgecolor='black')
    ax.set_xlabel('Year', fontsize=12)
    ax.set_ylabel('Number of Unique Stations (sample)', fontsize=12)
    ax.set_title('Weather Stations by Year', fontsize=14, fontweight='bold')
    ax.grid(True, alpha=0.3, axis='y')
    plt.tight_layout()
    plt.show()

## 7. Data Volume Analysis

Estimate the size and number of observations across years.

In [None]:
# Sample multiple years to estimate data volume
print('Estimating data volume across years...')
sample_years = [1900, 1950, 2000, 2010, 2015, 2020, 2023]
volume_info = []

for year in sample_years:
    try:
        year_path = f'{bucket_name}/parquet/by_year/YEAR={year}/'
        files = s3.ls(year_path)
        
        # Calculate total size
        total_size = sum(s3.size(f) for f in files) / (1024**3)  # GB
        
        volume_info.append({
            'Year': year,
            'Files': len(files),
            'Size_GB': total_size
        })
        print(f'  {year}: {len(files)} files, {total_size:.2f} GB')
    except Exception as e:
        print(f'  {year}: Not available or error')

# Create DataFrame and visualize
if volume_info:
    df_volume = pd.DataFrame(volume_info)
    
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    fig.suptitle('Data Volume Growth Over Time', fontsize=16, fontweight='bold')
    
    # Number of files
    axes[0].plot(df_volume['Year'], df_volume['Files'], marker='o', linewidth=2, markersize=8, color='blue')
    axes[0].set_xlabel('Year', fontsize=12)
    axes[0].set_ylabel('Number of Files', fontsize=12)
    axes[0].set_title('Files per Year', fontsize=13)
    axes[0].grid(True, alpha=0.3)
    
    # Data size
    axes[1].plot(df_volume['Year'], df_volume['Size_GB'], marker='s', linewidth=2, markersize=8, color='red')
    axes[1].set_xlabel('Year', fontsize=12)
    axes[1].set_ylabel('Data Size (GB)', fontsize=12)
    axes[1].set_title('Storage Size per Year', fontsize=13)
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print(f'\nSummary:')
    print(df_volume)

## 8. Summary and Next Steps

### Key Findings:

This notebook explored the NOAA GHCN dataset on S3:

✅ **Temporal Coverage**: 275+ years of historical weather data
✅ **Data Organization**: Parquet files organized by year and station
✅ **Geographic Coverage**: Thousands of weather stations worldwide
✅ **Data Volume**: Growing dataset with recent years containing most data
✅ **Measurement Types**: Multiple weather elements (TMAX, TMIN, PRCP, etc.)

### Next Steps:

1. **Load specific years** for detailed analysis
2. **Analyze trends** over time (climate change indicators)
3. **Geographic patterns** by region/continent
4. **Seasonal analysis** within years
5. **Compare** different weather elements
6. **Quality control** analysis using flags

### Resources:

- [NOAA GHCN Documentation](https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-daily)
- [AWS Open Data Registry](https://registry.opendata.aws/noaa-ghcn/)
- [Data Format Description](https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt)