# Data Download

The purpose of this notebook is to download data from The National Map using the TNM Rest API. API documentation is [available here](https://tnmaccess.nationalmap.gov/api/v1/docs). A more comprehensive dataset documentation is [available here](../../docs/markdown/datasets.md).

Data downloaded:
1. **Watershed Boundary (WBD)**
2. **Digital Elevation Model (DEM)**
3. **National Land Cover Data (Land Cover, Land Cover Confidence, Fraction Impervious Surface, Fraction Impervious Descriptor)**
4. **Precipitation Data**


In [1]:
# Import necessary modules
import requests
import pandas as pd
from pathlib import Path
import json
import sys
import os
import warnings
import re
from datetime import datetime
from IPython.display import display, Markdown
import json
from dotenv import load_dotenv

In [2]:
# Base path
project_base_path = Path.cwd().parent.parent

# Supress warnings
warnings.filterwarnings("ignore")

In [3]:
# Add 'src' to system path
sys.path.append(str(project_base_path / 'src'))

# Import modules
from dataDownload.download import download_shp, download_GeoTIFF, download_large_file

In [4]:
# Load tokens
load_dotenv(project_base_path / '.env')

NOAA_API_TOKEN = os.getenv('NOAA_API_TOKEN')

## 1. Watershed Boundary Dataset (WBD)

WBD is available thorough Rest API. (Refer to [dataset documentation here](../../docs/markdown/datasets.md) for more details). 
The HU-4 digits Watershed Boundary have the proper resolution. From the HU-2 region we select the HU-4 subregion. Below, are the available datasets.

### 1.1. Watershed Boundary Dataset availability

Below are the available WBD for download.

In [22]:
# Load the bounding box of the neighboring New York state
neighboring_ny_state_bbox_path = project_base_path / 'data' / 'raw' / 'geo' /'json' / 'ny_neighboring_bbox.json'
with open(neighboring_ny_state_bbox_path, 'r') as f:
    neighboring_ny_state_bbox_dict = json.load(f)

# Build the box as a string for feeding the request in the parameters
corners = ['bottom_left', 'bottom_right', 'top_right', 'top_left']
pairs = [f"{neighboring_ny_state_bbox_dict[corner][0]} {neighboring_ny_state_bbox_dict[corner][1]}" for corner in corners]
bbox = ",".join(pairs)

# Define the base URL for the TNM API
base_url = "https://tnmaccess.nationalmap.gov/api/v1/"

# Define parameters for the API request to query available datasets
params = {
    "polygon": bbox,  # Specify the area to search for
    "datasets": "National Watershed Boundary Dataset (WBD)",  # Specify Watershed Boundary Dataset
    "outputFormat": "JSON"  # Specify JSON output
}

# Send a GET request to the API
response = requests.get(base_url + "products", params=params)

# Check if the request was successful
if response.status_code == 200:
    # Parse the JSON response
    response_json = response.json()
    
    # Display the dataset information
    print("Available Watershed Boundary Datasets:\n")
    for dataset in response_json.get("items", []):
        print(f"- Name: {dataset['title']}")
        print(f"  Extent: {dataset['extent']}")
        print(f"  Description: {dataset['body']}")
        print(f"  Metadata URL: {dataset['metaUrl']}\n")
else:
    print(f"Failed to retrieve data. HTTP Status Code: {response.status_code}")

Available Watershed Boundary Datasets:

- Name: USGS Watershed Boundary Dataset (WBD) - National (published 20250107) FileGDB
  Extent: National
  Description: The Watershed Boundary Dataset (WBD) is a comprehensive aggregated collection of hydrologic unit data consistent with the national criteria for delineation and resolution. It defines the areal extent of surface water drainage to a point except in coastal or lake front areas where there could be multiple outlets as stated by the "Federal Standards and Procedures for the National Watershed Boundary Dataset (WBD)" "Standard" (https://pubs.usgs.gov/tm/11/a3/). Watershed boundaries are determined solely upon science-based hydrologic principles, not favoring any administrative boundaries or special projects, nor particular program or agency. This dataset represents the hydrologic unit boundaries to the 12-digit (6th level) for the entire United States. Some areas may also include additional subdivisions representing the 14- and 16-dig

The response contains info about the available data set, including the download file. Below is an example of metadata for each Watershed Boundary Dataset.

In [23]:
", ".join([item for item in response_json.get("items", [])[0]])

'title, moreInfo, sourceId, sourceName, sourceOriginId, sourceOriginName, metaUrl, vendorMetaUrl, publicationDate, lastUpdated, dateCreated, sizeInBytes, extent, format, downloadURL, downloadURLRaster, previewGraphicURL, downloadLazURL, urls, datasets, boundingBox, bestFitIndex, body, processingUrl, modificationInfo'

In [24]:
# Filter files within the extent HU-2 digit in shapefile format
filtered_response = [
    record for record in response_json.get('items', [])
    if record.get('format') == 'Shapefile' and record.get('extent') == 'HU-2 Region'
]

# Save the filtered response to a file
export_path = project_base_path / 'data' / 'raw'/ 'json_docs' / 'watershed_boundary_dataset.json'
if not export_path.exists():
    try:
        with open(export_path, 'w') as f:
            json.dump(filtered_response, f, indent=4)
        print(f'\nWatershed Boundary Dataset exported successfully to {export_path}.\n')
    except Exception as err:
        print(f'Could not save the file. An error was encountered: {err}')
else:
    print(f'\nWatershed Boundary Dataset already exists at {export_path}.\n')
    
# Print the file download as markdown
display(Markdown(f'**{len(filtered_response)} files were found:**\n  '))
for record in filtered_response:
    print(f"- Name: {record['title']}")
    print(f"  Extent: {record['extent']}")
    print(f"  Description: {record['body']}")
    print(f"  Metadata URL: {record['metaUrl']}\n")


Watershed Boundary Dataset already exists at /Users/alan/Data Science Projects/Unit-Hydrograph-Model/data/raw/json_docs/watershed_boundary_dataset.json.



**4 files were found:**
  

- Name: USGS Watershed Boundary Dataset (WBD) for 2-digit Hydrologic Unit - 01 (published 20250108) Shapefile
  Extent: HU-2 Region
  Description: The Watershed Boundary Dataset (WBD) is a comprehensive aggregated collection of hydrologic unit data consistent with the national criteria for delineation and resolution. It defines the areal extent of surface water drainage to a point except in coastal or lake front areas where there could be multiple outlets as stated by the "Federal Standards and Procedures for the National Watershed Boundary Dataset (WBD)" "Standard" (https://pubs.usgs.gov/tm/11/a3/). Watershed boundaries are determined solely upon science-based hydrologic principles, not favoring any administrative boundaries or special projects, nor particular program or agency. This dataset represents the hydrologic unit boundaries to the 12-digit (6th level) for the entire United States. Some areas may also include additional subdivisions representing the 14- and 16-digit hydrologic

### 1.2. Watershed Boundary Dataset Download

In [25]:
# Download the Watershed Boundary shapefiles
for item in filtered_response:
    try:
        url = item.get('downloadURL')
        
        item_base_name = os.path.basename(url)
        item_local_path = project_base_path / 'data' / 'raw' / 'json_docs' / item_base_name
        
        if not item_local_path.exists():
            print(f'Downloading {item.get("title")}...')
            download_shp(url=url, filename=item_local_path, unzip=True)
            print(f'{item_base_name} downloaded and unzipped successfully.\n')
        else:
            print(f'{item_base_name} already exists.\n')

    except Exception as err:
        print(f'Failed to download or unzip file {item_base_name}: {err}')
        continue

Downloading USGS Watershed Boundary Dataset (WBD) for 2-digit Hydrologic Unit - 01 (published 20250108) Shapefile...
Downloading file WBD_01_HU2_Shape.zip...
Downloaded: WBD_01_HU2_Shape.zip
Extracted files to: /Users/alan/Data Science Projects/Unit-Hydrograph-Model/data/raw/json_docs/WBD_01_HU2_Shape
Deleted ZIP file: /Users/alan/Data Science Projects/Unit-Hydrograph-Model/data/raw/json_docs/WBD_01_HU2_Shape.zip
Function 'download_shp' executed in 30.0259 seconds.
WBD_01_HU2_Shape.zip downloaded and unzipped successfully.

Downloading USGS Watershed Boundary Dataset (WBD) for 2-digit Hydrologic Unit - 02 (published 20250108) Shapefile...
Downloading file WBD_02_HU2_Shape.zip...
Downloaded: WBD_02_HU2_Shape.zip
Extracted files to: /Users/alan/Data Science Projects/Unit-Hydrograph-Model/data/raw/json_docs/WBD_02_HU2_Shape
Deleted ZIP file: /Users/alan/Data Science Projects/Unit-Hydrograph-Model/data/raw/json_docs/WBD_02_HU2_Shape.zip
Function 'download_shp' executed in 1 minutes and 43.

## 2. Digital Elevation Model (DEM)

DEM available dataset is available through TNM Rest API. Refer to [dataset documentation here](../../docs/markdown/datasets.md). It has 1-meter resolution and is used to automatically delineate watershed boundaries. 

### 2.1. Digital Elevation Model Dataset availability

Below are the available DEM for download.

In [11]:
# Load the bounding box for upper hudson basin
upper_hudson_basin_bbox_path = project_base_path / 'data' / 'raw'/ 'geo' /'json' / 'upper_hudson_basin_bbox.json'
with open(upper_hudson_basin_bbox_path, 'r') as f:
    upper_hudson_basin_bbox_dict = json.load(f)

# Build the box as a string for feeding the request in the parameters
corners = ['bottom_left', 'bottom_right', 'top_right', 'top_left']
pairs = [f"{upper_hudson_basin_bbox_dict[corner][0]} {upper_hudson_basin_bbox_dict[corner][1]}" for corner in corners]
bbox = ",".join(pairs)

# Define the base URL for the TNM API
base_url = "https://tnmaccess.nationalmap.gov/api/v1/"

product = "National Elevation Dataset (NED) 1/3 arc-second"

# Define parameters for the API request to query available datasets
params = {
    "polygon": bbox,  # Specify the area to search for
    "datasets": product,  
    "outputFormat": "JSON"  # Specify JSON output
}

# Send a GET request to the API
response_upper_hudson_basin = requests.get(base_url + "products", params=params)

# Check if the request was successful
if response_upper_hudson_basin.status_code == 200:
    # Parse the JSON response
    response_upper_hudson_basin_json = response_upper_hudson_basin.json()
    
    # Display the dataset information
    display(Markdown(f'**{len(response_upper_hudson_basin_json.get("items",[]))} files were found:**\n '))
    for dataset in response_upper_hudson_basin_json.get("items", []):
        print(f"- Name: {dataset['title']}")
        print(f"  Publication Date: {dataset['publicationDate']}")
        print(f"  Description: {dataset['body']}")
        print(f"  Metadata URL: {dataset['metaUrl']}\n")
else:
    print(f"Failed to retrieve data. HTTP Status Code: {response_upper_hudson_basin.status_code}")

**50 files were found:**
 

- Name: USGS 1/3 Arc Second n41w074 20211109
  Publication Date: 2021-11-09
  Description: This tile of the 3D Elevation Program (3DEP) seamless products is 1/3 Arc Second resolution. 3DEP data serve as the elevation layer of The National Map, and provide basic elevation information for Earth science studies and mapping applications in the United States. Scientists and resource managers use 3DEP data for global change research, hydrologic modeling, resource monitoring, mapping and visualization, and many other applications. 3DEP data compose an elevation dataset that consists of seamless layers and a high resolution layer. Each of these layers consists of the best available raster elevation data of the conterminous United States, Alaska, Hawaii, territorial islands, Mexico and Canada. 3DEP data are updated continually as new data become available. Seamless 3DEP data are derived from diverse source data that are processed to a common coordinate system and unit of vertical measure. These

From the list above we observe that from the same area there is multiple `.tiff` files, from different dates. However, it is needed only the latest file. 

Below, we filter the list to have only the latest GeoTIFF for the each area and dowload those files.

In [12]:
def filter_latest_geotiff(data):

    region_latest = {}
    
    for item in data:
        # Extract region from title using regex pattern (e.g., n41w074)
        match = re.search(r'n\d+w\d+', item['title'], re.IGNORECASE)
        if not match:
            continue
        region = match.group(0).lower()

        # Parse publication date
        pub_date = datetime.strptime(item['publicationDate'], '%Y-%m-%d')

        # If region is new or found a later publication date, update the entry
        if region not in region_latest or pub_date > datetime.strptime(region_latest[region]['publicationDate'], '%Y-%m-%d'):
            region_latest[region] = item

    # Return unique latest items for each region
    return list(region_latest.values())

upper_hudson_dem_all_items = response_upper_hudson_basin_json.get("items",[])
upper_hudson_dem_filtered_data = filter_latest_geotiff(data = upper_hudson_dem_all_items)

# Display the dataset information
display(Markdown(f'**{len(upper_hudson_dem_filtered_data)} unique files were found:**\n '))
for dataset in upper_hudson_dem_filtered_data:
    print(f"- Name: {dataset['title']}")
    print(f"  Publication Date: {dataset['publicationDate']}")
    print(f"  Description: {dataset['body']}")
    print(f"  Metadata URL: {dataset['metaUrl']}\n")


**17 unique files were found:**
 

- Name: USGS 1/3 Arc Second n41w074 20240925
  Publication Date: 2024-09-25
  Description: This tile of the 3D Elevation Program (3DEP) seamless products is 1/3 Arc Second resolution. 3DEP data serve as the elevation layer of The National Map, and provide basic elevation information for Earth science studies and mapping applications in the United States. Scientists and resource managers use 3DEP data for global change research, hydrologic modeling, resource monitoring, mapping and visualization, and many other applications. 3DEP data compose an elevation dataset that consists of seamless layers and a high resolution layer. Each of these layers consists of the best available raster elevation data of the conterminous United States, Alaska, Hawaii, territorial islands, Mexico and Canada. 3DEP data are updated continually as new data become available. Seamless 3DEP data are derived from diverse source data that are processed to a common coordinate system and unit of vertical measure. These

In [14]:
# Save the filtered response to a file
export_path_dem_ds = project_base_path / 'data' / 'json_docs' / 'dem_1_3_arcsec_dataset.json'

if not export_path_dem_ds.exists():
    try:
        with open(export_path_dem_ds, 'w') as f:
            json.dump(upper_hudson_dem_filtered_data, f, indent=4)
        print(f'\nDEM 1/3 arcsec Dataset saved successfully to {export_path_dem_ds}.\n')
    except Exception as err:
        print(f'Could not save the file. An error was encountered: {err}')
else:
    print(f'\nDEM 1/3 arcsec Dataset already exists at {export_path_dem_ds}.\n')


DEM 1/3 arcsec Dataset saved successfully to /Users/alan/Data Science Projects/Unit-Hydrograph-Model/data/json_docs/dem_1_3_arcsec_dataset.json.



### 2.2. Digital Elevation Model Download

In [16]:
# Read the json dataset document
export_path_dem_ds = project_base_path / 'data' / 'raw' / 'json_docs' / 'dem_1_3_arcsec_dataset.json'
with open(export_path_dem_ds, 'r') as file:
    dem_ds = json.load(file)

count = 0
total_files = len(dem_ds)

# Download the DEM 1/3 arcsec file
for item in dem_ds:
    count += 1
    try:
        url = item.get('downloadURL')
        
        item_base_name = os.path.basename(url)
        item_local_path = project_base_path / 'data' / 'geo' / 'raster' / 'dem13arcsec' / item_base_name
        
        if not item_local_path.exists():
            print(f'Downloading {item.get("title")} file {count} of {total_files}...')
            download_GeoTIFF(url=url, filename=item_local_path, chunk_size=1024*1024)
            
        else:
            print(f'{item_base_name} already exists.\n')

    except Exception as err:
        print(f'Failed to download file: {err}')
        continue

USGS_13_n41w074_20240925.tif already exists.

USGS_13_n41w075_20221115.tif already exists.

USGS_13_n41w076_20221115.tif already exists.

USGS_13_n42w073_20211109.tif already exists.

USGS_13_n42w074_20241010.tif already exists.

USGS_13_n42w075_20240925.tif already exists.

USGS_13_n42w076_20230227.tif already exists.

USGS_13_n43w073_20230117.tif already exists.

Downloading USGS 1/3 Arc Second n43w074 20241010 file 9 of 17...
Downloading file USGS_13_n43w074_20241010.tif...
Downloaded: USGS_13_n43w074_20241010.tif0.00%)
Function 'download_GeoTIFF' executed in 12 minutes and 19.3284 seconds.

Downloading USGS 1/3 Arc Second n43w075 20241010 file 10 of 17...
Downloading file USGS_13_n43w075_20241010.tif...
Downloaded: USGS_13_n43w075_20241010.tif0.00%)
Function 'download_GeoTIFF' executed in 10 minutes and 37.8858 seconds.

Downloading USGS 1/3 Arc Second n43w076 20230227 file 11 of 17...
Downloading file USGS_13_n43w076_20230227.tif...
Downloaded: USGS_13_n43w076_20230227.tif0.00%)
F

## 3. National Land Cover Data

The National Land Cover data (Land Cover, Land Cover Confidence, Fraction Impervious Surface, Fraction Impervious Descripor) were donwloaded from MRLC download tool: given an area of interest it sends an e-mail with a link for the download. Because the file is too big (about 18Gb), a function has been written to allow it to resume download in case it halts for any reason. 

In [5]:
mrlc_data_urls = 'https://www.mrlc.gov/downloads/sciweb1/shared/mrlc/download-tool/NLCD_lUiv89Ym9tDA9GgEH1RN.zip'
mrlc_data_path = project_base_path / 'data' / 'raw' / 'geo' / 'raster' / 'NLCD' / 'NLCD.zip'

# Ensure folder and subfulders exist
mrlc_data_path.parent.mkdir(parents=True, exist_ok=True)

# Download the NLCD data
download_large_file(url=mrlc_data_urls, destination=mrlc_data_path, max_retries=5)

Starting download: /Users/alan/Data Science Projects/Unit-Hydrograph-Model/data/geo/raster/mrlc/NLCD.zip (0/19750974412 bytes)
Downloaded 19750974412/19750974412 bytes (100.00%)
Download completed: /Users/alan/Data Science Projects/Unit-Hydrograph-Model/data/geo/raster/mrlc/NLCD.zip
Function 'download_large_file' executed in 238 minutes and 48.6681 seconds.



## 4. Precipitation Data

Fifiteen minutes precipitation data are available in [NCDC NOAA website here](https://www1.ncdc.noaa.gov/pub/data/hpd/auto/v2/beta/15min/) for version 2 and [NOAA website here](https://gis.ncdc.noaa.gov/kml/precip_15.kmz) for version 1. First we download the data inventory to locate which stations fall down within the region of interest, second we download data for those precipitation stations. (Refere to [data preprocessing here](../preprocessing/data_prepocessing.ipynb), for the study region selection). Documentation may be accessed [here](https://www1.ncdc.noaa.gov/pub/data/hpd/auto/v2/beta/15min/readme.15min.txt) and [here](../../docs/other/readme.15min.txt).

### 4.1. Precipitation Station Inventory

There are two versions of precipitation data available from NOAA. Version 1 has data up to December 2013. Version two has data starting the same period as version 1 to the present but, there are less precipitatin stations available as compared to version 1. In this study is used both dataset. Dataset are available for direct download on the above mentioned links. The inventory contain the following relevant attributes:
- Version 1:
    - Name: the name of station.
    - StnID: station identification code.
    - elev: elevation.
- Version 2:
    -  StnID: station identification code.
    -  Lat: latitude.
    -  Lon:lontitude.
    -  Elev: elevation of the station.
    -  Name: the name of the station. 
    -  Sample_Interval (min): is in units of minute and indicates the typical time between sampling.
    -  UTC_Offset: is the number of hours the station's local time is offset from GMT.
    -  POR_Date_Range: first and last year-month-day of the station's Period of Record.
    -  PCT_Last_Half_Good: is the percentage of non-missing and non-flagged observations during the last half of the station's POR.  

In [8]:
# Read precipitation data inventory from website
ppt_stations_inventory = pd.read_csv(filepath_or_buffer='https://www1.ncdc.noaa.gov/pub/data/hpd/auto/v2/beta/15min/hpd-stations-inventory.15min.csv')

In [None]:
# Save the precipitation inventory data
ppt_stations_inventory_path = project_base_path / 'data' / 'raw' /'geo' / 'json' / 'ppt_stations_inventory.json'

# Save the file in csv format
if not ppt_stations_inventory_path.exists():
    try:
        ppt_stations_inventory_path.parent.mkdir(parents=True, exist_ok=True)
        ppt_stations_inventory.to_json(ppt_stations_inventory_path)
        print('Successfully saved precipitation inventory data.')
    except Exception as err:
        print(f'Failed to save precipitation inventory data: {err}')
else:
    print('File already exists.')

File already exists.


### 4.2. Precipitation Historical Data

Historical data for precipitation dataset version 1 is available through NOAA API (see the docs) and through FTP for version 2.

First, we load the metadata information about the the precipitation stations to get the ID to download data for. We Download both dataset for starting at year 1985 up to present, acording to data availability.

#### 4.2.1. Download Historical Precipitation Data Version 1

In [5]:
# Load precipitation station codes
preciptation_station_v1_codes_path = project_base_path / 'data/clean/json/precip_stations_v1_codes.json'

with open(preciptation_station_v1_codes_path, 'r') as file:
    preciptation_station_v1_codes = json.load(file)

In [20]:
# Download the data
base_url = 'https://www.ncei.noaa.gov/cdo-web/api/v2/'
headers = {"Token": NOAA_API_TOKEN}
endpoint = 'data'
parameters = {
    'stationid': 'COOP:309670',
    'datasetid': 'PRECIP_15',
    'startdate': '2014-10-03',
    'enddate': '2014-10-04',
    'limit':1000
}

response = requests.get(url=base_url + endpoint, headers=headers, params=parameters)
response.json()

{}

In [7]:
test = preciptation_station_v1_codes.copy()
test[0]['StatDate'] = 'someDate'
test[0]

{'StnID': 'COOP:281582', 'StatDate': 'someDate'}

In [10]:
# Get start and end dates
import requests
from bs4 import BeautifulSoup
import re

def extract_station_dates(url):
    try:
        # Fetch the webpage content
        response = requests.get(url)
        response.raise_for_status()
        
        # Parse HTML content
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Find all script tags
        script_tags = soup.find_all('script')
        
        # Look for the script containing the station data
        for script in script_tags:
            if script.string and 'station:' in script.string:
                # Extract the station object content
                content = script.string
                
                # Use regex to find startDate and endDate
                start_date_match = re.search(r"startDate:\s*'([^']+)'", content)
                end_date_match = re.search(r"endDate:\s*'([^']+)'", content)
                
                if start_date_match and end_date_match:
                    start_date = start_date_match.group(1)
                    end_date = end_date_match.group(1)
                    
                    return {
                        'startDate': start_date,
                        'endDate': end_date
                    }
        
        print("Could not find station dates in the script")
        return None
    
    except requests.RequestException as e:
        print(f"Error fetching webpage: {e}")
        return None
    except Exception as e:
        print(f"Error processing data: {e}")
        return None

for index, stid in enumerate(preciptation_station_v1_codes):        
    print(f'Getting operational dates for station ID {stid.get('StnID')}')
    try:
        url = f'https://www.ncdc.noaa.gov/cdo-web/datasets/PRECIP_15/stations/{stid.get('StnID')}/detail'
        operational_dates = extract_station_dates(url)

        preciptation_station_v1_codes[index].update(operational_dates)
    except:
        print(f'Failed to get operational dates for station ID {stid.get('StnID')}')
        continue

Getting operational dates for station ID COOP:281582
Error fetching webpage: 503 Server Error: Service Unavailable for url: https://www.ncdc.noaa.gov/cdo-web/datasets/PRECIP_15/stations/COOP:281582/detail
Failed to get operational dates for station ID COOP:281582
Getting operational dates for station ID COOP:362160
Getting operational dates for station ID COOP:308223
Error fetching webpage: 503 Server Error: Service Unavailable for url: https://www.ncdc.noaa.gov/cdo-web/datasets/PRECIP_15/stations/COOP:308223/detail
Failed to get operational dates for station ID COOP:308223
Getting operational dates for station ID COOP:309670
Getting operational dates for station ID COOP:306119
Getting operational dates for station ID COOP:301207
Error fetching webpage: 503 Server Error: Service Unavailable for url: https://www.ncdc.noaa.gov/cdo-web/datasets/PRECIP_15/stations/COOP:301207/detail
Failed to get operational dates for station ID COOP:301207
Getting operational dates for station ID COOP:302

In [14]:
for index, stid in enumerate(preciptation_station_v1_codes):
    if 'startDate' not in stid:        
        print(f'Getting operational dates for station ID {stid.get('StnID')}')
        try:
            url = f'https://www.ncdc.noaa.gov/cdo-web/datasets/PRECIP_15/stations/{stid.get('StnID')}/detail'
            operational_dates = extract_station_dates(url)

            preciptation_station_v1_codes[index].update(operational_dates)
        except:
            print(f'Failed to get operational dates for station ID {stid.get('StnID')}')
            continue
preciptation_station_v1_codes

Getting operational dates for station ID COOP:308223
Getting operational dates for station ID COOP:301207
Getting operational dates for station ID COOP:305346
Getting operational dates for station ID COOP:300047
Getting operational dates for station ID COOP:307513
Getting operational dates for station ID COOP:308506


[{'StnID': 'COOP:281582',
  'startDate': '1970-10-03T17:15:00.000',
  'endDate': '2014-01-01T00:00:00.000'},
 {'StnID': 'COOP:362160',
  'startDate': '1971-12-06T03:15:00.000',
  'endDate': '1983-01-31T12:45:00.000'},
 {'StnID': 'COOP:308223',
  'startDate': '1980-01-01T00:15:00.000',
  'endDate': '1985-03-01T00:15:00.000'},
 {'StnID': 'COOP:309670',
  'startDate': '1970-10-03T17:15:00.000',
  'endDate': '2014-01-01T00:00:00.000'},
 {'StnID': 'COOP:306119',
  'startDate': '1971-05-02T03:15:00.000',
  'endDate': '2004-04-01T00:15:00.000'},
 {'StnID': 'COOP:301207',
  'startDate': '1971-11-15T05:15:00.000',
  'endDate': '1996-02-01T00:15:00.000'},
 {'StnID': 'COOP:302582',
  'startDate': '1984-01-01T00:15:00.000',
  'endDate': '1986-06-01T00:15:00.000'},
 {'StnID': 'COOP:306825',
  'startDate': '1971-05-03T09:15:00.000',
  'endDate': '2014-01-01T00:00:00.000'},
 {'StnID': 'COOP:301559',
  'startDate': '1978-01-01T19:15:00.000',
  'endDate': '2013-05-01T00:00:00.000'},
 {'StnID': 'COOP:30

In [15]:
# Load precipitation station codes
preciptation_station_v1_codes_path = project_base_path / 'data/clean/json/precip_stations_v1_codes.json'

with open(preciptation_station_v1_codes_path, 'w') as file:
    json.dump(preciptation_station_v1_codes, file, indent=4)

In [11]:
#Download data

# API Access
base_url = 'https://www.ncei.noaa.gov/cdo-web/api/v2/'
headers = {"Token": NOAA_API_TOKEN}
endpoint = 'data'

start_date = '1986-01-01'
end_date = '2014-01-01'

# Initial parameters
initial_parameters = {
    'datasetid': 'PRECIP_15',
    'startdate': start_date,
    'enddate': (pd.to_datetime(start_date)+pd.Timedelta(minutes=15000)).strftime('%Y-%m-%dT%H:%M:%S'), # Set exactly 1000 observation required by the API.
    'limit': 1000
}

# Log lists
content_log = []
status_code_log = []
err_log = []

# Directory for CSV files
output_dir = project_base_path / 'data/raw/tabular/precipitation/v1'
output_dir.mkdir(parents=True, exist_ok=True)

for stationcode in preciptation_station_v1_codes:
    StnID = stationcode.get('StnID')
    print(f'Fetching data for station {StnID}...', end = '\r')
    if not StnID:
        err_log.append(f"Missing StnID in stationcode: {stationcode}")
        continue
    
    # Reset parameters for new station
    parameters = initial_parameters.copy()
    parameters['stationid'] = StnID
    csv_file = f"{output_dir}/{StnID.replace(':','')}.csv"
    fetch_complete = False

    # Nested loop to fetch all data for this station
    while not fetch_complete:
        try:
            response = requests.get(url=base_url + endpoint, 
                                  headers=headers, 
                                  params=parameters,
                                  timeout=30)
            
            # Check response status
            if response.status_code != 200:
                status_code_log.append(response.status_code)
                content_log.append(f"StnID: {StnID}. Response: {response.content}")
                break

            response_json = response.json()
            if not response_json or 'results' not in response_json:
                content_log.append(f"StnID: {StnID}. Empty or invalid JSON: {response.content}")
                break

            # Convert results to DataFrame
            data = pd.DataFrame(response_json['results'])
            if data.empty:
                content_log.append(f"StnID: {StnID}. No data in results")
                break

            # Write to CSV
            try:
                if not csv_file.exists():
                    data.to_csv(csv_file, mode='w', index=False)
                else:
                    data.to_csv(csv_file, mode='a', header=False, index=False)
            except Exception as csv_err:
                err_log.append(f"CSV write error for {StnID}: {csv_err}")
                break

            # Check if we've reached the end of available data
            max_date = pd.to_datetime(data['date'].max())
            new_start = max_date + pd.Timedelta(minutes=15)
            if new_start > end_date:
                fetch_complete = True  # Less than limit, likely no more data
            else:
                # Update startdate for next batch
                try:
                    new_end = new_start + pd.Timedelta(minutes=15000)
                    parameters['startdate'] = new_start.strftime('%Y-%m-%dT%H:%M:%S')
                    parameters['enddate'] = new_end.strftime('%Y-%m-%dT%H:%M:%S') if new_end < pd.to_datetime(end_date) else end_date
                except Exception as date_err:
                    err_log.append(f"Date processing error for {StnID}: {date_err}")
                    break

        except requests.exceptions.RequestException as req_err:
            err_log.append(f"Request error for {StnID}: {req_err}")
            content_log.append(f"StnID: {StnID}. Failed request")
            status_code_log.append(getattr(response, 'status_code', 'N/A'))
            break
        
        except Exception as general_err:
            err_log.append(f"Unexpected error for {StnID}: {general_err}")
            content_log.append(f"StnID: {StnID}. Unexpected failure")
            status_code_log.append(getattr(response, 'status_code', 'N/A'))
            break

#Save logs to file
logs_df = pd.DataFrame({
    'content': content_log,
    'status_code': status_code_log,
    'error': err_log
})
log_path = output_dir / 'fetch_logs.csv'
logs_df.to_csv(log_path, index=False)
print('Finalized fetching precipitation v1 data. See log file for any detail if any err.')

Fetching data for station COOP:304555...

ValueError: All arrays must be of the same length

In [None]:
https://www.ncdc.noaa.gov/cdo-web/datasets/PRECIP_15/stations/COOP:362160/detail

ValueError: All arrays must be of the same length

In [13]:
content_log

["StnID: COOP:362160. Empty or invalid JSON: b'{}'",
 "StnID: COOP:308223. Empty or invalid JSON: b'{}'",
 "StnID: COOP:307035. Empty or invalid JSON: b'{}'",
 "StnID: COOP:301523. Empty or invalid JSON: b'{}'",
 "StnID: COOP:304426. Empty or invalid JSON: b'{}'",
 "StnID: COOP:305346. Empty or invalid JSON: b'{}'",
 "StnID: COOP:301483. Empty or invalid JSON: b'{}'",
 "StnID: COOP:301761. Empty or invalid JSON: b'{}'",
 "StnID: COOP:304025. Empty or invalid JSON: b'{}'",
 "StnID: COOP:197230. Empty or invalid JSON: b'{}'",
 "StnID: COOP:300047. Empty or invalid JSON: b'{}'",
 "StnID: COOP:300048. Empty or invalid JSON: b'{}'",
 "StnID: COOP:307514. Empty or invalid JSON: b'{}'",
 "StnID: COOP:307513. Empty or invalid JSON: b'{}'",
 "StnID: COOP:192852. Empty or invalid JSON: b'{}'",
 "StnID: COOP:436500. Empty or invalid JSON: b'{}'",
 "StnID: COOP:438160. Empty or invalid JSON: b'{}'",
 'StnID: COOP:307549. Failed request',
 "StnID: COOP:438150. Empty or invalid JSON: b'{}'",
 "StnID

In [38]:
url = 'https://www.ncdc.noaa.gov/cdo-web/datasets/PRECIP_15/stations/COOP:362160/detail'
# Fetch the webpage content
response = requests.get(url)
response.raise_for_status()

# Parse HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Find all script tags
script_tags = soup.find_all('script')

In [None]:
soup.find_all


<!DOCTYPE html>

<html lang="en">
<head>
<title>Precipitation 15 Minute Station Details: DINGMANS FERRY, PA US, COOP:362160 | Climate Data Online (CDO) | National Climatic Data Center (NCDC)</title>
<meta content="Precipitation 15 Minute Station Details:DINGMANS FERRY, PA US, COOP:362160" name="description"/>
<meta content="Climate Data Online (CDO)" name="application-name"/>
<meta content="Climate Data Online (CDO)" name="msapplication-tooltip">
<!--[if IE]>
	<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<![endif]-->
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="National Centers for Environmental Information (NCEI)" name="author"/>
<link href="/shared/v1/images/favicon.ico" rel="shortcut icon"/>
<link href="/shared/v1/images/apple-touch-icon-144x144.png" rel="apple-touch-icon" sizes="144x144"/>
<link href="/shared/v1/images/apple-touch-icon-114x114.png" rel="apple-touch-icon" sizes="114x114"/>
<link href

In [None]:
# Download the data and store on the disk

# API Access
base_url = 'https://www.ncei.noaa.gov/cdo-web/api/v2/'
headers = {"Token": NOAA_API_TOKEN}
endpoint = 'data'

# Parameters
parameters = {
    'datasetid': 'PRECIP_15',
    'startdate': '1986-01-01',
    'enddate': '2013-12-31',
    'limit': 1000
    }

# Log
content_long = []
status_code_log = []
err_log = []

for stationcode in preciptation_station_v1_codes:
    StnID = stationcode.get('StnID')
    parameters.update(stationid = StnID)

    try:
        response = requests.get(url=base_url + endpoint, headers=headers, params=parameters)
        if response.status_code == 200:
            response_json = response.json()
            if response_json:
                data = pd.DataFrame(response_json.get('results'))
                startdate = pd.Datetime(pd.data['date'].max()) + pd.Timedelta(minutes=15)
                startdate = str(startdate).replace(' ', 'T')
                
    except Exception as err:
        content_long.append(f'StnID: {StnID}. Start Date: {parameters.get('startdate')}. Response Content: {response.content}')
        status_code_log.append(response.status_code)
        err_log.append(err)


In [113]:
pd.DataFrame(response.json().get('results'))['date'].max()

'1986-12-25T05:45:00'

In [120]:
str(pd.to_datetime(pd.DataFrame(response.json().get('results'))['date'].max()) + pd.Timedelta(minutes=15)).replace(' ', 'T')

'1986-12-25T06:00:00'

In [90]:
response.json().get('metadata').get('resultset')

{'offset': 1, 'count': 406, 'limit': 1000}

In [103]:
pd.DataFrame(response.json().get('results',[]))

Unnamed: 0,date,datatype,station,attributes,value
0,1986-01-01T00:15:00,QPCP,COOP:309670,"g,,HT",0
1,1986-01-03T07:30:00,QPCP,COOP:309670,",,HT",10
2,1986-01-03T08:30:00,QPCP,COOP:309670,",,HT",10
3,1986-01-03T10:30:00,QPCP,COOP:309670,",,HT",10
4,1986-01-05T02:00:00,QPCP,COOP:309670,",,HT",10
...,...,...,...,...,...
401,1986-12-25T02:45:00,QPCP,COOP:309670,",,HT",10
402,1986-12-25T03:15:00,QPCP,COOP:309670,",,HT",20
403,1986-12-25T03:30:00,QPCP,COOP:309670,",,HT",10
404,1986-12-25T03:45:00,QPCP,COOP:309670,",,HT",10


#### 4.2.2. Download Historical Precipitation Data Version 2

In [6]:
# Load metada for historical precipitation file version 2
precip_v2_metada_path = project_base_path / 'data/clean/json/precip_stations_v2_codes.json'
try:
    with open(precip_v2_metada_path, 'r') as file:
        precip_v2_metada = json.load(file)
    print('Precipitation V2 loaded successfully!')
except Exception as err:
    print('Failed to load precipitation v2 metadata file:', err)

Precipitation V2 loaded successfully!


In [None]:
# Download data
destination_folder = project_base_path / 'data/raw/tabular/precipitation/v2'
destination_folder.mkdir(parents=True, exist_ok=True) 
for station in precip_v2_metada:
    StnID = station.get('StnID')
    StnID_extension = StnID + '.csv'
    file_name = destination_folder / StnID_extension
    url = 'https://www1.ncdc.noaa.gov/pub/data/hpd/auto/v2/beta/15min/all_csv/' + StnID + '.15m.csv'
    
    print(f"Downloading {StnID} from: {url}")
    try:
        download_large_file(url=url, destination=file_name, max_retries=3, chunk_size=8192)
    except Exception as err:
        print(f'Failed to download {StnID}: {err}')
    finally:
        continue

Downloading USC00305435 from: https://www1.ncdc.noaa.gov/pub/data/hpd/auto/v2/beta/15min/all_csv/USC00305435.15m.csv
Saving to: c:\Users\avpalves\Downloads\Pessoal\Unit-Hydrograph-Model\data\raw\tabular\precipitation\v1\USC00305435.csv
Starting download: c:\Users\avpalves\Downloads\Pessoal\Unit-Hydrograph-Model\data\raw\tabular\precipitation\v1\USC00305435.csv (0/10110775 bytes)
Downloaded 10110775/10110775 bytes (100.00%)
Download completed: c:\Users\avpalves\Downloads\Pessoal\Unit-Hydrograph-Model\data\raw\tabular\precipitation\v1\USC00305435.csv
Function 'download_large_file' executed in 12.0008 seconds.
Downloading USC00306825 from: https://www1.ncdc.noaa.gov/pub/data/hpd/auto/v2/beta/15min/all_csv/USC00306825.15m.csv
Saving to: c:\Users\avpalves\Downloads\Pessoal\Unit-Hydrograph-Model\data\raw\tabular\precipitation\v1\USC00306825.csv
Starting download: c:\Users\avpalves\Downloads\Pessoal\Unit-Hydrograph-Model\data\raw\tabular\precipitation\v1\USC00306825.csv (0/10840544 bytes)
Dow