# Data Access

## Getting Data via Serial Download

This tutorial demonstrates a common method of acquiring data that is useful for data exploration. This method involves the following:

1. Download one or several miniseed files from a data provider. We will use EarthScope's FDSN service to request files.
2. Read each stream extract metadata.
3. Process the data by removing trends (linear, mean, taper) and applying a bandpass filter. This process normalizes the data for comparison.
4. Visualize the data pre and post processing

### Setup

We will use built-in python packages and obspy. These packages are already included in GeoLab; you will not need to install them. We start the script by importing the packages.

In [None]:
from __future__ import annotations

import pandas as pd
import io
import os
import sys
from dataclasses import dataclass
from pathlib import Path
from typing import Dict, Iterable, List, Optional, Tuple

import numpy as np
import requests
from obspy import UTCDateTime, read as obspy_read
from obspy.clients.fdsn import Client, URL_MAPPINGS
from obspy.clients.fdsn.header import FDSNNoDataException


### IMUSH Data

A listing for miniseed data is available for Mt. St. Helens from IMUSH (
Imaging Magma Under St. Helens). The [web page](https://ds.iris.edu/mda/XD/?starttime=2014-01-01T00%3A00%3A00&endtime=2016-12-31T23%3A59%3A59#XD_2014-01-01_2016-12-31) lists the stations that recorded activity from 2014 to 2016.

The stations that have miniseed data have been saved in the IMUSH.csv file.

**Protip**
> The listing is dynamically generated by JavaScript, which makes scraping the stations we want more complicated. A simple solution is to copy the stations of interest and paste them into a spreadsheet such Google Sheets and save it as a CSV file.

### Getting Stations Data

The IMUSH.csv provides the station names and the start and end times for the recorded data. We can use this information to request the data by reading the CSV file. When we read each row of the CSV file, we need to store the data using a `@dataclass`

In [None]:
@dataclass(frozen=True)
class StationRow:
    station: str
    datacenter: str
    start: UTCDateTime
    end: UTCDateTime
    site: str
    latitude: float
    longitude: float
    elevation_m: float

We will need a function to read the CSV file and put them in a list. The function has three parameters, the first is the path and name of the file, a start date, and an end date. The function uses the `pandas` package to read the file. Pandas treats the rows of the data as a table so we can select all the rows or a specific set.

Note that the function uses the `StationRow` data class and returns a Python list. 

In [None]:
def read_csv(csv_path: str | Path, *, start_row: Optional[int] = None, end_row: Optional[int] = None) -> List[StationRow]:
    
    """
    Read a station CSV using pandas and return StationRow dataclass objects.

    Parameters
    ----------
    csv_path : str or Path
        Path to the CSV file.

    start_row : int, optional
        Zero-based index of the first row to read (inclusive).
        If None, starts from the beginning.

    end_row : int, optional
        Zero-based index of the last row to read (exclusive).
        If None, reads through the end of the file.

    Behavior
    --------
    - If start_row and end_row are both None, all rows are read.
    - Rows are selected using df.iloc[start_row:end_row].
    """
    df = pd.read_csv(csv_path)

    # Slice rows (pandas handles None cleanly)
    df_sel = df.iloc[start_row:end_row]

    station_rows: List[StationRow] = []

    for _, r in df_sel.iterrows():
        station_rows.append(
            StationRow(
                station=str(r["Station"]).strip(),
                datacenter=str(r["DataCenter"]).strip(),
                start=UTCDateTime(str(r["Start"])),
                end=UTCDateTime(str(r["End"])) + 86400,  # inclusive end date
                site=str(r["Site"]).strip(),
                latitude=float(r["Latitude"]),
                longitude=float(r["Longitude"]),
                elevation_m=float(r["Elevation"]),
            )
        )

    return station_rows

Let's try out the `read_csv` function and print out six rows.

In [None]:
stations = read_csv("IMUSH.csv", start_row=0, end_row=5)

for s in stations:
    print(s)

Note that the index of the first row of a pandas table or dataframe starts at 0. Note the station in the first row. Change the `start_row` parameter to 0 and compare the result.

### Downloading Miniseed files

Next we will download three miniseed files and save them in GeoLab. Using `obspy` we'll write a function to download files by a list we provide from the IMUSH.csv file. We can use `obspy` to request the data from EarthScope's FDSN service.

In [None]:
def download_miniseed(station_rows, *, starttime=None, endtime=None, output_dir="./seismic_data"):
    """
    Download miniseed file from EarthScope's FDSN service.
    
    Parameters:
    -----------
    station_rows : str
        Station code (e.g., 'ANMO')
    start_date : str
        Start date in format 'YYYY-MM-DD' or 'YYYY-MM-DDTHH:MM:SS'
    end_date : str
        End date in format 'YYYY-MM-DD' or 'YYYY-MM-DDTHH:MM:SS'
    output_dir : str, optional
        Directory to save the miniseed file (default: './seismic_data')
    
    Returns:
    --------
    str : Path to the saved miniseed file
    
    Example:
    --------
    >>> download_miniseed('ANMO', 'IU', '2024-01-01', '2024-01-02')
    """
    
    # default values
    network = "XD"
    location = "*"
    channel = "*HZ"
    
    # parse list as a tuple using list comprehension
    station_data = [(row.station, row.start, row.end) for row in station_rows]
    
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    # Initialize EarthScope FDSN client
    client = Client('IRIS')  # IRIS is part of EarthScope
    
    # Download waveform data
    for station, start, end in station_data:
        actual_start = starttime if starttime is not None else start
        actual_end = endtime if endtime is not None else end
        starttime=UTCDateTime(actual_start)
        endtime=UTCDateTime(actual_end)
        
        try:
            st = client.get_waveforms(
                network=network,
                station=station,
                location=location,
                channel=channel,
                starttime=starttime,
                endtime=endtime
            )
        except:
            continue
    
        # Create filename
        filename = f"{network}_{station}_{starttime.strftime('%Y%m%d')}_{endtime.strftime('%Y%m%d')}.mseed"
        filepath = os.path.join(output_dir, filename)
        
        # Save to miniseed file
        st.write(filepath, format='MSEED')
        print(f"Successfully saved to: {filepath}")
        
    return filepath

This is how to call the function. Note that the optionl start and end times are specified.

In [None]:
starttime=UTCDateTime(2014, 7, 17)
endtime=UTCDateTime(2014, 7, 18)

stations = read_csv("IMUSH.csv", start_row=5, end_row=8)

filepath = download_miniseed(stations, starttime=starttime, endtime=endtime)
                             

Try downloading a single station without specifying the start and end time.

In [None]:
# download a single station without a start and end time



## Parallel Seismic Data Download from EarthScope FDSN

This notebook demonstrates how to download seismic miniseed files in parallel from EarthScope's FDSN service using Python's concurrent.futures module.

### Concurrency and Parallel Processing with Python

Concurrency and parallel processing in Python allow programs to handle multiple tasks more efficiently, though they work in different ways. Python has a Global Interpreter Lock (GIL) that prevents multiple threads from executing Python bytecode simultaneously, which means threads don't truly run in parallel for CPU-bound tasks.

Concurrency involves managing multiple tasks without necessarily running simultaneously. Python juggles multiple tasks by switching between them rapidly. When one task is waiting (like waiting for a file to download or a database to respond), Python can pause that task and work on something else productive instead of just sitting idle.

Parallel processing, on the other hand, actually executes multiple computations simultaneously across multiple CPU cores. Parallel processing works by launching multiple instances of Python as separate processes across cores, making it ideal for CPU-intensive tasks like data processing or mathematical computations. The key distinction is that concurrency is about dealing with multiple things at once by switching between them efficiently, while parallelism is about doing multiple things at the exact same time.

### Import Required Libraries

We'll need ObsPy for seismic data handling and `concurrent.futures` for parallel downloads. 

In [None]:
from concurrent.futures import ThreadPoolExecutor, as_completed
import os
from obspy import UTCDateTime
from obspy.clients.fdsn import Client

### Single Station Download Function

This function (or task) handles downloading data for one station at a time. It will be called in parallel by the `download_miniseed` function.

In [None]:
def download_single_station(station, start, end, network, location, channel, 
                            starttime_override, endtime_override, output_dir, client):
    """
    Download miniseed file for a single station.
    
    This function is designed to be called in parallel for multiple stations.
    It handles all the logic for one download operation.
    
    Parameters:
    -----------
    station : str
        Station code (e.g., 'ANMO')
    start : str
        Station's start date from metadata
    end : str
        Station's end date from metadata
    network : str
        Network code (e.g., 'XD')
    location : str
        Location code (e.g., '*' for all locations)
    channel : str
        Channel code (e.g., 'BH?' for all BH channels)
    starttime_override : str or None
        Override start time if provided
    endtime_override : str or None
        Override end time if provided
    output_dir : str
        Directory to save the file
    client : obspy.clients.fdsn.Client
        FDSN client instance
    
    Returns:
    --------
    tuple : (success: bool, filepath: str or None, error: str or None)
    """
    # Determine actual start/end times (use override if provided, otherwise use station metadata)
    actual_start = starttime_override if starttime_override is not None else start
    actual_end = endtime_override if endtime_override is not None else end
    starttime = UTCDateTime(actual_start)
    endtime = UTCDateTime(actual_end)
    
    try:
        # Request waveform data from the FDSN service
        st = client.get_waveforms(
            network=network,
            station=station,
            location=location,
            channel=channel,
            starttime=starttime,
            endtime=endtime
        )
        
        # Create a descriptive filename with network, station, and date range
        filename = f"{network}_{station}_{starttime.strftime('%Y%m%d')}_{endtime.strftime('%Y%m%d')}.mseed"
        filepath = os.path.join(output_dir, filename)
        
        # Save the waveform data to a miniseed file
        st.write(filepath, format='MSEED')
        print(f"✓ Successfully saved to: {filepath}")
        
        return (True, filepath, None)
        
    except Exception as e:
        # If download fails, print error and return failure status
        print(f"✗ Failed to download {station}: {str(e)}")
        return (False, None, str(e))



> **Explainer:**
>
> Note the similarity between this code and the previous miniseed download example. Instead of reading the function parameters from a CSV file, each parameter required by the FDSN dataselect service is specified.

### Parallel Download Function

This is the function that orchestrates parallel downloads for multiple stations by reading a list of stations, start times, and end times. The list can be constructed programmatically or read from a CSV file.

In [None]:
def download_miniseed(station_rows, *, starttime=None, endtime=None, 
                     output_dir="./seismic_data", max_workers=5):
    """
    Download miniseed files from EarthScope's FDSN service in parallel.
    
    This function uses ThreadPoolExecutor to download data from multiple stations
    simultaneously, significantly reducing total download time compared to sequential
    downloads.
    
    Parameters:
    -----------
    station_rows : iterable
        Iterable of station objects with .station, .start, .end attributes
        (e.g., rows from a pandas DataFrame or list of namedtuples)
    starttime : str, optional
        Override start time in format 'YYYY-MM-DD' or 'YYYY-MM-DDTHH:MM:SS'
        If None, uses each station's individual start time
    endtime : str, optional
        Override end time in format 'YYYY-MM-DD' or 'YYYY-MM-DDTHH:MM:SS'
        If None, uses each station's individual end time
    output_dir : str, optional
        Directory to save the miniseed files (default: './seismic_data')
    max_workers : int, optional
        Maximum number of parallel downloads (default: 5)
        Increase for faster downloads, but be mindful of server limits
    
    Returns:
    --------
    dict : Dictionary containing:
        - 'successful': List of dicts with station names and filepaths
        - 'failed': List of dicts with station names and error messages
    
    Example:
    --------
    >>> # Assuming you have a DataFrame with station information
    >>> import pandas as pd
    >>> stations_df = pd.DataFrame({
    ...     'station': ['ANMO', 'CCM', 'HLID'],
    ...     'start': ['2024-01-01', '2024-01-01', '2024-01-01'],
    ...     'end': ['2024-01-02', '2024-01-02', '2024-01-02']
    ... })
    >>> 
    >>> # Download data for all stations in parallel
    >>> results = download_miniseed(
    ...     stations_df.itertuples(),
    ...     starttime='2024-01-01',
    ...     endtime='2024-01-02',
    ...     max_workers=10
    ... )
    >>> 
    >>> print(f"Downloaded {len(results['successful'])} files")
    >>> print(f"Failed: {len(results['failed'])} files")
    """
    
    # Default FDSN parameters for EarthScope/IRIS network
    network = "XD"        # Network code
    location = "*"        # All locations
    channel = "*HZ"       # All broadband high-gain channels (BHZ, BHN, BHE)
    
    # Extract station data from the input rows
    # This creates a list of tuples: [(station_code, start_date, end_date), ...]
    station_data = [(row.station, row.start, row.end) for row in station_rows]
    
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    print(f"Output directory: {output_dir}")
    
    # Initialize EarthScope FDSN client
    # The client object is thread-safe and can be shared across threads
    client = Client('IRIS')
    print(f"Initialized IRIS/EarthScope client")
    print(f"Preparing to download {len(station_data)} stations with {max_workers} parallel workers\n")
    
    # Initialize results dictionary to track successful and failed downloads
    results = {
        'successful': [],
        'failed': []
    }
    
    # Use ThreadPoolExecutor for parallel downloads
    # ThreadPoolExecutor is ideal for I/O-bound tasks like network downloads
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Submit all download tasks to the executor
        # This creates a Future object for each station download
        future_to_station = {
            executor.submit(
                download_single_station,
                station, start, end, network, location, channel,
                starttime, endtime, output_dir, client
            ): station
            for station, start, end in station_data
        }
        
        # Process completed downloads as they finish (not in submission order)
        # as_completed() yields futures as they complete, allowing real-time progress updates
        for future in as_completed(future_to_station):
            station = future_to_station[future]
            try:
                # Get the result from the completed future
                success, filepath, error = future.result()
                
                if success:
                    # Add to successful downloads list
                    results['successful'].append({
                        'station': station,
                        'filepath': filepath
                    })
                else:
                    # Add to failed downloads list with error message
                    results['failed'].append({
                        'station': station,
                        'error': error
                    })
            except Exception as e:
                # Handle any unexpected errors that weren't caught in download_single_station
                print(f"✗ Unexpected error for station {station}: {str(e)}")
                results['failed'].append({
                    'station': station,
                    'error': str(e)
                })
    
    # Print summary of download results
    print(f"\n{'='*60}")
    print(f"Download Summary:")
    print(f"  ✓ Successful: {len(results['successful'])} stations")
    print(f"  ✗ Failed: {len(results['failed'])} stations")
    print(f"{'='*60}")
    
    return results



> **Explainer:**
>
> This code uses Python's `ThreadPoolExecutor` to download seismic data from multiple stations simultaneously by creating a pool of worker threads (controlled by max_workers) that can execute download tasks concurrently. 
>
> First, it submits all download tasks to the executor using `executor.submit()`, which immediately returns a Future object for each station. Future objects act as placeholders for results that will arrive later, and they're stored in a dictionary (future_to_station) that maps each Future to its corresponding station name for tracking purposes. 
>
> Instead of waiting for all downloads to finish, the code `uses as_completed()` to process results as soon as each individual download completes (which may be in a different order than they were submitted), allowing for real-time progress updates and immediate handling of both successful downloads (storing the filepath) and failures (storing the error message). 
>
> The with statement ensures that all threads are properly cleaned up when the downloads finish, and the try-except block catches any unexpected errors that might occur when retrieving results from the Future objects. The entire process is robust and efficient for I/O-bound network operations where threads spend most of their time waiting for server responses rather than consuming CPU resources.

### Example Usage

Below are examples demonstrating the parallel download function.

In [None]:
# Example 1: Download data for multiple stations
# -----------------------------------------------

import pandas as pd

# Create sample station data
stations_df = pd.DataFrame({
    'station': ['ANMO', 'CCM', 'HLID', 'SRU'],
    'start': ['2024-01-01', '2024-01-01', '2024-01-01', '2024-01-01'],
    'end': ['2024-01-02', '2024-01-02', '2024-01-02', '2024-01-02']
})

# Download with 10 parallel workers
results = download_miniseed(
    stations_df.itertuples(),
    starttime='2014-07-17T00:00:00', # optional override
    endtime='2014-07-19T00:00:00', # optional override
    output_dir='./my_seismic_data', # optional override
    max_workers=10
)

# Display results
print("\nSuccessful downloads:")
for item in results['successful']:
    print(f"  {item['station']}: {item['filepath']}")

if results['failed']:
    print("\nFailed downloads:")
    for item in results['failed']:
        print(f"  {item['station']}: {item['error']}")

> **Explainer:**
>
> This examples uses a pandas dataframe to create table that specifies a station with a start and end time. It calls the `download_miniseed` function to download the files in parallel. 

In [None]:
Download the iMUSH data using the CSV file.

In [None]:
# Example 2: Using with a larger station list from a CSV file
# -----------------------------------------------------------

import pandas as pd

# Read station list from CSV
stations_df = pd.read_csv('stations.csv')

# Download data with higher parallelism for faster processing
results = download_miniseed(
    stations_df.itertuples(),
    starttime='2014-07-17',
    endtime='2014-07-19',  # One day of data
    output_dir='./daily_seismic_data',
    max_workers=20  # Process 20 stations simultaneously
)

# Save results to CSV for record keeping
import pandas as pd
successful_df = pd.DataFrame(results['successful'])
failed_df = pd.DataFrame(results['failed'])

successful_df.to_csv('successful_downloads.csv', index=False)
if len(failed_df) > 0:
    failed_df.to_csv('failed_downloads.csv', index=False)

> **Explainer:**
>
> Instead of creating a table of stations with start and endtime programmatically, we can read the data from a CSV file and use 20 workers to down load the files.

### Performance Comparison
The parallel approach can dramatically reduce download time:

Sequential (original code):
  - 10 stations × 5 seconds each = 50 seconds total

Parallel (with max_workers=10):
  - 10 stations ÷ 5 workers × 5 seconds = 10 seconds total
  - ~5x speedup!

Parallel (with max_workers=20):
  - 10 stations ÷ 10 workers × 5 seconds = 5 seconds total
  - ~10x speedup!

Note: Actual speedup depends on network bandwidth and server response times.