# Data Preprocessing

After acquiring the data the following step is to preprocess the data which increases the likelihood of finding similar events by applying filters (such as bandpass filters) , removing out-of-band noise, and allowing the desired signal's spectral characteristics to stand out.

Pre-processing involves removing the mean value from a trace and centering it on zero. Another process is linear detrending, which forces the start and end of a trace to align to zero. This focuses on high-frequency events, such as P and S-waves from volcanic tectonic earthquakes. A third process is applying a bandpass filter which removes both low-frequency noise (greater than 5Hz for VT events) and high-frequency noise (lower than 25Hz for VT events).


## Parallel Processing Patterns

In the previouse exercise, we demonstrated how to use concurrency to efficiently download miniseed files. The pattern defined a single task (request and download data), create a function to manage multiple tasks, and call the function. We can use the same pattern tp perform computational tasks, such as waveform preprocessing, in parallel. Recall that parallel processing, unlike concurrency, launches multiple Python instances against CPU cores. 

### Import packages

In [None]:
import os
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor, as_completed
from obspy import read

### Get a list of miniseed files in a directory

In [None]:
def get_miniseed_files(input_dir, extensions=None):
    """
    Get all miniseed files from the input directory.
    
    Parameters:
    -----------
    input_dir : str or Path
        Directory containing miniseed files
    extensions : list of str, optional
        File extensions to look for (default: ['.mseed', '.miniseed', '.ms'])
    
    Returns:
    --------
    list : List of Path objects for miniseed files
    """
    if extensions is None:
        extensions = ['.mseed', '.miniseed', '.ms']
    
    input_path = Path(input_dir)
    
    if not input_path.exists():
        raise FileNotFoundError(f"Input directory not found: {input_dir}")
    
    # Find all files with miniseed extensions
    miniseed_files = []
    for ext in extensions:
        miniseed_files.extend(input_path.glob(f"*{ext}"))
    
    return sorted(miniseed_files)

> **Explainer:**
> 
> This is a utility function that scans a directory and finds all files with miniseed extensions (.mseed, .miniseed, or .ms). It returns a sorted list of Path objects representing each miniseed file found. The function validates that the directory exists and raises an error if it doesn't, preventing silent failures later in the processing pipeline.

### Preprocess a Single Miniseed File

In [None]:
def preprocess_single_file(input_filepath, output_dir, freqmin, freqmax, 
                           taper_percentage, corners, zerophase):
    """
    Preprocess a single miniseed file with detrending, tapering, and filtering.
    
    Processing steps:
    1. Linear detrend - removes linear trends in the data
    2. Demean - removes the mean value (centers data around zero)
    3. Taper - applies a window to reduce edge effects (default 5%)
    4. Bandpass filter - keeps only frequencies within specified range
    
    Parameters:
    -----------
    input_filepath : str or Path
        Path to the input miniseed file
    output_dir : str or Path
        Directory to save the processed file
    freqmin : float
        Minimum frequency for bandpass filter (Hz)
    freqmax : float
        Maximum frequency for bandpass filter (Hz)
    taper_percentage : float
        Percentage of trace to taper (0-0.5)
    corners : int
        Number of corners for Butterworth filter
    zerophase : bool
        If True, apply zero-phase filter
    
    Returns:
    --------
    tuple : (success: bool, input_file: str, output_file: str or None, error: str or None)
    """
    input_path = Path(input_filepath)
    filename = input_path.name
    
    try:
        # Read the miniseed file
        st = read(str(input_filepath))
        
        # Get original file info for reporting
        n_traces = len(st)
        original_stats = f"{n_traces} trace(s)"
        
        # Process each trace in the stream
        for tr in st:
            # Step 1: Apply linear detrend (removes linear trend)
            tr.detrend('linear')
            
            # Step 2: Remove mean (demean)
            tr.detrend('demean')
            
            # Step 3: Apply taper (5% by default)
            # Taper reduces edge effects in filtering
            tr.taper(max_percentage=taper_percentage, type='hann')
            
            # Step 4: Apply bandpass filter
            tr.filter('bandpass', 
                     freqmin=freqmin, 
                     freqmax=freqmax,
                     corners=corners,
                     zerophase=zerophase)
        
        # Create output filepath with "_processed" suffix
        output_filename = input_path.stem + "_processed" + input_path.suffix
        output_filepath = Path(output_dir) / output_filename
        
        # Save processed stream
        st.write(str(output_filepath), format='MSEED')
        
        print(f"✓ Processed: {filename} ({original_stats})")
        return (True, str(input_filepath), str(output_filepath), None)
        
    except Exception as e:
        error_msg = f"{type(e).__name__}: {str(e)}"
        print(f"✗ Failed: {filename} - {error_msg}")
        return (False, str(input_filepath), None, error_msg)

> **Explainer:**
> 
> This is the core processing function that handles a single miniseed file. It reads the file using ObsPy, then applies four sequential processing steps to each trace:
> 1. linear detrend to remove any long-term linear trends,
> 2. demean to center the data around zero,
> 3. a Hann taper applied to a small percentage (default 5%) of each end to prevent edge artifacts, and
> 4.  a Butterworth bandpass filter to isolate frequencies of interest.
>   
> The processed data is saved with a "_processed" suffix and the function returns a tuple indicating success/failure along with file paths and any error messages. This function is designed to be called in parallel for multiple files.


### Main Parallel Processing Function

In [None]:
def preprocess_miniseed_parallel(input_dir, output_dir="./processed_data",
                                 freqmin=0.1, freqmax=10.0,
                                 taper_percentage=0.05, corners=4,
                                 zerophase=True, max_workers=4):
    """
    Preprocess multiple miniseed files in parallel.
    
    This function orchestrates the parallel processing of all miniseed files in a
    directory. It uses ThreadPoolExecutor to process multiple files simultaneously,
    which significantly reduces total processing time compared to sequential processing.
    
    Parameters:
    -----------
    input_dir : str
        Directory containing input miniseed files
    output_dir : str, optional
        Directory to save processed files (default: './processed_data')
    freqmin : float, optional
        Minimum frequency for bandpass filter in Hz (default: 0.1)
    freqmax : float, optional
        Maximum frequency for bandpass filter in Hz (default: 10.0)
    taper_percentage : float, optional
        Percentage of trace to taper, 0-0.5 (default: 0.05 = 5%)
    corners : int, optional
        Number of corners for Butterworth filter (default: 4)
    zerophase : bool, optional
        If True, apply zero-phase filter (default: True)
    max_workers : int, optional
        Maximum number of parallel workers (default: 4)
    
    Returns:
    --------
    dict : Dictionary containing 'successful' and 'failed' processing results
    
    Example:
    --------
    >>> results = preprocess_miniseed_parallel(
    ...     input_dir='./seismic_data',
    ...     output_dir='./processed_data',
    ...     freqmin=0.5,
    ...     freqmax=20.0,
    ...     max_workers=8
    ... )
    >>> print(f"Processed {len(results['successful'])} files successfully")
    """
    
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    # Get all miniseed files from the input directory
    miniseed_files = get_miniseed_files(input_dir)
    
    if not miniseed_files:
        print(f"No miniseed files found in {input_dir}")
        return {'successful': [], 'failed': []}
    
    # Print processing configuration
    print(f"{'='*70}")
    print(f"Miniseed Preprocessing")
    print(f"{'='*70}")
    print(f"Input directory:  {input_dir}")
    print(f"Output directory: {output_dir}")
    print(f"Files found:      {len(miniseed_files)}")
    print(f"Parallel workers: {max_workers}")
    print(f"\nProcessing parameters:")
    print(f"  - Linear detrend:    enabled")
    print(f"  - Demean:            enabled")
    print(f"  - Taper:             {taper_percentage*100}% (Hann window)")
    print(f"  - Bandpass filter:   {freqmin}-{freqmax} Hz")
    print(f"  - Filter corners:    {corners}")
    print(f"  - Zero-phase:        {zerophase}")
    print(f"{'='*70}\n")
    
    # Initialize results dictionary
    results = {
        'successful': [],
        'failed': []
    }
    
    # Process files in parallel using ThreadPoolExecutor
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Submit all processing tasks to the executor
        # Each task is a Future object that will complete independently
        future_to_file = {
            executor.submit(
                preprocess_single_file,
                filepath, output_dir, freqmin, freqmax,
                taper_percentage, corners, zerophase
            ): filepath
            for filepath in miniseed_files
        }
        
        # Process results as they complete (not necessarily in submission order)
        for future in as_completed(future_to_file):
            filepath = future_to_file[future]
            try:
                # Get the result from the completed processing task
                success, input_file, output_file, error = future.result()
                
                if success:
                    # Add to successful processing list
                    results['successful'].append({
                        'input_file': input_file,
                        'output_file': output_file
                    })
                else:
                    # Add to failed processing list with error message
                    results['failed'].append({
                        'input_file': input_file,
                        'error': error
                    })
            except Exception as e:
                # Handle any unexpected errors
                error_msg = f"Unexpected error: {str(e)}"
                print(f"✗ Failed: {filepath.name} - {error_msg}")
                results['failed'].append({
                    'input_file': str(filepath),
                    'error': error_msg
                })
    
    # Print summary of processing results
    print(f"\n{'='*70}")
    print(f"Processing Summary:")
    print(f"  ✓ Successful: {len(results['successful'])} files")
    print(f"  ✗ Failed:     {len(results['failed'])} files")
    print(f"{'='*70}")
    
    # List failed files if any
    if results['failed']:
        print(f"\nFailed files:")
        for item in results['failed']:
            print(f"  - {Path(item['input_file']).name}: {item['error']}")
    
    return results

> **Explainer:**
> 
> This is the main orchestration function that manages the parallel processing workflow. It first creates the output directory and finds all miniseed files in the input directory, then displays a detailed configuration summary.
>
> The function uses ThreadPoolExecutor to create a pool of worker threads (specified by max_workers) and submits all files for processing simultaneously. As each file completes processing (tracked via as_completed()), the results are collected into 'successful' and 'failed' lists.
>
> This parallel approach can dramatically reduce processing time - for example, processing 20 files with 4 workers could be ~4x faster than sequential processing.
>
> The function returns a dictionary with detailed results for each file, making it easy to track what succeeded and what failed, and includes comprehensive error reporting.


### Example Usage

In [None]:
# Example 1: Basic usage with default parameters
# -----------------------------------------------
# Process all miniseed files in './seismic_data' with default filter settings

results = preprocess_miniseed_parallel(
    input_dir='./seismic_data',
    output_dir='./processed_data'
)

# Example 2: Custom filter parameters
# ------------------------------------
# Process files with a higher frequency range (0.5-20 Hz) and more workers

results = preprocess_miniseed_parallel(
    input_dir='./seismic_data',
    output_dir='./processed_data',
    freqmin=0.5,        # Higher low-frequency cutoff
    freqmax=20.0,       # Higher high-frequency cutoff
    taper_percentage=0.1,   # 10% taper instead of 5%
    max_workers=8       # Use 8 parallel workers
)

# Example 3: Access and analyze results
# --------------------------------------

print(f"\nSuccessfully processed files:")
for item in results['successful']:
    print(f"  Input:  {item['input_file']}")
    print(f"  Output: {item['output_file']}")
    print()

if results['failed']:
    print(f"\nFailed files (need attention):")
    for item in results['failed']:
        print(f"  File:  {item['input_file']}")
        print(f"  Error: {item['error']}")
        print()

# Example 4: Save results to a CSV file for record-keeping
# ---------------------------------------------------------

import pandas as pd

# Convert results to DataFrames
if results['successful']:
    successful_df = pd.DataFrame(results['successful'])
    successful_df.to_csv('successful_processing.csv', index=False)
    print(f"Saved successful results to: successful_processing.csv")

if results['failed']:
    failed_df = pd.DataFrame(results['failed'])
    failed_df.to_csv('failed_processing.csv', index=False)
    print(f"Saved failed results to: failed_processing.csv")

> **Explainer:**
>
>   Example 1 shows the simplest usage with default parameters suitable for most seismic data processing tasks (0.1-10 Hz is good for teleseismic and regional events).
>
> Example 2 demonstrates how to customize the filter parameters for specific applications - higher frequencies (0.5-20 Hz) are better for local events and crustal studies, while the increased worker count speeds up processing of large datasets.
>
> Example 3 shows how to access and iterate through the results dictionary to examine what was processed and what failed.
>
> Example 4 demonstrates saving results to CSV files for documentation and quality control purposes, which is useful for tracking processing workflows and debugging issues with specific files.