# Handling and Processing Climate Data with PySpark and NetCDF

This notebook demonstrates the techniques and processes involved in handling, analyzing, and managing large-scale climate data using PySpark, xarray, and netCDF4. It includes detailed explanations of how to set up the required Python environment, manage data files based on specific attributes such as date ranges, and efficiently process and move large datasets using distributed computing principles. Each section is thoroughly documented to ensure clarity and ease of understanding, facilitating the replication and adaptation of these methods for similar data-intensive tasks.


### Installation of Required Libraries

This section includes the installation commands for Python libraries that are essential for handling and processing various data formats and performing parallel computations. Each library serves a specific purpose:

- `xarray`: Used for labeling, indexing, and synchronizing multidimensional arrays, especially useful for working with climate data formats like netCDF.
- `netCDF4` and `h5netcdf`: These libraries provide interfaces to netCDF and HDF5 files, respectively, allowing for efficient storage and access to large datasets.
- `dask`: Enhances scalability and efficiency in analytics by enabling parallel computing.
- `rioxarray`: Extends `xarray` to include tools for spatial analysis, such as rasterio integration for geospatial operations.
- `tqdm`: Provides a progress bar for loops and other iterative computations, useful for tracking the progress of data processing tasks.


In [0]:
%pip install xarray
%pip install netCDF4 h5netcdf
%pip install dask   
%pip install rioxarray
%pip install tqdm 

### Restarting Python Environment

This command, `dbutils.library.restartPython()`, is used to restart the Python environment within Databricks notebooks. Restarting the Python environment is a critical step after installing new libraries or making significant changes to the environment. It ensures that all installed libraries are loaded correctly and that the environment is reset, clearing any residual state from previous computations. This operation is particularly useful when libraries that affect the entire runtime environment are added or updated.


In [0]:
dbutils.library.restartPython()

### Importing Necessary Libraries

This code chunk imports the necessary libraries and modules required for data processing and analysis in a PySpark and scientific data context. Each import serves a specific function in the workflow:

- `SparkSession`: Initializes a Spark session, which is the entry point to programming Spark with the Dataset and DataFrame API.
- `os`: Provides a way of using operating system dependent functionality like reading or writing to the filesystem.
- `xarray` (imported as `xr`): Facilitates working with labeled multi-dimensional arrays and datasets, especially useful for manipulating large climate data files like netCDF.
- `datetime`: Used to handle and manipulate date and time data, crucial for time-series analysis.
- `shutil`: Offers high-level file operations such as copying and archiving.
- `pandas` (imported as `pd`): Essential for data manipulation and analysis, particularly useful for handling tabular data with heterogeneously-typed columns.
- `netCDF4` (imported as `nc`): Enables interaction with netCDF files which are commonly used for storing scientific data.
- `lit`: A function from PySpark's SQL module that is used to add a new column with a constant value or to make explicit data type casting in DataFrame operations.
- `tqdm.auto`: Automatically selects an appropriate progress bar based on the environment (notebook, terminal, etc.), useful for monitoring the progress of data processing loops.


In [0]:
from pyspark.sql import SparkSession
import os
import xarray as xr
from datetime import datetime
import shutil
import pandas as pd 
import netCDF4 as nc
from pyspark.sql.functions import lit 
from tqdm.auto import tqdm


### Function: `spark_copy_and_move_files_by_date`

This function processes and moves NetCDF files from a source folder to a target folder based on a specified date range and file prefix. It leverages Apache Spark for distributed processing, which is particularly useful for handling large datasets efficiently.

#### Parameters:
- **start_date (str):** The start of the date range for filtering files. The date format is specified by `date_pattern`.
- **end_date (str):** The end of the date range.
- **source_folder (str):** The directory containing the source files.
- **target_folder (str):** The destination directory for the processed files.
- **prefix (str):** A prefix to filter files by name.
- **date_pattern (str):** The format string for parsing dates in filenames (default is '%Y-%m-%d').
- **source_file_attr (str):** The attribute name in the NetCDF file for the source file information (default is 'source_file').

#### Returns:
- None. The function outputs the process results directly, indicating successful moves and any issues encountered.

#### Description:
The function initializes a Spark session and then filters and processes files that match the given criteria (prefix and date range). Each file is processed using `xarray` for data manipulation, and metadata attributes are added or updated using `netCDF4`. Files are temporarily saved and then moved to the target folder. This process is parallelized using Spark's distributed computing capabilities to enhance performance, especially with large datasets. The function concludes by outputting the results of each file processed.


In [0]:
def spark_copy_and_move_files_by_date(start_date, end_date, source_folder, target_folder, prefix, date_pattern='%Y-%m-%d', source_file_attr='source_file'):
    """
    Process and move NetCDF files from one folder to another based on a date range and a prefix using Spark.

    Parameters:
    - start_date (str): Start date in the format specified by date_pattern.
    - end_date (str): End date in the format specified by date_pattern.
    - source_folder (str): Path to the source folder containing the files.
    - target_folder (str): Path to the target folder where the files will be moved.
    - prefix (str): Prefix of the file names to consider.
    - date_pattern (str): Date pattern in the filename (default: '%Y-%m-%d').
    - source_file_attr (str): Attribute name for source file in the NetCDF metadata (default: 'source_file').

    Returns:
    - None
    """
    # Initialize Spark session
    spark = SparkSession.builder.appName("ProcessAndMoveFilesByDate").getOrCreate()
    sc = spark.sparkContext

    # Parse dates
    start_date = datetime.strptime(start_date, date_pattern)
    end_date = datetime.strptime(end_date, date_pattern)

    # List all files in the source folder that match the prefix
    all_files = [filename for filename in os.listdir(source_folder) if filename.startswith(prefix) and filename.endswith(".nc")]

    # Initialize list for files within date range
    filepaths_in_range = []

    # For each file in the list, extract date and check if it's in the range
    for filename in all_files:
        # Replace underscores with hyphens in the date part of the filename
        filename_with_hyphens = filename.replace('_', '-')
        # Extract date from filename
        date_str = filename_with_hyphens.split('-')[-3] + '-' + filename_with_hyphens.split('-')[-2] + '-' + filename_with_hyphens.split('-')[-1].split('.')[0]  # Assumes 'YYYY-MM-DD'
        file_date = datetime.strptime(date_str, date_pattern)

        # Check if the file date is within the range
        if start_date <= file_date <= end_date:
            filepath = os.path.join(source_folder, filename)
            filepaths_in_range.append(filepath)

    # Define a function to process and move each NetCDF file
    def process_and_move_file(filepath):
        # Process the file using xarray
        ds = xr.open_dataset(filepath)

        # Get the filename without the directory
        filename = os.path.basename(filepath)

        # Get the date_updated attribute from the dataset, set to null if not present
        date_updated = ds.attrs.get('date_updated', None)

        # Save the processed dataset to a temporary file in /tmp/
        temp_file_path = os.path.join('/tmp/', filename)
        ds.to_netcdf(temp_file_path)

        # Get the modification time of the original file
        date_modified_in_s3 = datetime.fromtimestamp(os.path.getmtime(filepath)).isoformat()

        # Add date_updated, source file, and date_modified_in_s3 as metadata
        with nc.Dataset(temp_file_path, 'a') as dst:
            dst.setncattr('date_updated', str(date_updated) if date_updated is not None else 'null')
            dst.setncattr(source_file_attr, filename)
            dst.setncattr('date_modified_in_s3', date_modified_in_s3)

        # Move the temporary file to the target directory
        target_file_path = os.path.join(target_folder, filename)
        shutil.move(temp_file_path, target_file_path)
        return f"Processed and moved {filename} to {target_folder}"

    # Distribute the file processing and moving using Spark
    rdd = sc.parallelize(filepaths_in_range)
    results = rdd.map(process_and_move_file).collect()

    # Print results
    for result in results:
        print(result)

    print("File processing and move complete.")

**IMPLEMENTATION**

In [0]:

start_date = '1950-01-01'
end_date =  '2023-12-31'
source_folder = '/Volumes/aer-processed/era5/daily_summary'
target_folder = '/Volumes/pilot/bronze_test/era5_daily_summary'
prefix = 'reanalysis-era5-sfc-daily-'
date_pattern='%Y-%m-%d'
source_file_attr = 'source_file'

spark_copy_and_move_files_by_date(start_date, 
                                  end_date, 
                                  source_folder, 
                                  target_folder, 
                                  prefix,
                                  date_pattern,
                                  source_file_attr)

