# **Data Acquisition and Preprocessing**

## **Introduction**  
This notebook is used to **extract** the ML dataset from **pre-processed Earth System Model (ESM) outputs**, perform preprocessing, and save the results to a user-specific path.


# 1. Setup workspace and Import packages

In [1]:
%%capture
!pip install tensorflow
!pip install keras

In [2]:
%%capture
### standard imports ###
import pandas as pd
import xarray as xr
import gcsfs
### Python file with supporting functions ###
# standard imports
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import datetime

import json
import random
pd.set_option('display.max_colwidth',100)

cwd=os.getcwd()
parent_dir = os.path.dirname(cwd)
os.chdir(parent_dir)
cwd = parent_dir
print("Current working directory:", os.getcwd())

# Python file with supporting functions
import lib.residual_utils as supporting_functions

E0000 00:00:1743960831.283977     197 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1743960831.290692     197 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1743960831.308977     197 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1743960831.308997     197 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1743960831.309000     197 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1743960831.309002     197 computation_placer.cc:177] computation placer already registered. Please check linka

In [3]:
### Setting the date range to unify the date type ###

# Define date range
date_range_start = '2004-01-01T00:00:00.000000000'
date_range_end = '2023-12-31T00:00:00.000000000'

# create date vector, adds 14 days to start & end
dates = pd.date_range(start=date_range_start, 
                      end=date_range_end,freq='MS')


init_date = str(dates[0].year) + format(dates[0].month,'02d')
fin_date = str(dates[-1].year) + format(dates[-1].month,'02d')

# 2. Data Introduction and Storage

## **Data Sources and Paths**
The data for the machine learning model are derived primarily from **pre-processed Earth System Model (ESM) outputs** with supplemental observational constraints where available. 

The data is stored in a **cloud-based environment**, ensuring efficient access and scalability for the machine learning workflow. Key datasets include:

- **Ensemble dir**:
  Contains the original data from pre-processed Earth System Model (ESM) outputs. 

- **Atmospheric xCO₂ Data**: Provides atmospheric CO₂ concentrations from **CMIP6 models**, essential for driving the long-term trends in oceanic pCO₂.

- **SOCAT Data (Mask File)**:  Masking file based on **SOCAT pCO₂ observations**, used to sample model data in a way that reflects real-world observational density.

- **Topography and Land-Sea Mask**:  
  - **GEBCO Topography File**:    Global topographic dataset used to define coastal regions and provide physical constraints for model inputs.
  - **Land-Sea Mask File**:  Binary land-sea mask used to restrict predictions to ocean regions only.


## Where is this happening? 

In [4]:

### set up for getting files from leap bucket ###
fs = gcsfs.GCSFileSystem()

### set paths ###

### paths for loading: ###
# directory of regridded members from notebook 00
ensemble_dir = "gs://leap-persistent/abbysh/pco2_all_members_1982-2023/00_regridded_members" # path to regridded data

# directory of reference zarr files
zarr_dir = 'gs://leap-persistent/abbysh/zarr_files_'

# atmospheric xco2 file
xco2_path = f"{zarr_dir}/xco2_cmip6_183501-224912_monthstart.zarr"

# socat data file
socat_path = f"{zarr_dir}/socat_mask_feb1982-dec2023.zarr"

# topo and land-sea masks
topo_path = f"{zarr_dir}/GEBCO_2014_1x1_global.zarr"
lsmask_path = f"{zarr_dir}/lsmask.zarr"

############################################# change this to your username!

your_username = 'Mukkke'

# where machine learning inputs are saved
MLinputs_path = f"gs://leap-persistent/{your_username}/pco2_residual/post01_xgb_inputs"



# 3.  Select specific Earth System Models ( Ensemble Members) and members
This notebook uses data from multiple Earth System Models (ESMs) included in the Large Ensemble Testbed (LET). The LET originally provides 100 ensemble members across various ESMs, each representing distinct initial conditions. These ensemble members are essential for capturing internal climate variability and evaluating model uncertainty.

You can customize the mems_dict variable to include selected members from each ESM. This flexibility enables broader analysis while preserving the notebook’s overall structure.

In [5]:
mems_dict = dict()

selected_ensembles = ['ACCESS-ESM1-5', 'CanESM5', 'MPI-ESM1-2-LR', 'UKESM1-0-LL']
# ensemble_dir = 'leap-persistent/abbysh/pco2_all_members_1982-2023/00_regridded_members'

# Get all paths
all_paths = fs.ls(ensemble_dir)

# Filter paths containing the selected ensemble names
filtered_paths = [
    path for path in all_paths
    if path.split('/')[-1] in selected_ensembles
]

for ens_path in filtered_paths:             
    ens = ens_path.split('/')[-1]
    mems = fs.ls(ens_path)
    for mem in mems:        
        memo = mem.split('/')[-1]
        if ens not in mems_dict:
            mems_dict[ens] = [memo]
        elif ens in mems_dict:
            mems_dict[ens].append(memo)


In [6]:
random.seed(42)  # Set seed for reproducibility

selected_mems_dict = {}

num_members = 5  # Set the number of ensemble members from each ESM

for ens, members in mems_dict.items():
    if len(members) >= num_members:
        selected_mems_dict[ens] = random.sample(members, num_members)  # Select `num_members` random members
    else:
        selected_mems_dict[ens] = members  # If there are fewer members than `num_members`, select all

print(selected_mems_dict)

{'ACCESS-ESM1-5': ['member_r5i1p1f1', 'member_r1i1p1f1', 'member_r10i1p1f1', 'member_r31i1p1f1', 'member_r2i1p1f1'], 'CanESM5': ['member_r3i1p2f1', 'member_r2i1p1f1', 'member_r1i1p2f1', 'member_r1i1p1f1', 'member_r6i1p2f1'], 'MPI-ESM1-2-LR': ['member_r12i1p1f1', 'member_r11i1p1f1', 'member_r15i1p1f1', 'member_r22i1p1f1', 'member_r23i1p1f1'], 'UKESM1-0-LL': ['member_r8i1p1f2', 'member_r1i1p1f2', 'member_r3i1p1f2', 'member_r4i1p1f2', 'member_r2i1p1f2']}


In [23]:
# import gcsfs

# fs = gcsfs.GCSFileSystem()

# MLinputs_path = "gs://leap-persistent/Mukkke/pco2_residua/"

# files_to_delete = fs.ls(MLinputs_path, detail=False)

# for file_path in files_to_delete:
#     fs.rm(file_path, recursive=True)
#     print(f"Deleted: {file_path}")


Deleted: leap-persistent/Mukkke/pco2_residual


## Processing and Saving ESM Data for ML

In this step, we extract the required ML data from ESM datasets stored in **Google Cloud Storage (GCS)**. The extracted data is structured into Pandas DataFrames and saved under our **own username-specific workspace** in GCS. This ensures that we have direct access to preprocessed data for ML experiments. This means that even if you exit and re-enter JupyterHub, your data will remain available, eliminating the need for reprocessing.

### **Important Note: Run This Only Once**

The **data extraction and processing** step needs to be run **only once** per project, unless new data is required. This helps save computational resources and execution time.

For each **ensemble member**, the estimated runtime is:
- **1.32 minutes** for data retrieval and processing.
- **Total estimated time**: **1.32 × (total number of selected members)** min.

With a **128GB CPU**, actual runtimes may vary based on system load and selected members, but this serves as a general guideline.

### Before Running Again:
Before executing the data processing or ML training steps again, **check whether you actually need new data**. If not, avoid redundant computations to optimize resource usage. You can also check the storage under the path constantly to clear redundant usage.


In [8]:
### creating pandas dataframes out of "raw" data to prep for ML ###
N_time = len(dates)
member_counter = 0 

### loop through each ESM
for ens, mem_list in selected_mems_dict.items(): 

    ### loop through each member of that ESM
    for member in mem_list:
        
        print(f'making dataframe {member_counter}: {ens,member}')

        ### uses utility function file to make data into dataframe for ML use
        df = supporting_functions.create_inputs(ensemble_dir, ens, member, dates, N_time,
                                  xco2_path,
                                  socat_path,
                                  topo_path,
                                  lsmask_path)

        ### Save the pandas dataframe to workspace
        supporting_functions.save_clean_data(df, MLinputs_path, ens, member, dates)
        member_counter += 1

making dataframe 0: ('ACCESS-ESM1-5', 'member_r5i1p1f1')


TypeError: Unsupported type for store_like: 'FSMap'

In [9]:
for ens, mem_list in selected_mems_dict.items():
    for member in mem_list:
        print(ens, member)
        data_dir = f"{MLinputs_path}/{ens}/{member}"
        files = fs.ls(data_dir)
        for file in files:
            print(file)

ACCESS-ESM1-5 member_r5i1p1f1
leap-persistent/Mukkke/pco2_residual/post01_xgb_inputs/ACCESS-ESM1-5/member_r5i1p1f1/MLinput_ACCESS-ESM1-5_r5i1p1f1_mon_1x1_200401_202312.pkl
ACCESS-ESM1-5 member_r1i1p1f1
leap-persistent/Mukkke/pco2_residual/post01_xgb_inputs/ACCESS-ESM1-5/member_r1i1p1f1/MLinput_ACCESS-ESM1-5_r1i1p1f1_mon_1x1_200401_202312.pkl
ACCESS-ESM1-5 member_r10i1p1f1
leap-persistent/Mukkke/pco2_residual/post01_xgb_inputs/ACCESS-ESM1-5/member_r10i1p1f1/MLinput_ACCESS-ESM1-5_r10i1p1f1_mon_1x1_200401_202312.pkl
ACCESS-ESM1-5 member_r31i1p1f1
leap-persistent/Mukkke/pco2_residual/post01_xgb_inputs/ACCESS-ESM1-5/member_r31i1p1f1/MLinput_ACCESS-ESM1-5_r31i1p1f1_mon_1x1_200401_202312.pkl
ACCESS-ESM1-5 member_r2i1p1f1
leap-persistent/Mukkke/pco2_residual/post01_xgb_inputs/ACCESS-ESM1-5/member_r2i1p1f1/MLinput_ACCESS-ESM1-5_r2i1p1f1_mon_1x1_200401_202312.pkl
CanESM5 member_r3i1p2f1
leap-persistent/Mukkke/pco2_residual/post01_xgb_inputs/CanESM5/member_r3i1p2f1/MLinput_CanESM5_r3i1p2f1_mon_1x