# Create files for `globus transfer --batch` on daily data

This notebook is used to generate the `batch_files/batch*day*.txt` files that contain the files to transfer to the Arctic Climate Data Node. Currently, this is done from the LLNL node only. 

In [1]:
import pandas as pd
import luts
from config import *

Read in the main table of desired filenames at the daily temporal resolution:

In [2]:
df = pd.read_csv("llnl_esgf_day_filenames.csv", converters={"filenames": lambda x: x.strip("[]").split(", ")})
# ignore rows where data not on LLNL node for now
df = df.query("~n_files.isnull()")

Define a function to convert rows of that table into tuples of (\<remote path>, \<ACDN path>):

In [3]:
def generate_transfer_paths(row):
    """Generate the paths for transferring between LLNL ESGF node and ACDN
    
    Args:
        row (pandas.core.series.Series): a single row series from pandas.DataFrame.iterrows() on dataframe of desired data filenames
    
    Returns:
        transfer_list (list): has format [(<remote path>, <target path>), ...] for all files in row["filenames"]
    """
    activity = "CMIP" if row["scenario"] == "historical" else "ScenarioMIP"
    model = row["model"]
    institution = luts.model_inst_lu[model]["institution"]
    group_path = Path().joinpath(
        activity,
        institution,
        model,
        row["scenario"],
        row["mirror_variant"],
        "day",
        row["variable"],
        row["grid_type"],
        row["version"],
    )
    
    transfer_list = []
    for fn in row["filenames"]:
        fp = group_path.joinpath(fn.replace("'", ""))
        transfer_list.append((llnl_prefix.joinpath(fp), acdn_prefix.joinpath(fp)))
        
    return transfer_list

Iterate over variables to create a batch file for each. We're not sure what a reasonable batch size is, so we are arbitrarily going with variable for now. I think the fewert he better if possible.

First, define a function to actually write a list of transfer file tuples to a text file for the `--batch` argument.

In [10]:
def write_batch_file(varname, transfer_paths):
    """Write the batch file for a particular variable and scenario group"""
    batch_file = f"batch_files/batch_llnl_day_{varname}.txt"
    with open(batch_file, "w") as f:
        for paths in transfer_paths:
            f.write(f"{paths[0]} {paths[1]}\n")

Now, create the files:

In [12]:
# ESGF directory structure convention is /<activity>/<institution>/<model>/<scenario>/<variant>/<frequency>/<variable>/<grid type>/<version>/

# get all potential variable names
varnames = list(luts.vars_tier1.keys()) + list(luts.vars_tier2.keys())

for varname in varnames:
    transfer_paths = []
    query_str = f"variable == '{varname}' & scenario == '{scenario}'"
    for row in df.query(query_str).iterrows():
        transfer_paths.extend(generate_transfer_paths(row[1]))

    write_batch_file(varname, transfer_paths)

batch files should now be in the `batch_files/` folder.