# Find Run Diff

This assumes you have two folders you want to check for runs (subfolders), the name of the folders are not identical and one of the folders may have a run not found on the other folder. The script will take the two folders, regex's to make them comparable (by using the capture groups to make a shared key). It will then create an output file of set name with a list of `key\tfolder` pairs for runs that exist in the first but not the second. Comments for these files will be added with a `#` for parsing purposes.

This program is intended to be used for scanning runs that need to be demultiplexed (eg raw NextSeq to demux folder), and for finding runs that have been demux'd and need to be processed (eg demux to QC pipeline).

## Installation/Setup
Please ensure you've followed instructions in `./README.md` to install the conda environment which this is based on. It contains the required libraries for also running this jupyter notebook. If you follow the default set up then ensure your Jupyter server is pointing to `./venv/bin/python`.

## Libraries

In [69]:
from dotenv import dotenv_values                                   # Used for loading configs
import os
import re                                                          # Used to get run keys

## Config variables
Variables for running the process. It first checks the defaults which are all relative to the project location. Specific settings can be set with another file which will overwrite values in the default but must be passed as an environmental variable `CONFIG_PATH`. Afterwards you can overwrite individual settings with environmental variables.

In [70]:
config = {
    **dotenv_values("./notebooks/find_run_diff.default.env"),       # load global default vars
    **dotenv_values(os.getenv("CONFIG_PATH")),                      # load specific vars, path of config is stored in ENV variable DEMUX_ILLUM_CONFIG_PATH
    **os.environ,                                                   # override loaded values with ENV variables
    'PROJECT_PATH': os.getcwd()                                     # set the project path relative to notebook
}

## Program variables
Inputs:
- OUTPUT_DIR
- SOURCE_DIR
- DESTINATION_DIR
- SOURCE_REGEX
- DESTINATION_REGEX
- OUTPUT_FILE_NAME

Outputs:
- OUTPUT_FILE_PATH

In [71]:
#parameters
# From config
OUTPUT_DIR = config['OUTPUT_DIR']
SOURCE_DIR = config['SOURCE_DIR']
DESTINATION_DIR = config['DESTINATION_DIR']
SOURCE_REGEX = re.compile(config['SOURCE_REGEX'])                       # regex to match source files, also assumes capture groups used to make compare key
DESTINATION_REGEX = re.compile(config['DESTINATION_REGEX'])             # regex to match destination files, also assumes capture groups used to make compare key
OUTPUT_FILE_NAME = config['OUTPUT_FILE_NAME']

# Calculated
OUTPUT_FILE_PATH = os.path.join(OUTPUT_DIR, OUTPUT_FILE_NAME)           # path to output file

Function `find_runs` is used in `find_runs_in_source_but_not_dest`. As the folders are assumed to have a similar list of runs but are not directly comparable `find_runs` gets a dict of comparable keys and folder paths. `find_runs_in_source_but_not_dest` then preforms set logic to find which runs have not been handled and returns a `.tsv` file containing the results.

In [72]:
def find_runs(dir, regex):
    # make keys from capture groups of regex
    runs = {}
    for root, dirs, files in os.walk(dir):
        for dir in dirs:
            if regex.match(dir):
                key = ""
                for group in range(1, regex.match(dir).lastindex + 1):
                    key += regex.match(dir).group(group)
                runs[key] = os.path.join(root, dir)
    return runs

def find_runs_in_source_but_not_dest(source_dir, source_regex, destination_dir, destination_regex, output_file):
    source_runs = find_runs(SOURCE_DIR, SOURCE_REGEX)
    destination_runs = find_runs(DESTINATION_DIR, DESTINATION_REGEX)

    target_runs = set(source_runs.keys()) - set(destination_runs.keys())
    target_runs = sorted(target_runs)

    with open(output_file, 'w') as f:
        f.write(f'#paths\n')
        for run in target_runs:
            f.write(f"{source_runs[run]}\n")

Run the function `find_runs_in_source_but_not_dest` with the paths provided in config. Assumption is an output file is created.

In [73]:
find_runs_in_source_but_not_dest(SOURCE_DIR, SOURCE_REGEX, DESTINATION_DIR, DESTINATION_REGEX, OUTPUT_FILE_PATH)
assert(os.path.exists(OUTPUT_FILE_PATH))