# Demux Illumina

This assumes you start with a tsv file with sample key and file path as outputted by [find_run_diff](https://github.com/ssi-dk/calc-script-find_run_diff), though can be made manually. The file is parsed for paths that will be demultiplexed, path for demultiplexing should be known and used with the previous program. It will then go through all the raw folders and create the demultiplexed folders.

The demux process will take standard NextSeq, MiSeq or NovaSeq folders and will turn them into paired end fastq files.

Nonstandard differences. Data for sequencing is organized into year folders. This information is gathered from the first 2 digits of the run folder and will affect the output location.

## Installation/Setup
Please ensure you've followed instructions in `./README.md` to install the conda environment which this is based on. It contains the required libraries for also running this jupyter notebook. If you follow the default set up then ensure your Jupyter server is pointing to `./venv/bin/python`.

### Additional program requirements:
bcl2fastq # not handled in conda

## Libraries

In [53]:
from dotenv import dotenv_values                                   # Used for loading configs
import os
import re                                                          # Used to get run keys
import subprocess

## Config variables
Variables for running the process. It first checks the defaults which are all relative to the project location. Specific settings can be set with another file which will overwrite values in the default but must be passed as an environmental variable `CONFIG_PATH`. Afterwards you can overwrite individual settings with environmental variables.

In [52]:
config = {
    **dotenv_values("./notebooks/demux_illumina.default.env"),       # load global default vars
    **dotenv_values(os.getenv("CONFIG_PATH")),                      # load specific vars, path of config is stored in ENV variable DEMUX_ILLUM_CONFIG_PATH
    **os.environ,                                                   # override loaded values with ENV variables
    'PROJECT_PATH': os.getcwd()                                     # set the project path relative to notebook
}

## Program variables
Inputs:
- RUNS_TO_DEMUX_FILE_PATH
- EXPERMINENT_NAME_INSERTION_SITE
- SAMPLESHEET_FILE_NAME
- THREADS
- MEMORY
# Outputs
- OUTPUT_DIR

In [None]:
#parameters
# From config
# Inputs
RUNS_TO_DEMUX_FILE_PATH=config["RUNS_TO_DEMUX_FILE_PATH"]
EXPERMINENT_NAME_INSERTION_SITE=re.compile(config["EXPERMINENT_NAME_INSERTION_SITE"])
SAMPLESHEET_FILE_NAME=config["SAMPLESHEET_FILE_NAME"]
THREADS=config["THREADS"]
MEMORY=config["MEMORY"]
# Outputs
OUTPUT_DIR=config["OUTPUT_DIR"]

In [None]:
def get_experiment_name(sample_sheet_path):
    with open(sample_sheet_path, 'r') as f:
        file_text = f.read()
    match = re.search("Experiment\ Name,(?P<experiment_name>.+)\n", file_text)
    if match.lastindex == 0:
        raise Exception("Could not find experiment name in sample sheet")
    else:
        return match.group("experiment_name")

def get_year(run_path):
    # strip trailing slash if present
    if run_path[-1] == "/":
        run_path = run_path[:-1]
    # get the last part of the path
    run_name = os.path.basename(run_path)
    # get the year from the first 2 digits of the path
    year = run_name[:2]
    # check the first 2 digits of the path are actually digits
    if not year.isdigit():
        raise Exception("Could not find year in run name")
    # make the 2 digit year into a 4 digit year
    # FIXME: This will fail when the year is >2099
    year = "20" + year
    return year

def get_output_folder_path(run_path, experiment_name_insertion_site, samplesheet_file_name, output_dir):
    year = get_year(run_path)
    experiment_name = get_experiment_name(os.path.join(run_path, samplesheet_file_name))
    run_name = os.path.basename(run_path)
    match = re.search(experiment_name_insertion_site, run_name)
    assert(match.lastindex == 2)
    pre_experiment_name = match.group(1)
    post_experiment_name = match.group(2)
    output_folder = os.path.join(output_dir, year, f"{pre_experiment_name}_{experiment_name}{post_experiment_name}")
    return output_folder

def demultiplex(run_path, output_folder, threads, memory):
    # Note the bcl2fastq command will create output folders if they don't exist. This includes parent directories.
    command = f"""
        bcl2fastq --no-lane-splitting -r {threads} -p {threads} -w {threads} -R {run_path} -o {output_folder} --sample-sheet {run_path}/SampleSheet.csv; 
        cp -r {run_path}/InterOp {output_folder}/InterOp;
        cp {run_path}/*.xml {output_folder}/;
        cp {run_path}/SampleSheet.csv {output_folder}/SampleSheet.csv;
        cp {run_path}/run_metadata.xlsx {output_folder}/run_metadata.xlsx;
        """
    # This is where I'd normally run it with subprocess or such but am utilizing linux commands in Notebooks as the alternative.
    return command

In [None]:
def generate_demux_illumina_commands(runs_to_demux_file_path, experiment_name_insertion_site, samplesheet_file_name, threads, memory, output_dir):
    commands = []
    with open(runs_to_demux_file_path, 'r') as f:
        # ignore lines starting with #
        runs = [line.strip() for line in f if not line.startswith("#")]
    for run in runs:
        output_folder = get_output_folder_path(run, experiment_name_insertion_site, samplesheet_file_name, output_dir)
        # Notification here
        commands.append(demultiplex(run, output_folder, threads, memory))
    return commands

Get a list of demux commands from `generate_demux_illumina_commands` and then run them on the terminal. This is done with the `!` part sending it to the terminal (a Jupyter feature).

In [None]:
commands = generate_demux_illumina_commands(RUNS_TO_DEMUX_FILE_PATH, EXPERMINENT_NAME_INSERTION_SITE, SAMPLESHEET_FILE_NAME, THREADS, MEMORY, OUTPUT_DIR)
for command in commands:
    print(command)
    !{command}
    # TODO Add a notification here and ability to disable for development