# MIDRC Nextflow CPU Batch Demo

### Define Workflow

Define the processes of the workflow. In this workflow we will execute two processes in batch. The first process converts the dicom files to png files using a containerized script. The second process extracts metadata from the dicom files and writes the metadata to a csv file using a second containerized script. Before running this workflow, you need to run the 'midrc_download_dcm_conda.ipynb' workflow to download dicom files to your local workspace environment.

In [None]:
%%writefile main.nf
#!/usr/bin/env nextflow

/* pipeline input parameters, update this to your data dir */
dicom_data = "$baseDir/sdk_data/53/dabffdcc40fbffe39d78e7a926e655/*.dcm" # path to your downloaded dicom files.
project_dir = projectDir

process dicom_to_png {
    
    label 'dcm2png'
    
    input:
    path dicom_files
    
    output:
    stdout emit: dicom_to_png_log
    path('*.png'), emit: png_files
    
    script:
    """
    python3 /utils/dicom_to_png.py $dicom_files
    """
}


process extract_metadata {
    
    label 'ext_metadata'
    
    input:
    path dicom_files
    
    output:
    stdout emit: extract_metadata_log
    path('*.csv'), emit: csv_files
    
    script:
    """
    python3 /utils/extract_metadata.py $dicom_files
    """
}

// Define the entry workflow (initial workflow for Nextflow to run)
workflow {
   
    dicom_files = Channel.fromPath(dicom_data)
    dicom_to_png(dicom_files)
    extract_metadata(dicom_files)
}

### Define Workflow Containers And Resources

Define the containers and compute resources used in the workflow. Each process in the workflow needs it's own defined container.

In [None]:
%%writefile nextflow.config

process {
    withLabel: dcm2png {
        executor = 'awsbatch'
        queue = 'placeholder'
        container = 'placeholder'
    } 
}

process {
    withLabel: ext_metadata {
        executor = 'awsbatch'
        queue = 'placeholder'
        container = 'placeholder'
    } 
}

aws {
    region = 'us-east-1'
    batch {
        cliPath = '/home/ec2-user/miniconda/bin/aws'
        jobRole = 'placeholder'
    }
}
workDir = 'placeholder'


docker.enabled = true

### Run Workflow

In [None]:
!nextflow run main.nf -dsl2

## Gather Results
- Gather the converted .png files
- Pull down the metadata files for each dicom file and merge the metadata to a single file

In [None]:
!pip install -q awscli

In [None]:
import os
import pandas as pd

In [None]:
# Get the aws endpoints for each batch process. Since we are running two processes on 5 threads there will be 10 total endpoints.
end_points = []
with open(".nextflow.log", 'r') as f:
    for line in f:
        if "COMPLETED" in line:
            end_points.append(line.split(' ')[-1][:-2])

In [None]:
# Download the results from each batch session. The results are all placed into a local 'results' folder.
for i in range(len(end_points)):
    command = f'aws s3 cp {end_points[i]}/ ./results/ --recursive --exclude "*" --include "*" --quiet'
    os.system(command)

In [None]:
# Combine inference results from each batch session and write the combined metadata to a local csv file.
files = os.listdir('results/')
results_df = pd.DataFrame()
for file in files:
    if file[-3:] == 'csv':
        label = file.split('_')[-1].split('.')[0]
        temp_df = pd.read_csv('results/' + file)
        temp_df.drop('Unnamed: 0', axis=1, inplace=True)
        temp_df['Label'] = label
        results_df = pd.concat([results_df, temp_df])

results_df.to_csv('midrc_batch_dicom_metadata.csv', index=False)
results_df