## Creation of a SoS workflow from interactive analysis

## Basic Syntax

### Script format of function calls

In [None]:
_input = 'test.pdf'

In [1]:
R(f'''
pdf('{_input}')
plot(0, 0)
dev.off()
''', workdir='result')

null device 
          1 


is equivalent to

In [None]:
R: expand=True, workdir='result'
    pdf('{_input}')
    plot(0, 0)
    dev.off()  

Or with different sigil

In [None]:
R: expand='${ }', workdir='result'
    pdf('${_input}')
    plot(0, 0)
    dev.off()  

In [None]:
[RNASeq_20 (QC)]

parameter: fastq_files = list

input:   fastq_files, group_by=1
depends: executable('fastqc')
output:  f'{_input:bn}_fastqc_html'

print(f'Processing {_input}')

task: walltime='30m'

sh: expand=True
    fastqc {_input}

### Interactive data analysis

Interactive data analysis can be performed in cells with different kernels as follows. Because SoS is an extension to Python 3, you can use arbitrary Python statements in SoS cells.

In [1]:
excel_file = 'data/DEG.xlsx'
csv_file = 'DEG.csv'
figure_file = 'output.pdf'

In [2]:
%expand
xlsx2csv {excel_file} > {csv_file}

In [3]:
%expand
data <- read.csv('{csv_file}')
pdf('{figure_file}')
plot(data$log2FoldChange, data$stat)
dev.off()

### Convert to SoS actions

In [4]:
excel_file = 'data/DEG.xlsx'
csv_file = 'DEG.csv'
figure_file = 'output.pdf'

In [5]:
sh: expand=True
  xlsx2csv {excel_file} > {csv_file}

In [6]:
R: expand=True
  data <- read.csv('{csv_file}')
  pdf('{figure_file}')
  plot(data$log2FoldChange, data$stat)
  dev.off()

null device 
          1 


### Conversion to a SoS Workflow

SoS workflows within a SoS Notebook are defined by sections marked by section headers (`[name: option]`). A `[global]` section should be used for definitions that will be used by all steps.

You also need to convert scripts to SoS actions so that they can be executed as **complete** scripts. Remember also to change the cell type from subkernel to SoS.

In [7]:
[global]
excel_file = 'data/DEG.xlsx'
csv_file = 'DEG.csv'
figure_file = 'output.pdf'

In [8]:
[plot_1 (convert)]
sh: expand=True
    xlsx2csv {excel_file} > {csv_file}

In [9]:
[plot_2 (plot)]
R: expand=True
    data <- read.csv('{csv_file}')
    pdf('{figure_file}')
    plot(data$log2FoldChange, data$stat)
    dev.off()

In [10]:
%sosrun plot

null device 
          1 


## Parameters

In [3]:
%run --excel-file data/DEG.xlsx

[global]
parameter: excel_file = path
parameter: figure_file = 'output.pdf'

csv_file = excel_file.with_suffix('.csv')

[plot_1 (convert)]
sh: expand=True
    xlsx2csv {excel_file} > {csv_file}
    
[plot_2 (plot)]
R: expand=True
    data <- read.csv('{csv_file}')
    pdf('{figure_file}')
    plot(data$log2FoldChange, data$stat)
    dev.off()


null device 
          1 


### Signature

In [2]:
%run --excel-file data/DEG.xlsx

[global]
parameter: excel_file = path
parameter: figure_file = 'output.pdf'

csv_file = excel_file.with_suffix('.csv')

[plot_1 (convert)]
input: excel_file
output: csv_file
sh: expand=True
    xlsx2csv {_input} > {_output}
    
[plot_2 (plot)]
output: figure_file
R: expand=True
    data <- read.csv('{_input}')
    pdf('{_output}')
    plot(data$log2FoldChange, data$stat)
    dev.off()


### Process-oriented vs Outcome oriented workflows

In [14]:
%run

[plot_1 (convert)]
input:  'data/DEG.xlsx'
output: 'DEG.csv'
sh: expand=True
    xlsx2csv {_input} > {_output}
    
[plot_2 (plot)]
input:  'DEG.csv'
output: 'DEG.pdf'
R: expand=True
    data <- read.csv('{_input}')
    pdf('{_output}')
    plot(data$log2FoldChange, data$stat)
    dev.off()

null device 
          1 


In [7]:
!rm -f data/DEG.csv DEG.pdf
%run -t DEG.pdf

[convert: provides='{filename}.csv']
input:  f'{filename}.xlsx'
sh: expand=True
    xlsx2csv {_input} > {_output}
    
[plot]
input:  'data/DEG.csv'
output: 'DEG.pdf'
R: expand=True
    data <- read.csv('{_input}')
    pdf('{_output}')
    plot(data$log2FoldChange, data$stat)
    dev.off()


null device 
          1 
