# Release 1.0 roadmap

## Unit tests

### Module names
- [] module name cannot start with `pipeline_`
- [] module names have to be SQL friendly
- [] modules that are identical in every way except names
- [] module names match number of executables

### Parser
- [x] all but @CONF decoration tests
- [] module inherit omitting exec
- [] no multiple module input `$x`, `$y` or any mixture of them with module parameters
- [] all module output in the same block should have same name (a required good practice)
- [] `DSC::run` we support only `()` `*` and  `,`
- [] module names / parameters cannot start with `_` and cannot have `.` in them
- [] duplicate module / parameter names
- [] Different length of params in different exec. e.g. mybeta = 1,2,3 vs. mybeta = (4,5,6) will lead to file lock fail. 
- [] strings ending with `,` intentionally
- [] Check if @ALIAS in the parameter list (or is a number or str?)
- `@FILTER` cannot have pipeline variables `$`

### Execution
- [] identical tasks will result in complaint. check for identical jobs ie same parameter twice 
- [] both in `DSC::run` and in `--target`: what if the first module has upstream dependency? Should catch and report an error.
- [] Unsupported keywords in `DSC` block
- [] Bad pipeline logic specification (resulting in failure as said in #22 )
- [] Looped steps. Actually this should be a feature when desired ...
- [] Downstream pipeline did not use any of upstream variables
- [] all modules are valid (defined)

### Query
- [] dsc-query strip path for dsc_output argument

### Misc
- [] do not write library installed files if installation fails

## Documentation
### Best practices
- [] RE seeds -- users should ensure seeds for modules are always the same, when applicable.

### Examples
- [x] convert all previous examples to new syntax
- [] add a tutorial that compares computation time / speed between modules
- []  a tutorial for benchmark output managing, eg, remove / rerun specified steps and moving project from one computer to another

## Small features
- [] add exec `depend` property to track source files
- [x] create a switch / or --debug switch to generate mock run file
- [!] pipeline seed batch has to be used / tested; based on total number of jobs distribute it smartly.
- [x] remove old Rlib info files
- [x] fix github R package arbitary paths
- [!] add tags for queries
- [] Properly handle grouped input eg `g: (N,P)` (then in the table has to have 2 columns g_N and g_P)
- [x] `file()` / `file(ext)` behavior is reversed ... need to fix
- [] Replace eg `simulate_n` with `simulate:n` in the column names

## Major features

Many were existing features removed due to due to new syntax and SoS advances etc. We need to bring them back in.

### Enhanced interface syntax
- [] `inline` executables

### Multi-language related issues
- [] New data exchange format

### Shell command executable related issues
- [] Multiple output files
- [] executable command options
    ```
    `exec` specifies the names of executable computational routines as well as their command line arguments if applicable. For example an `exec` entry reads:

     exec: datamaker.R, ms $nsam $nreps -t $theta -seed $seed
     ```
 - [] index slicing
    ```
    *  Index for parameters, for example `exec: makeped.py $data $output[1]` where `output` parameter takes the form of `output: (1.ped, 1.map), (2.ped, 2.map)`. In this case `output[1]` will only use the first value of each parameter group.
     
- [] New data extraction interface / basic data exploration features
- [x] a more self-contained way to load DSC related functions: a companion R package eventually?
- [x] performance (`SoS` tasks)
    ```

### Large scale computations
- [] `--host` feature bind with SoS: is now broken due to new SoS task model
  - To check: `scp`, `ssh` commands are available
  - To sync by DSC: 
    - the output folder
    - Not the host config file (?)
- [] CONF `merge` feature:
   ```
   `inline`: True or False, of whether or not an R script is executed inline with the next procedure instead of producing return files. This feature is useful when the cost of computation for a procedure is trivial compared to the cost of storing its output. For example if a simulation procedure is simply `runif(500000)` it makes more sense to save this line of code and execute it inline with the next step, rather than to save a vector of 500,000 random numbers to disk.
   ```

## Known issues

SoS issues:

- [] Multiple loads of I/O data for each task distribution (SoS global section repeated executions)
- [] `-c` option not honored for non-task
- [] Hang on dead-lock