# Managing SoS Projects

## Tracked and Untracked files

The most important concept for managing SoS projects is <font color='red'>tracked files</font>, which are all input, output, and dependent files of steps of an executed workflow. The files are tracked by SoS using a signature system that records the name and content of the files. For a file to be tracked,

* the file has to be included in `input`, `output`, or `depends` statements of one of the steps
* the workflow has to be executed with [signature](Execution_of_Workflow.html) (namely, not with `-s ignore`, which is the default mode under Jupyter notebook).

Let us create a SoS script that generate some tracked and untracked files

In [1]:
%sandbox --dir temp
%set -s default -v2
%run 

parameter: name='tracked_f1'
[0]
output:  name
sh: expand=True
    dd if=/dev/urandom of={name} count=1000

[1]
output:  'd1/tracked_f2'
sh: expand=True
    dd if=/dev/urandom of={output} count=5000
    dd if=/dev/urandom of=d1/untracked_f4 count=50

[2]
output:  'd2/d3/tracked_f3'
sh: expand=True
    dd if=/dev/urandom of={output} count=600

Set sos options to "-s default -v2"
1000+0 records in
1000+0 records out
512000 bytes transferred in 0.034012 secs (15053475 bytes/sec)
5000+0 records in
5000+0 records out
2560000 bytes transferred in 0.170195 secs (15041561 bytes/sec)
50+0 records in
50+0 records out
25600 bytes transferred in 0.001709 secs (14979657 bytes/sec)
600+0 records in
600+0 records out
307200 bytes transferred in 0.020365 secs (15084704 bytes/sec)


This workflow creates three tracked files `tracked_f1`, `d1/tracked_f2`, `d2/d3/tracked_f3`, and as a side effect creates a `d1/untracked_f4` file.

## Remove tracked or untracked files

Subcommand `remove` is used to remove untracked files and directories to keep the project directory clean. It can also be used to remove some files so that it can be re-generated. The latter is needed because SoS would not generate removed intermediate files as long as they are not actually used in a another step.

When we re-run the workflow, all steps are ignored.

In [2]:
%sandbox --dir temp
%rerun -v2

INFO: [32mdefault_0[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mdefault_1[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mdefault_2[0m (index=0) is [32mignored[0m due to saved signature


If we remove an intermediate file, the workflow would not complain during re-execution

In [3]:
%sandbox --dir temp
!rm -f d1/tracked_f2
%rerun -v2

INFO: [32mdefault_0[0m (index=0) is [32mignored[0m due to saved signature


5000+0 records in
5000+0 records out
2560000 bytes transferred in 0.172323 secs (14855824 bytes/sec)
50+0 records in
50+0 records out
25600 bytes transferred in 0.001865 secs (13727203 bytes/sec)
600+0 records in
600+0 records out
307200 bytes transferred in 0.020251 secs (15169595 bytes/sec)


You can use option `-s force` to re-generate all files, but it is easier to just remove the intermediate file and its signature using command `remove`. Option `-y` stands for `-yes` and you would be prompted each time for file removal without this option.

In [4]:
%sandbox --dir temp
!sos remove -y d1/tracked_f2
%rerun -v2

INFO: 3 tracked files from 1 run are identified.
Remove file d1/tracked_f2
INFO: 1 file removed


INFO: [32mdefault_0[0m (index=0) is [32mignored[0m due to saved signature


5000+0 records in
5000+0 records out
2560000 bytes transferred in 0.171767 secs (14903911 bytes/sec)
50+0 records in
50+0 records out
25600 bytes transferred in 0.001717 secs (14908940 bytes/sec)
600+0 records in
600+0 records out
307200 bytes transferred in 0.020738 secs (14813297 bytes/sec)


You can use the `sos remove` command to remove all untracked files with option `-u` (untracked).

In [5]:
%sandbox --dir temp
!ls -R

B2.txt        C2.txt        Source.gv     [34md2[m[m            tracked_f1
B3.txt        C3.txt        Source.gv.png result.txt
C1.txt        C4.txt        [34md1[m[m            size.txt

./d1:
tracked_f2   untracked_f4

./d2:
[34md3[m[m

./d2/d3:
tracked_f3


In [6]:
%sandbox --dir temp
!sos remove . -u -y

INFO: 3 tracked files from 1 run are identified.
Remove untracked file ./d1/untracked_f4
INFO: 1 file removed


or remove all tracked files with option `-t` (tracked)

In [7]:
%sandbox --dir temp
!sos remove . -t -y

INFO: 3 tracked files from 1 run are identified.
Remove tracked file ./d1/tracked_f2
Remove tracked file ./d2/d3/tracked_f3
INFO: 2 files removed


In [8]:
%sandbox --dir temp
!ls -R

B2.txt        C2.txt        Source.gv     [34md2[m[m            tracked_f1
B3.txt        C3.txt        Source.gv.png result.txt
C1.txt        C4.txt        [34md1[m[m            size.txt

./d1:

./d2:
[34md3[m[m

./d2/d3:


Because files under the currently directly are often important, they are by default not removed with these global options (`-t` or `-u`). You will have to remove them explicitly if so desired.

In [9]:
%sandbox --dir temp
!sos remove tracked_f1 -y

INFO: 3 tracked files from 1 run are identified.
Remove file tracked_f1
INFO: 1 file removed


## Archving a project

Let us re-run the project and create all files

In [10]:
%sandbox --dir temp
%rerun

1000+0 records in
1000+0 records out
512000 bytes transferred in 0.034881 secs (14678430 bytes/sec)
5000+0 records in
5000+0 records out
2560000 bytes transferred in 0.168757 secs (15169723 bytes/sec)
50+0 records in
50+0 records out
25600 bytes transferred in 0.001837 secs (13935650 bytes/sec)
600+0 records in
600+0 records out
307200 bytes transferred in 0.020466 secs (15010195 bytes/sec)


For archiving and reproducibility purposes, it is often needed to create an archive for the analysis so that you can refer to it later. This is very easy to do with the `sos pack` command.

To use this feature, you first need to get the workflow session that you would like to pack, using the ID that is printed at the end of the execution. You do not have to specify the complete IDs, only the first few characters (even no character) as long as it can be used to identify an unique session. Because we have only executed one workflow, we can run

In [11]:
%sandbox --dir temp
!sos pack -o myproj.sar -y

INFO: Checking tracked_f1
INFO: Checking d1/tracked_f2
INFO: Checking d2/d3/tracked_f3
INFO: Archiving 3 files (3.2 MiB)...
INFO: Adding tracked_f1
INFO: Adding d1/tracked_f2
INFO: Adding d2/d3/tracked_f3
INFO: Adding runtime files


You can use options `--include` and `--exclude` to include or exclude specific files or directories to the archive. For example, if you would like to archive the untracked file `d1/untracked_f4`, you can add this file, or the whole directory using command

In [12]:
%sandbox --dir temp
!sos pack -i d1 -o myproj_all.sar -y

INFO: Checking d2/d3/tracked_f3
INFO: Checking tracked_f1
INFO: Checking d1/tracked_f2
INFO: Checking d1/untracked_f4
INFO: Archiving 4 files (3.2 MiB)...
INFO: Adding d2/d3/tracked_f3
INFO: Adding tracked_f1
INFO: Adding d1/tracked_f2
INFO: Adding d1/untracked_f4
INFO: Adding runtime files


## Unpacking an archive

Now that we have an archive of the project, we can remove all files under the directory

In [13]:
%sandbox --dir temp
!rm -rf .sos d1 d2 tracked_f1 *.dot *.sos

In [14]:
%sandbox --dir temp
!ls

B2.txt         C2.txt         Source.gv      myproj_all.sar
B3.txt         C3.txt         Source.gv.png  result.txt
C1.txt         C4.txt         myproj.sar     size.txt


We can unpack the archive using command `sos unpack`

In [15]:
%sandbox --dir temp
!sos unpack myproj.sar -y

INFO: Extracting tracked_f1
INFO: Extracting d1/tracked_f2
INFO: Extracting d2/d3/tracked_f3
INFO: Extracting bae9f8b5ae291ed35320444902c1f323.exe_info
INFO: Extracting af18a5b2373f4435.sig
INFO: Extracting fbcc51e353626a825a8e0ed6bee26e82.file_info
INFO: Extracting 205711fc765ca01e2586f3a6794c3cd2.file_info
INFO: Extracting 840d03031bb1511b16f39c4880d45422.exe_info
INFO: Extracting ab180a7fd17af770b21db711638a4e44.file_info
INFO: Extracting b7484e58bc2bd7f0baaa888f22f816b2.exe_info


In [16]:
%sandbox --dir temp
!ls

B2.txt         C2.txt         Source.gv      [34md2[m[m             result.txt
B3.txt         C3.txt         Source.gv.png  myproj.sar     size.txt
C1.txt         C4.txt         [34md1[m[m             myproj_all.sar tracked_f1


As you have noticed, because the script is specified in Jupyter notebook, it is archived in name `__interactive__.sos`, but it is not unpacked. This is because sos script does not have to be in the current directory and it can be dangerous to overwrite a local script with an archived one. To unpack the script, use option `-s`

In [17]:
%sandbox --dir temp
!sos unpack myproj.sar -y -s

INFO: Ignore identical tracked_f1
INFO: Ignore identical d1/tracked_f2
INFO: Ignore identical d2/d3/tracked_f3


and now you can check the script that is used to generate this archive