# ILP solver: Number of assignments with simulated reads depending on the constraint softening parameter ε
Timothée Jourde, Cedric Chauve  
June 20, 2019

### Abstract
In this notebook, we describe the results of applying our base ILP, with weighted read mappings accounting for mismatches, to various sets of simulated reads for instances composed of alleles of the *clpA* locus, with the aim to see how our ILP scales with increasingly complex instances.

## Methods.

We generated various instances composed of (1) a set of randomly chosen alleles for *clpA* and random copy numbers for each allele, (2) a database of known alleles including the chosen alleles, and (3) a range of values for the parameter $\epsilon$. 

We started with instances where the alleles database is exactly the set of selected alleles, then progressively increased the size of this database. In all datasets, the sequencing depth was $100x$ and simulated reads were generated using Art.

Let's see the parameters used for the generation of simulated reads:

In [None]:
!cat ../data/clpA/3721cdda46/sim/params.json

- `*_nb_alleles` is the number of alleles taken from the database `alleles_file` which are then used to generate a sample.
- `ends_length` is the length of random DNA sequences prepended and appended to each allele of the sample before feeding them to Art in order to get an uniform coverage. The resulting reads are then truncated.
- `*_copy_nb` dictates how many times each chosen allele is replicated in the sample.
- When `take_ref` is `true` we output error-free reads.
- `art_flags` are raw parameters given to Art.

For each dataset, we recorded how many reads were assigned to an allele by the ILP if either the raw simulated reads are used or the reads corrected for sequencing errors are used. Please note that while the following plots show the number of read assignments, we don't check that those assignments are correct.

Let's see the directory tree of a batch `results-*`:

In [None]:
!tree ../results-000

Here we ran the ILP `2 * epsilons` times over `2` datasets `clpA/3721cdda46/ref` and `clpA/3721cdda46/sim`. For each dataset, we derived one filtered instance `a001-r001` with an alleles database of size one (`a001`) and reads originating from that one allele (`r001`) only. `ref` (references) means that instead of taking the simulated reads, we took the corresponding reference sequences, thus canceling any simulated errors. `sim` (simulated) means we took the simulated reads with errors.

## Results
Below are 5 batches, all starting from the same datasets but each with different parameters regarding instance filtering. As said above, the following plots show the number of read assignments depending on the constraint softening parameter ε. Each dot is an ILP run.

Each plot is accompagnied by a description following this format:
```
Paths              ../results-001/clpA/3721cdda46/*/a010-r003
Datasets           clpA/3721cdda46/ref clpA/3721cdda46/sim
Alleles database   10
Alleles in reads   3
Reads              3440
Bases              293548
Faulty reads       45.52 %
Faulty bases       0.73 %
Timeout            10 minutes

```
- `Alleles database` is the size of the alleles database given to the ILP solver.
- `Alleles in reads` is the number of alleles in the sample (from which the reads come).
- `Reads` is the number of reads given to the ILP solver.
- `Bases` is the sum of the length of all reads.
- `Faulty reads` is the percentage of reads with at least one sequencing error.
- `Faulty bases` is the percentage of per-nucleobase sequencing error.
- After `Timeout`, the ILP solver is killed and the solution is discarded causing a missing dot on the plot.

In [None]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
  return false;
}

In [None]:
%run ../src/plots.py

### `results-000-no-weights`

In [None]:
plot_all("../results-000-no-weights")

### `results-000`

In [None]:
plot_all("../results-000")

### `results-001`

In [None]:
plot_all("../results-001")

### `results-002`

In [None]:
plot_all("../results-002")

### `results-003`

In [None]:
plot_all("../results-003")