Skip to content

Commit

Permalink
Merge pull request #136 from MitraDarja/insert
Browse files Browse the repository at this point in the history
[FEATURE] Insert new experiments.
  • Loading branch information
MitraDarja authored Jul 28, 2023
2 parents ae8d30b + eb4841c commit aedaf36
Show file tree
Hide file tree
Showing 12 changed files with 1,309 additions and 27 deletions.
30 changes: 29 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Needle stores its data in multiple interleaved Bloom filter, a fast and space ef

## Citation

Please cite:
Please cite:

Mitra Darvish, Enrico Seiler, Svenja Mehringer, René Rahn, Knut Reinert, Needle: a fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments, Bioinformatics, Volume 38, Issue 17, 1 September 2022, Pages 4100–4108, https://doi.org/10.1093/bioinformatics/btac492

Expand Down Expand Up @@ -95,6 +95,34 @@ The created file "expressions.out" (if you prefer a different name, use "-o") sh
GeneA 0 32
```

## Insert into an existing Needle index
It is possible to insert new sequence files into an uncompressed Needle index. Similar to the build step this can be done by either using the sequence files as input directly or the minimiser files outputed by `needle minimiser`. Most options are the same as the ones from the build step, however as the Needle index already exist, neither the false positive rate nor the number of hash functions can be changed. It is necessary to specify `i` to the directory, where the existing Needle index can be found.

The following example inserts into the Needle index build above for two paired-end experiments.

```
./bin/needle ibf ../needle/test/data/exp_0*.fasta --paired -e 16 -e 32 -f 0.3 -c -o example // Create Index
./bin/needle insert ../needle/test/data/exp_1*.fasta --paired -i example // Insert into created index
```

Based on minimiser files an insertion to the Needle index can be achieved by using the following command:
```
./bin/needle ibf ../needle/test/data/exp_0*.fasta --paired -e 16 -e 32 -f 0.3 -c -o example // Create Index
./bin/needle insertmin exp*.minimiser -i example // Insert into created index
```

The insert methods based on minimiser or on sequence files is independent of the way the index was created.

## Delete experiments from an existing Needle index
It is possible to delete sequence files from an uncompressed Needle index by specifiying the position of the experiment, which should be deleted.
These deleted experiments won't change the size of the index as the space is kept for later insertions.
```
./bin/needle ibf ../needle/test/data/exp_*.fasta --paired -e 16 -e 32 -f 0.3 -c -o example // Create Index
./bin/needle delete -i example 0 // Delete first experiment exp_0 (with position 0) from index
``
## Note
This app was created with the [seqan3 app-template](https://github.com/seqan/app-template).
40 changes: 40 additions & 0 deletions include/ibf.h
Original file line number Diff line number Diff line change
Expand Up @@ -113,3 +113,43 @@ std::vector<uint16_t> ibf(std::vector<std::filesystem::path> const & minimiser_f
*/
void minimiser(std::vector<std::filesystem::path> const & sequence_files, min_arguments const & args,
minimiser_arguments & minimiser_args, std::vector<uint8_t> & cutoffs);

/*! \brief Insert into IBFs.
* \param sequence_files A vector of sequence file paths.
* \param ibf_args The IBF specific arguments to use (bin size, number of hash functions, ...). See
* struct ibf_arguments.
* \param minimiser_args The minimiser specific arguments to use.
* \param cutoffs List of cutoffs.
* \param expression_by_genome_file File that contains the only minimisers that should be considered for the
* determination of the expression thresholds.
* \param path_in Input directory.
* \param samplewise True, if expression levels were set beforehand.
* \returns The expression thresholds per experiment.
*/
std::vector<uint16_t> insert(std::vector<std::filesystem::path> const & sequence_files,
estimate_ibf_arguments & ibf_args, minimiser_arguments & minimiser_args,
std::vector<uint8_t> & cutoffs,
std::filesystem::path const expression_by_genome_file, std::filesystem::path path_in, bool samplewise);

/*! \brief Insert into IBFs based on the minimiser files
* \param minimiser_files A vector of minimiser file paths.
* \param ibf_args The IBF specific arguments to use (bin size, number of hash functions, ...). See
* struct ibf_arguments.
* \param expression_by_genome_file File that contains the only minimisers that should be comnsidered for the
* determination of the expression_thresholds.
* \param path_in Input directory.
* \param samplewise True, if expression levels were set beforehand.
* \returns The expression thresholds per experiment.
*/
std::vector<uint16_t> insert(std::vector<std::filesystem::path> const & minimiser_files,
estimate_ibf_arguments & ibf_args,
std::filesystem::path const expression_by_genome_file, std::filesystem::path path_in, bool samplewise);

/*! \brief Delete bins from ibfs
* \param delete_files A vector of integers specifiying the bins to delete.
* \param ibf_args The IBF specific arguments to use (bin size, number of hash functions, ...). See
* struct ibf_arguments.
* \param path_in Input directory.
* \param samplewise True, if expression levels were set beforehand.
*/
void delete_bin(std::vector<uint64_t> const & delete_files, estimate_ibf_arguments & ibf_args, std::filesystem::path path_in, bool samplewise);
31 changes: 24 additions & 7 deletions src/estimate.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
template <class IBFType, bool last_exp, bool normalization, typename exp_t>
void check_ibf(min_arguments const & args, IBFType const & ibf, std::vector<uint16_t> & estimations_i,
seqan3::dna4_vector const seq, std::vector<uint32_t> & prev_counts,
exp_t const & expressions, uint16_t const k, std::vector<double> const fprs)
exp_t const & expressions, uint16_t const k, std::vector<double> const fprs, std::vector<int> deleted)
{
// Check, if one expression threshold for all or individual thresholds
static constexpr bool multiple_expressions = std::same_as<exp_t, std::vector<std::vector<uint16_t>>>;
Expand All @@ -55,6 +55,8 @@ void check_ibf(min_arguments const & args, IBFType const & ibf, std::vector<uint
// Check every experiment by going over the number of bins in the ibf.
for(int j = 0; j < counter.size(); j++)
{
if (std::find(deleted.begin(), deleted.end(), j) != deleted.end())
continue;
// Correction by substracting the expected number of false positives
counter[j] = std::max((double) 0.0, (double) ((counter[j]-(minimiser_length*fprs[j]))/(1.0-fprs[j])));
// Check, if considering previously seen minimisers and minimisers found ar current level equal to or are greater
Expand Down Expand Up @@ -149,6 +151,7 @@ void estimate(estimate_ibf_arguments & args, IBFType & ibf, std::filesystem::pat
std::vector<std::vector<uint16_t>> estimations;
std::vector<std::vector<uint16_t>> expressions;
std::vector<std::vector<double>> fprs;
std::vector<int> deleted{};

omp_set_num_threads(args.threads);
seqan3::contrib::bgzf_thread_count = args.threads;
Expand All @@ -167,6 +170,20 @@ void estimate(estimate_ibf_arguments & args, IBFType & ibf, std::filesystem::pat

read_levels<double>(fprs, estimate_args.path_in.string() + "IBF_FPRs.fprs");

// Check, if bins have been deleted
if (std::filesystem::exists(estimate_args.path_in.string() + "IBF_Deleted"))
{
std::ifstream fin;
fin.open(estimate_args.path_in.string() + "IBF_Deleted");
uint64_t number;

while (fin >> number)
{
deleted.push_back(number);
}
fin.close();
}

// Make sure expression levels are sorted.
sort(args.expression_thresholds.begin(), args.expression_thresholds.end());

Expand All @@ -193,15 +210,15 @@ void estimate(estimate_ibf_arguments & args, IBFType & ibf, std::filesystem::pat
if constexpr (samplewise & normalization_method)
check_ibf<IBFType, true, true>(args, ibf, estimations[i], seqs[i], prev_counts[i],
expressions,args.number_expression_thresholds - 1,
fprs[args.number_expression_thresholds - 1]);
fprs[args.number_expression_thresholds - 1], deleted);
else if constexpr (samplewise)
check_ibf<IBFType, true, false>(args, ibf, estimations[i], seqs[i], prev_counts[i],
expressions, args.number_expression_thresholds - 1,
fprs[args.number_expression_thresholds - 1]);
fprs[args.number_expression_thresholds - 1], deleted);
else
check_ibf<IBFType, true, false>(args, ibf, estimations[i], seqs[i], prev_counts[i],
args.expression_thresholds[args.expression_thresholds.size() - 1], prev_expression,
fprs[args.expression_thresholds.size() - 1]);
fprs[args.expression_thresholds.size() - 1], deleted);
}

if constexpr (!samplewise)
Expand All @@ -221,13 +238,13 @@ void estimate(estimate_ibf_arguments & args, IBFType & ibf, std::filesystem::pat
{
if constexpr (samplewise & normalization_method)
check_ibf<IBFType, false, true>(args, ibf, estimations[i], seqs[i], prev_counts[i],
expressions, j, fprs[j]);
expressions, j, fprs[j], deleted);
else if constexpr (samplewise)
check_ibf<IBFType, false, false>(args, ibf, estimations[i], seqs[i], prev_counts[i],
expressions, j, fprs[j]);
expressions, j, fprs[j], deleted);
else
check_ibf<IBFType, false, false>(args, ibf, estimations[i], seqs[i], prev_counts[i],
args.expression_thresholds[j], prev_expression, fprs[j]);
args.expression_thresholds[j], prev_expression, fprs[j], deleted);
}

if (!samplewise)
Expand Down
Loading

0 comments on commit aedaf36

Please sign in to comment.