Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auxiliary methods for taxonomic classification #343

Merged
merged 2 commits into from
Jul 28, 2021
Merged

Conversation

heracle
Copy link
Collaborator

@heracle heracle commented Jul 28, 2021

Add some small methods that will be further used for tax class

Signed-off-by: Radu Muntean radu.muntean.2@gmail.com

@heracle heracle requested review from karasikov and hmusta July 28, 2021 13:49
Signed-off-by: Radu Muntean <radu.muntean.2@gmail.com>
Comment on lines 792 to 796
throw std::runtime_error(
folly::to<string>("The current 'call_annotated_rows' call has returned, ", unique_matrix_rows.size(),
"rows. The maximum number of rows that can be returned is ",
std::numeric_limits<uint32_t>::max(),
". Please reduce the query batch size"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use fmt::format

if (prefix.size() > str.size()) {
return false;
}
return prefix == str.substr(0, static_cast<int>(prefix.size()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return prefix == str.substr(0, static_cast<int>(prefix.size()));
return prefix == std::string_view(str).substr(0, prefix.size());

Comment on lines 779 to 804
void AnnotatedDBG::call_annotated_rows(const std::vector<node_index> &rows,
std::function<void(const std::string&)> callback_cell,
std::function<void()> callback_row) const {
assert(check_compatibility());

auto unique_matrix_rows = annotator_->get_matrix().get_rows(rows);

//TODO make sure that this function works even if we have duplications in 'rows'. Then, delete this error catch.
if (rows.size() != unique_matrix_rows.size()) {
throw std::runtime_error("The current 'call_annotated_rows' call contains duplication.");
}

if (unique_matrix_rows.size() >= std::numeric_limits<uint32_t>::max()) {
throw std::runtime_error(
folly::to<string>("The current 'call_annotated_rows' call has returned, ", unique_matrix_rows.size(),
"rows. The maximum number of rows that can be returned is ",
std::numeric_limits<uint32_t>::max(),
". Please reduce the query batch size"));
}
const auto &label_encoder = annotator_->get_label_encoder();
for (auto row : unique_matrix_rows) {
for (auto cell : row) {
callback_cell(label_encoder.decode(cell));
}
callback_row();
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite see a reason to add this function. You have access to all members, so can implement this directly where needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! answered in the other comment

". Please reduce the query batch size"));
}
const auto &label_encoder = annotator_->get_label_encoder();
for (auto row : unique_matrix_rows) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is unnecessarily making a copy

Suggested change
for (auto row : unique_matrix_rows) {
for (const auto &row : unique_matrix_rows) {

@@ -156,6 +156,10 @@ class AnnotatedDBG : public AnnotatedSequenceGraph {
int32_t match_score = 1,
int32_t mismatch_score = 2) const;

void call_annotated_rows(const std::vector<node_index> &rows,
std::function<void(const std::string&)> callback_cell,
std::function<void()> callback_row) const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the second callback supposed to be used for? It's not clear...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to remove this function

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments! I need to have access to annotator_ field and I can't implement it directly where needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is using 2 callback functions: one for doing a preprocessing per cell (i.e. get the taxid from label) and one for a final computation on an annotated row (i.e. get the LCA taxid). If you think that it would be more clear, I can reimplement it to use only one callback that receives the list of labels on one row (but this will use more memory -> store list of labels instead of list of integers)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But do you need to have a non-const reference to the annotator? If not, you can just call get_annotator().get_matrix()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I missed that get_annotator(). Did the changes now, thanks for the feedback

@heracle heracle force-pushed the mtg_taxclass_1 branch 2 times, most recently from 78ad12f to b92912e Compare July 28, 2021 16:43
@karasikov karasikov merged commit 38690ab into master Jul 28, 2021
@karasikov karasikov deleted the mtg_taxclass_1 branch July 28, 2021 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants