Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mlwh loader extension for ampliconstats #170

Merged
merged 2 commits into from Oct 21, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
15 changes: 15 additions & 0 deletions Changes
@@ -1,5 +1,20 @@
LIST OF CHANGES

- retrieval and loading of heron artic autoqc data is refactored
to make the code more generic and flexible, in particular:
1. when retrieving autoqc data, the data from all portable
pipelines are stored under a single top-level key, thus
making it easy to skip this data when loading non-pp tables
2. if the same column names are used in pp and non-pp tables,
preference is given to data from the generic result objects
3. since in-memory data structure for the generic autoqc results
is now more flexible, it becomes possible to accomodate
multiple data sets, for example, storing ampliconstats data
per entity per amplicon, rather than flattenning ampliconstats
data
- retrieval of ampiconstats autoqc data and loading the data to the
new iseq_product_ampliconstats table

release 43.0.0
- ml warehouse run loader extended to load autoqc data to the
iseq_heron_product_metrics table
Expand Down
3 changes: 3 additions & 0 deletions MANIFEST
Expand Up @@ -178,9 +178,12 @@ t/data/runfolders/110804_HS22_06642_A_B020JACXX/Data/Intensities/BAM_basecalls_2
t/data/runfolders/110804_HS22_06642_A_B020JACXX/Data/Intensities/BAM_basecalls_20110813-160456/no_cal/archive/lane4/qc/6642_4.verify_bam_id.json
t/data/runfolders/180130_MS6_24975_A_MS6073474-300V2/Data/Intensities/BAM_basecalls_20180221-165427/no_cal/archive/lane1/plex1/qc/24975_1#1.rna_seqc.json
t/data/runfolders/180130_MS6_24975_A_MS6073474-300V2/Data/Intensities/BAM_basecalls_20180221-165427/no_cal/archive/lane1/qc/24975_1.tag_metrics.json
t/data/runfolders/180423_MS7_25710_A_MS6392545-300V2/Data/Intensities/BAM_basecalls_20180501-112028/no_cal/archive/lane1/plex58/qc/25710_1#58.ncov2019-artic-nf_ampliconstats.generic.json
t/data/runfolders/180423_MS7_25710_A_MS6392545-300V2/Data/Intensities/BAM_basecalls_20180501-112028/no_cal/archive/lane1/plex59/qc/25710_1#59.other_pipeline.generic.json
t/data/runfolders/180423_MS7_25710_A_MS6392545-300V2/Data/Intensities/BAM_basecalls_20180501-112028/no_cal/archive/lane1/plex59/qc/25710_1#59.ncov2019-artic-nf.generic.json
t/data/runfolders/180423_MS7_25710_A_MS6392545-300V2/Data/Intensities/BAM_basecalls_20180501-112028/no_cal/archive/lane1/plex59/qc/25710_1#59.ncov2019-artic-nf_ampliconstats.generic.json
t/data/runfolders/180423_MS7_25710_A_MS6392545-300V2/Data/Intensities/BAM_basecalls_20180501-112028/no_cal/archive/lane1/plex60/qc/25710_1#60.ncov2019-artic-nf.generic.json
t/data/runfolders/180423_MS7_25710_A_MS6392545-300V2/Data/Intensities/BAM_basecalls_20180501-112028/no_cal/archive/lane1/plex60/qc/25710_1#60.ncov2019-artic-nf_ampliconstats.generic.json
t/data/runfolders/180423_MS7_25710_A_MS6392545-300V2/Data/Intensities/BAM_basecalls_20180501-112028/no_cal/archive/lane1/plex60/qc/25710_1#60.genotype_call.json
t/data/runfolders/180423_MS7_25710_A_MS6392545-300V2/Data/Intensities/BAM_basecalls_20180501-112028/no_cal/archive/lane1/qc/25710_1.tag_metrics.json
t/data/runfolders/181008_HX1_27116_B_HNYKNCCXY/Data/Intensities/BAM_basecalls_20181015-171739/no_cal/archive/lane1/plex1/qc/27116_1#1.bam_flagstats.json
Expand Down
12 changes: 9 additions & 3 deletions bin/npg_mlwarehouse_run_delete
Expand Up @@ -30,12 +30,18 @@ my $transaction = sub {
while (my $rl_row = $rs->next) {
my $rsp = $rl_row->iseq_product_metrics();
while (my $row = $rsp->next) {
$row->iseq_product_components()->delete();
my $crow = $row->iseq_product_components();
$row->iseq_product_ampliconstats()->delete();
$crow->delete();
}
}
$rs->delete();
$schema_wh->resultset(q[IseqProductMetric])
->search({id_run => $id_run})->delete;
$rs = $schema_wh->resultset(q[IseqProductMetric])
->search({id_run => $id_run});
for ($rs->all()) {
$_->iseq_product_ampliconstats()->delete();
}
$rs->delete();
};

$schema_wh->txn_do($transaction);
Expand Down
134 changes: 81 additions & 53 deletions lib/npg_warehouse/loader/autoqc.pm
Expand Up @@ -4,6 +4,7 @@ use Carp;
use Moose;
use MooseX::StrictConstructor;
use Readonly;
use Clone qw/clone/;

use npg_tracking::glossary::rpt;
use npg_tracking::glossary::composition;
Expand All @@ -17,12 +18,12 @@ our $VERSION = '0';

## no critic (ProhibitUnusedPrivateSubroutines)

Readonly::Scalar our $PP_PREFIX => q[pp.];
Readonly::Scalar our $ARTIC_PP_NAME => q[ncov2019-artic-nf];
Readonly::Scalar our $PP_KEY => q[pp];

# Maximum value for MYSQL smallint unsigned
Readonly::Scalar my $INSERT_SIZE_QUARTILE_MAX_VALUE => 65_535;
Readonly::Scalar my $HUNDRED => 100;
Readonly::Scalar my $PRIMER_PANEL_MAX_LENGTH => 255;

Readonly::Hash my %AUTOQC_MAPPING => {
gc_fraction => {
Expand Down Expand Up @@ -152,30 +153,79 @@ sub _composition_without_subset {
return npg_tracking::glossary::composition->new(components => \@components);
}

sub _astats_data {
my ($astats, $info, $common_data) = @_;

my $num_amplicons = $astats->{num_amplicons};
$num_amplicons or croak 'Number of amplicons should be defined';
my $command = $info->{Samtools_command};
$command or croak 'Samtools_command is not recorded';
my ($primer_panel) = $command =~ /primer_panel\/(\S+[.]bed)\s*\S*\Z/smx;
$primer_panel or
($primer_panel) = $command =~ /(\S+[.]bed)\s*\S*\Z/smx;
$primer_panel or croak 'Failed to extract the primer panel path';
# Trim the start of the string to fit the column.
$primer_panel = substr $primer_panel, -$PRIMER_PANEL_MAX_LENGTH;

$common_data->{primer_panel} = $primer_panel;
$common_data->{primer_panel_num_amplicons} = $num_amplicons;

my $convert_name = sub {
my $name = shift;
$name =~ s/-/_/gsmx;
return join q[_], 'metric', lc $name;
};

my @per_amplicon_data = ();

for my $i ((1 .. $num_amplicons)) {
my $idata = clone($common_data);
$idata->{amplicon_index} = $i;
for my $name ( keys %{$astats} ) {
my $array = $astats->{$name};
$array and (ref $array eq q[ARRAY]) or next;
my $value = $array->[$i-1];
defined $value or croak 'Array length mismatch';
$idata->{ $convert_name->($name) } = $value;
}
push @per_amplicon_data, $idata;
}

return \@per_amplicon_data;
}

sub _generic {
my ($self, $result, $c) = @_;

$self->mlwh or return ();
$result->pp_name or croak 'pp_name attribute should be defined';
($result->pp_name eq $ARTIC_PP_NAME) or return ();

my $data = $self->_basic_data($c);
my $prefix = $self->get_column_prefix4pp_name($ARTIC_PP_NAME);
$data->{$prefix . 'pp_name'} = $result->pp_name;
$data->{$prefix . 'pp_version'} = $result->info->{'Pipeline_version'};
my $key = 'supplier_sample_name';
$data->{$prefix . $key} = $result->doc->{'meta'}->{$key};

foreach my $name (qw/ num_aligned_reads
longest_no_N_run
pct_covered_bases
pct_N_bases
qc_pass / ) {
my $pname = $name eq 'qc_pass' ? 'artic_qc_outcome' : lc $name;
$data->{$prefix . $pname} = $result->doc->{'QC summary'}->{$name};
}
my $basic_data = $self->_basic_data($c);
my $data = {};
$data->{'pp_name'} = $result->pp_name;
$data->{'pp_version'} = $result->info->{'Pipeline_version'};

if ($result->pp_name eq 'ncov2019-artic-nf') {
foreach my $name (qw/ num_aligned_reads
longest_no_N_run
pct_covered_bases
pct_N_bases
qc_pass / ) {
my $pname = $name eq 'qc_pass' ? 'artic_qc_outcome' : lc $name;
$data->{$pname} = $result->doc->{'QC summary'}->{$name};
}
my $key = 'supplier_sample_name';
$data->{$key} = $result->doc->{'meta'}->{$key};
$basic_data->{$PP_KEY} = {$result->pp_name => $data};

} elsif ($result->pp_name =~ /ampliconstats/xms) {
my $astats = $result->doc->{amplicon_stats};
if ($astats and keys %{$astats}) {
$basic_data->{$PP_KEY} =
{$result->pp_name => _astats_data($astats, $result->info, $data)};
}
}

return ($data);
return $basic_data->{$PP_KEY} ? ($basic_data) : ();
}

sub _interop {
Expand Down Expand Up @@ -481,7 +531,17 @@ sub _add_data {
if (exists $autoqc->{$digest}) {
delete $data->{'composition'};
while (my ($column_name, $value) = each %{$data}) {
$autoqc->{$digest}->{$column_name} = $value;
if (ref $value eq 'HASH') {
($column_name eq $PP_KEY) or croak "Unexpected key $column_name";
my @keys = keys %{$value};
(@keys == 1) or croak 'Invalid number of keys';
my $key = $keys[0];
# Be careful, do not overwrite data from other pipelines, which
# migh be already hashed under $PP_KEY.
$autoqc->{$digest}->{$column_name}->{$key} = $value->{$key};
} else {
$autoqc->{$digest}->{$column_name} = $value;
}
}
} else {
$autoqc->{$digest} = $data;
Expand All @@ -490,38 +550,6 @@ sub _add_data {
return;
}

=head2 get_column_prefix4pp_name

Class method. Given a portable pipeline name, returns a full
prefix that is prepended to the hash keys under which the QC
results for this portable pipeline are saved. To have correct
column names for uploading the data to mlwh, this prefix has
to be removed.

If a table 'iseq_XX_product_metrics' has a column 'my_yield',
and the name of the portable pipeline is 'oak', the value of
'my_yeild' will be saved under the 'pp.oak.my_yield' key. The
'pp.oak.' portion of the key will have to be removed before
uploading the data to the 'iseq_XX_product_metrics' table. Given
'oak' as an argument, this method returns the full prefix that
has to be removed.

# as class method
my $prefix = npg_warehouse::loader::autoqc
->get_column_prefix4pp_name('oak');
print $prefix; # pp.oak.

# as instance method
$obj->get_column_prefix4pp_name('oak');

=cut

sub get_column_prefix4pp_name {
my ($self, $pp_name) = @_;
$pp_name or croak 'Pipeline name is required';
return join q[], ($PP_PREFIX, $pp_name, q[.]);
}

=head2 retrieve

Retrieves autoqc results for a run.
Expand Down
8 changes: 3 additions & 5 deletions lib/npg_warehouse/loader/product.pm
Expand Up @@ -211,13 +211,11 @@ sub _indexed_lanes_hash {
sub _filter_column_names {
my ($self, $values) = @_;

my $pp_prefix = $npg_warehouse::loader::autoqc::PP_PREFIX;
# No harm deleting a key that might not exist.
delete $values->{$npg_warehouse::loader::autoqc::PP_KEY};

my @columns = keys %{$values};
foreach my $name (@columns) {
if ($name =~ /\A$pp_prefix/smx) {
delete $values->{$name};
next;
}
my $old_name = $name;
my $count = $name =~ s/\Atag_sequence\Z/tag_sequence4deplexing/xms;
if (!$count) {
Expand Down