diff --git a/.gitignore b/.gitignore index 335f59c..ac4b28d 100644 --- a/.gitignore +++ b/.gitignore @@ -11,4 +11,10 @@ test_output_renamed/ # Other README.html -old \ No newline at end of file +old +temp +**/old_working +old_working +tests/subset.bam +**/new_not_right +nextflow_schema.json_old \ No newline at end of file diff --git a/README.md b/README.md index 284f72e..e6b95af 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # seqWell LongPlex Demultiplex Nextflow Pipeline -[![CI](https://github.com/seqwell/LongPlex/actions/workflows/nextflow-tests.yml/badge.svg?branch=main)](https://github.com/seqwell/LongPlex/actions/workflows/nextflow-tests.yml?query=branch%3Amain) +[![Nextflow Workflow Tests](https://github.com/seqwell/LongPlex/actions/workflows/nextflow-tests.yml/badge.svg?branch=main)](https://github.com/seqwell/LongPlex/actions/workflows/nextflow-tests.yml?query=branch%3Amain) [![Nextflow](https://img.shields.io/badge/Nextflow%20DSL2-%E2%89%A523.11-blue.svg)](https://www.nextflow.io/) This is the Nextflow pipeline to demultiplex PacBio HiFi data for the seqWell LongPlex Long Fragment Multiplexing Kit. @@ -12,12 +12,15 @@ The pipeline starts with HiFi BAM files and has the following steps: This setting will demultiplex reads with both an i7 and i5 seqWell barcode sequence. 2. The `LIST_HYBRIDS` and `REMOVE_HYBRIDS` processes identify and remove any reads with mismatched i7 and i5 seqWell barcode sequences in the remaining non-demultiplexed reads. 3. The second Lima process, `LIMA_EITHER_END`, demultiplexes reads with only an i7 or i5 seqWell barcode sequence. -4. The BAM files for each sample within each pool are merged in the `MERGE_READS` process and FASTQ files are created. +4. The BAM files for each sample within each pool are merged in the `MERGE_READS` process and merged FASTQ files and bam files are created. 5. The `DEMUX_STATS` process generates a summary of the demultiplexing steps. 6. If a `rename_map` is provided, the `RENAME_DEMUX_STATS` process renames the sample identifiers in the demultiplexing summary to match the user-defined sample names. 7. `NANOSTAT` and `MULTIQC` are used to generate summary metrics for the reads assigned to each sample in the pool. +8. `NANOSTAT_UNBARCODED` generates sequencing metrics for the unbarcoded reads remaining after both lima steps. + Because the unbarcoded BAM is unaligned, reads are first converted to FASTQ via pysam before being passed to NanoStat. +9. `DEMUX_QC` combines lima barcode statistics, per-sample NanoStat results, and unbarcoded NanoStat results into two final output tables per pool: a per-well stats table and a per-pool summary table. -The final output from this pipeline includes Lima output files, demultiplexed BAM and FASTQ files, a demultiplexing summary, and a MultiQC report collating NanoStat results. +The final output from this pipeline includes Lima output files, demultiplexed BAM and FASTQ files, a demultiplexing summary, a MultiQC report collating NanoStat results, and comprehensive per-pool and per-well demux statistics. ![Fig1. LongPlex Workflow](./docs/LongPlex_Workflow.png) @@ -32,15 +35,17 @@ All docker containers used in this pipeline are publicly available. - *lima*: quay.io/biocontainers/lima:2.13.0--h9ee0642_0 - *samtools*: quay.io/biocontainers/samtools:1.19.2--h50ea8bc_1 -- *longplexpy*: seqwell/longplexpy:latest +- *longplexpy*: seqwell/longplexpy:0.2.1 +- *picard*: quay.io/biocontainers/picard:3.2.0--hdfd78af_0 - *R*: rocker/verse:4.3.1 - *nanostat*: quay.io/biocontainers/nanostat:1.6.0--pyhdfd78af_0 - *multiqc*: quay.io/biocontainers/multiqc:1.21--pyhdfd78af_0 -- *python*: python:3.12-slim-bookworm +- *python*: python:3.12-bookworm +- *pandas*: quay.io/biocontainers/pandas:1.5.2 ## Conda Environment -The conda environment is defined in `environment-pipeline.yml` and will be built automatically if the pipeline is run with `-profile conda`. +The conda environment is defined in `environment-pipeline.yml` and will be built automatically if the pipeline is run with `-profile conda`. Note that this profile is only supported on Linux systems, as **lima (v2.13.0)** is only available for Linux. # How to run the pipeline: @@ -75,13 +80,14 @@ If it is an AWS S3 URI, please make sure to [set your security credentials appro ### `rename_map` -`rename_map` is the path to a CSV file used to rename output BAM and FASTQ files, as well as the sample identifiers in the demultiplexing summary. -If not provided, output files and demultiplexing summary will use `pool_ID.well_ID` as the default sample identifier. +`rename_map` is the path to a CSV file used to rename output BAM and FASTQ files, as well as the sample identifiers in the demultiplexing summary and the `DEMUX_QC` output tables. +If not provided, output files and all summary tables will use `pool_ID.well_ID` as the default sample identifier. There are two required columns: -- *pool_ID.well_ID*: The default sample identifier in the format `pool_ID.well_ID` (e.g. `bc1015.A01`). -- *sample_ID*: The desired output sample name (e.g. `bc1015.sample1`). +- *pool_ID.well_ID*: The default sample identifier in the format `pool_ID.well_ID` (e.g. `bc1015.A01`). + The formatting is strict — the pool ID and well ID must be joined with a `.` (not `_` or any other character). The well ID must follow the format of a letter `A–H` followed by a **two-digit** number (e.g. A01, B12); single-digit row numbers must be zero-padded (e.g. `A1` is invalid, use `A01`). +- *sample_ID*: The desired output sample name (e.g. `bc1015.sample1`). Unlike `pool_ID.well_ID`, underscores (_) are accepted as connectors within the sample name (e.g. `bc1015_sample1` is also valid). Example (`tests/sample_map.csv`): @@ -95,7 +101,12 @@ Example (`tests/sample_map.csv`): | bc1015.B03 | bc1015.sample6 | | bc1015.C01 | bc1015.sample7 | -When `rename_map` is provided, the `RENAME_DEMUX_STATS` process will also produce a renamed version of the demultiplexing summary CSV with the user-defined sample names applied. +When `rename_map` is provided: +- The `RENAME_DEMUX_STATS` process produces a renamed version of the demultiplexing summary CSV with the user-defined sample names applied. +- The `DEMUX_QC` process uses the map to populate the `Sample_Name` column in the per-well stats table. + The `Barcode` column always retains the original `pool_ID.well_ID` key (e.g. `bc1015.A01`) regardless of renaming. +- When multiple pools are present in the `pool_sheet`, the `rename_map` may contain entries for all pools. + Each pool's `DEMUX_QC` run will automatically filter the map to only its own entries using the `pool_ID` prefix, ensuring no cross-pool mixing. ## Profiles: @@ -134,7 +145,7 @@ nextflow run \ --output "${PWD}/test_output" \ -with-report \ -with-trace \ - -resume + -resume -bg ``` The pipeline can be run using included test data with BAM and FASTQ file renaming: @@ -149,7 +160,7 @@ nextflow run \ --rename_map "${PWD}/tests/sample_map.csv" \ -with-report \ -with-trace \ - -resume + -resume -bg ``` ## With Conda @@ -163,7 +174,7 @@ nextflow run \ --output "${PWD}/test_output" \ -with-report \ -with-trace \ - -resume + -resume -bg ``` ## Expected Outputs @@ -194,14 +205,51 @@ test_output/ │ ├── merged_bam/ │ │ ├── bc1015.[BARCODE_WELL/sample_ID].bam # Merged BAM file for specific barcode well; sample_ID is used if rename_map is provided, otherwise barcode_well is used (e.g. bc1015.A01) │ │ └── ... -│ └── merged_fastq/ -│ ├── bc1015.[BARCODE_WELL/sample_ID].fastq.gz # Merged FASTQ file for specific barcode well; sample_ID is used if rename_map is provided, otherwise barcode_well is used (e.g. bc1015.A01) -│ └── ... -├── logs/ -│ ├── execution_report_[DATE-TIME-STAMP].html # Nextflow execution report -│ ├── execution_timeline_[DATE-TIME-STAMP].html # Nextflow execution timeline -│ ├── execution_trace_[DATE-TIME-STAMP].txt # Nextflow execution trace -│ └── pipeline_dag_[DATE-TIME-STAMP].html # Nextflow pipeline DAG -└── multiqc/ - └── [DATE-TIME-STAMP]_multiqc_report.html # MultiQC report including NanoStat results +│ ├── merged_fastq/ +│ │ ├── bc1015.[BARCODE_WELL/sample_ID].fastq.gz # Merged FASTQ file for specific barcode well; sample_ID is used if rename_map is provided, otherwise barcode_well is used (e.g. bc1015.A01) +│ │ └── ... +│ └── demux_qc/ +│ │ ├── bc1015_per_barcode_qc_report.csv # Per-barcode QC report for pool bc1015 +│ │ └── bc1015_per_pool_qc_report.csv # Per-pool QC report for pool bc1015 +| └── multiqc/ +| └── bc1015_multiqc_report.html # MultiQC report including NanoStat results +└── logs/ + ├── execution_report_[DATE-TIME-STAMP].html # Nextflow execution report + ├── execution_timeline_[DATE-TIME-STAMP].html # Nextflow execution timeline + ├── execution_trace_[DATE-TIME-STAMP].txt # Nextflow execution trace + └── pipeline_dag_[DATE-TIME-STAMP].html # Nextflow pipeline DAG + ``` + +### Per-well stats table (`{pool_ID}/demux_qc/{pool_ID}_per_barcode_qc_report.csv`) + +One row per well. Rows with no HiFi yield (bleed-through barcodes with negligible reads) are excluded. + +| Column | Description | +|---|---| +| `Sample_Name` | User-defined sample name from `rename_map`, or `pool_ID.well_ID` if not provided | +| `Barcode` | Original `pool_ID.well_ID` key (e.g. `bc1015.A01`): always the well identifier regardless of renaming | +| `Barcode_Quality` | Mean `ScoreCombined` from both `.lima.report` files across all reads assigned to this well | +| `HiFi_Reads_count` | Total reads assigned to this well: both-end reads plus either-end reads (P5-only and P7-only rows summed per well) | +| `Mean_HiFi_Read_Length` | Mean read length from NanoStat on the merged BAM for this well | +| `Median_HiFi_Read_Quality` | Median read quality (QV) from NanoStat on the merged BAM for this well | +| `HiFi_Yield` | Total bases from NanoStat on the merged BAM for this well | + +### Per-pool summary table (`{pool_ID}/demux_qc/{pool_ID}_per_pool_qc_report.csv`) + +One summary table per pool covering the full run. + +| Metric | Description | +|---|---| +| Unique Barcodes | Number of wells with assigned reads and non-zero yield | +| Barcoded HiFi Reads | Total reads assigned to any barcode across both lima steps | +| Unbarcoded HiFi Reads | Reads not assigned after both lima steps (from lima counts) | +| Barcoded HiFi Reads (%) | Fraction of total reads that are barcoded | +| Barcoded HiFi Yield (Gb) | Total bases across all barcoded wells | +| Unbarcoded HiFi Yield (Gb) | Total bases in the unbarcoded BAM from `NANOSTAT_UNBARCODED` | +| Barcoded HiFi Yield (%) | Fraction of total yield that is barcoded | +| Mean HiFi Reads per Barcode | Mean read count across all wells | +| Max HiFi Reads per Barcode | Highest read count across all wells | +| Min HiFi Reads per Barcode | Lowest read count across all wells | +| Barcoded HiFi Read Length (mean, kb) | Weighted mean read length across all barcoded wells | +| Unbarcoded HiFi Read Length (mean, kb) | Mean read length from `NANOSTAT_UNBARCODED` | diff --git a/bin/merge_demux_stats.py b/bin/merge_demux_stats.py new file mode 100755 index 0000000..b2700a5 --- /dev/null +++ b/bin/merge_demux_stats.py @@ -0,0 +1,492 @@ +#!/usr/bin/env python3 +import argparse +import pandas as pd +import sys +import re +from pathlib import Path + + +def parse_args(): + parser = argparse.ArgumentParser(description="Merge two-step lima + NanoStat stats into demux table") + parser.add_argument("--report-both", required=True, help="lima_both.lima.report") + parser.add_argument("--report-either", required=True, help="lima_either.lima.report") + parser.add_argument("--counts-both", required=True, help="lima_both.lima.counts") + parser.add_argument("--counts-either", required=True, help="lima_either.lima.counts") + parser.add_argument("--summary-both", required=True, help="lima_both.lima.summary") + parser.add_argument("--nanostat", required=True, nargs="+", help="Per-sample NanoStat .txt files") + parser.add_argument("--unbarcoded-nanostat", required=True, help="NanoStat .txt for unbarcoded BAM") + parser.add_argument("--sample-map", required=False, default=None, + help="Optional CSV with pool_ID.well_ID and sample_ID columns") + parser.add_argument("--output", required=True, help="Output CSV for per-barcode stats") + parser.add_argument("--summary-output", required=True, help="Output CSV for per-pool summary") + return parser.parse_args() + + +def extract_well(barcode_name): + """ + Extract well ID from lima barcode name. + e.g. seqwell_UDI1_A01_P7 -> A01 + seqwell_UDI1_A01_P5 -> A01 + """ + m = re.search(r'([A-H]\d{2})', str(barcode_name)) + return m.group(1) if m else None + + +def infer_pool_id(counts_both_path): + """ + Infer pool_ID from the both-end counts filename. + e.g. i7_i5_bc1015.lima.counts -> bc1015 + """ + stem = Path(counts_both_path).stem # i7_i5_bc1015.lima + stem = stem.replace(".lima", "") # i7_i5_bc1015 + pool_id = stem.split("_")[-1] # bc1015 + return pool_id + + +def parse_lima_summary_input_reads(path): + """ + Parse total input HiFi reads from lima_both.lima.summary. + Looks for a line containing 'input' and extracts the first integer. + e.g. "Reads input : 11014" -> 11014 + """ + with open(path) as f: + for line in f: + if "input" in line.lower(): + m = re.search(r'(\d+)', line) + if m: + return int(m.group(1)) + print(f"[WARN] Could not parse input reads from {path}", file=sys.stderr) + return None + + +def load_report(path, step_label): + """ + Load lima .lima.report file. + Uses ScoreCombined which matches MeanScore in .lima.counts. + Returns dataframe with columns: Barcode, Score, step, well_ID + """ + df = pd.read_csv(path, sep="\t") + df.columns = [c.strip() for c in df.columns] + + score_col = None + for candidate in ["ScoreCombined", "ScoreLead", "Score"]: + if candidate in df.columns: + score_col = candidate + break + + barcode_col = next((c for c in df.columns if "CombinedNamed" in c or c == "BarcodeNamed"), None) + + if not score_col or not barcode_col: + print(f"[ERROR] Cannot find Score or Barcode column in {path}", file=sys.stderr) + print(f" Available columns: {list(df.columns)}", file=sys.stderr) + sys.exit(1) + + print(f"[info] Using score column: '{score_col}' from {path}", file=sys.stderr) + + df = df[[barcode_col, score_col]].copy() + df.columns = ["Barcode", "Score"] + df["step"] = step_label + df["well_ID"] = df["Barcode"].apply(extract_well) + return df + + +def load_counts(path, group_by_well=False): + """ + Load lima .lima.counts file. + group_by_well=False: both-end — one row per well (P7 name only) + group_by_well=True: either-end — two rows per well (P5 + P7), sum them + Returns (df with columns: well_ID, Counts), not_barcoded_count + """ + df = pd.read_csv(path, sep="\t") + df.columns = [c.strip() for c in df.columns] + + barcode_col = next((c for c in df.columns if "CombinedNamed" in c or c == "BarcodeNamed"), None) + count_col = next((c for c in df.columns if c.lower() == "counts"), None) + + if not barcode_col or not count_col: + print(f"[ERROR] Cannot find Barcode or Counts column in {path}", file=sys.stderr) + print(f" Available columns: {list(df.columns)}", file=sys.stderr) + sys.exit(1) + + df = df[[barcode_col, count_col]].copy() + df.columns = ["Barcode", "Counts"] + df["well_ID"] = df["Barcode"].apply(extract_well) + + not_barcoded_count = df[df["well_ID"].isna()]["Counts"].sum() + df_barcoded = df[df["well_ID"].notna()].copy() + + if group_by_well: + # Either-end: P5 and P7 are separate rows — sum per well + df_barcoded = ( + df_barcoded + .groupby("well_ID") + .agg(Counts=("Counts", "sum")) + .reset_index() + ) + else: + # Both-end: already one row per well (P7 name) + df_barcoded = df_barcoded[["well_ID", "Counts"]].copy() + + return df_barcoded, not_barcoded_count + + +def load_sample_map(path, pool_id=None): + """ + Load sample map CSV with columns: pool_ID.well_ID, sample_ID + Filters to only entries matching pool_id if provided. + Returns: + - df with columns: pool_well_key, Sample_Name, well_ID + - sample_to_well dict: sample_id_suffix -> well_ID + """ + df = pd.read_csv(path) + df.columns = [c.strip() for c in df.columns] + + well_col = next((c for c in df.columns if "well" in c.lower() or "pool" in c.lower()), None) + sample_col = next((c for c in df.columns if "sample" in c.lower()), None) + + if not well_col or not sample_col: + print(f"[ERROR] Cannot find pool_ID.well_ID or sample_ID columns in {path}", file=sys.stderr) + print(f" Available columns: {list(df.columns)}", file=sys.stderr) + sys.exit(1) + + df = df[[well_col, sample_col]].rename( + columns={well_col: "pool_well_key", sample_col: "Sample_Name"} + ) + df["well_ID"] = df["pool_well_key"].apply(lambda x: x.split(".")[-1]) + + # Filter to only this pool's entries + if pool_id: + before = len(df) + df = df[df["pool_well_key"].str.startswith(pool_id + ".")].copy() + print(f"[info] Sample map filtered: {before} -> {len(df)} entries for pool {pool_id}", + file=sys.stderr) + + if df.empty: + print(f"[WARN] No sample map entries found for pool {pool_id}", file=sys.stderr) + + # Reverse lookup: sample_id suffix -> well_ID + sample_to_well = dict( + zip( + df["Sample_Name"].apply(lambda x: x.split(".")[-1]), + df["well_ID"] + ) + ) + print(f"[debug] sample_to_well for {pool_id}: {sample_to_well}", file=sys.stderr) + + return df, sample_to_well + + +def _parse_nanostat_metrics(path): + """ + Shared parser for NanoStat --tsv format: + number_of_reads 344 + number_of_bases 1575809.0 + mean_read_length 4580.8 + median_qual 39.8 + Returns dict of key -> float. + """ + metrics = {} + with open(path) as f: + for line in f: + parts = line.strip().split("\t") + if len(parts) == 2: + key, val = parts[0].strip(), parts[1].strip() + try: + metrics[key] = float(val.replace(",", "")) + except ValueError: + pass + if not metrics: + print(f"[WARN] No metrics parsed from {path}", file=sys.stderr) + return metrics + + +def parse_nanostat_file(path, sample_to_well=None): + """ + Parse a per-well NanoStat file. + Filename: bc1015.sample1_nanostat.txt -> pool_well_key=bc1015.sample1 + well_ID resolved via sample_to_well lookup, else regex fallback. + """ + stem = Path(path).stem # bc1015.sample1_nanostat + pool_well_key = stem.replace("_nanostat", "").strip() # bc1015.sample1 + suffix = pool_well_key.split(".")[-1] # sample1 or A01 + + if sample_to_well and suffix in sample_to_well: + well_id = sample_to_well[suffix] + else: + m = re.search(r'([A-H]\d{2})', suffix) + well_id = m.group(1) if m else suffix + + metrics = _parse_nanostat_metrics(path) + + return { + "pool_well_key": pool_well_key, + "well_ID": well_id, + "MeanReadLength": metrics.get("mean_read_length"), + "MedianQuality": metrics.get("median_qual"), + "TotalBases": metrics.get("number_of_bases"), + } + + +def parse_unbarcoded_nanostat(path): + """ + Parse NanoStat output for the unbarcoded BAM. + Returns dict with MeanReadLength and TotalBases. + """ + metrics = _parse_nanostat_metrics(path) + return { + "MeanReadLength": metrics.get("mean_read_length"), + "TotalBases": metrics.get("number_of_bases"), + } + + +def write_pool_summary(final_df, total_input_reads, total_demuxed_reads, + unbarcoded_nano, output_path): + """ + Generate per-pool run-level summary table. + + total_input_reads : true HiFi reads before any lima demux (from lima.summary) + total_demuxed_reads: all reads lima assigned to any barcode (from counts files) + not_barcoded_reads : derived as total_input - total_demuxed + + NanoStat runs on every well regardless of sample map so no unused-barcode + concept applies — two buckets only: barcoded and unbarcoded. + + Invariant: + total_demuxed_reads + not_barcoded_reads == total_input_reads + """ + # Barcoded = everything lima successfully demuxed (ground truth from counts) + barcoded_reads = total_demuxed_reads + not_barcoded_reads = total_input_reads - total_demuxed_reads + total_reads = total_input_reads + + barcoded_yield = final_df["HiFi_Yield(bp)"].sum() + unbarcoded_yield = unbarcoded_nano.get("TotalBases") or 0 + total_yield = barcoded_yield + unbarcoded_yield + + valid = final_df[final_df["Mean_HiFi_Read_Length(bp)"].notna()].copy() + if len(valid) > 0: + barcoded_mean_len = ( + (valid["Mean_HiFi_Read_Length(bp)"] * valid["HiFi_Reads_count"]).sum() + / valid["HiFi_Reads_count"].sum() + ) + else: + barcoded_mean_len = None + + unbarcoded_mean_len = unbarcoded_nano.get("MeanReadLength") + + def fmt_gb(val): + return f"{val / 1e9:.5f} Gb" if val else "N/A" + + def fmt_kb(val): + return f"{val / 1000:.2f} kb" if val else "N/A" + + def fmt_pct(num, den): + return f"{100 * num / den:.2f}%" if den > 0 else "N/A" + + rows = [ + ("HiFi Reads (before demux)", + str(total_input_reads)), + + ("Unique Barcodes", + str(int(final_df["Barcode"].nunique()))), + + ("Barcoded HiFi Reads", + str(barcoded_reads)), + + ("Unbarcoded HiFi Reads", + str(int(not_barcoded_reads))), + + ("Barcoded HiFi Reads (%)", + fmt_pct(barcoded_reads, total_reads)), + + ("Unbarcoded HiFi Reads (%)", + fmt_pct(not_barcoded_reads, total_reads)), + + ("Barcoded HiFi Yield (Gb)", + fmt_gb(barcoded_yield)), + + ("Unbarcoded HiFi Yield (Gb)", + fmt_gb(unbarcoded_yield)), + + ("Barcoded HiFi Yield (%)", + fmt_pct(barcoded_yield, total_yield)), + + ("Mean HiFi Reads per Barcode", + str(int(barcoded_reads / max(final_df["Barcode"].nunique(), 1)))), + + ("Max HiFi Reads per Barcode", + str(int(final_df["HiFi_Reads_count"].max()))), + + ("Min HiFi Reads per Barcode", + str(int(final_df["HiFi_Reads_count"].min()))), + + ("Barcoded HiFi Read Length (mean kb)", + fmt_kb(barcoded_mean_len)), + + ("Unbarcoded HiFi Read Length (mean kb)", + fmt_kb(unbarcoded_mean_len)), + ] + + summary_df = pd.DataFrame(rows, columns=["Analysis_Metric", "Value"]) + + # Write without quoting — avoids pandas adding quotes around values with commas + with open(output_path, "w") as f: + f.write("Analysis_Metric,Value\n") + for _, row in summary_df.iterrows(): + f.write(f"{row['Analysis_Metric']},{row['Value']}\n") + + print(f"[done] Pool summary written: {output_path}", file=sys.stderr) + print(summary_df.to_string(index=False)) + + +def main(): + args = parse_args() + + # ── Infer pool_ID from counts filename ──────────────────────────────────── + pool_id = infer_pool_id(args.counts_both) + print(f"[info] Inferred pool_ID: {pool_id}", file=sys.stderr) + + # ── 0. True input read count from lima_both.lima.summary ───────────────── + total_input_reads = parse_lima_summary_input_reads(args.summary_both) + if total_input_reads is None: + print("[ERROR] Could not determine total input reads — aborting", file=sys.stderr) + sys.exit(1) + print(f"[info] Total input HiFi reads (before demux): {total_input_reads}", file=sys.stderr) + + # ── 1. Barcode Quality — per-read ScoreCombined, grouped by well_ID ────── + rep_both = load_report(args.report_both, "both") + rep_either = load_report(args.report_either, "either") + all_reads = pd.concat([rep_both, rep_either], ignore_index=True) + + all_reads_barcoded = all_reads[all_reads["well_ID"].notna()].copy() + + bq = ( + all_reads_barcoded + .groupby("well_ID") + .agg(Barcode_Quality=("Score", "mean")) + .round({"Barcode_Quality": 1}) + .reset_index() + ) + print(f"[debug] Barcode quality computed for {len(bq)} wells", file=sys.stderr) + + # ── 2. HiFi Read Counts ─────────────────────────────────────────────────── + cnt_both_df, nb_both = load_counts(args.counts_both, group_by_well=False) + cnt_either_df, nb_either = load_counts(args.counts_either, group_by_well=True) + + print(f"[debug] nb_both={nb_both} nb_either={nb_either}", file=sys.stderr) + + counts_by_well = pd.merge( + cnt_both_df.rename(columns={"Counts": "Counts_both"}), + cnt_either_df.rename(columns={"Counts": "Counts_either"}), + on="well_ID", how="outer" + ).fillna(0) + counts_by_well["HiFi_Reads_count"] = ( + counts_by_well["Counts_both"] + counts_by_well["Counts_either"] + ) + counts_by_well = counts_by_well[["well_ID", "HiFi_Reads_count"]] + + # Ground-truth demux count — all reads lima assigned to any barcode. + # Captured here from counts files, consistent with the R script. + # not_barcoded = input - demuxed, matches reads_before_demux logic exactly. + total_demuxed_reads = int(counts_by_well["HiFi_Reads_count"].sum()) + not_barcoded_reads = total_input_reads - total_demuxed_reads + + print(f"[debug] Total demuxed (all barcodes): {total_demuxed_reads}", file=sys.stderr) + print(f"[debug] True unbarcoded (input - demuxed): {not_barcoded_reads}", file=sys.stderr) + print(f"[debug] Wells with read counts: {len(counts_by_well)}", file=sys.stderr) + + # ── 3. Sample map (before NanoStat — needed for well_ID lookup) ─────────── + if args.sample_map: + sample_map_df, sample_to_well = load_sample_map(args.sample_map, pool_id=pool_id) + print(f"[info] Sample map loaded: {len(sample_map_df)} entries for {pool_id}", + file=sys.stderr) + else: + sample_map_df = None + sample_to_well = None + print("[info] No sample map — Sample_Name defaults to pool_well_key", file=sys.stderr) + + # ── 4. NanoStat per well — filter to this pool only ────────────────────── + pool_nanostat = [ + f for f in args.nanostat + if Path(f).name.startswith(pool_id + ".") + or Path(f).name.startswith(pool_id + "_") + ] + if not pool_nanostat: + print(f"[WARN] No NanoStat files matched pool {pool_id} — using all files", + file=sys.stderr) + pool_nanostat = args.nanostat + + print(f"[debug] NanoStat files for pool {pool_id}: " + f"{[Path(f).name for f in pool_nanostat]}", file=sys.stderr) + + nano_rows = [parse_nanostat_file(f, sample_to_well=sample_to_well) + for f in pool_nanostat] + nano = pd.DataFrame(nano_rows) + print(f"[debug] NanoStat wells:\n{nano[['pool_well_key','well_ID']].to_string()}", + file=sys.stderr) + + # ── 5. NanoStat unbarcoded ──────────────────────────────────────────────── + unbarcoded_nano = parse_unbarcoded_nanostat(args.unbarcoded_nanostat) + print(f"[debug] Unbarcoded NanoStat: {unbarcoded_nano}", file=sys.stderr) + + # ── 6. Merge everything on well_ID ──────────────────────────────────────── + final = ( + bq + .merge(counts_by_well, on="well_ID", how="outer") + .merge( + nano[["well_ID", "pool_well_key", "MeanReadLength", "MedianQuality", "TotalBases"]], + on="well_ID", how="left" + ) + ) + + # Sample_Name and Barcode resolution + if sample_map_df is not None and not sample_map_df.empty: + final = final.merge( + sample_map_df[["well_ID", "Sample_Name", "pool_well_key"]], + on="well_ID", how="left", + suffixes=("", "_map") + ) + final["Sample_Name"] = final["Sample_Name"].fillna( + final["pool_well_key_map"].fillna(final["well_ID"]) + ) + final["Barcode"] = final["pool_well_key_map"].fillna(final["well_ID"]) + else: + final["Sample_Name"] = final["pool_well_key"].fillna(final["well_ID"]) + final["Barcode"] = final["pool_well_key"].fillna(final["well_ID"]) + + # ── 7. Filter: keep only rows with HiFi Yield reported ─────────────────── + # Drops wells where NanoStat produced no output (true zero-read bleed-through). + # Does not affect per-pool read counts — those come from total_demuxed_reads. + before = len(final) + final = final[final["TotalBases"].notna()].copy() + after = len(final) + print(f"[info] Removed {before - after} rows with no HiFi Yield (zero-read wells)", + file=sys.stderr) + + # ── 8. Select, rename, sort, write per-barcode CSV ──────────────────────── + final = final[[ + "Sample_Name", "Barcode", "Barcode_Quality", + "HiFi_Reads_count", "MeanReadLength", "MedianQuality", "TotalBases" + ]].rename(columns={ + "MeanReadLength": "Mean_HiFi_Read_Length(bp)", + "MedianQuality": "Median_HiFi_Read_Quality", + "TotalBases": "HiFi_Yield(bp)", + }) + + final = final.sort_values("HiFi_Reads_count", ascending=False) + final.to_csv(args.output, sep=",", index=False) + print(f"[done] Per-barcode stats written: {args.output}", file=sys.stderr) + print(final.to_string(index=False)) + + # ── 9. Write per-pool summary ───────────────────────────────────────────── + write_pool_summary( + final_df=final, + total_input_reads=total_input_reads, + total_demuxed_reads=total_demuxed_reads, + unbarcoded_nano=unbarcoded_nano, + output_path=args.summary_output + ) + + +if __name__ == "__main__": + main() diff --git a/main.nf b/main.nf index 6aaf471..87bef62 100644 --- a/main.nf +++ b/main.nf @@ -10,7 +10,9 @@ include { MERGE_READS } from './modules/merge_reads.nf' include { REMOVE_HYBRIDS } from './modules/remove_hybrids.nf' include { RENAME_DEMUX_STATS } from './modules/rename_demux_stats.nf' include { NANOSTAT } from './modules/nanostat.nf' +include { NANOSTAT_UNBARCODED } from './modules/nanostat_unbarcoded.nf' include { MULTIQC } from './modules/multiqc.nf' +include { DEMUX_QC } from './modules/demux_qc.nf' def infer_well(path) { def well_id = path.tokenize("_")[2] @@ -22,19 +24,15 @@ workflow { log.info paramsSummaryLog(workflow) def pools_ch = Channel.fromList(samplesheetToList(params.pool_sheet, "schemas/input_schema.json")) - // Build rename map as a plain Groovy map, not a channel def rename_maps = [:] - if (params.rename_map) { - file(params.rename_map) - .readLines() - .drop(1) // skip header - .each { line -> - def parts = line.trim().split(",") - if (parts.size() == 2) { - rename_maps[parts[0].trim()] = parts[1].trim() - } - } + + + if (params.rename_map) { + def rename_rows = samplesheetToList(params.rename_map, "schemas/rename_map_schema.json") + rename_rows.each { row -> + rename_maps[row[0]] = row[1] } + } log.info "Rename maps: ${rename_maps}" LIMA_BOTH_END(pools_ch) @@ -42,6 +40,9 @@ workflow { REMOVE_HYBRIDS(LIMA_BOTH_END.out.bam_unbarcoded_and_barcodes, LIST_HYBRIDS.out.hybrids) LIMA_EITHER_END(REMOVE_HYBRIDS.out.bam_filtered_and_barcodes) + // Run NanoStat on unbarcoded BAM from either-end step (one per pool) + NANOSTAT_UNBARCODED(LIMA_EITHER_END.out.bam_unbarcoded) + def bams_by_well_ch = LIMA_BOTH_END.out.bam .join(LIMA_EITHER_END.out.bam) .map { meta, both_bams, either_bams -> @@ -60,10 +61,7 @@ workflow { .map { meta, bams -> def key = "${meta.pool_ID}.${meta.well_ID}" def sample = rename_maps[key] - // log.info "KEY: '${key}' | MATCH: ${sample}" - def new_meta = meta + [ - sample_ID: sample ?: key - ] + def new_meta = meta + [sample_ID: sample ?: key] tuple(new_meta, bams) } @@ -75,32 +73,72 @@ workflow { .map { meta, stats -> tuple(meta, stats.flatten()) } DEMUX_STATS(stat_ch) - - def final_stats_ch + def final_stats_ch if (params.rename_map) { - // Use the Python process we created def rename_map_file = file(params.rename_map) - RENAME_DEMUX_STATS(DEMUX_STATS.out, rename_map_file) final_stats_ch = RENAME_DEMUX_STATS.out.csv } else { final_stats_ch = DEMUX_STATS.out } - - - NANOSTAT(MERGE_READS.out.fastq) - - + + NANOSTAT(MERGE_READS.out.fastq) + + ch_multiqc_input = NANOSTAT.out.report - .map { meta, report -> tuple( meta.pool_ID, report) } // drop per-sample meta, key by pool - .groupTuple() - .map { pool_id, reports -> tuple(pool_id, reports.flatten()) } - //.view { pool_id, reports -> "Pool: ${pool_id} | Reports: ${reports}" } - - MULTIQC(ch_multiqc_input) - - + .map { meta, report -> tuple(meta.pool_ID, meta, report) } + .groupTuple(by: 0) + .map { pool_id, metas, reports -> + def pool_meta = [pool_ID: pool_id] + tuple(pool_meta, reports.flatten()) + } + +MULTIQC(ch_multiqc_input) + + // ── MERGE_DEMUX_STATS — one per pool, pools never mixed ───────────────── + + // Keep full meta (with pool_ID) throughout — add id field for joining + def lima_reports_ch = LIMA_BOTH_END.out.report + .join(LIMA_EITHER_END.out.report) + // tuple(meta{pool_ID}, report_both, report_either) + // Both channels share the same meta from pools_ch so join works directly + + def lima_counts_ch = LIMA_BOTH_END.out.counts + .join(LIMA_EITHER_END.out.counts) + // tuple(meta{pool_ID}, counts_both, counts_either) + + // Group per-sample nanostats by pool_ID — keyed by pool_ID string + def nanostat_by_pool_ch = NANOSTAT.out.report + .map { meta, report -> tuple(meta.pool_ID, report) } + .groupTuple() + // tuple("bc1015", [s1.txt, s2.txt, ...]) + // tuple("bc1016", [s1.txt, s2.txt, ...]) + + // Unbarcoded nanostat — one per pool, keyed by pool_ID string + def nanostat_unbarcoded_ch = NANOSTAT_UNBARCODED.out.report + .map { meta, report -> tuple(meta.pool_ID, report) } + // tuple("bc1015", bc1015_unbarcoded_nanostat.txt) + // tuple("bc1016", bc1016_unbarcoded_nanostat.txt) + def lima_summary_ch = LIMA_BOTH_END.out.summary + + // Join all channels on pool_ID — each pool stays separate + def merge_demux_input_ch = lima_reports_ch + .join(lima_counts_ch) + .join(lima_summary_ch) // ← join summary_both here + .map { meta, report_both, report_either, counts_both, counts_either, summary_both -> + tuple(meta.pool_ID, meta, report_both, report_either, counts_both, counts_either, summary_both) + } + .join(nanostat_by_pool_ch) + .join(nanostat_unbarcoded_ch) + .map { pool_id, meta, report_both, report_either, counts_both, counts_either, + summary_both, nanostat_files, unbarcoded_report -> + def sample_map = params.rename_map ? file(params.rename_map) : [] + tuple(meta, report_both, report_either, counts_both, counts_either, + summary_both, nanostat_files, unbarcoded_report, sample_map) + } + + DEMUX_QC(merge_demux_input_ch) workflow.onComplete = { println "Project output directory: ${workflow.projectDir}/${params.output}" diff --git a/modules/demux_qc.nf b/modules/demux_qc.nf new file mode 100644 index 0000000..7d8ea35 --- /dev/null +++ b/modules/demux_qc.nf @@ -0,0 +1,42 @@ +process DEMUX_QC { + tag "${meta.pool_ID}" + label 'process_low' + // publishDir "${params.output}/demux_stats", mode: 'copy' + + //container 'quay.io/biocontainers/pandas:1.5.2' + input: + tuple val(meta), + path(report_both), + path(report_either), + path(counts_both), + path(counts_either), + path(summary_both), + path(nanostat_files), + path(unbarcoded_nanostat), + path(sample_map) + + output: + tuple val(meta), path("${meta.pool_ID}_*.csv") + + script: + def sample_map_arg = sample_map ? "--sample-map ${sample_map}" : "" + """ + merge_demux_stats.py \\ + --report-both ${report_both} \\ + --report-either ${report_either} \\ + --counts-both ${counts_both} \\ + --counts-either ${counts_either} \\ + --summary-both ${summary_both} \\ + --nanostat ${nanostat_files} \\ + --unbarcoded-nanostat ${unbarcoded_nanostat} \\ + ${sample_map_arg} \\ + --output ${meta.pool_ID}_per_barcode_qc_report.csv \\ + --summary-output ${meta.pool_ID}_per_pool_qc_report.csv + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + python: \$(python --version | sed 's/Python //') + pandas: \$(python -c "import pandas; print(pandas.__version__)") + END_VERSIONS + """ +} \ No newline at end of file diff --git a/modules/fastqc.nf b/modules/fastqc.nf deleted file mode 100644 index 5018d1c..0000000 --- a/modules/fastqc.nf +++ /dev/null @@ -1,15 +0,0 @@ -process FASTQC { - tag "${meta.pool_ID}" - - input: - tuple val(meta), path(reads) - - output: - path("*.html"), emit: report - path("*.zip"), emit: archive - - script: - """ - fastqc -q $reads - """ -} diff --git a/modules/lima_i7_i5_both_end.nf b/modules/lima_i7_i5_both_end.nf index 8356ca9..c8b708b 100644 --- a/modules/lima_i7_i5_both_end.nf +++ b/modules/lima_i7_i5_both_end.nf @@ -32,5 +32,6 @@ process LIMA_BOTH_END { mv demux_i7_i5/${meta.pool_ID}.lima.counts demux_i7_i5/i7_i5_${meta.pool_ID}.lima.counts mv demux_i7_i5/${meta.pool_ID}.lima.summary demux_i7_i5/i7_i5_${meta.pool_ID}.lima.summary + mv demux_i7_i5/${meta.pool_ID}.lima.report demux_i7_i5/i7_i5_${meta.pool_ID}.lima.report """ } diff --git a/modules/lima_i7_i5_either_end.nf b/modules/lima_i7_i5_either_end.nf index c1cb1f0..a4411a6 100644 --- a/modules/lima_i7_i5_either_end.nf +++ b/modules/lima_i7_i5_either_end.nf @@ -9,6 +9,7 @@ process LIMA_EITHER_END { tuple val(meta), path("demux_either_i7_i5/*lima.counts"), emit: counts tuple val(meta), path("demux_either_i7_i5/*lima.summary"), emit: summary tuple val(meta), path("demux_either_i7_i5/*unbarcoded.bam"), emit: bam_unbarcoded + tuple val(meta), path("demux_either_i7_i5/*lima.report"), emit: report script: """ @@ -31,5 +32,6 @@ process LIMA_EITHER_END { mv demux_either_i7_i5/${meta.pool_ID}.lima.counts demux_either_i7_i5/i7_5_${meta.pool_ID}.lima.counts mv demux_either_i7_i5/${meta.pool_ID}.lima.summary demux_either_i7_i5/i7_5_${meta.pool_ID}.lima.summary + mv demux_either_i7_i5/${meta.pool_ID}.lima.report demux_either_i7_i5/i7_5_${meta.pool_ID}.lima.report """ } diff --git a/modules/multiqc.nf b/modules/multiqc.nf index f713d0c..13b7fea 100644 --- a/modules/multiqc.nf +++ b/modules/multiqc.nf @@ -1,23 +1,19 @@ process MULTIQC { - tag "${pool_id}" - + tag "${meta.pool_ID}" input: - tuple val(pool_id), path(nanostat_reports) - + tuple val(meta), path(nanostat_reports) output: - path("${pool_id}_multiqc_report.html") - + path("${meta.pool_ID}_multiqc_report_*.html") script: """ - # Create sample rename file: old_name -> new_name + timestamp=\$(date +%Y%m%d_%H%M%S) for f in \$(ls ${nanostat_reports} | sort -V); do newname=\$(basename \$f | sed 's/_nanostat.txt//') cp \$f \${newname}.txt done - - multiqc *_nanostat.txt \\ + multiqc *_nanostat.txt \\ --fullnames \\ - --filename ${pool_id}_multiqc_report.html + --filename ${meta.pool_ID}_multiqc_report_\${timestamp}.html """ } diff --git a/modules/nanostat_unbarcoded.nf b/modules/nanostat_unbarcoded.nf new file mode 100644 index 0000000..bf31b56 --- /dev/null +++ b/modules/nanostat_unbarcoded.nf @@ -0,0 +1,41 @@ +process NANOSTAT_UNBARCODED { + tag "${meta.pool_ID}" + label 'process_low' + container "quay.io/biocontainers/nanostat:1.6.0--pyhdfd78af_0" + + input: + tuple val(meta), path(bam) + + output: + tuple val(meta), path("${meta.pool_ID}_unbarcoded_nanostat.txt"), emit: report + path "versions.yml", emit: versions + + script: + """ + # Use pysam (already in container) to convert BAM -> FASTQ, bypassing header check + python3 - <<'EOF' +import pysam + +with pysam.AlignmentFile("${bam}", "rb", check_sq=False) as inbam, \\ + open("unbarcoded_reads.fastq", "w") as fq: + for read in inbam: + if read.query_sequence is None: + continue + qual = read.query_qualities + qual_str = ''.join(chr(q + 33) for q in qual) if qual is not None else 'I' * len(read.query_sequence) + fq.write(f"@{read.query_name}\\n{read.query_sequence}\\n+\\n{qual_str}\\n") +EOF + + NanoStat \\ + --fastq unbarcoded_reads.fastq \\ + --tsv \\ + --outdir . \\ + --name ${meta.pool_ID}_unbarcoded_nanostat.txt \\ + -t ${task.cpus} + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + nanostat: \$(NanoStat --version | sed 's/NanoStat //') + END_VERSIONS + """ +} \ No newline at end of file diff --git a/nextflow.config b/nextflow.config index 2851564..c7e3111 100644 --- a/nextflow.config +++ b/nextflow.config @@ -114,6 +114,18 @@ process { } + withName: DEMUX_QC { + memory = '6.0G' + cpus = 2 + container = 'quay.io/biocontainers/pandas:1.5.2' + publishDir = [ + path: { "${params.output}/${meta.pool_ID}/demux_qc"}, + pattern: '*.csv', + mode: 'copy' + ] + } + + withName: NANOSTAT { @@ -121,6 +133,12 @@ process { cpus = 2 container = 'quay.io/biocontainers/nanostat:1.6.0--pyhdfd78af_0' } + + withName: NANOSTAT_UNBARCODED { + memory = '8.0G' + cpus = 2 + container = 'quay.io/biocontainers/nanostat:1.6.0--pyhdfd78af_0' + } withName: MULTIQC { @@ -128,7 +146,7 @@ process { cpus = 2 container = 'quay.io/biocontainers/multiqc:1.21--pyhdfd78af_0' publishDir = [ - path: "${params.output}/multiqc", + path: { "${params.output}/${meta.pool_ID}/multiqc"}, pattern: '*.html', mode: 'copy' ] diff --git a/nextflow_schema.json b/nextflow_schema.json index 2af70a1..fa5dc22 100644 --- a/nextflow_schema.json +++ b/nextflow_schema.json @@ -27,6 +27,16 @@ "fa_icon": "fas fa-folder-open", "format": "directory-path", "type": "string" + }, + "rename_map": { + "description": "Optional CSV file mapping pool_ID.well_ID keys to user-defined sample names.", + "fa_icon": "fas fa-file-csv", + "format": "file-path", + "help_text": "CSV with two columns: 'pool_ID.well_ID' and 'sample_ID'. Each key must follow the format . where well_ID is [A-H][0-9]{2} (e.g. bc1015.A01). When multiple pools are present, entries for all pools may be included; each pool filters to its own entries by pool_ID prefix. The Barcode column always retains the original pool_ID.well_ID key.", + "mimetype": "text/csv", + "pattern": "^\\S+\\.(csv)$", + "schema": "schemas/rename_map_schema.json", + "type": "string" } }, "required": [ @@ -37,4 +47,4 @@ } }, "type": "object" -} +} \ No newline at end of file diff --git a/schemas/rename_map_schema.json b/schemas/rename_map_schema.json new file mode 100644 index 0000000..4cbdd31 --- /dev/null +++ b/schemas/rename_map_schema.json @@ -0,0 +1,27 @@ +{ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "title": "rename_map schema", + "description": "Schema for the rename_map CSV.", + "type": "array", + "items": { + "type": "object", + "properties": { + "pool_ID.well_ID": { + "type": "string", + "description": "Original barcode key in the format pool_ID.well_ID (e.g. bc1015.A01).", + "pattern": "^[A-Za-z0-9]+\\.[A-H][0-9]{2}$", + "errorMessage": "pool_ID.well_ID must follow . where well_ID is a plate coordinate [A-H][0-9]{2}, e.g. bc1015.A01" + }, + "sample_ID": { + "type": "string", + "description": "User-defined sample name to assign to this well.", + "pattern": "^\\S+$", + "errorMessage": "sample_ID must be a non-empty string with no whitespace" + } + }, + "required": [ + "pool_ID.well_ID", + "sample_ID" + ] + } +} \ No newline at end of file diff --git a/tests/pool_sheet.csv b/tests/pool_sheet.csv index cb782ad..c8a11fd 100644 --- a/tests/pool_sheet.csv +++ b/tests/pool_sheet.csv @@ -1,2 +1,3 @@ pool_ID,pool_path,i7_barcode,i5_barcode bc1015,tests/example.bam,barcodes/LongPlex_set1_i7_trimmed_adapters.fa,barcodes/LongPlex_set1_i5_trimmed_adapters.fa +bc1016,tests/example.bam,barcodes/LongPlex_set1_i7_trimmed_adapters.fa,barcodes/LongPlex_set1_i5_trimmed_adapters.fa \ No newline at end of file diff --git a/tests/run_longplex_integration.sh b/tests/run_longplex_integration.sh index 4e499b7..7190fc1 100644 --- a/tests/run_longplex_integration.sh +++ b/tests/run_longplex_integration.sh @@ -1,4 +1,4 @@ -#!/usr/bin/env bash +t#!/usr/bin/env bash nextflow run \ -profile docker \ diff --git a/tests/sample_map.csv b/tests/sample_map.csv index e51cdad..40cc1d1 100644 --- a/tests/sample_map.csv +++ b/tests/sample_map.csv @@ -1,25 +1,49 @@ pool_ID.well_ID,sample_ID -bc1015.A01,bc1015.sample1 -bc1015.A02,bc1015.sample2 -bc1015.A03,bc1015.sample3 -bc1015.B01,bc1015.sample4 -bc1015.B02,bc1015.sample5 -bc1015.B03,bc1015.sample6 -bc1015.C01,bc1015.sample7 -bc1015.C02,bc1015.sample8 -bc1015.C03,bc1015.sample9 -bc1015.D01,bc1015.sample10 -bc1015.D02,bc1015.sample11 -bc1015.D03,bc1015.sample12 -bc1015.E01,bc1015.sample13 -bc1015.E02,bc1015.sample14 -bc1015.E03,bc1015.sample15 -bc1015.F01,bc1015.sample16 -bc1015.F02,bc1015.sample17 -bc1015.F03,bc1015.sample18 -bc1015.G01,bc1015.sample19 -bc1015.G02,bc1015.sample20 -bc1015.G03,bc1015.sample21 -bc1015.H01,bc1015.sample22 -bc1015.H02,bc1015.sample23 -bc1015.H03,bc1015.sample24 +bc1015.A01,bc1015_sample1 +bc1015.A02,bc1015_sample2 +bc1015.A03,bc1015_sample3 +bc1015.B01,bc1015_sample4 +bc1015.B02,bc1015_sample5 +bc1015.B03,bc1015_sample6 +bc1015.C01,bc1015_sample7 +bc1015.C02,bc1015_sample8 +bc1015.C03,bc1015_sample9 +bc1015.D01,bc1015_sample10 +bc1015.D02,bc1015_sample11 +bc1015.D03,bc1015_sample12 +bc1015.E01,bc1015_sample13 +bc1015.E02,bc1015_sample14 +bc1015.E03,bc1015_sample15 +bc1015.F01,bc1015_sample16 +bc1015.F02,bc1015_sample17 +bc1015.F03,bc1015_sample18 +bc1015.G01,bc1015_sample19 +bc1015.G02,bc1015_sample20 +bc1015.G03,bc1015_sample21 +bc1015.H01,bc1015_sample22 +bc1015.H02,bc1015_sample23 +bc1015.H03,bc1015_sample24 +bc1016.A01,bc1016.sample1 +bc1016.A02,bc1016.sample2 +bc1016.A03,bc1016.sample3 +bc1016.B01,bc1016.sample4 +bc1016.B02,bc1016.sample5 +bc1016.B03,bc1016.sample6 +bc1016.C01,bc1016.sample7 +bc1016.C02,bc1016.sample8 +bc1016.C03,bc1016.sample9 +bc1016.D01,bc1016.sample10 +bc1016.D02,bc1016.sample11 +bc1016.D03,bc1016.sample12 +bc1016.E01,bc1016.sample13 +bc1016.E02,bc1016.sample14 +bc1016.E03,bc1016.sample15 +bc1016.F01,bc1016.sample16 +bc1016.F02,bc1016.sample17 +bc1016.F03,bc1016.sample18 +bc1016.G01,bc1016.sample19 +bc1016.G02,bc1016.sample20 +bc1016.G03,bc1016.sample21 +bc1016.H01,bc1016.sample22 +bc1016.H02,bc1016.sample23 +bc1016.H03,bc1016.sample24 diff --git a/tests/test_longplex_integration.yaml b/tests/test_longplex_integration.yaml index 370878b..8a9a3f1 100644 --- a/tests/test_longplex_integration.yaml +++ b/tests/test_longplex_integration.yaml @@ -56,7 +56,6 @@ - path: output/bc1015/lima_out/demux_either_i7_i5/bc1015.unbarcoded.bam - path: output/bc1015/lima_out/demux_either_i7_i5/i7_5_bc1015.lima.counts - path: output/bc1015/lima_out/demux_either_i7_i5/i7_5_bc1015.lima.summary - - path: output/bc1015/lima_out/demux_i7_i5/bc1015.lima.report - path: output/bc1015/lima_out/demux_i7_i5/bc1015.seqwell_UDI1_A01_P5--seqwell_UDI1_A01_P7.bam - path: output/bc1015/lima_out/demux_i7_i5/bc1015.seqwell_UDI1_A02_P5--seqwell_UDI1_A02_P7.bam - path: output/bc1015/lima_out/demux_i7_i5/bc1015.seqwell_UDI1_A03_P5--seqwell_UDI1_A03_P7.bam diff --git a/tests/test_multiqc_html.py b/tests/test_multiqc_html.py index 00487a4..52c618d 100644 --- a/tests/test_multiqc_html.py +++ b/tests/test_multiqc_html.py @@ -4,5 +4,5 @@ @pytest.mark.workflow('LongPlex Integration Test') def test_multiqc_html(workflow_dir): - multiqc_html = glob.glob(str(Path(workflow_dir, "output", "multiqc", "*multiqc_report.html"))) + multiqc_html = glob.glob(str(Path(workflow_dir, "output"))) assert len(multiqc_html) != 0