# Directive for creating a script for your notebook

The block here below is required at the top of each notebook that you want to create a script for. You will also need to edit the "settings.ini" file, to create a script (see [Coding in NBdev](https://dksund.sharepoint.com/:fl:/g/contentstorage/CSP_7c761ee7-b577-4e08-8517-bc82392bf65e/ETlSfUyArSNJhX8veMI_JQ8By1aXGHzDJkhotpfpXx4mmw?e=037EwH&nav=cz0lMkZjb250ZW50c3RvcmFnZSUyRkNTUF83Yzc2MWVlNy1iNTc3LTRlMDgtODUxNy1iYzgyMzkyYmY2NWUmZD1iJTIxNXg1MmZIZTFDRTZGRjd5Q09TdjJYblkwVlNiWXFYcE1yaHVrVmZqTVJUVEE4X1VwZjhTd1JxcjRNdmFrSmh2RCZmPTAxVlVLVzVWSlpLSjZVWkFGTkVORVlLN1pQUERCRDZKSVAmYz0lMkYmYT1Mb29wQXBwJnA9JTQwZmx1aWR4JTJGbG9vcC1wYWdlLWNvbnRhaW5lciZ4PSU3QiUyMnclMjIlM0ElMjJUMFJUVUh4a2EzTjFibVF1YzJoaGNtVndiMmx1ZEM1amIyMThZaUUxZURVeVpraGxNVU5GTmtaR04zbERUMU4yTWxodVdUQldVMkpaY1Zod1RYSm9kV3RXWm1wTlVsUlVRVGhmVlhCbU9GTjNVbkZ5TkUxMllXdEthSFpFZkRBeFZsVkxWelZXU1RJMVJsaFBNalkyUlZkQ1FqTTFRVmhKVTBkRFVVcFdXa1klM0QlMjIlMkMlMjJpJTIyJTNBJTIyNzRmNzM1ZmUtYzg4Ny00MjhhLWFkZmYtNTEyZTg2YmNmZmQzJTIyJTdE) 
(**Writing your own notebooks**) on loop for more details). Replace **some_string** with a name that makes sense for your notebook. 

In [1]:
#| default_exp Ecoli_parser

# Libraries
Include all the libraries which should be used in this *Escherichia coli* module, to create log of the various operations to load input files, create datastructures, maniplate and output the desired results.

In [2]:
#| export

# Standard libraries
import os
import sys
from pathlib import Path
sys.path.append(str(Path().resolve().parent))
# Logging libraries
import logging
from datetime import datetime 

# Function specific libraries
from typing import List, Dict 
from fastcore.script import call_parse
#import functions from core module (optional, but most likely needed). 
from ssi_analysis_result_parsers import(
    core)

# Project specific libraries
import pandas as pd 
from pathlib import Path


In [None]:
# This block should never be exported. It is to have python running in the project (and not the nbs) dir, and to initiate the package using pip.
os.chdir(core.PROJECT_DIR)

# Pre- data manipulation requirements

To ensure accurate data wrangling, two pre-analysis requirements are defined:
- Locus-specific thresholds on virulence and serotype-defining markers for pathogenic Escherichia coli, used to filter out genes identified with high uncertainty
- A logging function, designed to identify potential issues during the data manipulation process

### Loci-specific thresholds

The previous defined threshold filters are applied to the *KMA .res* output file to discard results that have values below the desired levels in the [*template coverage*,*Query identity*] columns.

The loci-specific thresholds are defined as such:
- "stx": [98, 98] – Shiga toxin genes (stx1 and stx2 subtypes); both template coverage and query identity must exceed 98%.
- "wzx", "wzy", "wzt", "wzm": [98, 98] – O-antigen genes for serotyping; both columns above 98%.
- "fliC", "fli": [90, 90] – Flagellar genes (including fliC for H-antigen typing); both columns above 90%.
- "eae": [95, 95] – Intimin gene for surface adhesion in EPEC/STEC strains; both columns above 95%.
- "ehxA": [95, 95] – Hemolytic toxin gene; both columns above 95%.
- "other": [98, 98] – General filtering for all other genes; both columns above 98%.

In [22]:
#| export

thresholds = {
    "stx": [98, 98],
    "wzx": [98, 98],
    "wzy": [98, 98],
    "wzt": [98, 98],
    "wzm": [98, 98],
    "fliC": [90, 90],
    "fli": [90, 90],
    "eae": [95, 95],
    "ehxA": [95, 95],
    "other": [98, 98]
}

### Logging

The following function defines the logging strategy used throughout the subsequent processing steps to record both informational messages and errors in a per-sample log file.

Each log file is created within a specified logging directory and is named according to the corresponding sample. The log includes detailed messages tracking the progress, errors, and warnings encountered during execution, each annotated with a timestamp. This logging approach is useful for debugging and tracing issues in multi-sample pipelines.


In [23]:
#| export

def setup_logging(log_dir: str, sample_name: str) -> None:
    """
    Sets up logging to both a file and the console for a specific sample.

    Creates a log file in the specified directory, removes any existing handlers,
    configures logging to write INFO-level messages and above, and adds a console
    stream handler for real-time output.

    Args:
        log_dir (str): Path to the directory where the log file will be stored.
        sample_name (str): Name of the sample, used to name the log file.

    Returns:
        None
    """
    timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    os.makedirs(log_dir, exist_ok=True)
    log_file = os.path.join(log_dir, f"{sample_name}_kma_fbi.log")

    logger = logging.getLogger()
    while logger.hasHandlers():
        logger.removeHandler(logger.handlers[0])

    logging.basicConfig(
        filename=log_file,
        filemode="a",
        format="%(asctime)s - %(levelname)s - %(message)s",
        level=logging.INFO
    )

    console_handler = logging.StreamHandler(sys.stdout)
    console_handler.setFormatter(logging.Formatter("%(message)s"))
    logger.addHandler(console_handler)

    logging.info(f"Logging started for {log_file}")


# KMA alignment results
The KMA tool (*https://github.com/genomicepidemiology/kma*) produces multiple output files depending on the parameters used. For *Escherichia coli* OH-typing purposes, only the file with the *<u>.res</u>* suffix is required.

As an example, consider the following alignment results from the sample ERR3528110 (*https://www.ebi.ac.uk/ena/browser/view/ERR3528110?show=reads*), which produces a .res file containing output like:


| #Template | Score | Expected | Template_length | Template_Identity | Template_Coverage | Query_Identity | Query_Coverage | Depth | q_value | p_value |
|-----------|-------|----------|------------------|--------------------|--------------------|----------------|----------------|--------|---------|---------|
| 1__wzx__O6__AJ426045   | 20056 | 153 | 1257 | 99.92 | 100.00 | 99.92 | 100.00 | 16.07 | 19601.01 | 1.0e-26 |
| 2__wzy__O6__AJ426423   | 23540 | 159 | 1344 | 100.00 | 100.00 | 100.00 | 100.00 | 17.35 | 23065.87 | 1.0e-26 |
| 5__fliC__H1__AB028471  | 107030| 73  | 1788 | 100.00 | 100.00 | 100.00 | 100.00 | 62.14 | 106810.34 | 1.0e-26 |

The structure and meaning of the columns are described in detail in the official *KMA specification (https://gensoft.pasteur.fr/docs/kma/1.2.22/KMAspecification.pdf)*. For the *Escherichia coli*-specific filtering and interpretation, the following columns are particularly important (the definitions below are copied from the specification document):

- #Template: Contains the name of the template, default is the fasta header from the template sequence, including any spaces, tabs or special characters
- Template_Coverage: Is the percentage of bases in the template that is covered by the consensus sequence. A Template_Coverage above 100% indicates the presence of more insertions than deletions.
- Query_Identity: Is the number of bases in the template sequence that are identical to the consensus sequence divided by the length of the consensus. In other words, the percentage of identical nucleotides between template and consensus w.r.t. the consensus.

These fields are critical for downstream filtering based on locus-specific thresholds to assess gene presence with high certainty.

# Data manipulation

Once alignment with KMA is completed, several downstream processing steps are performed to extract species-specific information relevant for the surveillance of pathogenic *Escherichia coli* strains. These steps include:

- Extracting data from the KMA .res file
- Filtering results based on gene-specific thresholds for Template_Coverage and Query_Identity
- Performing O:H serotyping and Shiga toxin (stx) detection based on locus-specific matches
- Wrangling the data using pandas DataFrames to generate a structured and analysis-ready format
- Storing results in either:
    * per-sample output file with sample specific information from the original samplesheet
    * A combined .tsv file that extends the original samplesheet with annotated typing and virulence information


### Extraction of data and filtering

First, the gene-specific thresholds defined in the Thresholds section are extracted using the function `get_threshold()`. These thresholds are then applied within the function `process_res_file()` to filter out alignment results that do not pass the required criteria.

In [24]:
def get_threshold(template_name: str, thresholds: Dict[str, List[int]]) -> List[int]:
    """
    Returns the coverage and identity threshold for a given gene.

    Args:
        template_name (str): Name of the template (gene) from the .res file.
        thresholds (Dict[str, List[int]]): Dictionary of gene thresholds.

    Returns:
        List[int]: A list of two integers: [coverage_threshold, identity_threshold].
    """
    for key in thresholds:
        if key in template_name:
            return thresholds[key]
    return thresholds["other"]

def process_res_file(res_file_path: str) -> pd.DataFrame:
    """
    Reads and filters a KMA .res file based on predefined thresholds.

    Args:
        res_file_path (str): Path to the .res file.
        thresholds (Dict[str, List[int]]): Gene-specific thresholds.

    Returns:
        pd.DataFrame: Filtered results DataFrame.
    """
    try:
        res_df = pd.read_csv(res_file_path, sep="\t")
    except FileNotFoundError:
        raise FileNotFoundError(f"File not found: {res_file_path}")
    except pd.errors.EmptyDataError:
        raise ValueError(f"File is empty or not properly formatted: {res_file_path}")

    required_columns = {"#Template", "Template_Coverage", "Query_Identity", "Depth"}
    if not required_columns.issubset(res_df.columns):
        raise ValueError(f"Missing expected columns in {res_file_path}")

    res_df["threshold"] = res_df["#Template"].apply(lambda x: get_threshold(x, thresholds))
    res_df_filtered = res_df[
        (res_df["Template_Coverage"] >= res_df["threshold"].apply(lambda x: x[0])) &
        (res_df["Query_Identity"] >= res_df["threshold"].apply(lambda x: x[1]))
    ]
    return res_df_filtered


### O:H Typing and Shiga Toxin (stx) Detection

To generate *Escherichia coli*-specific results, a class named `EcoliResults` is implemented. This class contains all the necessary functions to process the alignment results across multiple samples.

It uses a samplesheet.tsv file to identify which samples to process and maps each to its corresponding KMA output file(s). An example row from the samplesheet:

| sample_name | illumina_read_files | nanopore_read_file | assembly_file | organism | variant | notes |
|-----------|-------|----------|------------------|--------------------|--------------------|----------------|
| ERR3528110 | ERR3528110_1.fastq.gz,ERR3528110_2.fq.gz | Na | Na | E.coli | Na | Na | 


The class performs the following main steps.
1. Process each sample to determine the O:H serotype and Shiga toxin (stx) type based on locus-specific matches.
2. Combine the processed results with the original samplesheet.tsv to preserve metadata and add typing information.
3. Write the final output by extending the original samplesheet with new columns for O-antigen, H-antigen, and stx results.

#### Step 1 - Sample-Level Processing
The function `summarize_single_sample()` performs the initial sample-level processing, which includes the following steps:
- First the KMA results are filtered using the previously defined `process_res_file()` function, which applies gene-specific thresholds to the *.res* files *Template_Coverage* and *Query_Identity* columns.
- Defines serotype specific requirements to accurately determine the OH type
    - If only *wzx* and *wzy* are present as O-antigen genes with no conflicting alleles, the O-type is assigned based on their information.
    - If only *wzt* and *wzm* are present as O-antigen genes with no conflicting alleles, the O-type is assigned based on their information.
    - If more than two O-antigen genes are detected, or if the *wzx & wzy* or *wzt & wzm* gene pairs show conflicting alleles, the individual locus data is recorded, but no definitive O-type is assigned—indicating that further analysis is required.
    - If any *fli* gene (e.g., *fliA*, *fliB*, etc.) is present, it is used to assign the H-type.
    - If no general *fli* gene is found, but *fliC* is present, *fliC* is used to assign the H-type.
- Depending on configuration, additional gene results may be retained for further filtering or analysis.

#### Step 2 - Merging with Samplesheet
The processed results are then merged with the original metadata using the function `from_samplesheet()`. This integrates seruptyping and toxing determination results into the same structure as the input samplesheet.tsv.

#### Step 3 - Writing the Final Output
The final output is stored as a *.tsv* file, extending the original *samplesheet.tsv*. The final structure is:

| sample_name | illumina_read_files | nanopore_read_file | assembly_file | organism | variant | notes | stx | OH | wzx | wzy | wzt | wzm | eae | ehxA | Other | verbose|
|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|
| ERR3528110 | ERR3528110_1.fastq.gz,ERR3528110_2.fq.gz | Na | Na | E.coli | Na | Na | - | O6;H1 | - | - | - | - | - | - | - | wzx_O6_16.07_100.00_99.92;wzy_O6_17.35_100.00_100.00;fliC_H1_62.14_100.00_100.00|

##### Output Interpretation
- If the serotype-specific rules (e.g., consistent wzx/wzy or wzt/wzm) are not met, the individual gene columns (wzx, wzy, etc.) will be filled, but the OH column will remain empty.
    
    Example:
    * If wzx=O6 and wzy=O6, then OH=O6.
    * If wzx=O6 and wzy=O7, then the wzx and wzy columns are filled with their respective values and OH is left blank due to conflicts within the gene pair.
    * If wzx=O6, wzy=O6, and wzm=O7, then all gene columns (wzx,wzy,wzm) are filled but OH is left blank to indicate a conflict since more than one gene pair is present.
- The eae and ehxA columns are marked as "positive" only if the corresponding gene is present and passes the defined threshold

##### Verbose column structure

- The verbose column contains detailed gene match information as a ;-separated list. Each entry follows this format:
    
    `<gene>_<allele>_<depth>_<template_coverage>_<query_identity>`

    * Example: *wzx_O6_16.07_100.00_99.92;wzy_O6_17.35_100.00_100.00;fliC_H1_62.14_100.00_100.00*
    
        Each part represent

        Gene (e.g. wzx)

        Allele (e.g O6)

        Depth of coverage (e.g. 16.07x) 

        Breadth of coverage (e.g. 100.00 %) - from the *Template_Coverage* column

        Identity (e.g. 100.00 %) - from the *Query_identity* column

        With similar part for the following seperated list for *wzy* or *fliC*


In [25]:
#| export

class EcoliResults:
    """
    Object for holding and processing E. coli typing results.

    This class stores summary typing data for multiple samples, provides utilities for per-sample processing, and export results in a tab-seperated format (.tsv).
    """

    # converts the sample results in dict to pandas df
    def __init__(self, results_dict: dict):
        """
        Initializes the EcoliResults object with typing result data.

        Args:
            results_dict (dict): Dictionary where keys are sample names and values are summary result dictionaries.
        """
        self.results_dict = results_dict
        self.results_df = pd.DataFrame.from_dict(results_dict, orient="index").reset_index(names="sample_name")

    @staticmethod
    def summarize_single_sample(sample_name: str, res_path: str, verbose_flag: int = 1) -> dict:
        """
        Processes a single sample KMA .res file and returns a summary dictionary.

        Args:
            sample_name (str): Sample identifier.
            res_path (str): Path to the sample's .res file.
            verbose_flag (int, optional): Include verbose info if set to 1. Default is 1.

        Returns:
            Dict[str, str]: Summary values extracted from the .res file.
        """
        log_dir = "examples/Log"
        setup_logging(log_dir, sample_name)

        NA_string = "-"
        output_data = {
            "stx": NA_string,
            "OH": NA_string, "wzx": NA_string, "wzy": NA_string, "wzt": NA_string, "wzm": NA_string,
            "eae": NA_string, "ehxA": NA_string,
            "Other": NA_string
        }

        try:
            logging.info(f"Processing .res file: {res_path}")
            filtered_df = process_res_file(res_path)
        except Exception as e:
            logging.error(f"Failed to process {res_path}: {e}")
            return output_data

        gene_map = {
            "wzx": "wzx", "wzy": "wzy", "wzt": "wzt", "wzm": "wzm",
            "eae": "eae", "ehxA": "ehxA"
        }
        toxin = "stx"
        stx_alleles = set()
        fli = NA_string
        fliC = NA_string

        for template in filtered_df["#Template"]:
            parts = template.split("__")
            if len(parts) < 3:
                continue
            gene, allele = parts[1], parts[2]

            if gene in ["eae", "ehxA"]:
                output_data[gene] = "Positive"
            elif gene in gene_map:
                output_data[gene] = allele
            elif gene == "fliC":
                fliC = allele
            elif gene == "fli":
                fli = allele
            elif gene.startswith(toxin):
                stx_alleles.add(allele)
            elif gene not in thresholds:
                output_data["Other"] = allele

        if stx_alleles:
            output_data[toxin] = ";".join(sorted(stx_alleles))

        # serotype specific requirements
        wzx, wzy, wzt, wzm = output_data["wzx"], output_data["wzy"], output_data["wzt"], output_data["wzm"]
        Otype = "-"
        if wzx != NA_string and wzy != NA_string and wzx == wzy and wzt == NA_string and wzm == NA_string:
            Otype = wzx
            output_data["wzx"] = output_data["wzy"] = NA_string
        elif wzt != NA_string and wzm != NA_string and wzt == wzm and wzx == NA_string and wzy == NA_string:
            Otype = wzt
            output_data["wzt"] = output_data["wzm"] = NA_string

        Htype = fli if fli != NA_string else fliC
        output_data["OH"] = f"{Otype};{Htype}"

        # adding the additional depth, template coverage and query identity information
        if verbose_flag == 1:
            verbose_parts = []
            for _, row in filtered_df.iterrows():
                parts = row["#Template"].split("__")
                if len(parts) >= 3:
                    gene, allele = parts[1], parts[2]
                    depth = row["Depth"]
                    coverage = row["Template_Coverage"]
                    identity = row["Query_Identity"]
                    verbose_parts.append(f"{gene}_{allele}_{depth:.2f}_{coverage:.2f}_{identity:.2f}")
            output_data["verbose"] = ";".join(verbose_parts)

        logging.info(f"Successfully processed sample: {sample_name}")
        return output_data

    @classmethod
    def from_samplesheet(cls, 
                        samplesheet_path: Path, 
                        verbose: int = 1, 
                        results_base: str = "examples/Results/{sample_name}/kma/{sample_name}.res",
                    ) -> "EcoliResults":
        """
        Loads sample data from a samplesheet and summarizes each sample.

        Args:
            samplesheet_path (Path): Path to the samplesheet TSV file.
            verbose (int, optional): Whether to include verbose output per sample. Default is 1.

        Returns:
            EcoliResults: An instance of the class populated with summaries for all samples.
        """
        df = pd.read_csv(samplesheet_path, sep="\t")
        df.columns = df.columns.str.strip()
        #print("I AM INSIDE FROM SAMPLESHEET")
        #if "Illumina_read_files" in df.columns and ("read1" not in df.columns or "read2" not in df.columns):
        #    df[["read1", "read2"]] = df["Illumina_read_files"].str.split(",", expand=True)

        results_dict = {}
        for idx, row in df.iterrows():
            sample_name = row["sample_name"]
            res_path = Path(results_base.format(sample_name=sample_name)) #results_base / sample_name / "kma" / f"{sample_name}.res"
            #print(f"The res path is : {res_path}")
            summary = cls.summarize_single_sample(sample_name, res_path, verbose_flag=verbose)
            results_dict[sample_name] = summary
        
        # Convert to DataFrame
        result_df = pd.DataFrame.from_dict(results_dict, orient="index").reset_index(names="sample_name")

        # Merge with original metadata
        merged_df = df.merge(result_df, on="sample_name", how="left")

        # Create and return object
        obj = cls(results_dict)
        obj.results_df = merged_df
        return obj

    def write_tsv(self, output_file: Path):
        """
        Writes the summarized typing results to a TSV file.

        Args:
            output_file (Path): Destination file path for the output table.
        """
        self.results_df.to_csv(output_file, sep="\t", index=False)

    def __repr__(self):
        """
        Returns a concise summary of the results object.

        Returns:
            str: A string with sample and variable counts.
        """
        return f"<EcoliResults: {len(self.results_df)} samples, {len(self.results_df.columns)} variables>"

### Parser
The function `ecoli_parser()` defines the parser for the samplesheet.tsv file, leveraging the previously described functionality to summarize *Escherichia coli* typing results for each sample. The parser outputs a .tsv file or prints the results as a DataFrame, with an option to include verbose details.



In [None]:
#| export
@call_parse
def ecoli_parser(
    samplesheet_path: Path,  # Input samplesheet
    output_file: Path = None,  # Path to output
    verbose: int = 1,  # Verbosity - binary value to add information 0=exclude, 1=include
    results_base: str = "examples/Results/{sample_name}/kma/{sample_name}.res"  # Path template for .res files
):
    results = EcoliResults.from_samplesheet(samplesheet_path, verbose=verbose, results_base=results_base)
    if output_file:
        results.write_tsv(output_file)
    else:
        print(results.results_df)


## Testing – Empirical Datasets

Inline testing is performed using two empirical datasets to validate the functionality of the EcoliResults pipeline and its data wrangling capabilities under different input conditions, each representing distinct biological scenarios:

A known *Escherichia coli* sample ERR3528110 (*https://www.ebi.ac.uk/ena/browser/view/ERR3528110?show=reads*) expected to produce detectable O:H serotype and stx results, as demonstrated throughout this notebook.

A known *Actinobacillus pleuropneumoniae* sample ERR14229029 (*https://www.ebi.ac.uk/ena/browser/view/ERR14229029?show=reads*) expected to yield no results for *Escherichia coli*-specific loci, serving to confirm that the pipeline appropriately filters non-target organisms.

In [27]:
#| export
#| eval: true
import pandas as pd
from pathlib import Path
import os

# Define paths
samplesheet_path = Path("test_input/Ecoli/samplesheet.tsv")
output_dir = Path("test_output/Ecoli")

# Create output directory
if not output_dir.exists():
    output_dir.mkdir(parents=True, exist_ok=True)

output_path = output_dir / "KMA_cases_parser.tsv"

# Assert input exists
assert samplesheet_path.exists(), f"File does not exist: {samplesheet_path}"
print(output_path)

# try the ecoli parser to see if the wrangling functionality works
try:
    ecoli_parser(
        samplesheet_path=samplesheet_path,
        output_file=output_path,
        verbose=1,
        results_base="test_input/Ecoli/{sample_name}.res"
    )
except Exception as e:
    raise AssertionError(f"Parser execution failed: {e}")

# compare the output with the expected results based on input to ensure correct wrangling

# read the created output files and check the information
sample_sheet_df = pd.read_csv(samplesheet_path, sep="\t")
sample_output_df = pd.read_csv(output_path, sep="\t")

### Test case 1. Check if the datastructure is correct
original_cols = sample_sheet_df.columns.tolist()
output_cols = sample_output_df.columns.tolist()
output_initial_cols = sample_output_df.columns[:len(original_cols)].tolist()
output_specific_cols = sample_output_df.columns[len(original_cols):].tolist()

assert original_cols == output_initial_cols, (
    f"Mismatch in first columns:\nExpected: {original_cols}\nGot: {output_initial_cols}"
)

assert output_specific_cols

### Test case 2. Check sample ERR3528110 which is correctly believed to be e.coli and ensure datawrangling does as expected
ERR3528110_res_path = "test_input/Ecoli/ERR3528110.res"
ERR3528110_input_df = pd.read_csv(ERR3528110_res_path, sep="\t")

ERR3528110_row = sample_output_df[sample_output_df["sample_name"] == "ERR3528110"].iloc[:,len(original_cols):len(output_cols)].iloc[0]

#extract the original genes from the res
gene_hits = ERR3528110_input_df["#Template"].tolist()

parsed_hits = []

for hit in gene_hits:
    parts = hit.split("__")
    assert len(parts) != 3, f"Unexpected KMA result format in: '{hit}'. Expected at least 3 '__' parts (e.g., ref__gene__allele) as off ecoli fbi 24-04-2025."
    gene, allele = parts[1], parts[2]
    parsed_hits.append((gene, allele))

# Extract OH genes 
O_gene_alleles = {gene: allele for gene, allele in parsed_hits if gene in {"wzx", "wzy", "wzt", "wzm"}}
H_gene_alleles = {gene: allele for gene, allele in parsed_hits if gene in {"fli", "fliC"}}

O_type = ERR3528110_row["OH"].split(";")[0]
H_type = ERR3528110_row["OH"].split(";")[1]

O_gene_keys = set(O_gene_alleles.keys())
H_gene_keys = set(H_gene_alleles.keys())

O_genes_no = len(O_gene_keys)
H_genes_no = len(H_gene_keys)

# O typing scenarios
# Case 1: wzx/wzy match
if O_gene_keys == {"wzx", "wzy"} and O_gene_alleles["wzx"] == O_gene_alleles["wzy"]:
    expected_otype = O_gene_alleles["wzx"]
    assert O_type == expected_otype, f"Expected OH '{expected_otype}', got '{O_type}'"
    # wzx/wzy should be suppressed
    assert ERR3528110_row["wzx"] == "-", "wzx column should be '-' when OH is used"
    assert ERR3528110_row["wzy"] == "-", "wzy column should be '-' when OH is used"
    #print(f"O-type correctly assigned from matching wzx/wzy: {O_type}")

# Case 2: wzt/wzm match
elif O_gene_keys == {"wzt", "wzm"} and O_gene_alleles["wzt"] == O_gene_alleles["wzm"]:
    expected_otype = O_gene_alleles["wzt"]
    assert O_type == expected_otype, f"Expected OH '{expected_otype}', got '{O_type}'"
    assert ERR3528110_row["wzt"] == "-", "wzt column should be '-' when OH is used"
    assert ERR3528110_row["wzm"] == "-", "wzm column should be '-' when OH is used"
    #print(f"O-type correctly assigned from matching wzt/wzm: {O_type}")

# Case 3: Conflict (≥3 genes, or 2 mismatched genes)
elif O_genes_no >= 3 or (
    (O_gene_keys == {"wzx", "wzy"} and O_gene_alleles["wzx"] != O_gene_alleles["wzy"]) or
    (O_gene_keys == {"wzt", "wzm"} and O_gene_alleles["wzt"] != O_gene_alleles["wzm"])
):
    assert O_type == "-", f"Expected OH = '-' due to conflict, got: '{O_type}'"
    for gene in O_gene_keys:
        assert ERR3528110_row[gene] == O_gene_alleles[gene], f"{gene} column should contain '{O_gene_alleles[gene]}'"
    #print("Conflict in O-typing correctly led to OH = '-' and individual gene columns retained.")

# H typing scenarios

# Case 1: If fli is present it will always take precedence over fliC
if H_gene_keys == {"fli"}:
    expected_htype = H_gene_alleles["fli"]
    assert H_type == expected_htype, f"Expected OH '{expected_htype}' from 'fli', got '{H_type}'"

# Case 2: only if fliC is the sole gene it is used
elif H_gene_keys == {"fliC"}:
    expected_htype = H_gene_alleles["fliC"]
    assert H_type == expected_htype, f"Expected OH '{expected_htype}' from 'fliC', got '{H_type}'"

# Case 3: if none exist the H type remains empty
else:
    assert H_type == "-", f"Expected H-type '-', but got '{H_type}'"

### Test case 3. Check sample ERR14229029 which is believed to be e.coli in the samplesheet is empty, as a result of being erroneously classified as e.coli

ERR14229029_row = sample_output_df[sample_output_df["sample_name"] == "ERR14229029"].iloc[:,len(original_cols):len(output_cols)].iloc[0]

ERR14229029_expected_values = ['-', '-;-', '-', '-', '-', '-', '-', '-', '-', float('nan')]
ERR14229029_values = [ERR14229029_row[col] for col in output_specific_cols]

for col, actual, expected in zip(output_specific_cols, ERR14229029_values, ERR14229029_expected_values):
    if pd.isna(expected):
        assert pd.isna(actual), f"{col}: Expected NaN, got {actual}"
    else:
        assert actual == expected, f"{col}: Expected '{expected}', got '{actual}'"


test_output/Ecoli/KMA_cases_parser.tsv
Logging started for examples/Log/ERR3528110_kma_fbi.log
Processing .res file: test_input/Ecoli/ERR3528110.res
Successfully processed sample: ERR3528110
Logging started for examples/Log/ERR14229029_kma_fbi.log
Processing .res file: test_input/Ecoli/ERR14229029.res
Successfully processed sample: ERR14229029


## Testing – Synthetic Data

This section defines 12 synthetic sample scenarios designed to validate case-specific functionality of the EcoliResults pipeline. Each sample simulates different KMA *.res* file content representing a variety of pathogenic E. coli cases and edge conditions.

- Synthetic sample 1  : O103;H2 - Contains wzx and wzy gene information with no conflicts — clean O:H type assignment.
- Synthetic sample 2  : O8;H10 - Contains consistent wzx and wzy results. Also includes stx2-a and eae genes.
- Synthetic sample 3  : -;H7 - Only fliC is present — no O-type, H-type resolved via fliC.
- Synthetic sample 4  : -;H11 - Flagellar hit includes malformed identifiers (e.g., "contig"), which are excluded during processing.
- Synthetic sample 5  : Simulates a non-*Escherichia coli* sample or failed alignment — no results expected.
- Synthetic sample 6  : -;H2 - Contains all four O-antigen genes (*wzx, wzy, wzt, wzm*) with conflicting data — O-type left unresolved. The output is expected to have filled *wzx,wzy,wzt & wzm* columns and empty *OH*.
- Synthetic sample 7  : -;H9 - Contains one correct O-antigen gene pair (*wzx,wzy*) with conflicting data. The output is expected to have filled *wzx & wzy* columns and empty *OH*.
- Synthetic sample 8  : -;H1 - Both fli and fliC genes are present — to confirm the preference for general fli is utilized over fliC.
- Synthetic sample 9  : -;H2, stx1-a;stx2-a;stx2-d, Positive, Positive - complex case with all four O-antigens present, both flagellar gene types, multiple stx subtypes, and additional virulence genes. Tests full pipeline capabilities.
- Synthetic sample 10 : -;H4 - Includes an unrelated gene (*adk*) to test pipelines behavior with an irrelevant hit.
- Synthetic sample 11 : -;H6 - Contains eae, but fails threshold filtering — should be excluded from final result.
- Synthetic sample 12 : -;H21, with double shiga toxins stx1-a;stx2-c - Tests detection of multiple stx subtypes.

In [28]:

import os
from tempfile import TemporaryDirectory
from pathlib import Path

test_cases = [
    # sample_name, res_content, expected_oh, expected_stx, expected_eae, expected_ehxA
    ("sample1", "1__wzx__O103__X\t100\t100\t60\n2__wzy__O103__X\t100\t100\t65\n3__fliC__H2__X\t100\t100\t70", "O103;H2", "-", "-", "-"),
    ("sample2", "1__wzt__O8__X\t100\t100\t60\n2__wzm__O8__X\t100\t100\t65\n3__fliC__H10__X\t100\t100\t70\n4__stx2__stx2-a__X\t100\t100\t90\n5__eae__eae-5__X\t100\t100\t80", "O8;H10", "stx2-a", "Positive", "-"),
    ("sample3", "1__fliC__H7__X\t100\t100\t70", "-;H7", "-", "-", "-"),
    ("sample4", "bad_line\n2__wzy__O111__X\t100\t100\t70\n3__fliC__H11__X\t100\t100\t70", "-;H11", "-", "-", "-"),
    ("sample5", "", "-;-", "-", "-", "-"),
    ("sample6", "1__wzx__O157__X\t100\t100\t60\n2__wzy__O157__X\t100\t100\t65\n3__wzt__O8__X\t100\t100\t60\n4__wzm__O8__X\t100\t100\t65\n5__fli__H2__X\t100\t100\t70", "-;H2", "-", "-", "-"),
    ("sample7", "1__wzx__O157__X\t100\t100\t60\n2__wzy__O111__X\t100\t100\t65\n3__fliC__H9__X\t100\t100\t70", "-;H9", "-", "-", "-"),
    ("sample8", "1__fli__H1__X\t100\t100\t70\n2__fliC__H12__X\t100\t100\t70", "-;H1", "-", "-", "-"),
    ("sample9", "1__wzx__O157__X\t100\t100\t60\n2__wzy__O157__X\t100\t100\t65\n3__wzt__O8__X\t100\t100\t60\n4__wzm__O8__X\t100\t100\t65\n5__fliC__H10__X\t100\t100\t70\n6__fli__H2__X\t100\t100\t70\n7__stx1__stx1-a__X\t100\t100\t90\n8__stx2__stx2-d__X\t100\t100\t90\n9__stx2__stx2-a__X\t100\t100\t90\n10__eae__eae-42-5__X\t100\t100\t80\n11__ehxA__ehxA-7__X\t100\t100\t80", "-;H2", "stx1-a;stx2-a;stx2-d", "Positive", "Positive"),
    ("sample10", "1__adk__adk__X\t100\t100\t70\n2__fliC__H4__X\t100\t100\t70", "-;H4", "-", "-", "-"),
    ("sample11", "1__eae__eae-1__X\t100\t94\t70\n2__fliC__H6__X\t100\t100\t70", "-;H6", "-", "-", "-"),
    ("sample12", "1__stx1__stx1a__X\t100\t100\t80\n2__stx2__stx2c__X\t100\t100\t85\n3__fli__H21__X\t100\t100\t70", "-;H21", "stx1a;stx2c", "-", "-"),
]

for sample_name, res_content, expected_oh, expected_stx, expected_eae, expected_ehxA in test_cases:
    with TemporaryDirectory() as tmpdir:
        tmpdir = Path(tmpdir)
        os.chdir(tmpdir)

        res_dir = tmpdir / f"examples/Results/{sample_name}/kma"
        res_dir.mkdir(parents=True)
        res_file = res_dir / f"{sample_name}.res"
        res_file.write_text("#Template\tTemplate_Coverage\tQuery_Identity\tDepth\n" + res_content)

        sheet = tmpdir / "samplesheet.tsv"
        sheet.write_text(
            "sample_name\tIllumina_read_files\tNanopore_read_file\tassembly_file\torganism\tvariant\tnotes\n"
            f"{sample_name}\tread1.fastq,read2.fastq\t-\t-\tEcoli\t-\t-\n"
        )

        results = EcoliResults.from_samplesheet(sheet)
        df = results.results_df
        row = df.iloc[0]
        
        # general output and functionality test
        assert row["sample_name"] == sample_name
        
        if row["OH"] != expected_oh:
            raise AssertionError(f"\nSample: {sample_name}\nExpected OH: {expected_oh}\nActual OH: {row['OH']}")
        assert row["OH"] == expected_oh
        
        if row["stx"] != expected_stx:
            raise AssertionError(f"\nSample: {sample_name}\nExpected stx: {expected_stx}\nActual stx: {row['stx']}")
        assert row["stx"] == expected_stx

        if row["eae"] != expected_eae:
            raise AssertionError(f"\nSample: {sample_name}\nExpected eae: {expected_eae}\nActual eae: {row['eae']}")
        assert row["eae"] == expected_eae

        if row["ehxA"] != expected_ehxA:
            raise AssertionError(f"\nSample: {sample_name}\nExpected ehxA: {expected_ehxA}\nActual ehxA: {row['ehxA']}")
        assert row["ehxA"] == expected_ehxA

        # sample specific information tests
        
        # without confliciting O and H typing, the OH column should be filled and the remaining four genes empty
        if sample_name == "sample1": 
            assert row["wzx"] == "-"
            assert row["wzy"] == "-"
            assert row["wzt"] == "-"
            assert row["wzm"] == "-"
        # with conflicts the OH should remain empty and the four 'conflicting' gene information remain filled
        elif sample_name == "sample6":
            assert row["wzx"] == "O157"
            assert row["wzy"] == "O157"
            assert row["wzt"] == "O8"
            assert row["wzm"] == "O8"
        elif sample_name == "sample10":
            assert row["Other"] == "adk"

print("All 12 syntehtic E. coli sample inline tests passed.")

Logging started for examples/Log/sample1_kma_fbi.log
Processing .res file: examples/Results/sample1/kma/sample1.res
Successfully processed sample: sample1
Logging started for examples/Log/sample2_kma_fbi.log
Processing .res file: examples/Results/sample2/kma/sample2.res
Successfully processed sample: sample2
Logging started for examples/Log/sample3_kma_fbi.log
Processing .res file: examples/Results/sample3/kma/sample3.res
Successfully processed sample: sample3
Logging started for examples/Log/sample4_kma_fbi.log
Processing .res file: examples/Results/sample4/kma/sample4.res
Successfully processed sample: sample4
Logging started for examples/Log/sample5_kma_fbi.log
Processing .res file: examples/Results/sample5/kma/sample5.res
Successfully processed sample: sample5
Logging started for examples/Log/sample6_kma_fbi.log
Processing .res file: examples/Results/sample6/kma/sample6.res
Successfully processed sample: sample6
Logging started for examples/Log/sample7_kma_fbi.log
Processing .res f

# Directive for ensuring that the code in your notebook get executed as a script

The code-block here below is required to ensure that the code in the notebook is also transferred to the module (script), otherwise it will just be a notebook. See [Coding in NBdev](https://dksund.sharepoint.com/:fl:/g/contentstorage/CSP_7c761ee7-b577-4e08-8517-bc82392bf65e/ETlSfUyArSNJhX8veMI_JQ8By1aXGHzDJkhotpfpXx4mmw?e=037EwH&nav=cz0lMkZjb250ZW50c3RvcmFnZSUyRkNTUF83Yzc2MWVlNy1iNTc3LTRlMDgtODUxNy1iYzgyMzkyYmY2NWUmZD1iJTIxNXg1MmZIZTFDRTZGRjd5Q09TdjJYblkwVlNiWXFYcE1yaHVrVmZqTVJUVEE4X1VwZjhTd1JxcjRNdmFrSmh2RCZmPTAxVlVLVzVWSlpLSjZVWkFGTkVORVlLN1pQUERCRDZKSVAmYz0lMkYmYT1Mb29wQXBwJnA9JTQwZmx1aWR4JTJGbG9vcC1wYWdlLWNvbnRhaW5lciZ4PSU3QiUyMnclMjIlM0ElMjJUMFJUVUh4a2EzTjFibVF1YzJoaGNtVndiMmx1ZEM1amIyMThZaUUxZURVeVpraGxNVU5GTmtaR04zbERUMU4yTWxodVdUQldVMkpaY1Zod1RYSm9kV3RXWm1wTlVsUlVRVGhmVlhCbU9GTjNVbkZ5TkUxMllXdEthSFpFZkRBeFZsVkxWelZXU1RJMVJsaFBNalkyUlZkQ1FqTTFRVmhKVTBkRFVVcFdXa1klM0QlMjIlMkMlMjJpJTIyJTNBJTIyNzRmNzM1ZmUtYzg4Ny00MjhhLWFkZmYtNTEyZTg2YmNmZmQzJTIyJTdE) 
(**Writing your own notebooks**) on loop for more details.

In [29]:
# | hide
# This is included at the end to ensure when you run through your notebook the code is also transferred to the module and isn't just a notebook
import nbdev

nbdev.nbdev_export()

ModuleNotFoundError: No module named 'nbdev'