### **Purpose of the Code**

This Jupyter notebook presents a Python-based workflow designed to align sequences of class D β-lactamases to a pre-built Hidden Markov Model (HMM) profile. The primary objective is to achieve consistent residue numbering across various β-lactamase sequences. This standardization addresses challenges posed by inconsistent numbering schemes in the literature, which can impede comparative studies and data integration. The code will also provide the consencensus secondary structure annotation based on the reference structure (OXA-48 PDB ID: [5DTK](https://www.rcsb.org/structure/5dtk)). Researchers should refer to the literature or structural biology tools to assign secondary structure elements to each residue in the query. The annotation assigned by the code provide the consensus naming of each identified secondary structure element as aligned to the reference.



By leveraging structural alignment and HMM-based methods, this workflow facilitates:

- Accurate alignment of query sequences to a structurally informed reference model.
- Automatic generation of outputs, including mapping tables and .aln files.

### **Why Use HMM for Sequence Alignment?**

Hidden Markov Models (HMMs) are powerful tools for sequence alignment and annotation, especially for structurally diverse families like class D β-lactamases. Here's why:



1.   **Sensitivity to Sequence Variation:**
HMMs capture both conserved and variable regions within a sequence family by learning from a multiple sequence alignment (MSA). This capability makes them adept at detecting distantly related sequences and structural motifs, which is crucial for analyzing diverse enzyme families like class D β-lactamases.
[EMBL-EBI HOMEPAGE
](https://www.ebi.ac.uk/training/online/courses/protein-classification-intro-ebi-resources/what-are-protein-signatures/signature-types/what-are-hmms/?utm_source=chatgpt.com)

2.   **Structure-Based Alignment:**
The HMM profile employed in this workflow is derived from a structural alignment of experimentally obtained 3D structures of class D β-lactamases. This ensures that the alignment reflects not only sequence similarity but also structural conservation, which is essential for functional insights.


3.   **Standardization:**
Aligning all query sequences to a single HMM profile ensures consistent residue numbering, eliminating discrepancies between studies and facilitating direct comparisons. This standardization is vital for integrating data across different research efforts.


## Installing packages and fetching files:

In [None]:
#make sure you are in the folder where you have the required files
!ls

## Run the code:

In [1]:
import sys
script_dir = "./files"
if script_dir not in sys.path:
    sys.path.append(script_dir)

from ASSIGN_SAND import upload_and_save_query, align_sequences, parse_alignment, map_and_save_csv, print_mapping_table,load_secondary_structure_annotations,cleanup_files
import pandas as pd
pd.options.display.max_colwidth = 8  # Limit column width
pd.options.display.max_rows = None  # Display all rows

def main():
    """
    This main function handles the process of aligning a query sequence to HMM profile pre-built with structure based sequence alignment.
    Then mapping the standard numbering to the query and saving it to csv output that will automatically get downoaded. Make sure your browser does not block automatic file download.
    """
    # Step 1: Upload query FASTA file
    upload_and_save_query() # File upload interface in Google Colab.


    # Step 2: Align the query sequence to refernce and HMM profile sequence
    align_sequences()


    # Step 3: Load secondary structure annotations
    ss_dict = load_secondary_structure_annotations("./files/ss_dictionary.csv")

    # Step 4: Mapping Standard numbering on the query sequence
    reference, query = parse_alignment("query_ref_aligned.aln")
    query_name = query.id.replace(" ", "_") #This step is only to avoid having spaces in output file name
    mapped_output_file_col = f"mapped_{query_name}_column.csv" # This is the name of the output file you should see in your Downloads folder

    map_and_save_csv(reference, query, mapped_output_file_col, ss_dict) # Saving the mapping table as csv

    print_mapping_table(reference, query, ss_dict) # Print the mapping table in the notebook
    cleanup_files()
# Call this function to run the full workflow
if __name__ == "__main__":
    main()

Please enter the path to your query FASTA file:  ./example/oxa-10.fasta


File uploaded and saved as query.fasta
Query sequence aligned to the HMM profile.
Mapped CSV file saved as mapped_OXA-10_column.csv (column format).


Unnamed: 0,Reference residue number,Reference Secondary Structure Annotation,Reference residue name,Query residue name,Query original numbering,Query Standard numbering SAND,Comments
0,1,,m,-,-,-,Dele...
1,2,,r,-,-,-,Dele...
2,3,,v,-,-,-,Dele...
3,4,,l,-,-,-,Dele...
4,5,,a,-,-,-,Dele...
5,6,,l,-,-,-,Dele...
6,7,,s,-,-,-,Dele...
7,8,,a,-,-,-,Dele...
8,9,,v,m,1,9,
9,10,,f,k,2,10,


In [None]:
from ASSIGN_SAND import upload_and_save_query, align_sequences, parse_alignment, map_and_save_csv, print_mapping_table, load_secondary_structure_annotations
import pandas as pd
import os

pd.options.display.max_colwidth = 8  # Limit column width
pd.options.display.max_rows = None  # Display all rows

def main():
    """
    This main function handles the process of aligning a query sequence to HMM profile pre-built with structure-based sequence alignment.
    Then mapping the standard numbering to the query and saving it to a CSV output file.
    """
    try:
        # Step 1: Upload query FASTA file
        upload_and_save_query()  # Handles file upload locally.

        # Step 2: Align the query sequence to reference and HMM profile sequence
        align_sequences()

        # Step 3: Load secondary structure annotations
        ss_dict = load_secondary_structure_annotations("ss_dictionary.csv")

        # Step 4: Mapping Standard numbering on the query sequence
        reference, query = parse_alignment("query_ref_aligned.aln")
        query_name = query.id.replace(" ", "_")  # This step is only to avoid having spaces in the output file name
        mapped_output_file_col = f"mapped_{query_name}_column.csv"  # This is the name of the output file

        map_and_save_csv(reference, query, mapped_output_file_col, ss_dict)  # Save the mapping table as CSV

        print_mapping_table(reference, query, ss_dict)  # Print the mapping table in the notebook

    finally:
        # Clean up temporary files
        for temp_file in ["concatenated_output.fasta", "query.fasta", "query_ref_aligned.sto", "query_ref_aligned.aln"]:
            if os.path.exists(temp_file):
                os.remove(temp_file)
                print(f"Deleted temporary file: {temp_file}")

# Call this function to run the full workflow
if __name__ == "__main__":
    main()
