Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Sep 11, 2025

This PR implements a comprehensive refactor of the element resolution system to make it fast, explicit, and future-proof for the upcoming atom_type refactor, while maintaining full backward compatibility.

Overview

The previous element_guess naming was unclear and suggested ad-hoc heuristics. This refactor introduces an explicit atomic number (Z) identity system with a human-readable enum and implements hash-backed lookups for common molecular modeling forcefields and PDB conventions.

Key Changes

New Public API (src/core/md_atomic.h)

  • Atomic Number Types: typedef uint8_t md_atomic_number_t with md_element_t as legacy alias
  • Complete Element Enum: MD_Z_H=1, MD_Z_C=6, MD_Z_O=8, ..., MD_Z_OG=118 for all 118 elements
  • New Preferred Functions:
    md_atomic_number_t md_atomic_number_from_symbol(str_t sym);
    str_t md_symbol_from_atomic_number(md_atomic_number_t z);
    md_atomic_number_t md_atom_infer_atomic_number(str_t atom_name, str_t res_name);
    bool md_atoms_infer_atomic_numbers(md_atomic_number_t out[], size_t n, const md_molecule_t* mol);

Hash-Backed Inference System (src/md_atomic_infer.c)

  • Fast Lookups: Uses md_hashmap32_t with md_hash64 keys for O(1) performance
  • Context-Aware Resolution: Disambiguates "CA" (alpha carbon vs calcium) using residue context
  • Comprehensive Coverage:
    • Water variants: HOH/WAT/TIP3/TIP4/TIP5/SPC/SOL/H2O with O/OW/OH2 and H/H1/H2/HW1/HW2
    • Amino acids: Backbone atoms (N, CA, C, O, OXT) and sidechain atoms (SER/OG, CYS/SG, etc.)
    • Nucleic acids: Phosphate groups (P, OP1/OP2/O1P/O2P) for DNA/RNA residues
    • Ions: NA, K, CL, BR, MG, CA, ZN, FE, etc.
    • Fallback heuristics: Two-letter elements (CL12→Cl), first-letter mapping

Perfect Backward Compatibility

All existing md_util_* functions remain unchanged and fully functional:

// These continue to work exactly as before
md_element_t elem = md_util_element_lookup(STR_LIT("Ca"));
str_t symbol = md_util_element_symbol(elem);
bool success = md_util_element_guess(elements, count, mol);

The old md_util_element_guess now delegates to the new hash-backed system, providing better performance and accuracy while maintaining identical behavior.

Examples

Context-Aware Disambiguation

// Correctly identifies as carbon alpha, not calcium
md_atomic_number_t ca_carbon = md_atom_infer_atomic_number(STR_LIT("CA"), STR_LIT("ALA"));
assert(ca_carbon == MD_Z_C);

// Correctly identifies as calcium ion  
md_atomic_number_t ca_calcium = md_atom_infer_atomic_number(STR_LIT("CA"), STR_LIT("CA"));
assert(ca_calcium == MD_Z_CA);

Water Recognition

// All water variants correctly identified
assert(md_atom_infer_atomic_number(STR_LIT("OW"), STR_LIT("TIP3")) == MD_Z_O);
assert(md_atom_infer_atomic_number(STR_LIT("HW1"), STR_LIT("SPC")) == MD_Z_H);

Testing

Added comprehensive test suite (13 new tests) covering:

  • ✅ Enum constants and symbol lookups
  • ✅ Water, amino acid, nucleic acid, and ion inference
  • ✅ Fallback mechanisms and edge cases
  • ✅ API equivalence between old and new systems
  • ✅ Backward compatibility verification

Performance & Compatibility

  • Performance: O(1) hash lookups replace O(n) linear searches
  • Memory: Efficient static hash tables initialized on first use
  • Compatibility: Zero breaking changes - all existing code continues to work
  • Future-proof: Clean atomic number foundation for upcoming atom_type refactor

This refactor delivers the requested fast, explicit element inference while maintaining 100% backward compatibility for downstream users including viamd.

This pull request was created as a result of the following prompt from Copilot chat.

Refactor element inference to atomic-number inference, add a public atomic-number enum, and use md_hashmap32_t-backed lookup tables while preserving backward compatibility.

Context and goals

  • We want to make the element resolution fast, explicit, and future-proof for the upcoming atom_type refactor. The current naming (element_guess) is unclear and suggests ad-hoc heuristics.
  • Introduce an explicit notion of atomic number (Z) as the identity, and expose a human-readable enum with a safe prefix (e.g., MD_Z_H=1, MD_Z_C=6) to avoid naming collisions.
  • Implement a hash-backed, extended lookup for common MM forcefields and PDB conventions that disambiguates names like "CA" (alpha carbon) vs "Ca" (calcium) using residue context.
  • Maintain full compatibility with the existing md_util_* API by providing inline wrappers and type aliases, so downstream users (including viamd) are not broken.

Deliverables

  1. New public header for atomic numbers

    • File: src/core/md_atomic.h
    • Provide:
      • typedef uint8_t md_atomic_number_t; // 0 = unknown
      • typedef md_atomic_number_t md_element_t; // legacy alias for compatibility
      • Public enum listing all elements with a clear prefix:
        enum { MD_Z_X = 0, MD_Z_H = 1, MD_Z_HE = 2, MD_Z_LI = 3, ..., MD_Z_O = 8, MD_Z_P = 15, MD_Z_S = 16, ..., up to 118 (OGANESSON) }.
        Notes:
        • MD_Z_X (0) represents unknown.
        • Include the full periodic table, consistent with existing md_util_element_symbols ordering.
      • New preferred API names (forward-declare):
        md_atomic_number_t md_atomic_number_from_symbol(str_t sym);
        md_atomic_number_t md_atomic_number_from_symbol_icase(str_t sym);
        str_t md_symbol_from_atomic_number(md_atomic_number_t z);
        str_t md_name_from_atomic_number(md_atomic_number_t z);
        float md_atomic_mass(md_atomic_number_t z);
        float md_vdw_radius(md_atomic_number_t z);
        float md_covalent_radius(md_atomic_number_t z);
        int md_max_valence(md_atomic_number_t z);
        uint32_t md_cpk_color(md_atomic_number_t z);
        // Per-atom inference from labels (atom name + residue)
        md_atomic_number_t md_atom_infer_atomic_number(str_t atom_name, str_t res_name);
        // Batch form wired to molecule structure
        bool md_atoms_infer_atomic_numbers(md_atomic_number_t out[], size_t n, const struct md_molecule_t* mol);
  2. Update md_util.h to include md_atomic.h and add compatibility wrappers

    • Keep all existing declarations to avoid breaking users.
    • Add inline wrappers that forward old names to the new API, e.g.:
      • md_util_element_lookup -> md_atomic_number_from_symbol
      • md_util_element_lookup_ignore_case -> md_atomic_number_from_symbol_icase
      • md_util_element_symbol -> md_symbol_from_atomic_number
      • md_util_element_name -> md_name_from_atomic_number
      • md_util_element_atomic_mass -> md_atomic_mass
      • md_util_element_vdw_radius -> md_vdw_radius
      • md_util_element_covalent_radius -> md_covalent_radius
      • md_util_element_max_valence -> md_max_valence
      • md_util_element_cpk_color -> md_cpk_color
      • md_util_element_guess_from_names (if present) -> md_atom_infer_atomic_number
      • md_util_element_guess (batch) -> md_atoms_infer_atomic_numbers
    • Prefer to annotate wrappers with a deprecation macro if one exists (or TODO comment), but do not remove old names.
  3. Implement hash-backed element inference

    • New implementation file (choose one):
      • src/md_atomic_infer.c (preferred) or adapt existing element_guess implementation file.
    • Content:
      • Build two md_hashmap32_t maps (in core/md_hash.h style) using md_hash64 keys:
        a) residue+atom map: key = hash(UPPER(res) + '\t' + UPPER(atom)), value = (uint32_t)atomic number
        b) atom-only map: key = hash(UPPER(atom) with digits stripped), value = (uint32_t)atomic number
      • Populate tables to cover common MM forcefields and PDB conventions:
        • Water: HOH/WAT/TIP3/TIP4/TIP5/SPC/SOL/H2O with O/OW/OH2 and H/H1/H2/HW1/HW2.
        • Ions: NA, K, CL, BR, MG, CA, ZN, FE, MN, CU, CO, NI, CD, SR, BA, LI, CS, RB, AL, TI, CR, HG, PB, AG, AU, PT, etc.
        • Selenomethionine (MSE) with SE.
        • Nucleic acids (DNA/RNA): P and the phosphate oxygens OP1/OP2/O1P/O2P across DA/DC/DG/DT and A/C/G/U variants, plus common residue name suffixes (e.g., 5/3 termini if relevant in repo patterns).
        • Protein sidechain key atoms: SER/OG, THR/OG1, TYR/OH, CYS/SG, MET/SD, etc.
        • General atom-only fallbacks: H, C, N, O, S, P; halogens F/CL/BR/I; OW/OH2/HW*; backbone carbon labels (CA, CB, CG, CD, CE, CZ -> Carbon); OXT -> Oxygen; metals and common two-letter symbols.
      • Disambiguation logic:
        • If residue is water (md_util_resname_water), map O* to O and H* to H.
        • If residue is amino acid (md_util_resname_amino_acid) and atom is "CA", resolve to Carbon, not Calcium; also handle N, C, O, OXT explicitly and default class by first letter C/N/O/S/H.
        • If residue name itself is an element (ion), and atom name is empty or equals residue (ignoring digits), return that element.
        • Otherwise consult atom-only map, then try two-letter element prefix heuristic (e.g., CL12 => Cl, BR1 => Br, NA, MG, CA). If heuristic returns CA (Calcium) but residue is amino acid, override to Carbon.
        • Final fallback: first-letter element mapping (H/C/N/O/S/P/F/I/B/K).
      • Implement md_atom_infer_atomic_number using this logic.
      • Implement md_atoms_infer_atomic_numbers by iterating the molecule atoms and calling md_atom_infer_atomic_number with per-atom name/resname from the molecule.
  4. Keep existing lookup/conversion functions working

    • Implement new md_atomic_* functions by reusing the internal tables already backing md_util_element_* functions (symbols, names, masses, radii, etc.). If those internal arrays live in md_util.c, either:
      • Expose or reuse them from a shared compilation unit, or
      • Add thin forwarding in md_atomic_* implementations that call existing md_util_* functions internally to avoid code duplication.
  5. Tests

    • Add/extend unit tests (e.g., tests/test_element_guess.c or new tests/test_atomic_infer.c) to cover:
      • Water variants with and without residue context.
      • Amino acid disambiguation for CA, backbone atoms (N, C, O, OXT), and sidechain carbon labels (CB/CG/etc.).
      • Ions (resname=element) for several metals and halogens.
      • Nucleic acids (P, OP1/OP2/O1P/O2P) for DNA and RNA residues.
      • Two-letter heuristic (CL12, BR1) and fallbacks.
      • Exercise the new enum (MD_Z_H == 1, MD_Z_C == 6) and ensure md_symbol_from_atomic_number / md_atomic_number_from_symbol are consistent.
    • Keep any existing tests passing.
  6. Build system

    • Add new source file(s) to the build (e.g., src/md_atomic_infer.c, src/core/md_atomic.h).
    • Ensure public headers are exported in the install interface if the project supports installation.
  7. Backward compatibility

    • Preserve existing symbol names and types by providing wrappers and the md_element_t alias.
    • Do not remove public functions; prefer marking old names as deprecated in comments (or using an existing deprecation macro if available).

Acceptance criteria

  • New enum of atomic numbers is available at compile time via src/core/md_atomic.h with a safe prefix (MD_Z_*), aligned with the internal element ordering.
  • New APIs (md_atom_infer_atomic_number, md_atoms_infer_atomic_numbers, md_atomic_number_from_symbol, etc.) are implemented and used internally.
  • md_hashmap32_t is used for the residue+atom and atom-only maps (keys: md_hash64 of normalized strings, values: uint32_t atomic numbers).
  • md_util_element_guess now delegates to the new inference, preserving behavior but improving speed and clarity.
  • Existing public md_util_* functions remain available; code compiles without changes for downstream consumers.
  • Unit tests added/updated; CI passes.

Notes

  • Use uppercase normalization and digit-stripping for atom-only keys. For residue+atom keys, keep digits in atom part.
  • Unknown element remains 0. Ensure the enum exposes MD_Z_X = 0 for unknown.
  • Follow existing code style, includes, and allocator conventions (heap allocator for persistent maps; md_hashmap_reserve/md_hashmap_add/md_hashmap_get pattern as used in src/md_mmcif.c).

💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

Copilot AI and others added 2 commits September 11, 2025 14:19
…ard compatibility

Co-authored-by: scanberg <706523+scanberg@users.noreply.github.com>
Co-authored-by: scanberg <706523+scanberg@users.noreply.github.com>
Copilot AI changed the title [WIP] Atomic-number inference refactor: add md_atomic enum (MD_Z_*), hash-backed lookup, and compatible md_util wrappers Refactor element inference to atomic-number inference with hash-backed lookup tables Sep 11, 2025
Copilot AI requested a review from scanberg September 11, 2025 14:24
Copilot finished work on behalf of scanberg September 11, 2025 14:24
@scanberg scanberg closed this Sep 12, 2025
@scanberg scanberg deleted the copilot/fix-43ee205c-1eb7-402a-b05d-cc95a42f4322 branch September 12, 2025 05:25
@scanberg scanberg restored the copilot/fix-43ee205c-1eb7-402a-b05d-cc95a42f4322 branch September 12, 2025 05:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants