Refactor element inference to atomic-number inference with hash-backed lookup tables #8

Copilot · 2025-09-11T14:06:35Z

This PR implements a comprehensive refactor of the element resolution system to make it fast, explicit, and future-proof for the upcoming atom_type refactor, while maintaining full backward compatibility.

Overview

The previous element_guess naming was unclear and suggested ad-hoc heuristics. This refactor introduces an explicit atomic number (Z) identity system with a human-readable enum and implements hash-backed lookups for common molecular modeling forcefields and PDB conventions.

Key Changes

New Public API (`src/core/md_atomic.h`)

Atomic Number Types: typedef uint8_t md_atomic_number_t with md_element_t as legacy alias
Complete Element Enum: MD_Z_H=1, MD_Z_C=6, MD_Z_O=8, ..., MD_Z_OG=118 for all 118 elements

New Preferred Functions:

md_atomic_number_t md_atomic_number_from_symbol(str_t sym);
str_t md_symbol_from_atomic_number(md_atomic_number_t z);
md_atomic_number_t md_atom_infer_atomic_number(str_t atom_name, str_t res_name);
bool md_atoms_infer_atomic_numbers(md_atomic_number_t out[], size_t n, const md_molecule_t* mol);

Hash-Backed Inference System (`src/md_atomic_infer.c`)

Fast Lookups: Uses md_hashmap32_t with md_hash64 keys for O(1) performance
Context-Aware Resolution: Disambiguates "CA" (alpha carbon vs calcium) using residue context
Comprehensive Coverage:
- Water variants: HOH/WAT/TIP3/TIP4/TIP5/SPC/SOL/H2O with O/OW/OH2 and H/H1/H2/HW1/HW2
- Amino acids: Backbone atoms (N, CA, C, O, OXT) and sidechain atoms (SER/OG, CYS/SG, etc.)
- Nucleic acids: Phosphate groups (P, OP1/OP2/O1P/O2P) for DNA/RNA residues
- Ions: NA, K, CL, BR, MG, CA, ZN, FE, etc.
- Fallback heuristics: Two-letter elements (CL12→Cl), first-letter mapping

Perfect Backward Compatibility

All existing md_util_* functions remain unchanged and fully functional:

// These continue to work exactly as before
md_element_t elem = md_util_element_lookup(STR_LIT("Ca"));
str_t symbol = md_util_element_symbol(elem);
bool success = md_util_element_guess(elements, count, mol);

The old md_util_element_guess now delegates to the new hash-backed system, providing better performance and accuracy while maintaining identical behavior.

Examples

Context-Aware Disambiguation

// Correctly identifies as carbon alpha, not calcium
md_atomic_number_t ca_carbon = md_atom_infer_atomic_number(STR_LIT("CA"), STR_LIT("ALA"));
assert(ca_carbon == MD_Z_C);

// Correctly identifies as calcium ion  
md_atomic_number_t ca_calcium = md_atom_infer_atomic_number(STR_LIT("CA"), STR_LIT("CA"));
assert(ca_calcium == MD_Z_CA);

Water Recognition

// All water variants correctly identified
assert(md_atom_infer_atomic_number(STR_LIT("OW"), STR_LIT("TIP3")) == MD_Z_O);
assert(md_atom_infer_atomic_number(STR_LIT("HW1"), STR_LIT("SPC")) == MD_Z_H);

Testing

Added comprehensive test suite (13 new tests) covering:

✅ Enum constants and symbol lookups
✅ Water, amino acid, nucleic acid, and ion inference
✅ Fallback mechanisms and edge cases
✅ API equivalence between old and new systems
✅ Backward compatibility verification

Performance & Compatibility

Performance: O(1) hash lookups replace O(n) linear searches
Memory: Efficient static hash tables initialized on first use
Compatibility: Zero breaking changes - all existing code continues to work
Future-proof: Clean atomic number foundation for upcoming atom_type refactor

This refactor delivers the requested fast, explicit element inference while maintaining 100% backward compatibility for downstream users including viamd.

This pull request was created as a result of the following prompt from Copilot chat.

Refactor element inference to atomic-number inference, add a public atomic-number enum, and use md_hashmap32_t-backed lookup tables while preserving backward compatibility.

Context and goals

We want to make the element resolution fast, explicit, and future-proof for the upcoming atom_type refactor. The current naming (element_guess) is unclear and suggests ad-hoc heuristics.

Introduce an explicit notion of atomic number (Z) as the identity, and expose a human-readable enum with a safe prefix (e.g., MD_Z_H=1, MD_Z_C=6) to avoid naming collisions.

Implement a hash-backed, extended lookup for common MM forcefields and PDB conventions that disambiguates names like "CA" (alpha carbon) vs "Ca" (calcium) using residue context.

Maintain full compatibility with the existing md_util_* API by providing inline wrappers and type aliases, so downstream users (including viamd) are not broken.

Deliverables

New public header for atomic numbers

File: src/core/md_atomic.h

Provide:

typedef uint8_t md_atomic_number_t; // 0 = unknown

typedef md_atomic_number_t md_element_t; // legacy alias for compatibility

Public enum listing all elements with a clear prefix:
enum { MD_Z_X = 0, MD_Z_H = 1, MD_Z_HE = 2, MD_Z_LI = 3, ..., MD_Z_O = 8, MD_Z_P = 15, MD_Z_S = 16, ..., up to 118 (OGANESSON) }.
Notes:

MD_Z_X (0) represents unknown.

Include the full periodic table, consistent with existing md_util_element_symbols ordering.

New preferred API names (forward-declare):
md_atomic_number_t md_atomic_number_from_symbol(str_t sym);
md_atomic_number_t md_atomic_number_from_symbol_icase(str_t sym);
str_t md_symbol_from_atomic_number(md_atomic_number_t z);
str_t md_name_from_atomic_number(md_atomic_number_t z);
float md_atomic_mass(md_atomic_number_t z);
float md_vdw_radius(md_atomic_number_t z);
float md_covalent_radius(md_atomic_number_t z);
int md_max_valence(md_atomic_number_t z);
uint32_t md_cpk_color(md_atomic_number_t z);
// Per-atom inference from labels (atom name + residue)
md_atomic_number_t md_atom_infer_atomic_number(str_t atom_name, str_t res_name);
// Batch form wired to molecule structure
bool md_atoms_infer_atomic_numbers(md_atomic_number_t out[], size_t n, const struct md_molecule_t* mol);

Update md_util.h to include md_atomic.h and add compatibility wrappers

Keep all existing declarations to avoid breaking users.

Add inline wrappers that forward old names to the new API, e.g.:

md_util_element_lookup -> md_atomic_number_from_symbol

md_util_element_lookup_ignore_case -> md_atomic_number_from_symbol_icase

md_util_element_symbol -> md_symbol_from_atomic_number

md_util_element_name -> md_name_from_atomic_number

md_util_element_atomic_mass -> md_atomic_mass

md_util_element_vdw_radius -> md_vdw_radius

md_util_element_covalent_radius -> md_covalent_radius

md_util_element_max_valence -> md_max_valence

md_util_element_cpk_color -> md_cpk_color

md_util_element_guess_from_names (if present) -> md_atom_infer_atomic_number

md_util_element_guess (batch) -> md_atoms_infer_atomic_numbers

Prefer to annotate wrappers with a deprecation macro if one exists (or TODO comment), but do not remove old names.

Implement hash-backed element inference

New implementation file (choose one):

src/md_atomic_infer.c (preferred) or adapt existing element_guess implementation file.

Content:

Build two md_hashmap32_t maps (in core/md_hash.h style) using md_hash64 keys:
a) residue+atom map: key = hash(UPPER(res) + '\t' + UPPER(atom)), value = (uint32_t)atomic number
b) atom-only map: key = hash(UPPER(atom) with digits stripped), value = (uint32_t)atomic number

Populate tables to cover common MM forcefields and PDB conventions:
• Water: HOH/WAT/TIP3/TIP4/TIP5/SPC/SOL/H2O with O/OW/OH2 and H/H1/H2/HW1/HW2.
• Ions: NA, K, CL, BR, MG, CA, ZN, FE, MN, CU, CO, NI, CD, SR, BA, LI, CS, RB, AL, TI, CR, HG, PB, AG, AU, PT, etc.
• Selenomethionine (MSE) with SE.
• Nucleic acids (DNA/RNA): P and the phosphate oxygens OP1/OP2/O1P/O2P across DA/DC/DG/DT and A/C/G/U variants, plus common residue name suffixes (e.g., 5/3 termini if relevant in repo patterns).
• Protein sidechain key atoms: SER/OG, THR/OG1, TYR/OH, CYS/SG, MET/SD, etc.
• General atom-only fallbacks: H, C, N, O, S, P; halogens F/CL/BR/I; OW/OH2/HW*; backbone carbon labels (CA, CB, CG, CD, CE, CZ -> Carbon); OXT -> Oxygen; metals and common two-letter symbols.

Disambiguation logic:
• If residue is water (md_util_resname_water), map O* to O and H* to H.
• If residue is amino acid (md_util_resname_amino_acid) and atom is "CA", resolve to Carbon, not Calcium; also handle N, C, O, OXT explicitly and default class by first letter C/N/O/S/H.
• If residue name itself is an element (ion), and atom name is empty or equals residue (ignoring digits), return that element.
• Otherwise consult atom-only map, then try two-letter element prefix heuristic (e.g., CL12 => Cl, BR1 => Br, NA, MG, CA). If heuristic returns CA (Calcium) but residue is amino acid, override to Carbon.
• Final fallback: first-letter element mapping (H/C/N/O/S/P/F/I/B/K).

Implement md_atom_infer_atomic_number using this logic.

Implement md_atoms_infer_atomic_numbers by iterating the molecule atoms and calling md_atom_infer_atomic_number with per-atom name/resname from the molecule.

Keep existing lookup/conversion functions working

Implement new md_atomic_* functions by reusing the internal tables already backing md_util_element_* functions (symbols, names, masses, radii, etc.). If those internal arrays live in md_util.c, either:
• Expose or reuse them from a shared compilation unit, or
• Add thin forwarding in md_atomic_* implementations that call existing md_util_* functions internally to avoid code duplication.

Tests

Add/extend unit tests (e.g., tests/test_element_guess.c or new tests/test_atomic_infer.c) to cover:
• Water variants with and without residue context.
• Amino acid disambiguation for CA, backbone atoms (N, C, O, OXT), and sidechain carbon labels (CB/CG/etc.).
• Ions (resname=element) for several metals and halogens.
• Nucleic acids (P, OP1/OP2/O1P/O2P) for DNA and RNA residues.
• Two-letter heuristic (CL12, BR1) and fallbacks.
• Exercise the new enum (MD_Z_H == 1, MD_Z_C == 6) and ensure md_symbol_from_atomic_number / md_atomic_number_from_symbol are consistent.

Keep any existing tests passing.

Build system

Add new source file(s) to the build (e.g., src/md_atomic_infer.c, src/core/md_atomic.h).

Ensure public headers are exported in the install interface if the project supports installation.

Backward compatibility

Preserve existing symbol names and types by providing wrappers and the md_element_t alias.

Do not remove public functions; prefer marking old names as deprecated in comments (or using an existing deprecation macro if available).

Acceptance criteria

New enum of atomic numbers is available at compile time via src/core/md_atomic.h with a safe prefix (MD_Z_*), aligned with the internal element ordering.

New APIs (md_atom_infer_atomic_number, md_atoms_infer_atomic_numbers, md_atomic_number_from_symbol, etc.) are implemented and used internally.

md_hashmap32_t is used for the residue+atom and atom-only maps (keys: md_hash64 of normalized strings, values: uint32_t atomic numbers).

md_util_element_guess now delegates to the new inference, preserving behavior but improving speed and clarity.

Existing public md_util_* functions remain available; code compiles without changes for downstream consumers.

Unit tests added/updated; CI passes.

Notes

Use uppercase normalization and digit-stripping for atom-only keys. For residue+atom keys, keep digits in atom part.

Unknown element remains 0. Ensure the enum exposes MD_Z_X = 0 for unknown.

Follow existing code style, includes, and allocator conventions (heap allocator for persistent maps; md_hashmap_reserve/md_hashmap_add/md_hashmap_get pattern as used in src/md_mmcif.c).

💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

…ard compatibility Co-authored-by: scanberg <706523+scanberg@users.noreply.github.com>

Co-authored-by: scanberg <706523+scanberg@users.noreply.github.com>

Initial plan

3063126

Copilot AI assigned Copilot and scanberg Sep 11, 2025

Copilot started work on behalf of scanberg September 11, 2025 14:06 View session

Copilot AI and others added 2 commits September 11, 2025 14:19

Implement atomic number refactor with hash-backed inference and backw…

c36c520

…ard compatibility Co-authored-by: scanberg <706523+scanberg@users.noreply.github.com>

Add comprehensive API equivalence and compatibility tests

e723aef

Co-authored-by: scanberg <706523+scanberg@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Atomic-number inference refactor: add md_atomic enum (MD_Z_*), hash-backed lookup, and compatible md_util wrappers~~ Refactor element inference to atomic-number inference with hash-backed lookup tables Sep 11, 2025

Copilot AI requested a review from scanberg September 11, 2025 14:24

Copilot finished work on behalf of scanberg September 11, 2025 14:24

Copilot AI mentioned this pull request Sep 11, 2025

Complete atom type table implementation with parser-native element inference #10

Open

3 tasks

scanberg closed this Sep 12, 2025

scanberg deleted the copilot/fix-43ee205c-1eb7-402a-b05d-cc95a42f4322 branch September 12, 2025 05:25

scanberg restored the copilot/fix-43ee205c-1eb7-402a-b05d-cc95a42f4322 branch September 12, 2025 05:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor element inference to atomic-number inference with hash-backed lookup tables #8

Refactor element inference to atomic-number inference with hash-backed lookup tables #8

Uh oh!

Copilot AI commented Sep 11, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Refactor element inference to atomic-number inference with hash-backed lookup tables #8

Refactor element inference to atomic-number inference with hash-backed lookup tables #8

Uh oh!

Conversation

Copilot AI commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Changes

New Public API (src/core/md_atomic.h)

Hash-Backed Inference System (src/md_atomic_infer.c)

Perfect Backward Compatibility

Examples

Context-Aware Disambiguation

Water Recognition

Testing

Performance & Compatibility

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Sep 11, 2025 •

edited

Loading

New Public API (`src/core/md_atomic.h`)

Hash-Backed Inference System (`src/md_atomic_infer.c`)