Skip to content

bug: missing commas in Alliance column lists cause silent string concatenation #118

@AaryanCode69

Description

@AaryanCode69

Labels

bug, data-integrity, alliance, data-generation


Description

The Alliance of Genome Resources column definition lists in src/data_generation/alliance/__init__.py have missing commas between adjacent string literals. Python silently concatenates adjacent strings without a comma this is a language feature, not a syntax error so the column lists end up with fewer entries than intended, producing wrong column names.

This causes a silent schema mismatch when parsing Alliance TSV files: columns after the concatenation point get mapped to the wrong names, leading to incorrect metadata on ingested documents with no error raised.


Affected Locations

Location 1 : molecular_interaction column list (line ~141)

"Host organism(s)" "Interaction parameter(s)",

Expected: Two list entries → "Host organism(s)" and "Interaction parameter(s)"
Actual: One entry → "Host organism(s)Interaction parameter(s)" (concatenated)

The list has 1 entry instead of 2. Every column index after this point is shifted by 1.

Location 2 : genetic_interaction column list (lines ~180–184)

"Annotation(s) interactor A"
"Annotation(s) interactor B"
"Interaction annotation(s)"
"Host organism(s)"
"Interaction parameter(s)"
"Creation date",

Expected: Six separate list entries
Actual: The first five strings concatenate into one:

"Annotation(s) interactor AAnnotation(s) interactor BInteraction annotation(s)Host organism(s)Interaction parameter(s)"

Then "Creation date" is a separate entry (it has a comma). The list has 2 entries instead of 6 — missing 4 column names entirely.


Impact

  • Data integrity: Column-to-name mapping is silently wrong for all Alliance molecular_interaction and genetic_interaction data. Fields like "Host organism(s)", "Interaction parameter(s)", "Annotation(s) interactor A/B", and "Interaction annotation(s)" are never properly indexed as standalone columns.
  • Retrieval quality: Any metadata-based filtering or search against these column names will match nothing, since the names don't exist as standalone entries in the schema.
  • Silent failure: No error or warning is raised at any point the data loads successfully but with incorrect field assignments.

Steps to Reproduce

# Quick verification in a Python REPL:
columns = [
    "Annotation(s) interactor A"
    "Annotation(s) interactor B"
    "Interaction annotation(s)"
    "Host organism(s)"
    "Interaction parameter(s)"
    "Creation date",
]
print(len(columns))   # Expected: 6, Actual: 2
print(columns[0])     # Shows the concatenated mega-string

Proposed Fix

Add the missing commas after each string literal:

Location 1

-            "Host organism(s)" "Interaction parameter(s)",
+            "Host organism(s)",
+            "Interaction parameter(s)",

Location 2

-            "Annotation(s) interactor A"
-            "Annotation(s) interactor B"
-            "Interaction annotation(s)"
-            "Host organism(s)"
-            "Interaction parameter(s)"
+            "Annotation(s) interactor A",
+            "Annotation(s) interactor B",
+            "Interaction annotation(s)",
+            "Host organism(s)",
+            "Interaction parameter(s)",
             "Creation date",

Files to change

  • src/data_generation/alliance/__init__.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions