Labels
bug, data-integrity, alliance, data-generation
Description
The Alliance of Genome Resources column definition lists in src/data_generation/alliance/__init__.py have missing commas between adjacent string literals. Python silently concatenates adjacent strings without a comma this is a language feature, not a syntax error so the column lists end up with fewer entries than intended, producing wrong column names.
This causes a silent schema mismatch when parsing Alliance TSV files: columns after the concatenation point get mapped to the wrong names, leading to incorrect metadata on ingested documents with no error raised.
Affected Locations
Location 1 : molecular_interaction column list (line ~141)
"Host organism(s)" "Interaction parameter(s)",
Expected: Two list entries → "Host organism(s)" and "Interaction parameter(s)"
Actual: One entry → "Host organism(s)Interaction parameter(s)" (concatenated)
The list has 1 entry instead of 2. Every column index after this point is shifted by 1.
Location 2 : genetic_interaction column list (lines ~180–184)
"Annotation(s) interactor A"
"Annotation(s) interactor B"
"Interaction annotation(s)"
"Host organism(s)"
"Interaction parameter(s)"
"Creation date",
Expected: Six separate list entries
Actual: The first five strings concatenate into one:
"Annotation(s) interactor AAnnotation(s) interactor BInteraction annotation(s)Host organism(s)Interaction parameter(s)"
Then "Creation date" is a separate entry (it has a comma). The list has 2 entries instead of 6 — missing 4 column names entirely.
Impact
- Data integrity: Column-to-name mapping is silently wrong for all Alliance
molecular_interaction and genetic_interaction data. Fields like "Host organism(s)", "Interaction parameter(s)", "Annotation(s) interactor A/B", and "Interaction annotation(s)" are never properly indexed as standalone columns.
- Retrieval quality: Any metadata-based filtering or search against these column names will match nothing, since the names don't exist as standalone entries in the schema.
- Silent failure: No error or warning is raised at any point the data loads successfully but with incorrect field assignments.
Steps to Reproduce
# Quick verification in a Python REPL:
columns = [
"Annotation(s) interactor A"
"Annotation(s) interactor B"
"Interaction annotation(s)"
"Host organism(s)"
"Interaction parameter(s)"
"Creation date",
]
print(len(columns)) # Expected: 6, Actual: 2
print(columns[0]) # Shows the concatenated mega-string
Proposed Fix
Add the missing commas after each string literal:
Location 1
- "Host organism(s)" "Interaction parameter(s)",
+ "Host organism(s)",
+ "Interaction parameter(s)",
Location 2
- "Annotation(s) interactor A"
- "Annotation(s) interactor B"
- "Interaction annotation(s)"
- "Host organism(s)"
- "Interaction parameter(s)"
+ "Annotation(s) interactor A",
+ "Annotation(s) interactor B",
+ "Interaction annotation(s)",
+ "Host organism(s)",
+ "Interaction parameter(s)",
"Creation date",
Files to change
src/data_generation/alliance/__init__.py
Labels
bug,data-integrity,alliance,data-generationDescription
The Alliance of Genome Resources column definition lists in
src/data_generation/alliance/__init__.pyhave missing commas between adjacent string literals. Python silently concatenates adjacent strings without a comma this is a language feature, not a syntax error so the column lists end up with fewer entries than intended, producing wrong column names.This causes a silent schema mismatch when parsing Alliance TSV files: columns after the concatenation point get mapped to the wrong names, leading to incorrect metadata on ingested documents with no error raised.
Affected Locations
Location 1 :
molecular_interactioncolumn list (line ~141)Expected: Two list entries →
"Host organism(s)"and"Interaction parameter(s)"Actual: One entry →
"Host organism(s)Interaction parameter(s)"(concatenated)The list has 1 entry instead of 2. Every column index after this point is shifted by 1.
Location 2 :
genetic_interactioncolumn list (lines ~180–184)Expected: Six separate list entries
Actual: The first five strings concatenate into one:
Then
"Creation date"is a separate entry (it has a comma). The list has 2 entries instead of 6 — missing 4 column names entirely.Impact
molecular_interactionandgenetic_interactiondata. Fields like"Host organism(s)","Interaction parameter(s)","Annotation(s) interactor A/B", and"Interaction annotation(s)"are never properly indexed as standalone columns.Steps to Reproduce
Proposed Fix
Add the missing commas after each string literal:
Location 1
Location 2
Files to change
src/data_generation/alliance/__init__.py