Skip to content

Conversation

@robertatakenaka
Copy link
Member

@robertatakenaka robertatakenaka commented Jan 30, 2026

…elementos de data

O que esse PR faz?

Este PR implementa melhorias na conversão de strings de data para elementos XML e aumenta a resiliência do pipeline de metadados de histórico. As principais alterações são:

  • Validação rigorosa de datas: A função _create_date_element agora valida se os valores são numéricos e interrompe o processamento caso encontre zeros ou caracteres não numéricos, evitando a criação de elementos XML inválidos.
  • Tratamento de exceções: Adiciona um bloco try-except no XMLArticleMetaHistoryPipe para garantir que falhas no processamento de uma data específica não interrompam todo o pipeline, registrando o erro detalhadamente no objeto raw.exceptions.

Onde a revisão poderia começar?

A revisão deve começar pelo arquivo:
scielo_classic_website/spsxml/sps_xml_article_meta.py

Inicie pela função _create_date_element (linha 64) para entender a nova lógica de validação e depois siga para a classe XMLArticleMetaHistoryPipe (linha 523).

Como este poderia ser testado manualmente?

  1. Execute o processo de conversão SPS-XML utilizando uma base de dados que contenha datas malformadas (ex: datas com strings, espaços ou apenas zeros).
  2. Verifique se o processo termina sem lançar exceções fatais.
  3. Inspecione o objeto de saída e confirme se os erros foram capturados na lista raw.exceptions.
  4. Garanta que as datas válidas continuem sendo geradas corretamente no XML final.

Algum cenário de contexto que queira dar?

Em bases de dados legadas, é comum encontrar campos de data com preenchimento inconsistente (ex: "00000000" ou "2023 12"). A lógica anterior tentava converter esses valores, o que podia causar erros de conversão de tipo ou gerar XMLs semanticamente incorretos. Esta alteração torna o conversor mais defensivo contra dados ruidosos.

fixes
scieloorg/scms-upload#811

Screenshots

N/A (Alteração de lógica de backend)

Quais são tickets relevantes?

scieloorg/scms-upload#811

Referências

  • Documentação de Pipes e Plumber (arquitetura interna).
  • Padrões de validação XML para o padrão SPS (SciELO Publishing Schema).

Copilot AI review requested due to automatic review settings January 30, 2026 15:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the date validation logic in the XML article metadata processing pipeline to handle malformed dates more robustly. The changes aim to prevent invalid XML generation when encountering dates with zeros or non-numeric characters in legacy databases.

Changes:

  • Modified _create_date_element function to validate date components using isdigit() and stop processing when encountering zeros or non-numeric values
  • Added try-except block in XMLArticleMetaHistoryPipe to catch and log exceptions during date processing without halting the entire pipeline

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +67 to +73
if not value.isdigit():
break
if int(value) == 0:
break
e = ET.Element(label)
e.text = value
date_element.append(e)
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces significant changes to date validation logic but no new tests are added to verify the behavior with malformed dates (e.g., "00000000", "20200200", "2023 12", or dates with non-numeric characters). The PR description specifically mentions these edge cases as the motivation for the changes. Add test cases to cover these scenarios and verify that the validation behaves as expected, especially given the critical logic changes identified in the _create_date_element function.

Copilot uses AI. Check for mistakes.
Comment on lines +529 to +540
try:
elem = _create_date_element("date", attributes, date_)
except Exception as e:
raw.exceptions.append(
{
"pipe": "XMLArticleMetaHistoryPipe",
"error_type": str(type(e)),
"error_message": str(e),
"exc_traceback": traceback.format_exc(),
}
)
continue
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The traceback module is used here but is not imported at the top of the file. This will cause a NameError when an exception occurs. Add "import traceback" to the imports section at the top of the file (similar to how it's done in sps_xml_refs.py:2).

Copilot uses AI. Check for mistakes.
Comment on lines +68 to +70
break
if int(value) == 0:
break
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation logic now uses "break" instead of the implicit "continue" from the previous code, which fundamentally changes the behavior. When a zero or non-digit is encountered, the break statement stops all further date components from being processed.

For example, with date "20200200" (valid year and month, but day is "00"):

  • Old behavior: would skip day but still add month and year elements
  • New behavior: breaks at day "00" and adds no elements at all

This means valid year and month values will be lost if the day is zero or invalid. Consider using "continue" instead of "break" to preserve the old behavior of skipping invalid components while still processing valid ones. Alternatively, if the intent is to process in order of significance, the iteration order should be changed to (year, month, day) so that more significant components are processed first.

Suggested change
break
if int(value) == 0:
break
continue
if int(value) == 0:
continue

Copilot uses AI. Check for mistakes.
Comment on lines 64 to +67
year, month, day = date_text[:4], date_text[4:6], date_text[6:]
labels = ("day", "month", "year")
for label, value in zip(labels, (day, month, year)):
if int(value) != 0:
e = ET.Element(label)
e.text = value
date_element.append(e)
if not value.isdigit():
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code doesn't validate that date_text has the expected length (8 characters) before slicing. If date_text is shorter than 8 characters, the slicing operations will produce empty strings or truncated values. For example, if date_text is "2020", then year="2020", month="", day="". The empty string will fail the isdigit() check and break, but this could lead to silent failures. Consider adding a length check after the line 58 check, e.g., "if len(date_text) != 8: return" to make the validation more explicit and catch malformed input early.

Copilot uses AI. Check for mistakes.
}
)
continue
history.append(elem)
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the try-except block, the code attempts to append elem to history without checking if elem is None. The _create_date_element function can return None in two cases: when date_text is falsy (line 58-59), or when the date_element has no child elements added due to validation failures. If _create_date_element returns None and no exception is raised, this will attempt to append None to the history element, which could cause issues. Add a check "if elem is not None:" before the append operation on line 541.

Suggested change
history.append(elem)
if elem is not None:
history.append(elem)

Copilot uses AI. Check for mistakes.
@robertatakenaka robertatakenaka merged commit 2fde8a7 into scieloorg:main Jan 30, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant