Skip to content

Conversation

@robertatakenaka
Copy link
Member

@robertatakenaka robertatakenaka commented Nov 10, 2025

O que esse PR faz?

Este PR aumenta a rastreabilidade do sistema de processamento de documentos HTML/ISIS para XML, implementando:

  1. Preservação do HTML original: Adiciona uma etapa inicial (convert_html_to_xml_step_0) que captura e preserva o HTML original usando elementos <temp> com CDATA, permitindo visualizar o conteúdo antes de qualquer correção de má formação

  2. Pipeline em duas fases:

    • Fase 1 (Pre-processing): Captura o HTML bruto em elementos temporários com CDATA (PreMainHTMLPipe, PreTranslatedHTMLPipe)
    • Fase 2 (Processing): Processa e corrige o HTML preservado, convertendo para XML válido
  3. Rastreamento de transformações: Agora é possível comparar o HTML original (preservado no step_0) com o resultado final, facilitando o debug de problemas de conversão

  4. Sistema de inferência rastreável: Marca referências com índices inferidos usando guessed_reference_index="true", diferenciando dados originais de dados deduzidos

  5. Logging detalhado: Adiciona logs em pontos críticos, especialmente no processamento de referências e identificação de índices

Por que isso é importante?

Antes dessa mudança, quando o HTML era corrigido para resolver más formações, perdíamos a visibilidade do conteúdo original. Agora, com a preservação em CDATA na etapa inicial, conseguimos:

  • Debugar problemas: Ver exatamente o que estava no HTML original antes das correções
  • Auditar transformações: Comparar antes/depois de cada etapa do pipeline
  • Identificar padrões de erro: Analisar HTMLs problemáticos sem que sejam alterados pelo parser

Sugiro começar a revisão por:

  1. scielo_classic_website/spsxml/sps_xml_body_pipes.py - linhas 130-210 (classes PreMainHTMLPipe e PreTranslatedHTMLPipe)

    • Aqui está a principal mudança arquitetural: a separação entre pré-processamento e processamento final
  2. scielo_classic_website/htmlbody/html_body.py - linhas 228-242 (função fix_references)

    • Contém a lógica de inferência de índices para referências

Como este poderia ser testado manualmente?

Na aplicação Upload, execute a tarefa de migração de artigos, ingressados no site clássico no formato html. Acesse o menu Processing > HTML XML e observe as versões de conversão de html para xml, principalmente a versão 1 e 2.

Algum cenário de contexto que queira dar?

Contexto do problema

Alguns XML foram convertidos em defeitos e não era possível identificar a origem do problema porque a primeira saída da conversão estava distante da entrada.

Solução implementada

A refatoração introduz um pipeline em duas fases:

  1. Pré-processamento: Envolve conteúdo HTML em CDATA para evitar erros de parsing
  2. Processamento: Converte o HTML para XML estruturado, inferindo índices quando necessário

Esta abordagem é especialmente importante para a migração de conteúdo legado do SciELO, onde temos milhares de artigos com formatação inconsistente.

Screenshots

N/A - Mudanças são no processamento backend sem interface visual.

Quais são tickets relevantes?

  • Issue #[NÚMERO]: Apresentar o HTML 'mais original' possível para facilitar na identificação de erro de conversão

Referências

n/a

- Add logging import and debug message when initializing records with reference_index
- Helps track paragraph records that contain reference indices
- Add 'bases-work' to supported folder paths
- Extract HTML content fixing to dedicated function fix_html_content()
- Improve encoding handling with utf-8/iso-8859-1 fallback
- Add better error logging for path resolution and empty content
- Simplify PDF key/lang extraction logic
- Pass paragraph_records parameter through Document initialization
- Add logic to create ParagraphRecord instances from paragraph_records
- Improve error logging in h_record property
- Use ellipsis syntax for exception classes
- Add guessed_reference_index for paragraphs without explicit indices
- Refactor paragraph data extraction to avoid direct dict access
- Add paragraph_text property usage for cleaner code
- Implement index tracking and guessing logic in fix_references()
- Improve reference range identification with detailed logging
- Simplify data combination to avoid overwriting item.data
- Use ellipsis syntax for exception class
- Split MainHTMLPipe into PreMainHTMLPipe with CDATA handling
- Add PreTranslatedHTMLPipe for translation pre-processing
- Introduce temp elements with CDATA for safer HTML handling
- Add CompleteDispFormulaPipe for disp-formula wrapping
- Handle guessed_reference_index attribute in references
- Restructure pipeline to process temp elements before final XML
- Improve reference processing with better index handling
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the traceability of HTML-to-XML document processing for the SciELO classic website by implementing a two-phase pipeline that preserves original HTML content before any transformations occur. This allows debugging and auditing of conversion issues in legacy content.

Key changes:

  • Introduces a new preprocessing step (convert_html_to_xml_step_0) that wraps raw HTML in CDATA within temporary elements before any malformation corrections
  • Implements a two-phase pipeline: pre-processing (HTML preservation via PreMainHTMLPipe, PreTranslatedHTMLPipe) and processing (conversion to valid XML)
  • Adds guessed_reference_index attribute to track inferred reference indices, distinguishing original from inferred data
  • Enhances logging throughout reference processing and paragraph identification

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
scielo_classic_website/spsxml/sps_xml_body_pipes.py Core pipeline refactoring: adds step_0 for HTML preservation, creates PreMainHTMLPipe/PreTranslatedHTMLPipe classes, refactors MainHTMLPipe/TranslatedHTMLPipe to process CDATA-wrapped content, adds CompleteDispFormulaPipe
scielo_classic_website/htmlbody/html_body.py Implements reference index inference logic with guessed_reference_index tracking, adds logging for reference identification, refactors paragraph processing
scielo_classic_website/models/document.py Enables paragraph records to be passed to DocumentRecords, improves code formatting
scielo_classic_website/models/issue_folder.py Adds fix_html_content function for encoding handling, adds bases-work folder support, improves error handling
scielo_classic_website/isisdb/base_p_record.py Adds debug logging when reference_index is present
Comments suppressed due to low confidence (2)

scielo_classic_website/spsxml/sps_xml_body_pipes.py:428

  • Variable parent is not used.
            parent = node.getparent()

scielo_classic_website/spsxml/sps_xml_body_pipes.py:435

  • Variable parent is not used.
            parent = node.getparent()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +425 to +427
if node is not None:
body = text_to_node("body", node.text)
xml.find(".").insert(0, body)
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable parent is assigned but never used. After getting parent on line 428, there's no subsequent code that removes the temp node or performs any operation with parent. This looks like incomplete code where the temp node should be removed after processing.

Copilot uses AI. Check for mistakes.
Comment on lines +432 to +434
if node is not None:
back = text_to_node("back", node.text)
xml.find(".").insert(1, back)
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable parent is assigned but never used. After getting parent on line 435, there's no subsequent code that removes the temp node or performs any operation with parent. This looks like incomplete code where the temp node should be removed after processing.

Copilot uses AI. Check for mistakes.
Comment on lines +229 to +234
index = None
for item in p_records:
# item.data (dict which keys: text, index, reference_index)
if item.data["text"]:
item.data["text"] = html_fixer.avoid_mismatched_tags(item.data["text"])
yield item.data
if index:
index += 1
elif item.reference_index:
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index inference logic is incorrect. When index is None (first iteration), if the first item doesn't have reference_index, the index remains None and subsequent items won't increment it. The logic should initialize index when finding the first reference_index, then increment for every subsequent item regardless of whether it has a reference_index. Current logic will fail to set guessed_reference_index for items between gaps in reference indices.

Copilot uses AI. Check for mistakes.
Comment on lines +182 to +190
index = None
for item in p_records:
# item.data (dict which keys: text, index, reference_index)
if item.data["text"]:
# logging.info("Antes:")
# logging.info(item.data)

hc = HTMLContent(item.data["text"])
root = hc.tree.find(".//body/*")
if part_name:
root.set("data-part", part_name)
if not item.paragraph_text:
continue

ref_idx = item.data.get("reference_index")
if ref_idx:
root.set("data-ref-index", ref_idx)
if index:
index += 1
elif item.reference_index:
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index inference logic is incorrect. When index is None (first iteration), if the first item doesn't have reference_index, the index remains None and subsequent items won't increment it. Additionally, items without paragraph_text are skipped but should still affect the index count. The logic should initialize index when finding the first reference_index, then increment for every subsequent item to maintain sequential reference numbering.

Copilot uses AI. Check for mistakes.
ref_index = item["reference_index"]
ref.set("id", f"B{ref_index}")
except KeyError:
ref_index = item.get("guessed_reference_index")
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If item.get('guessed_reference_index') returns None (when the key doesn't exist), this will create an id 'BNone' which is invalid. The code should handle the case where both reference_index and guessed_reference_index are missing, or raise an appropriate error.

Suggested change
ref_index = item.get("guessed_reference_index")
ref_index = item.get("guessed_reference_index")
if ref_index is None:
raise ValueError(
f"Reference item at position {i} is missing both 'reference_index' and 'guessed_reference_index': {item}"
)

Copilot uses AI. Check for mistakes.
Comment on lines +502 to +510
except KeyError:
pass
temp_back = sub_article.find(".//temp[@type='back']")
if temp_back is not None:
try:
back = text_to_node("back", temp_back.text)
sub_article.append(back)
except KeyError:
pass
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except KeyError:
pass
temp_back = sub_article.find(".//temp[@type='back']")
if temp_back is not None:
try:
back = text_to_node("back", temp_back.text)
sub_article.append(back)
except KeyError:
pass
except KeyError as e:
logging.warning(f"KeyError while creating 'body' node from temp_body.text in sub-article: {e}")
temp_back = sub_article.find(".//temp[@type='back']")
if temp_back is not None:
try:
back = text_to_node("back", temp_back.text)
sub_article.append(back)
except KeyError as e:
logging.warning(f"KeyError while creating 'back' node from temp_back.text in sub-article: {e}")

Copilot uses AI. Check for mistakes.
back = text_to_node("back", temp_back.text)
sub_article.append(back)
except KeyError:
pass
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
pass
logging.warning(
"KeyError in TranslatedHTMLPipe.transform while processing temp_back.text: %r",
temp_back.text
)

Copilot uses AI. Check for mistakes.
node.text = ET.CDATA(texts["before references"])
sub_article.append(node)

except KeyError:
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
node.set("type", "back")
node.text = ET.CDATA(texts["after references"])
sub_article.append(node)
except KeyError:
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except KeyError:
except KeyError:
# It is expected that some translations may not have an 'after references' section.
# In such cases, we simply skip adding the corresponding node.

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@robertatakenaka robertatakenaka merged commit b501d04 into scieloorg:main Nov 10, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant