Aumenta a rastreabilidade do processamento de parágrafos e referências bibliográficas #116

robertatakenaka · 2025-11-10T22:26:42Z

O que esse PR faz?

Este PR aumenta a rastreabilidade do sistema de processamento de documentos HTML/ISIS para XML, implementando:

Preservação do HTML original: Adiciona uma etapa inicial (convert_html_to_xml_step_0) que captura e preserva o HTML original usando elementos <temp> com CDATA, permitindo visualizar o conteúdo antes de qualquer correção de má formação
Pipeline em duas fases:
- Fase 1 (Pre-processing): Captura o HTML bruto em elementos temporários com CDATA (PreMainHTMLPipe, PreTranslatedHTMLPipe)
- Fase 2 (Processing): Processa e corrige o HTML preservado, convertendo para XML válido
Rastreamento de transformações: Agora é possível comparar o HTML original (preservado no step_0) com o resultado final, facilitando o debug de problemas de conversão
Sistema de inferência rastreável: Marca referências com índices inferidos usando guessed_reference_index="true", diferenciando dados originais de dados deduzidos
Logging detalhado: Adiciona logs em pontos críticos, especialmente no processamento de referências e identificação de índices

Por que isso é importante?

Antes dessa mudança, quando o HTML era corrigido para resolver más formações, perdíamos a visibilidade do conteúdo original. Agora, com a preservação em CDATA na etapa inicial, conseguimos:

Debugar problemas: Ver exatamente o que estava no HTML original antes das correções
Auditar transformações: Comparar antes/depois de cada etapa do pipeline
Identificar padrões de erro: Analisar HTMLs problemáticos sem que sejam alterados pelo parser

Sugiro começar a revisão por:

scielo_classic_website/spsxml/sps_xml_body_pipes.py - linhas 130-210 (classes PreMainHTMLPipe e PreTranslatedHTMLPipe)
- Aqui está a principal mudança arquitetural: a separação entre pré-processamento e processamento final
scielo_classic_website/htmlbody/html_body.py - linhas 228-242 (função fix_references)
- Contém a lógica de inferência de índices para referências

Como este poderia ser testado manualmente?

Na aplicação Upload, execute a tarefa de migração de artigos, ingressados no site clássico no formato html. Acesse o menu Processing > HTML XML e observe as versões de conversão de html para xml, principalmente a versão 1 e 2.

Algum cenário de contexto que queira dar?

Contexto do problema

Alguns XML foram convertidos em defeitos e não era possível identificar a origem do problema porque a primeira saída da conversão estava distante da entrada.

Solução implementada

A refatoração introduz um pipeline em duas fases:

Pré-processamento: Envolve conteúdo HTML em CDATA para evitar erros de parsing
Processamento: Converte o HTML para XML estruturado, inferindo índices quando necessário

Esta abordagem é especialmente importante para a migração de conteúdo legado do SciELO, onde temos milhares de artigos com formatação inconsistente.

Screenshots

N/A - Mudanças são no processamento backend sem interface visual.

Quais são tickets relevantes?

Issue #[NÚMERO]: Apresentar o HTML 'mais original' possível para facilitar na identificação de erro de conversão

Referências

n/a

- Add logging import and debug message when initializing records with reference_index - Helps track paragraph records that contain reference indices

- Add 'bases-work' to supported folder paths - Extract HTML content fixing to dedicated function fix_html_content() - Improve encoding handling with utf-8/iso-8859-1 fallback - Add better error logging for path resolution and empty content - Simplify PDF key/lang extraction logic

- Pass paragraph_records parameter through Document initialization - Add logic to create ParagraphRecord instances from paragraph_records - Improve error logging in h_record property - Use ellipsis syntax for exception classes

- Add guessed_reference_index for paragraphs without explicit indices - Refactor paragraph data extraction to avoid direct dict access - Add paragraph_text property usage for cleaner code - Implement index tracking and guessing logic in fix_references() - Improve reference range identification with detailed logging - Simplify data combination to avoid overwriting item.data - Use ellipsis syntax for exception class

- Split MainHTMLPipe into PreMainHTMLPipe with CDATA handling - Add PreTranslatedHTMLPipe for translation pre-processing - Introduce temp elements with CDATA for safer HTML handling - Add CompleteDispFormulaPipe for disp-formula wrapping - Handle guessed_reference_index attribute in references - Restructure pipeline to process temp elements before final XML - Improve reference processing with better index handling

Copilot

Pull Request Overview

This PR enhances the traceability of HTML-to-XML document processing for the SciELO classic website by implementing a two-phase pipeline that preserves original HTML content before any transformations occur. This allows debugging and auditing of conversion issues in legacy content.

Key changes:

Introduces a new preprocessing step (convert_html_to_xml_step_0) that wraps raw HTML in CDATA within temporary elements before any malformation corrections
Implements a two-phase pipeline: pre-processing (HTML preservation via PreMainHTMLPipe, PreTranslatedHTMLPipe) and processing (conversion to valid XML)
Adds guessed_reference_index attribute to track inferred reference indices, distinguishing original from inferred data
Enhances logging throughout reference processing and paragraph identification

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
scielo_classic_website/spsxml/sps_xml_body_pipes.py	Core pipeline refactoring: adds step_0 for HTML preservation, creates PreMainHTMLPipe/PreTranslatedHTMLPipe classes, refactors MainHTMLPipe/TranslatedHTMLPipe to process CDATA-wrapped content, adds CompleteDispFormulaPipe
scielo_classic_website/htmlbody/html_body.py	Implements reference index inference logic with guessed_reference_index tracking, adds logging for reference identification, refactors paragraph processing
scielo_classic_website/models/document.py	Enables paragraph records to be passed to DocumentRecords, improves code formatting
scielo_classic_website/models/issue_folder.py	Adds fix_html_content function for encoding handling, adds bases-work folder support, improves error handling
scielo_classic_website/isisdb/base_p_record.py	Adds debug logging when reference_index is present

Comments suppressed due to low confidence (2)

scielo_classic_website/spsxml/sps_xml_body_pipes.py:428

Variable parent is not used.

            parent = node.getparent()

scielo_classic_website/spsxml/sps_xml_body_pipes.py:435

Variable parent is not used.

            parent = node.getparent()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-10T22:29:52Z

scielo_classic_website/spsxml/sps_xml_body_pipes.py

+        if node is not None:
+            body = text_to_node("body", node.text)
+            xml.find(".").insert(0, body)


Variable parent is assigned but never used. After getting parent on line 428, there's no subsequent code that removes the temp node or performs any operation with parent. This looks like incomplete code where the temp node should be removed after processing.

Copilot · 2025-11-10T22:29:52Z

scielo_classic_website/spsxml/sps_xml_body_pipes.py

+        if node is not None:
+            back = text_to_node("back", node.text)
+            xml.find(".").insert(1, back)


Variable parent is assigned but never used. After getting parent on line 435, there's no subsequent code that removes the temp node or performs any operation with parent. This looks like incomplete code where the temp node should be removed after processing.

Copilot · 2025-11-10T22:29:53Z

scielo_classic_website/htmlbody/html_body.py

+    index = None
    for item in p_records:
        # item.data (dict which keys: text, index, reference_index)
-        if item.data["text"]:
-            item.data["text"] = html_fixer.avoid_mismatched_tags(item.data["text"])
-            yield item.data
+        if index:
+            index += 1
+        elif item.reference_index:


The index inference logic is incorrect. When index is None (first iteration), if the first item doesn't have reference_index, the index remains None and subsequent items won't increment it. The logic should initialize index when finding the first reference_index, then increment for every subsequent item regardless of whether it has a reference_index. Current logic will fail to set guessed_reference_index for items between gaps in reference indices.

Copilot · 2025-11-10T22:29:53Z

scielo_classic_website/htmlbody/html_body.py

+    index = None
    for item in p_records:
        # item.data (dict which keys: text, index, reference_index)
-        if item.data["text"]:
-            # logging.info("Antes:")
-            # logging.info(item.data)
-
-            hc = HTMLContent(item.data["text"])
-            root = hc.tree.find(".//body/*")
-            if part_name:
-                root.set("data-part", part_name)
+        if not item.paragraph_text:
+            continue

-            ref_idx = item.data.get("reference_index")
-            if ref_idx:
-                root.set("data-ref-index", ref_idx)
+        if index:
+            index += 1
+        elif item.reference_index:


The index inference logic is incorrect. When index is None (first iteration), if the first item doesn't have reference_index, the index remains None and subsequent items won't increment it. Additionally, items without paragraph_text are skipped but should still affect the index count. The logic should initialize index when finding the first reference_index, then increment for every subsequent item to maintain sequential reference numbering.

scielo_classic_website/models/issue_folder.py

Copilot · 2025-11-10T22:29:54Z

scielo_classic_website/spsxml/sps_xml_body_pipes.py

+                ref_index = item["reference_index"]
+                ref.set("id", f"B{ref_index}")
+            except KeyError:
+                ref_index = item.get("guessed_reference_index")


If item.get('guessed_reference_index') returns None (when the key doesn't exist), this will create an id 'BNone' which is invalid. The code should handle the case where both reference_index and guessed_reference_index are missing, or raise an appropriate error.

Suggested change

ref_index = item.get("guessed_reference_index")

ref_index = item.get("guessed_reference_index")

if ref_index is None:

raise ValueError(

f"Reference item at position {i} is missing both 'reference_index' and 'guessed_reference_index': {item}"

)

Copilot · 2025-11-10T22:29:54Z

scielo_classic_website/spsxml/sps_xml_body_pipes.py

+                except KeyError:
+                    pass
+            temp_back = sub_article.find(".//temp[@type='back']")
+            if temp_back is not None:
+                try:
+                    back = text_to_node("back", temp_back.text)
+                    sub_article.append(back)
+                except KeyError:
+                    pass


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except KeyError:

pass

temp_back = sub_article.find(".//temp[@type='back']")

if temp_back is not None:

try:

back = text_to_node("back", temp_back.text)

sub_article.append(back)

except KeyError:

pass

except KeyError as e:

logging.warning(f"KeyError while creating 'body' node from temp_body.text in sub-article: {e}")

temp_back = sub_article.find(".//temp[@type='back']")

if temp_back is not None:

try:

back = text_to_node("back", temp_back.text)

sub_article.append(back)

except KeyError as e:

logging.warning(f"KeyError while creating 'back' node from temp_back.text in sub-article: {e}")

Copilot · 2025-11-10T22:29:54Z

scielo_classic_website/spsxml/sps_xml_body_pipes.py

+                    back = text_to_node("back", temp_back.text)
+                    sub_article.append(back)
+                except KeyError:
+                    pass


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

pass

logging.warning(

"KeyError in TranslatedHTMLPipe.transform while processing temp_back.text: %r",

temp_back.text

)

Copilot · 2025-11-10T22:29:55Z

scielo_classic_website/spsxml/sps_xml_body_pipes.py

+                node.text = ET.CDATA(texts["before references"])
+                sub_article.append(node)
+
+            except KeyError:


'except' clause does nothing but pass and there is no explanatory comment.

Copilot · 2025-11-10T22:29:55Z

scielo_classic_website/spsxml/sps_xml_body_pipes.py

+                node.set("type", "back")
+                node.text = ET.CDATA(texts["after references"])
+                sub_article.append(node)
+            except KeyError:


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except KeyError:

except KeyError:

# It is expected that some translations may not have an 'after references' section.

# In such cases, we simply skip adding the corresponding node.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

robertatakenaka added 5 commits November 10, 2025 19:02

refactor: add debug logging for reference_index in BaseParagraphRecord

1c01812

- Add logging import and debug message when initializing records with reference_index - Helps track paragraph records that contain reference indices

refactor: add paragraph records support to DocumentRecords

75648e2

- Pass paragraph_records parameter through Document initialization - Add logic to create ParagraphRecord instances from paragraph_records - Improve error logging in h_record property - Use ellipsis syntax for exception classes

Copilot AI review requested due to automatic review settings November 10, 2025 22:26

Copilot started reviewing on behalf of robertatakenaka November 10, 2025 22:27 View session

Copilot finished reviewing on behalf of robertatakenaka November 10, 2025 22:28

Copilot AI reviewed Nov 10, 2025

View reviewed changes

Update scielo_classic_website/models/issue_folder.py

aaccac9

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

robertatakenaka merged commit b501d04 into scieloorg:main Nov 10, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Aumenta a rastreabilidade do processamento de parágrafos e referências bibliográficas #116

Aumenta a rastreabilidade do processamento de parágrafos e referências bibliográficas #116

Uh oh!

robertatakenaka commented Nov 10, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 10, 2025

Uh oh!

Copilot AI Nov 10, 2025

Uh oh!

Copilot AI Nov 10, 2025

Uh oh!

Copilot AI Nov 10, 2025

Uh oh!

Uh oh!

Copilot AI Nov 10, 2025

Uh oh!

Copilot AI Nov 10, 2025

Uh oh!

Copilot AI Nov 10, 2025

Uh oh!

Copilot AI Nov 10, 2025

Uh oh!

Copilot AI Nov 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-                ref_index = item.get("guessed_reference_index")
+                ref_index = item.get("guessed_reference_index")
+                if ref_index is None:
+                    raise ValueError(
+                        f"Reference item at position {i} is missing both 'reference_index' and 'guessed_reference_index': {item}"
+                    )

-                    pass
+                    logging.warning(
+                        "KeyError in TranslatedHTMLPipe.transform while processing temp_back.text: %r",
+                        temp_back.text
+                    )

Aumenta a rastreabilidade do processamento de parágrafos e referências bibliográficas #116

Aumenta a rastreabilidade do processamento de parágrafos e referências bibliográficas #116

Uh oh!

Conversation

robertatakenaka commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

O que esse PR faz?

Por que isso é importante?

Como este poderia ser testado manualmente?

Algum cenário de contexto que queira dar?

Contexto do problema

Solução implementada

Screenshots

Quais são tickets relevantes?

Referências

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

robertatakenaka commented Nov 10, 2025 •

edited

Loading