-
Notifications
You must be signed in to change notification settings - Fork 3
Aumenta a rastreabilidade do processamento de parágrafos e referências bibliográficas #116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aumenta a rastreabilidade do processamento de parágrafos e referências bibliográficas #116
Conversation
- Add logging import and debug message when initializing records with reference_index - Helps track paragraph records that contain reference indices
- Add 'bases-work' to supported folder paths - Extract HTML content fixing to dedicated function fix_html_content() - Improve encoding handling with utf-8/iso-8859-1 fallback - Add better error logging for path resolution and empty content - Simplify PDF key/lang extraction logic
- Pass paragraph_records parameter through Document initialization - Add logic to create ParagraphRecord instances from paragraph_records - Improve error logging in h_record property - Use ellipsis syntax for exception classes
- Add guessed_reference_index for paragraphs without explicit indices - Refactor paragraph data extraction to avoid direct dict access - Add paragraph_text property usage for cleaner code - Implement index tracking and guessing logic in fix_references() - Improve reference range identification with detailed logging - Simplify data combination to avoid overwriting item.data - Use ellipsis syntax for exception class
- Split MainHTMLPipe into PreMainHTMLPipe with CDATA handling - Add PreTranslatedHTMLPipe for translation pre-processing - Introduce temp elements with CDATA for safer HTML handling - Add CompleteDispFormulaPipe for disp-formula wrapping - Handle guessed_reference_index attribute in references - Restructure pipeline to process temp elements before final XML - Improve reference processing with better index handling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the traceability of HTML-to-XML document processing for the SciELO classic website by implementing a two-phase pipeline that preserves original HTML content before any transformations occur. This allows debugging and auditing of conversion issues in legacy content.
Key changes:
- Introduces a new preprocessing step (
convert_html_to_xml_step_0) that wraps raw HTML in CDATA within temporary elements before any malformation corrections - Implements a two-phase pipeline: pre-processing (HTML preservation via
PreMainHTMLPipe,PreTranslatedHTMLPipe) and processing (conversion to valid XML) - Adds
guessed_reference_indexattribute to track inferred reference indices, distinguishing original from inferred data - Enhances logging throughout reference processing and paragraph identification
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| scielo_classic_website/spsxml/sps_xml_body_pipes.py | Core pipeline refactoring: adds step_0 for HTML preservation, creates PreMainHTMLPipe/PreTranslatedHTMLPipe classes, refactors MainHTMLPipe/TranslatedHTMLPipe to process CDATA-wrapped content, adds CompleteDispFormulaPipe |
| scielo_classic_website/htmlbody/html_body.py | Implements reference index inference logic with guessed_reference_index tracking, adds logging for reference identification, refactors paragraph processing |
| scielo_classic_website/models/document.py | Enables paragraph records to be passed to DocumentRecords, improves code formatting |
| scielo_classic_website/models/issue_folder.py | Adds fix_html_content function for encoding handling, adds bases-work folder support, improves error handling |
| scielo_classic_website/isisdb/base_p_record.py | Adds debug logging when reference_index is present |
Comments suppressed due to low confidence (2)
scielo_classic_website/spsxml/sps_xml_body_pipes.py:428
- Variable parent is not used.
parent = node.getparent()
scielo_classic_website/spsxml/sps_xml_body_pipes.py:435
- Variable parent is not used.
parent = node.getparent()
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if node is not None: | ||
| body = text_to_node("body", node.text) | ||
| xml.find(".").insert(0, body) |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable parent is assigned but never used. After getting parent on line 428, there's no subsequent code that removes the temp node or performs any operation with parent. This looks like incomplete code where the temp node should be removed after processing.
| if node is not None: | ||
| back = text_to_node("back", node.text) | ||
| xml.find(".").insert(1, back) |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable parent is assigned but never used. After getting parent on line 435, there's no subsequent code that removes the temp node or performs any operation with parent. This looks like incomplete code where the temp node should be removed after processing.
| index = None | ||
| for item in p_records: | ||
| # item.data (dict which keys: text, index, reference_index) | ||
| if item.data["text"]: | ||
| item.data["text"] = html_fixer.avoid_mismatched_tags(item.data["text"]) | ||
| yield item.data | ||
| if index: | ||
| index += 1 | ||
| elif item.reference_index: |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The index inference logic is incorrect. When index is None (first iteration), if the first item doesn't have reference_index, the index remains None and subsequent items won't increment it. The logic should initialize index when finding the first reference_index, then increment for every subsequent item regardless of whether it has a reference_index. Current logic will fail to set guessed_reference_index for items between gaps in reference indices.
| index = None | ||
| for item in p_records: | ||
| # item.data (dict which keys: text, index, reference_index) | ||
| if item.data["text"]: | ||
| # logging.info("Antes:") | ||
| # logging.info(item.data) | ||
|
|
||
| hc = HTMLContent(item.data["text"]) | ||
| root = hc.tree.find(".//body/*") | ||
| if part_name: | ||
| root.set("data-part", part_name) | ||
| if not item.paragraph_text: | ||
| continue | ||
|
|
||
| ref_idx = item.data.get("reference_index") | ||
| if ref_idx: | ||
| root.set("data-ref-index", ref_idx) | ||
| if index: | ||
| index += 1 | ||
| elif item.reference_index: |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The index inference logic is incorrect. When index is None (first iteration), if the first item doesn't have reference_index, the index remains None and subsequent items won't increment it. Additionally, items without paragraph_text are skipped but should still affect the index count. The logic should initialize index when finding the first reference_index, then increment for every subsequent item to maintain sequential reference numbering.
| ref_index = item["reference_index"] | ||
| ref.set("id", f"B{ref_index}") | ||
| except KeyError: | ||
| ref_index = item.get("guessed_reference_index") |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If item.get('guessed_reference_index') returns None (when the key doesn't exist), this will create an id 'BNone' which is invalid. The code should handle the case where both reference_index and guessed_reference_index are missing, or raise an appropriate error.
| ref_index = item.get("guessed_reference_index") | |
| ref_index = item.get("guessed_reference_index") | |
| if ref_index is None: | |
| raise ValueError( | |
| f"Reference item at position {i} is missing both 'reference_index' and 'guessed_reference_index': {item}" | |
| ) |
| except KeyError: | ||
| pass | ||
| temp_back = sub_article.find(".//temp[@type='back']") | ||
| if temp_back is not None: | ||
| try: | ||
| back = text_to_node("back", temp_back.text) | ||
| sub_article.append(back) | ||
| except KeyError: | ||
| pass |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except KeyError: | |
| pass | |
| temp_back = sub_article.find(".//temp[@type='back']") | |
| if temp_back is not None: | |
| try: | |
| back = text_to_node("back", temp_back.text) | |
| sub_article.append(back) | |
| except KeyError: | |
| pass | |
| except KeyError as e: | |
| logging.warning(f"KeyError while creating 'body' node from temp_body.text in sub-article: {e}") | |
| temp_back = sub_article.find(".//temp[@type='back']") | |
| if temp_back is not None: | |
| try: | |
| back = text_to_node("back", temp_back.text) | |
| sub_article.append(back) | |
| except KeyError as e: | |
| logging.warning(f"KeyError while creating 'back' node from temp_back.text in sub-article: {e}") |
| back = text_to_node("back", temp_back.text) | ||
| sub_article.append(back) | ||
| except KeyError: | ||
| pass |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| pass | |
| logging.warning( | |
| "KeyError in TranslatedHTMLPipe.transform while processing temp_back.text: %r", | |
| temp_back.text | |
| ) |
| node.text = ET.CDATA(texts["before references"]) | ||
| sub_article.append(node) | ||
|
|
||
| except KeyError: |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| node.set("type", "back") | ||
| node.text = ET.CDATA(texts["after references"]) | ||
| sub_article.append(node) | ||
| except KeyError: |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except KeyError: | |
| except KeyError: | |
| # It is expected that some translations may not have an 'after references' section. | |
| # In such cases, we simply skip adding the corresponding node. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
O que esse PR faz?
Este PR aumenta a rastreabilidade do sistema de processamento de documentos HTML/ISIS para XML, implementando:
Preservação do HTML original: Adiciona uma etapa inicial (
convert_html_to_xml_step_0) que captura e preserva o HTML original usando elementos<temp>com CDATA, permitindo visualizar o conteúdo antes de qualquer correção de má formaçãoPipeline em duas fases:
PreMainHTMLPipe,PreTranslatedHTMLPipe)Rastreamento de transformações: Agora é possível comparar o HTML original (preservado no step_0) com o resultado final, facilitando o debug de problemas de conversão
Sistema de inferência rastreável: Marca referências com índices inferidos usando
guessed_reference_index="true", diferenciando dados originais de dados deduzidosLogging detalhado: Adiciona logs em pontos críticos, especialmente no processamento de referências e identificação de índices
Por que isso é importante?
Antes dessa mudança, quando o HTML era corrigido para resolver más formações, perdíamos a visibilidade do conteúdo original. Agora, com a preservação em CDATA na etapa inicial, conseguimos:
Sugiro começar a revisão por:
scielo_classic_website/spsxml/sps_xml_body_pipes.py- linhas 130-210 (classesPreMainHTMLPipeePreTranslatedHTMLPipe)scielo_classic_website/htmlbody/html_body.py- linhas 228-242 (funçãofix_references)Como este poderia ser testado manualmente?
Na aplicação Upload, execute a tarefa de migração de artigos, ingressados no site clássico no formato html. Acesse o menu Processing > HTML XML e observe as versões de conversão de html para xml, principalmente a versão 1 e 2.
Algum cenário de contexto que queira dar?
Contexto do problema
Alguns XML foram convertidos em defeitos e não era possível identificar a origem do problema porque a primeira saída da conversão estava distante da entrada.
Solução implementada
A refatoração introduz um pipeline em duas fases:
Esta abordagem é especialmente importante para a migração de conteúdo legado do SciELO, onde temos milhares de artigos com formatação inconsistente.
Screenshots
N/A - Mudanças são no processamento backend sem interface visual.
Quais são tickets relevantes?
Referências
n/a