-
Notifications
You must be signed in to change notification settings - Fork 3
Melhorias críticas no processamento de HTML malformado #136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Melhorias críticas no processamento de HTML malformado #136
Conversation
- Modifica get_tagless_items() para escapar < e > como < e > - Previne quebra de parsing XML quando há caracteres especiais soltos - Mantém integridade do conteúdo textual ao processar HTML malformado
…ew_name - Propaga parâmetro new_name através da cadeia de chamadas - Modifica assinatura de get_html_to_embed() e _read_html_file() - Implementa fallback para quando file_reader não está disponível - Retorna tag <a> com atributos name e asset_type quando não há reader - Corrige propagação de new_name em _find_and_embed_html() - Prepara estrutura para futura implementação completa de file_reader
- Adiciona get_paragraphs_text() para concatenar textos de parágrafos - Refatora get_text_block() para processar bloco completo antes de fallback - Otimiza _identify_references_range() removendo logs e lógica redundante - Melhora performance ao processar grandes blocos de texto HTML - Mantém fallback para processamento individual quando bloco falha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR aims to improve HTML processing robustness and performance by adding special character escaping, refactoring parameter propagation, and optimizing paragraph processing. The changes span three core HTML processing modules.
Key changes include:
- Addition of special character escaping (
<and>) in tagless content extraction - Parameter name change from
raw_datatonew_namein HTML embedder with propagation through call chain - Optimization of paragraph processing to handle blocks at once rather than individually
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| scielo_classic_website/htmlbody/html_fixer.py | Adds escaping of < and > characters in non-tag content to prevent XML parsing errors |
| scielo_classic_website/htmlbody/html_embedder.py | Changes parameter from raw_data to new_name and adds fallback HTML placeholder generation when file_reader is unavailable |
| scielo_classic_website/htmlbody/html_body.py | Refactors reference range identification logic, adds new get_paragraphs_text() function, and optimizes get_text_block() to process paragraphs in bulk |
Comments suppressed due to low confidence (1)
scielo_classic_website/htmlbody/html_embedder.py:68
- The docstring parameter documentation is outdated. It still references 'raw_data' but the parameter has been changed to 'new_name'. Update the documentation to reflect the new parameter name and its purpose.
"""
Processa um HTML local navegando recursivamente e corrigindo âncoras/hrefs.
Args:
href: Caminho para o arquivo HTML local
raw_data: Dados brutos para acessar arquivos (opcional)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # TODO: implementar leitura via file_reader | ||
| name = f' name="{new_name}"' if new_name else "" | ||
| return f'<a{name} path="{href}" asset_type="html"></a>' |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fallback HTML generation is vulnerable to injection attacks. The 'new_name' and 'href' parameters are inserted directly into the HTML string without escaping. If these values contain special characters like quotes or angle brackets, they could break the HTML structure or enable XSS attacks. These values should be properly escaped using html.escape() or similar before insertion.
| if item and item[0] == "<" and item[-1] == ">": | ||
| continue | ||
| yield item | ||
| yield item.replace("<", "<").replace(">", ">") |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The yield statement is incorrectly indented. It should be at the same indentation level as the previous lines (line 246-248) to be inside the for loop. Currently, it's indented further which makes it inside the if block at line 246, meaning items that don't pass the strip check will never be yielded. This breaks the function's logic as it won't yield all items from break_content(content).
| if (item or "").strip(): | ||
| if item and item[0] == "<" and item[-1] == ">": | ||
| continue |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The escaping logic will fail for empty items. When an item is an empty string or contains only whitespace (fails the strip check on line 246), it will still reach line 249 and call replace on it. However, empty items should be yielded as-is without escaping, matching the original behavior where they were yielded unchanged. The escaping should only apply to non-empty, non-tag text content.
| if (item or "").strip(): | |
| if item and item[0] == "<" and item[-1] == ">": | |
| continue | |
| # Itens vazios ou apenas com espaços em branco são retornados sem escape, | |
| # preservando o comportamento original. | |
| if not (item or "").strip(): | |
| yield item | |
| continue | |
| # Itens que são tags completas são ignorados (removidos do conteúdo). | |
| if item and item[0] == "<" and item[-1] == ">": | |
| continue | |
| # Conteúdo de texto não vazio e que não é tag tem seus caracteres "<" e ">" | |
| # escapados para evitar que sejam interpretados como HTML. |
Pull Request: Melhorias críticas no processamento de HTML malformado
Descrição
Este PR corrige problemas críticos no processamento de HTML malformado e otimiza o pipeline de conversão HTML, garantindo maior robustez e performance.
Mudanças Principais
🛡️ Correção de Segurança - html_fixer.py
<e>soltos agora são escapados"x < 5 e y > 3"→"x < 5 e y > 3"🔄 Refatoração de Embedder - html_embedder.py
Melhor propagação de parâmetros:
new_nameagora flui corretamente pela cadeia de chamadasFallback robusto:
file_reader<a>com metadados quando reader indisponível⚡ Otimização de Performance - html_body.py
Processamento em bloco:
get_paragraphs_text(): Concatena textos uma única vezget_text_block(): Processa bloco inteiro antes de fallbackLimpeza de código:
_identify_references_range()Exemplos de Melhorias
Antes (html_fixer.py):
Depois:
Antes (html_body.py):
Depois:
Benefícios
Impacto
<,>) agora são processados corretamenteTestes
Todos os testes existentes passam. Novos casos de teste adicionados para caracteres especiais.