-
Notifications
You must be signed in to change notification settings - Fork 3
refactor(issue_folder): otimiza leitura de arquivos e melhora tratame… #134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(issue_folder): otimiza leitura de arquivos e melhora tratame… #134
Conversation
…nto de erros - Refatora fixed_glob() para retornar conjunto de paths sem ler conteúdo - Move leitura de arquivos para get_files() melhorando separação de responsabilidades - Simplifica logs de decodificação HTML (utf-8 vs iso-8859-1) - Remove logs verbosos desnecessários em files property - Melhora busca recursiva de arquivos de tradução com pattern '**/*.*' - Trata HTMLs vazios sem lançar exceção, apenas não define content - Simplifica lógica de detecção de traduções em bases_pdf_files - Remove continue desnecessário após IndexError em bases_pdf_files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the issue_folder.py module to optimize file processing for SciELO issues. The changes separate file discovery from file reading, update translation file search patterns, and modify error handling and logging approaches.
Key Changes:
- Refactored
fixed_glob()to only return file paths instead of reading file contents immediately - Changed translation file pattern from
*.ht*to**/*.*for recursive subdirectory search - Simplified error handling and reduced logging verbosity
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| logging.info("HTML content decoded as utf-8") | ||
| except Exception as e: | ||
| content = content.decode("iso-8859-1") | ||
| logging.warning( | ||
| f"HTML content decoded using iso-8859-1 due to utf-8 decoding error. {e}" | ||
| ) | ||
| logging.info("HTML content decoded as iso-8859-1") |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logging at INFO level for every file's encoding detection will create excessive log noise in production. Consider using DEBUG level for these routine operations, or removing the logs entirely since encoding detection is expected behavior.
| self._subdir_acron_issue, | ||
| "*.ht*", | ||
| "**", | ||
| "*.*", |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pattern change from ".ht" to "" + "." enables recursive searching, but will now match all file types in subdirectories, not just HTML files (.htm, .html). This could lead to processing non-HTML files as HTML content, which may cause errors. Consider using a more specific pattern like "/.ht" to maintain the original file type filtering while enabling recursive search.
| "*.*", | |
| "*.ht*", |
| f"Conteúdo vazio no arquivo HTML {item['path']}" | ||
| ) | ||
| item["content"] = content | ||
| if content: |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When fix_html_content() returns None (for empty content), the original item["content"] remains as raw bytes. This creates inconsistent content types in the output - some items will have processed string content while others have unprocessed bytes. This inconsistency could cause issues for consumers of this data. Consider always setting item["content"] to an appropriate value (e.g., empty string or None) to maintain type consistency.
| if content: | |
| if content is None: | |
| # Ensure item["content"] is always a text value, not raw bytes | |
| item["content"] = "" | |
| else: |
| lang = None | ||
| item["lang"] = None | ||
| key = item["key"] | ||
| if "_" in key and key[2] == "_": |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal of the IndexError handler without replacing it with proper bounds checking creates a potential bug. The code assumes key[2] exists when checking key[2] == "_" on line 184, but if key has fewer than 3 characters, this will raise an IndexError that is no longer caught. This will cause the entire generator to fail instead of gracefully handling the invalid filename.
| if "_" in key and key[2] == "_": | |
| if len(key) >= 3 and key[2] == "_": |
| except Exception as e: | ||
| yield {"type": file_type, "error": str(e), "error_type": type(e).__name__} | ||
| logging.exception(e) | ||
| item["error"] = str(e) | ||
| item["error_type"] = type(e).__name__ |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable item may not be defined if an exception occurs before line 54. If an exception happens during the modified_date(), os.path.basename(), or _get_classic_website_rel_path() calls (lines 57-59), the variable item will not exist when trying to set error attributes on lines 66-67, causing a NameError.
| def fixed_glob(patterns, file_type, recursive): | ||
| paths = set() | ||
| for pattern in patterns: | ||
| for path in glob.glob(pattern, recursive=recursive): | ||
| try: | ||
| item = { | ||
| "type": file_type, | ||
| } | ||
| item["path"] = path | ||
| with open(path, "rb") as f: | ||
| item["content"] = f.read() | ||
| item["modified_date"] = modified_date(path) | ||
| except Exception as e: | ||
| logging.exception(e) | ||
| item["error"] = str(e) | ||
| item["error_type"] = type(e).__name__ | ||
| yield item | ||
| paths.update( | ||
| glob.glob(pattern, recursive=recursive) | ||
| ) | ||
| return paths |
Copilot
AI
Dec 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter file_type is passed to fixed_glob() but never used within that function. This parameter should either be removed from the function signature or utilized if it serves a purpose.
Pull Request: Otimização do processamento de arquivos em issue_folder
Descrição
Este PR refatora o módulo
issue_folder.pypara melhorar performance e separação de responsabilidades no processamento de arquivos de issues do SciELO.Mudanças Principais
🚀 Otimização de Performance
fixed_glob()agora apenas localiza arquivos,get_files()faz a leitura📁 Melhorias no Processamento de Arquivos
'**/*.*'para arquivos de tradução🔧 Tratamento de Erros Mais Robusto
🧹 Limpeza de Código
continueredundante apósIndexErrorBenefícios
Impacto
Nenhuma mudança na API pública. Código cliente continua funcionando sem alterações.
Exemplo de Melhoria
Antes:
Depois: