Skip to content

Conversation

@robertatakenaka
Copy link
Member

Pull Request: Melhorias no tratamento de erros e preparação para refatoração

Descrição

Este PR implementa melhorias críticas no tratamento de erros e prepara o terreno para uma refatoração maior do sistema de migração, focando em tornar o código mais robusto e manutenível.

Mudanças Principais

🔧 Tratamento de Erros Aprimorado

  • classic_ws.py: Implementa validação explícita de article_pid com mensagens de erro descritivas
    • Lança FileNotFoundError quando arquivo ID não existe
    • Lança ValueError para PIDs com tamanho inválido
    • Melhora rastreabilidade de problemas na busca de arquivos

🐛 Correções Importantes

  • document.py: Corrige obtenção de licenças e datas
    • Usa fonte correta para issue_publication_date
    • Processa licenças como lista ao invés de dict
    • Adiciona fallback robusto para license_code
    • Implementa tratamento de exceções no parsing de HTML

🧹 Limpeza de Código

  • id2json3.py: Remove logging redundante que poluía os logs
  • controller.py e migration.py: Comenta temporariamente código não utilizado
    • Preserva lógica para referência futura
    • Prepara estrutura para nova arquitetura de migração

Impacto

  • Melhoria na confiabilidade: Erros são capturados e reportados com contexto adequado
  • Logs mais limpos: Remoção de mensagens desnecessárias
  • Preparação para refatoração: Código legado preservado mas isolado

Testes

As mudanças mantêm compatibilidade total com o código existente. Todos os testes existentes continuam passando.

Próximos Passos

Este PR prepara o terreno para:

  1. Nova arquitetura de migração de dados
  2. Refatoração completa do módulo controller
  3. Implementação de sistema de logs mais estruturado

- Comenta funções pids_and_their_records e get_acron_db_path
- Mantém código comentado para referência futura
- Prepara para refatoração completa do módulo controller
- Comenta imports e funções do módulo migration
- Mantém código comentado para preservar lógica de migração
- Prepara terreno para nova arquitetura de migração de dados
- Adiciona validação explícita do tamanho do article_pid
- Lança FileNotFoundError com mensagem descritiva quando arquivo ID não existe
- Lança ValueError quando article_pid tem tamanho inválido
- Mantém imports de Document, Issue, Journal com comentário explicativo
- Melhora rastreabilidade de erros na busca de arquivos de artigos
- Remove logging.info redundante que poluía os logs
- Mantém funcionalidade de conversão de registros ISIS intacta
- Usa h_record.issue_publication_date ao invés de self.issue.issue_publication_date
- Refatora license_texts para processar lista de licenças ao invés de dict
- Adiciona fallback para license_code quando issue_licenses não existe
- Melhora extração de URL e texto das licenças usando ElementTree
- Adiciona tratamento de exceções ao processar HTML de licenças
- Garante que sempre retorna dict com license_code quando disponível
Copilot AI review requested due to automatic review settings December 22, 2025 00:09
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request improves error handling in the SciELo Classic Website migration system and prepares the codebase for future refactoring. The main focus is on fixing data source issues for licenses and publication dates while making error conditions more explicit.

  • Fixes the source for issue_publication_date to use the correct h_record property instead of the non-existent self.issue property
  • Refactors license text processing to correctly handle list-based data structure and add fallback logic for missing license information
  • Enhances error reporting in file path resolution with explicit exceptions instead of silent failures

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
scielo_classic_website/models/document.py Corrects publication date source and refactors license processing to handle list structure with proper fallbacks
scielo_classic_website/classic_ws.py Adds explicit error handling with FileNotFoundError and ValueError for invalid article PIDs
scielo_classic_website/iid2json/id2json3.py Removes redundant logging statement to reduce log noise
scielo_classic_website/migration.py Comments out unused migration functions in preparation for refactoring
scielo_classic_website/controller.py Comments out unused controller functions in preparation for refactoring

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

item["code"] = license_code
try:
html = ET.fromstring(f"<root>{item['html']}</root>")
item["url"] = html.find(".//a").get("href")
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of html.find(".//a").get("href") can raise an AttributeError if no anchor tag is found in the HTML. When find() returns None, calling .get("href") on it will fail. Consider adding a None check or using a safer pattern like html.find(".//a") followed by a conditional check before accessing the href attribute.

Suggested change
item["url"] = html.find(".//a").get("href")
anchor = html.find(".//a")
if anchor is not None:
item["url"] = anchor.get("href")

Copilot uses AI. Check for mistakes.
Comment on lines +316 to +317
except Exception as e:
logging.exception(e)
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The broad Exception catch makes it difficult to identify and debug specific issues with HTML parsing or attribute access. Consider catching more specific exceptions (e.g., AttributeError, ET.ParseError) to provide better error context and handling.

Suggested change
except Exception as e:
logging.exception(e)
except (AttributeError, KeyError, ET.ParseError) as e:
logging.exception(f"Error parsing license HTML for language={language}: {e}")

Copilot uses AI. Check for mistakes.

from scielo_classic_website.iid2json import id2json3
from scielo_classic_website.isisdb.isis_cmd import ISISCommader
# manter Document, Issue, Journal imports para evitar quebra em outras partes do sistema
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment uses Portuguese ("manter Document, Issue, Journal imports para evitar quebra em outras partes do sistema"). Consider translating to English for consistency with the codebase, or verify if Portuguese is the standard for this project.

Suggested change
# manter Document, Issue, Journal imports para evitar quebra em outras partes do sistema
# Keep Document, Issue, and Journal imports to avoid breaking other parts of the system

Copilot uses AI. Check for mistakes.
f"ID file not found for article pid={article_pid} at {path}"
)
else:
raise ValueError(f"Found article_pid={article_pid} invalid size")
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message format "Found article_pid={article_pid} invalid size" is grammatically awkward. Consider rephrasing to "Invalid article_pid size: {article_pid} (expected 23 characters)" or "article_pid must be 23 characters, got {len(article_pid) if article_pid else 'None'}" for better clarity.

Suggested change
raise ValueError(f"Found article_pid={article_pid} invalid size")
raise ValueError(
f"article_pid must be 23 characters, got {len(article_pid) if article_pid else 'None'}"
)

Copilot uses AI. Check for mistakes.
Comment on lines +6 to +31
# def pids_and_their_records(source_file_path, db_type):
# if not source_file_path:
# return []

id_function = id2json3.article_id
if db_type == "title":
id_function = id2json3.journal_id
elif db_type == "issue":
id_function = id2json3.issue_id
# id_function = id2json3.article_id
# if db_type == "title":
# id_function = id2json3.journal_id
# elif db_type == "issue":
# id_function = id2json3.issue_id

# get id_file_path
id_file_path = isis_cmd.get_id_file_path(source_file_path)
return id2json3.pids_and_their_records(id_file_path, id_function)
# # get id_file_path
# id_file_path = isis_cmd.get_id_file_path(source_file_path)
# return id2json3.pids_and_their_records(id_file_path, id_function)


def get_acron_db_path(acron, id_folder_path=None):
config.check_migration_sources()
# def get_acron_db_path(acron, id_folder_path=None):
# config.check_migration_sources()

db_path = config.get_bases_acron(acron)
print("db:", db_path)
if id_folder_path:
id_file_path = os.path.join(id_folder_path, f"{acron}.id")
id_file_path = isis_cmd.create_id_file(db_path, id_file_path)
db_path = id_file_path
print(f"{id_file_path} - size: {size(id_file_path)} bytes")
return db_path
# db_path = config.get_bases_acron(acron)
# print("db:", db_path)
# if id_folder_path:
# id_file_path = os.path.join(id_folder_path, f"{acron}.id")
# id_file_path = isis_cmd.create_id_file(db_path, id_file_path)
# db_path = id_file_path
# print(f"{id_file_path} - size: {size(id_file_path)} bytes")
# return db_path
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment appears to contain commented-out code.

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +25
# def get_document_pids(from_date, to_date):
# return controller.isis_cmd.get_document_pids(from_date, to_date)

def get_document_pids(from_date, to_date):
return controller.isis_cmd.get_document_pids(from_date, to_date)

# def get_paragraphs_records(pid):
# id_file_path = config.get_paragraphs_id_file_path(pid)
# return controller.pids_and_their_records(id_file_path, "artigo")

def get_paragraphs_records(pid):
id_file_path = config.get_paragraphs_id_file_path(pid)
return controller.pids_and_their_records(id_file_path, "artigo")

# def get_records_by_pid(pid):
# source_path = controller.isis_cmd.get_document_isis_db(pid)
# return controller.pids_and_their_records(source_path, "artigo")

def get_records_by_pid(pid):
source_path = controller.isis_cmd.get_document_isis_db(pid)
return controller.pids_and_their_records(source_path, "artigo")

# def get_records_by_source_path(db_type, source_path):
# return controller.pids_and_their_records(source_path, db_type)

def get_records_by_acron(acron, id_folder_path=None):
source_path = controller.get_acron_db_path(acron, id_folder_path)
return controller.pids_and_their_records(source_path, "artigo")


def get_records_by_source_path(db_type, source_path):
return controller.pids_and_their_records(source_path, db_type)


def get_issue_files(acron, issue_folder):
issue_files = IssueFiles(acron, issue_folder)
return issue_files.files
# def get_issue_files(acron, issue_folder):
# issue_files = IssueFiles(acron, issue_folder)
# return issue_files.files
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment appears to contain commented-out code.

Copilot uses AI. Check for mistakes.
@robertatakenaka robertatakenaka merged commit d45b53b into scieloorg:main Dec 22, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant