Filter non-active journals from KBART export#80
Conversation
Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the KBART export to exclude journals that are not actively publishing, aligning the exported dataset with expectations of library discovery systems.
Changes:
- Filtered KBART export to only include journals where
current_status == 'current'. - Added debug logging to record when non-active journals are skipped.
- Updated
.gitignoreto ignore asrc/directory (pip source directory).
Reviewed changes
Copilot reviewed 1 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| export/kbart.py | Filters out non-current journals during iteration and adds debug logs for skipped titles. |
| .gitignore | Ignores src/ directory. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| logger.debug('Skipping non-active journal: %s (status: %s)' % (data.scielo_issn, data.current_status)) | ||
| continue | ||
| logger.debug('Reading document: %s' % data.scielo_issn) |
There was a problem hiding this comment.
These logger.debug(...) calls build the message via % interpolation before the log level check, which adds overhead when running with INFO/WARNING. Prefer passing the format string and arguments to the logger (e.g., logger.debug('Skipping ... %s', value)), which defers formatting until needed.
| logger.debug('Skipping non-active journal: %s (status: %s)' % (data.scielo_issn, data.current_status)) | |
| continue | |
| logger.debug('Reading document: %s' % data.scielo_issn) | |
| logger.debug('Skipping non-active journal: %s (status: %s)', data.scielo_issn, data.current_status) | |
| continue | |
| logger.debug('Reading document: %s', data.scielo_issn) |
| if data.current_status != 'current': | ||
| logger.debug('Skipping non-active journal: %s (status: %s)' % (data.scielo_issn, data.current_status)) | ||
| continue | ||
| logger.debug('Reading document: %s' % data.scielo_issn) | ||
| yield self.fmt_csv(data) |
There was a problem hiding this comment.
With the new current_status filter, items() will only ever call fmt_csv() for current journals. In fmt_csv() the last_document lookup and the non-current branch become unreachable, so last_included_document_by_journal() is now always an unnecessary API call for every exported row. Consider removing the non-current branch entirely (or at least moving the last_document lookup inside the non-current branch if you intend to keep it for future use).
KBART export was including all journals regardless of publication status. Library discovery systems expect only actively publishing journals.
Changes
current_statusfilter inDumper.items()to skip journals with statussuspendedordeceasedcurrent_status == 'current'are now exportedOriginal prompt
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.