Adiciona RawOrganizationMixin para transição de Institution#1276
Adiciona RawOrganizationMixin para transição de Institution#1276robertatakenaka merged 7 commits intomainfrom
Conversation
- Created RawOrganizationMixin in core/models.py with fields for raw organization data - Added RawOrganizationMixin to PublisherHistory, OwnerHistory, CopyrightHolderHistory, SponsorHistory - Updated _add_institution_history and add_* methods to accept and populate raw organization fields - Updated calls in am_to_core.py to pass raw organization data - Created task_replace_institution_by_raw_institution task in journal/tasks.py Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
- Added test cases for add_publisher, add_owner, add_sponsor, add_copyright_holder with raw organization fields - Added backward compatibility test to ensure existing code still works Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
- Replace N+1 query pattern with bulk_update for better performance - Update PublisherHistory, OwnerHistory, SponsorHistory, CopyrightHolderHistory records in batches Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
- Use institution__institution_name__in for batch filtering - Reduces multiple queries to single query per history type - Collects all names first before filtering Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces RawOrganizationMixin to enable storage of unstructured organization data directly in history models (PublisherHistory, OwnerHistory, CopyrightHolderHistory, SponsorHistory), preparing for the deprecation of institution.models.Institution. The changes are backward compatible, with raw fields being optional parameters.
Changes:
- Added
RawOrganizationMixinwith 7 CharField fields for unstructured organization data (raw_text, raw_institution_name, raw_country_name, raw_country_code, raw_state_name, raw_state_acron, raw_city_name) - Integrated mixin into 4 history models via multiple inheritance
- Extended
add_publisher,add_owner,add_sponsor, andadd_copyright_holdermethods to accept optional raw field parameters - Created
task_replace_institution_by_raw_institutionfor migrating existing AMJournal data to populate raw fields - Added migration 0055 to add 28 new database fields (7 fields × 4 models)
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 17 comments.
Show a summary per file
| File | Description |
|---|---|
| core/models.py | Defines RawOrganizationMixin with 7 optional CharField/TextField fields for storing unstructured organization data |
| journal/models.py | Adds RawOrganizationMixin to 4 history model classes and extends add_* methods to accept raw field parameters |
| journal/migrations/0055_add_raw_organization_fields.py | Migration adding 28 new nullable fields (7 per history model) to support raw organization data |
| journal/tasks.py | Implements migration task to populate raw fields from AMJournal data using bulk_update optimization |
| journal/sources/am_to_core.py | Updates institution import functions to pass raw field values when creating history records |
| journal/tests.py | Adds comprehensive test suite for RawOrganizationMixin functionality and backward compatibility |
| .gitignore | Adds src/ directory to ignore list (unrelated to main PR purpose) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from collection.models import Collection | ||
| collections = Collection.objects.filter(acron3__in=collection_acron_list) |
There was a problem hiding this comment.
The import for Collection model is placed inside the function rather than at the module level. However, Collection is already imported at the top of the file (line 10 in the original file). This function-level import is redundant and should be removed.
| @celery_app.task(bind=True, name="task_replace_institution_by_raw_institution") | ||
| def task_replace_institution_by_raw_institution( | ||
| self, | ||
| username=None, | ||
| user_id=None, | ||
| collection_acron_list=None, | ||
| journal_issns=None, | ||
| ): | ||
| """ | ||
| Task to populate RawOrganizationMixin fields from AMJournal records. | ||
|
|
||
| This task extracts institution data from journal.models.AMJournal records and | ||
| populates the raw organization fields in PublisherHistory, OwnerHistory, | ||
| CopyrightHolderHistory, and SponsorHistory. | ||
|
|
||
| Args: | ||
| username: User name for authentication | ||
| user_id: User ID for authentication | ||
| collection_acron_list: List of collection acronyms to filter journals | ||
| journal_issns: List of journal ISSNs to filter journals | ||
|
|
||
| Returns: | ||
| Dict with processing statistics | ||
| """ | ||
| from journal.sources.am_data_extraction import extract_value | ||
|
|
||
| user = _get_user(self.request, username=username, user_id=user_id) | ||
|
|
||
| try: | ||
| # Build queryset for AMJournal records | ||
| queryset = AMJournal.objects.all() | ||
|
|
||
| # Filter by collection if provided | ||
| if collection_acron_list: | ||
| from collection.models import Collection | ||
| collections = Collection.objects.filter(acron3__in=collection_acron_list) | ||
| queryset = queryset.filter(collection__in=collections) | ||
|
|
||
| # Filter by journal ISSN if provided | ||
| if journal_issns: | ||
| queryset = queryset.filter(pid__in=journal_issns) | ||
|
|
||
| processed_count = 0 | ||
| error_count = 0 | ||
|
|
||
| for am_journal in queryset.iterator(): | ||
| try: | ||
| # Skip if no data | ||
| if not am_journal.data: | ||
| continue | ||
|
|
||
| # Get the corresponding journal | ||
| try: | ||
| scielo_journal = SciELOJournal.objects.get( | ||
| issn_scielo=am_journal.pid, | ||
| collection=am_journal.collection | ||
| ) | ||
| journal = scielo_journal.journal | ||
| except SciELOJournal.DoesNotExist: | ||
| logger.warning( | ||
| f"SciELOJournal not found for pid={am_journal.pid}, " | ||
| f"collection={am_journal.collection}" | ||
| ) | ||
| continue | ||
|
|
||
| # Extract data from AMJournal | ||
| data = am_journal.data | ||
|
|
||
| # Extract publisher/owner data | ||
| publisher = extract_value(data.get("publisher_name")) | ||
| publisher_country = extract_value(data.get("publisher_country")) | ||
| publisher_state = extract_value(data.get("publisher_state")) | ||
| publisher_city = extract_value(data.get("publisher_city")) | ||
|
|
||
| # Extract sponsor data | ||
| sponsor = extract_value(data.get("sponsors")) | ||
|
|
||
| # Extract copyright holder data | ||
| copyright_holder = extract_value(data.get("copyrighter")) | ||
|
|
||
| # Update PublisherHistory and OwnerHistory records | ||
| if publisher: | ||
| if isinstance(publisher, str): | ||
| publisher = [publisher] | ||
|
|
||
| # Filter non-empty publisher names | ||
| publisher_names = [p for p in publisher if p] | ||
|
|
||
| if publisher_names: | ||
| # Update PublisherHistory records - use __in for single query | ||
| pub_hist_list = list(journal.publisher_history.filter( | ||
| institution__institution_name__in=publisher_names | ||
| )) | ||
| # Create a mapping of institution names to publisher data | ||
| for pub_hist in pub_hist_list: | ||
| pub_hist.raw_institution_name = pub_hist.institution.institution_name | ||
| pub_hist.raw_country_name = publisher_country | ||
| pub_hist.raw_state_name = publisher_state | ||
| pub_hist.raw_city_name = publisher_city | ||
|
|
||
| if pub_hist_list: | ||
| from journal.models import PublisherHistory | ||
| PublisherHistory.objects.bulk_update( | ||
| pub_hist_list, | ||
| ['raw_institution_name', 'raw_country_name', | ||
| 'raw_state_name', 'raw_city_name'] | ||
| ) | ||
|
|
||
| # Update OwnerHistory records - use __in for single query | ||
| own_hist_list = list(journal.owner_history.filter( | ||
| institution__institution_name__in=publisher_names | ||
| )) | ||
| for own_hist in own_hist_list: | ||
| own_hist.raw_institution_name = own_hist.institution.institution_name | ||
| own_hist.raw_country_name = publisher_country | ||
| own_hist.raw_state_name = publisher_state | ||
| own_hist.raw_city_name = publisher_city | ||
|
|
||
| if own_hist_list: | ||
| from journal.models import OwnerHistory | ||
| OwnerHistory.objects.bulk_update( | ||
| own_hist_list, | ||
| ['raw_institution_name', 'raw_country_name', | ||
| 'raw_state_name', 'raw_city_name'] | ||
| ) | ||
|
|
||
| # Update SponsorHistory records | ||
| if sponsor: | ||
| if isinstance(sponsor, str): | ||
| sponsor = [sponsor] | ||
|
|
||
| # Filter non-empty sponsor names | ||
| sponsor_names = [s for s in sponsor if s] | ||
|
|
||
| if sponsor_names: | ||
| spon_hist_list = list(journal.sponsor_history.filter( | ||
| institution__institution_name__in=sponsor_names | ||
| )) | ||
| for spon_hist in spon_hist_list: | ||
| spon_hist.raw_institution_name = spon_hist.institution.institution_name | ||
|
|
||
| if spon_hist_list: | ||
| from journal.models import SponsorHistory | ||
| SponsorHistory.objects.bulk_update( | ||
| spon_hist_list, | ||
| ['raw_institution_name'] | ||
| ) | ||
|
|
||
| # Update CopyrightHolderHistory records | ||
| if copyright_holder: | ||
| if isinstance(copyright_holder, str): | ||
| copyright_holder = [copyright_holder] | ||
|
|
||
| # Filter non-empty copyright holder names | ||
| cp_names = [cp for cp in copyright_holder if cp] | ||
|
|
||
| if cp_names: | ||
| cp_hist_list = list(journal.copyright_holder_history.filter( | ||
| institution__institution_name__in=cp_names | ||
| )) | ||
| for cp_hist in cp_hist_list: | ||
| cp_hist.raw_institution_name = cp_hist.institution.institution_name | ||
|
|
||
| if cp_hist_list: | ||
| from journal.models import CopyrightHolderHistory | ||
| CopyrightHolderHistory.objects.bulk_update( | ||
| cp_hist_list, | ||
| ['raw_institution_name'] | ||
| ) | ||
|
|
||
| processed_count += 1 | ||
|
|
||
| except Exception as e: | ||
| error_count += 1 | ||
| exc_type, exc_value, exc_traceback = sys.exc_info() | ||
| UnexpectedEvent.create( | ||
| exception=e, | ||
| exc_traceback=exc_traceback, | ||
| detail={ | ||
| "task": "task_replace_institution_by_raw_institution", | ||
| "am_journal_id": am_journal.id, | ||
| "pid": am_journal.pid, | ||
| "collection": str(am_journal.collection) if am_journal.collection else None, | ||
| }, | ||
| ) | ||
| logger.error( | ||
| f"Error processing AMJournal {am_journal.id}: {e}" | ||
| ) | ||
|
|
||
| result = { | ||
| "processed_count": processed_count, | ||
| "error_count": error_count, | ||
| } | ||
|
|
||
| logger.info( | ||
| f"task_replace_institution_by_raw_institution completed: {result}" | ||
| ) | ||
|
|
||
| return result | ||
|
|
||
| except Exception as e: | ||
| exc_type, exc_value, exc_traceback = sys.exc_info() | ||
| UnexpectedEvent.create( | ||
| exception=e, | ||
| exc_traceback=exc_traceback, | ||
| detail={ | ||
| "task": "task_replace_institution_by_raw_institution", | ||
| "collection_acron_list": collection_acron_list, | ||
| "journal_issns": journal_issns, | ||
| "user_id": user_id, | ||
| "username": username, | ||
| }, | ||
| ) | ||
| raise | ||
|
|
There was a problem hiding this comment.
The new task task_replace_institution_by_raw_institution lacks test coverage. While there are comprehensive tests for other tasks in the codebase (such as load_license_of_use_in_journal and fetch_and_process_journal_logos_in_collection in journal/tests.py), this migration task is not tested. Given the complexity of the task and the data transformation it performs, tests should be added to verify: (1) correct filtering by collection_acron_list and journal_issns, (2) proper data extraction and mapping from AMJournal records, (3) error handling scenarios, and (4) the bulk_update operations work as expected.
| publisher_country = extract_value(data.get("publisher_country")) | ||
| publisher_state = extract_value(data.get("publisher_state")) | ||
| publisher_city = extract_value(data.get("publisher_city")) | ||
|
|
||
| # Extract sponsor data | ||
| sponsor = extract_value(data.get("sponsors")) | ||
|
|
||
| # Extract copyright holder data | ||
| copyright_holder = extract_value(data.get("copyrighter")) | ||
|
|
||
| # Update PublisherHistory and OwnerHistory records | ||
| if publisher: | ||
| if isinstance(publisher, str): | ||
| publisher = [publisher] | ||
|
|
||
| # Filter non-empty publisher names | ||
| publisher_names = [p for p in publisher if p] | ||
|
|
||
| if publisher_names: | ||
| # Update PublisherHistory records - use __in for single query | ||
| pub_hist_list = list(journal.publisher_history.filter( | ||
| institution__institution_name__in=publisher_names | ||
| )) | ||
| # Create a mapping of institution names to publisher data | ||
| for pub_hist in pub_hist_list: | ||
| pub_hist.raw_institution_name = pub_hist.institution.institution_name | ||
| pub_hist.raw_country_name = publisher_country | ||
| pub_hist.raw_state_name = publisher_state | ||
| pub_hist.raw_city_name = publisher_city |
There was a problem hiding this comment.
The extract_value() function can return either a single value or a list depending on the input (as seen in journal/sources/am_data_extraction.py lines 47-51). However, at lines 439-441, the code extracts publisher_country, publisher_state, and publisher_city using extract_value() without checking if the result is a list. If any of these values are lists, the subsequent code at lines 465-467 and 483-485 will store the entire list as the raw field value instead of a string, which may not be the intended behavior. Consider handling the case where extract_value() returns a list, for example by taking the first element or joining the values.
| pub_hist.raw_institution_name = pub_hist.institution.institution_name | ||
| pub_hist.raw_country_name = publisher_country | ||
| pub_hist.raw_state_name = publisher_state | ||
| pub_hist.raw_city_name = publisher_city | ||
|
|
||
| if pub_hist_list: | ||
| from journal.models import PublisherHistory | ||
| PublisherHistory.objects.bulk_update( | ||
| pub_hist_list, | ||
| ['raw_institution_name', 'raw_country_name', | ||
| 'raw_state_name', 'raw_city_name'] | ||
| ) | ||
|
|
||
| # Update OwnerHistory records - use __in for single query | ||
| own_hist_list = list(journal.owner_history.filter( | ||
| institution__institution_name__in=publisher_names | ||
| )) | ||
| for own_hist in own_hist_list: | ||
| own_hist.raw_institution_name = own_hist.institution.institution_name | ||
| own_hist.raw_country_name = publisher_country | ||
| own_hist.raw_state_name = publisher_state | ||
| own_hist.raw_city_name = publisher_city | ||
|
|
||
| if own_hist_list: | ||
| from journal.models import OwnerHistory | ||
| OwnerHistory.objects.bulk_update( | ||
| own_hist_list, | ||
| ['raw_institution_name', 'raw_country_name', | ||
| 'raw_state_name', 'raw_city_name'] | ||
| ) | ||
|
|
||
| # Update SponsorHistory records | ||
| if sponsor: | ||
| if isinstance(sponsor, str): | ||
| sponsor = [sponsor] | ||
|
|
||
| # Filter non-empty sponsor names | ||
| sponsor_names = [s for s in sponsor if s] | ||
|
|
||
| if sponsor_names: | ||
| spon_hist_list = list(journal.sponsor_history.filter( | ||
| institution__institution_name__in=sponsor_names | ||
| )) | ||
| for spon_hist in spon_hist_list: | ||
| spon_hist.raw_institution_name = spon_hist.institution.institution_name | ||
|
|
||
| if spon_hist_list: | ||
| from journal.models import SponsorHistory | ||
| SponsorHistory.objects.bulk_update( | ||
| spon_hist_list, | ||
| ['raw_institution_name'] | ||
| ) | ||
|
|
||
| # Update CopyrightHolderHistory records | ||
| if copyright_holder: | ||
| if isinstance(copyright_holder, str): | ||
| copyright_holder = [copyright_holder] | ||
|
|
||
| # Filter non-empty copyright holder names | ||
| cp_names = [cp for cp in copyright_holder if cp] | ||
|
|
||
| if cp_names: | ||
| cp_hist_list = list(journal.copyright_holder_history.filter( | ||
| institution__institution_name__in=cp_names | ||
| )) | ||
| for cp_hist in cp_hist_list: | ||
| cp_hist.raw_institution_name = cp_hist.institution.institution_name |
There was a problem hiding this comment.
The code at line 464 accesses pub_hist.institution.institution_name without checking if pub_hist.institution is None. If a PublisherHistory record has no associated institution (which is allowed since the field is nullable as seen in journal/models.py line 1402-1407), this will raise an AttributeError. The same issue exists at lines 482, 508, and 530 for OwnerHistory, SponsorHistory, and CopyrightHolderHistory respectively. Consider adding a null check before accessing the institution_name, or filter out records where institution is None in the initial query.
| pub_hist_list = list(journal.publisher_history.filter( | ||
| institution__institution_name__in=publisher_names | ||
| )) | ||
| # Create a mapping of institution names to publisher data | ||
| for pub_hist in pub_hist_list: | ||
| pub_hist.raw_institution_name = pub_hist.institution.institution_name | ||
| pub_hist.raw_country_name = publisher_country | ||
| pub_hist.raw_state_name = publisher_state | ||
| pub_hist.raw_city_name = publisher_city | ||
|
|
||
| if pub_hist_list: | ||
| from journal.models import PublisherHistory | ||
| PublisherHistory.objects.bulk_update( | ||
| pub_hist_list, | ||
| ['raw_institution_name', 'raw_country_name', | ||
| 'raw_state_name', 'raw_city_name'] | ||
| ) | ||
|
|
||
| # Update OwnerHistory records - use __in for single query | ||
| own_hist_list = list(journal.owner_history.filter( | ||
| institution__institution_name__in=publisher_names | ||
| )) | ||
| for own_hist in own_hist_list: | ||
| own_hist.raw_institution_name = own_hist.institution.institution_name | ||
| own_hist.raw_country_name = publisher_country | ||
| own_hist.raw_state_name = publisher_state | ||
| own_hist.raw_city_name = publisher_city | ||
|
|
||
| if own_hist_list: | ||
| from journal.models import OwnerHistory | ||
| OwnerHistory.objects.bulk_update( | ||
| own_hist_list, | ||
| ['raw_institution_name', 'raw_country_name', | ||
| 'raw_state_name', 'raw_city_name'] | ||
| ) | ||
|
|
||
| # Update SponsorHistory records | ||
| if sponsor: | ||
| if isinstance(sponsor, str): | ||
| sponsor = [sponsor] | ||
|
|
||
| # Filter non-empty sponsor names | ||
| sponsor_names = [s for s in sponsor if s] | ||
|
|
||
| if sponsor_names: | ||
| spon_hist_list = list(journal.sponsor_history.filter( | ||
| institution__institution_name__in=sponsor_names | ||
| )) | ||
| for spon_hist in spon_hist_list: | ||
| spon_hist.raw_institution_name = spon_hist.institution.institution_name | ||
|
|
||
| if spon_hist_list: | ||
| from journal.models import SponsorHistory | ||
| SponsorHistory.objects.bulk_update( | ||
| spon_hist_list, | ||
| ['raw_institution_name'] | ||
| ) | ||
|
|
||
| # Update CopyrightHolderHistory records | ||
| if copyright_holder: | ||
| if isinstance(copyright_holder, str): | ||
| copyright_holder = [copyright_holder] | ||
|
|
||
| # Filter non-empty copyright holder names | ||
| cp_names = [cp for cp in copyright_holder if cp] | ||
|
|
||
| if cp_names: | ||
| cp_hist_list = list(journal.copyright_holder_history.filter( | ||
| institution__institution_name__in=cp_names | ||
| )) |
There was a problem hiding this comment.
The task filters history records by matching institution names against the extracted publisher/sponsor/copyright holder names from AMJournal data (lines 459-461, 478-480, 504-506, 526-528). However, this matching assumes that the institution_name stored in the Institution model exactly matches the raw string extracted from AMJournal. In practice, these values may have been normalized or transformed during the initial import, leading to no matches and the task silently skipping records that should be updated. Consider logging when no matches are found to help diagnose data inconsistencies, or add a fallback strategy to handle cases where the names don't match exactly.
| pub_hist_list = list(journal.publisher_history.filter( | ||
| institution__institution_name__in=publisher_names | ||
| )) | ||
| # Create a mapping of institution names to publisher data |
There was a problem hiding this comment.
The comment at line 462 states "Create a mapping of institution names to publisher data", but the subsequent code does not create any mapping - it directly iterates over the history records and updates their fields. The comment is misleading and should be updated to accurately describe what the code does, such as "Update raw fields for each publisher history record".
| # Create a mapping of institution names to publisher data | |
| # Update raw fields for each publisher history record |
|
|
||
|
|
||
| class OwnerHistory(Orderable, ClusterableModel, BaseHistoryItem): | ||
| class OwnerHistory(Orderable, ClusterableModel, BaseHistoryItem, RawOrganizationMixin): |
There was a problem hiding this comment.
Base classes have conflicting values for attribute 'Meta': class Meta and class Meta.
|
|
||
|
|
||
| class PublisherHistory(Orderable, ClusterableModel, BaseHistoryItem): | ||
| class PublisherHistory(Orderable, ClusterableModel, BaseHistoryItem, RawOrganizationMixin): |
There was a problem hiding this comment.
Base classes have conflicting values for attribute 'Meta': class Meta and class Meta.
| class PublisherHistory(Orderable, ClusterableModel, BaseHistoryItem, RawOrganizationMixin): | |
| class PublisherHistory(Orderable, ClusterableModel, BaseHistoryItem, RawOrganizationMixin): | |
| class Meta(BaseHistoryItem.Meta): | |
| pass |
|
|
||
|
|
||
| class SponsorHistory(Orderable, ClusterableModel, BaseHistoryItem): | ||
| class SponsorHistory(Orderable, ClusterableModel, BaseHistoryItem, RawOrganizationMixin): |
There was a problem hiding this comment.
Base classes have conflicting values for attribute 'Meta': class Meta and class Meta.
|
|
||
|
|
||
| class CopyrightHolderHistory(Orderable, ClusterableModel, BaseHistoryItem): | ||
| class CopyrightHolderHistory(Orderable, ClusterableModel, BaseHistoryItem, RawOrganizationMixin): |
There was a problem hiding this comment.
Base classes have conflicting values for attribute 'Meta': class Meta and class Meta.
O que esse PR faz?
Adiciona
RawOrganizationMixinaos modelos de histórico (PublisherHistory, OwnerHistory, CopyrightHolderHistory, SponsorHistory) para armazenar dados de organização não-estruturados, preparando a depreciação deinstitution.models.Institution.Principais mudanças:
RawOrganizationMixinemcore.modelscom 7 campos CharField:raw_text,raw_institution_name,raw_country_name,raw_country_code,raw_state_name,raw_state_acron,raw_city_nameadd_publisher/owner/sponsor/copyright_holderaceitam parâmetros raw opcionaistask_replace_institution_by_raw_institution: popula campos raw de dados AMJournal existentes, otimizado combulk_update()e queries__inExemplo de uso:
Backward compatible: parâmetros raw são opcionais (default=None).
Onde a revisão poderia começar?
core/models.py- definição do RawOrganizationMixin (linhas 280-337)journal/models.py- herança e métodos_add_institution_history,add_*(linhas 1069-1268)journal/tasks.py- task de migração com otimizações (linhas 456-619)Como este poderia ser testado manualmente?
Algum cenário de contexto que queira dar?
A modelagem atual depende de
institution.models.Institutionque será deprecated. Os campos raw permitem armazenar dados não-estruturados diretamente nos registros de histórico, eliminando a dependência.Performance: Task otimizado para processar grandes volumes:
__inreduzem múltiplas queries para 1 por tipo de históricobulk_update()atualiza registros em batchImpacto em disco: Adiciona ~2KB por registro de histórico (7 campos × ~255 chars).
Quais são tickets relevantes?
Issue #[será linkado automaticamente pelo sistema]
Referências
Original prompt
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.