Batch aggregation #4303

zwolf · 2024-03-06T22:18:40Z

Adds the necessary pieces to implement batch aggregation. In total, they should include:

Aggregation table migration
Client to connect to aggregation service
Refactors of: model, factory, controller, schemas, policy, serializer
Specs of all the above

This is ready for review.

Review checklist

First, the most important one: is this PR small enough that you can actually review it? Feel free to just reject a branch if the changes are hard to review due to the length of the diff.
If there are any migrations, will they the previous version of the app work correctly after they've been run (e.g. the don't remove columns still known about by ActiveRecord).
If anything changed with regards to the public API, are those changes also documented in the apiary.apib file?
Are all the changes covered by tests? Think about any possible edge cases that might be left untested.

yuenmichelle1

Looks fine. I had a couple non blocking comments and questions.
Since this feature has pretty much been siloed to just you, Cliff, (maybe Haley and Coleman?).
I think its worth adding more docs or comments on the PR so any one of us can go back and save some time on getting context.

Some things:

I think its worth documenting the flow of this to aggregations and back.
Semi related to ^, I'm currently not entirely sure of what the difference is between a task id vs the uuid vs the aggregations id. Maybe adding a description on the schema or more description on PR would be helpful?
Other thing/s I am wondering about: From what I understand of the new aggregations model, it is 1 aggregation per user per workflow.
- What happens if a user wants to re-run an aggregation for a workflow? (Does an aggregation record on panoptes get "cleaned" up after a certain point? Do we manually have to remove the aggregation record from the db so that re-run can be done? Is it even a thing to "re-run"?)

app/models/aggregation.rb

spec/policies/aggregation_policy_spec.rb

app/schemas/aggregation_create_schema.rb

spec/controllers/api/v1/aggregations_controller_spec.rb

…to batch-aggregation

app/controllers/api/v1/aggregations_controller.rb

…to batch-aggregation

app/controllers/api/v1/aggregations_controller.rb

spec/controllers/api/v1/aggregations_controller_spec.rb

zwolf · 2024-05-28T22:10:21Z

I had some trouble with Doorkeeper scopes when I was testing this API outside of the specs (not Pundit scopes, but Doorkeeper's own). Nothing else in the app scopes solely by workflow and just adding to the optional scopes wasn't enough. Rather than keep fighting it, I switched the base scope for the Aggregation resource back to Project and added that column to the table--that's all it's used for so far, but the relation is logical and Doorkeeper is happy.

I think its worth documenting the flow of this to aggregations and back.

Added a page to the docs for the new model. @lcjohnso I think you wanted a ping to have a look.

Semi related to ^, I'm currently not entirely sure of what the difference is between a task id vs the uuid vs the aggregations id. Maybe adding a description on the schema or more description on PR would be helpful?

Included this info in the new docs. The task_id comes from Celery via Aggregation and represents the running background task. This id can be used to check on the status of a running job via the aggregation API and to see if it's been completed. This could be implemented by the front end as a way to check the status of pending, long-running jobs. The UUID is generated by the aggregation run and used in filenames and folders on blob storage. It'll be used to construct a URL that the files will be downloaded directly from.

Other thing/s I am wondering about: From what I understand of the new aggregations model, it is 1 aggregation per user per workflow. What happens if a user wants to re-run an aggregation for a workflow? (Does an aggregation record on panoptes get "cleaned" up after a certain point? Do we manually have to remove the aggregation record from the db so that re-run can be done? Is it even a thing to "re-run"?)

Yes, aggregations are unique per user & workflow. And true, this was missing a way to re-run the aggregation--what makes the most sense? The FE checks for an existing aggregation resource via GET api/aggregations?workflow_id=[wf_id]&user_id=[user_id]. I have now implemented DELETE on the controller and added a couple specs to more clearly indicate what the intended behavior is. And though I was trying to avoid making the FE worry about it, the standard approach might be easiest: FE modal checks for existing agg, and if it's completed/failed, a click of "rerun" DELETEs by id, then POSTs to create new one, triggering a new aggregation run.

Talk actually does clean up data requests after a day and blocks early rerun attempts, I don't love that either. For now, I think this functionality is enough to work with, but maybe I could add some kind of shortcut (via route, POST body short circuit, other?) to reduce the required FE calls. Thoughts?

hound · 2024-05-30T20:57:39Z

spec/controllers/api/v1/aggregations_controller_spec.rb

-        end
-      end
+    context 'when there is an existing aggregation for that workflow' do
+      let!(:existing_agg) { create(:aggregation, workflow: workflow) }


RSpec/LetSetup: Do not use let! to setup objects not referenced in tests.

hound · 2024-05-30T20:57:39Z

spec/controllers/api/v1/aggregations_controller_spec.rb

+  describe '#index' do
+    let(:other_workflow) { create(:workflow) }
+    let!(:aggregations) { create(:aggregation, workflow: workflow) }
+    let!(:private_resource) { create(:aggregation, workflow: other_workflow) }


RSpec/LetSetup: Do not use let! to setup objects not referenced in tests.

hound · 2024-05-30T20:57:39Z

spec/controllers/api/v1/aggregations_controller_spec.rb

+
+  describe '#index' do
+    let(:other_workflow) { create(:workflow) }
+    let!(:aggregations) { create(:aggregation, workflow: workflow) }


RSpec/LetSetup: Do not use let! to setup objects not referenced in tests.

lcjohnso · 2024-06-05T06:52:56Z

Docs look good -- thanks!

One question regarding the existing aggregations table: I see you're removing the main aggregation data column in the migration, but will the >1 million rows stay in place? Or do those need to be cleaned up too? Is that happening elsewhere, or in this or a similar migration?

zwolf · 2024-06-05T20:44:48Z

I see you're removing the main aggregation data column in the migration, but will the >1 million rows stay in place? Or do those need to be cleaned up too? Is that happening elsewhere, or in this or a similar migration?

The plan looks like this:

This PR merges and cleans up the references to the aggregations attribute on the workflows table.
A new PR runs a migration that actually deletes it
The data in the aggregations table is actually deleted. This could happen in a rake task but will likely just be me running a truncate on Postgres.
This PR merges and modifies the aggregations table along with everything else.

I added the extra tasks to the board, plan to get them done this week.

hound · 2024-06-10T21:57:17Z

app/controllers/api/v1/aggregations_controller.rb


-  # THIS FUNCTIONALITY IS BEING DEPRECATED EFFECTIVE IMMEDIATELY
-  # A REPLACEMENT IS FORTHCOMING.
+class Api::V1::AggregationsController < Api::ApiController


Style/ClassAndModuleChildren: Use nested module/class definitions instead of compact style.

hound · 2024-06-10T21:57:17Z

spec/controllers/api/v1/aggregations_controller_spec.rb

+            uuid: '557ebcfa3c29496787336bfbd7c4d856',
+            status: 'completed'
+          }
+      }


Layout/MultilineHashBraceLayout: Closing hash brace must be on the same line as the last hash element when opening brace is on the same line as the first hash element.

zwolf · 2024-06-13T19:29:51Z

This is ready to merge. I truncated the aggregations table on staging in preparation:

irb(main):005:0> Aggregation.count
[primary]   (55.2ms)  SELECT COUNT(*) FROM "aggregations"
=> 1
irb(main):007:0> Aggregation.connection.truncate(Aggregation.table_name)
[primary]   (777.3ms)  TRUNCATE TABLE "aggregations"
=> #<PG::Result:0x00007fad5f56b968 status=PGRES_COMMAND_OK ntuples=0 nfields=0 cmd_tuples=0>
irb(main):008:0> Aggregation.count
[primary]   (65.2ms)  SELECT COUNT(*) FROM "aggregations"
=> 0

I plan on doing the same thing to production before this gets deployed. In the end, this feels like a data operation rather than a schema change and not something that should exist as a migration or a repeatable rake task. Truncate is the fastest way to remove the data, which will be important on prod where there are ~1.2M row (newest from 2017).

zwolf added 2 commits March 6, 2024 15:04

Aggregation migration, model, spec, factory

883016a

Client & spec

5f2076b

zwolf marked this pull request as draft March 6, 2024 22:18

zwolf added 15 commits March 26, 2024 15:37

Policy and spec

6683ba5

Schemas

0adb3f1

Update policy & schemas

38e4efe

Update status enum

d4067ec

Add default to status migration

2d8f8e2

Update serializer

7db71c9

Update factory

3a88a0c

Controller and spec

03ffb51

Fix migration

b3b7292

Hound

9b8d423

Hound again

a16c421

Third round hound

ea565fb

Feed hound

f6440bd

Consistency

25e1843

Merge branch 'master' into batch-aggregation

ec55ab9

zwolf marked this pull request as ready for review March 28, 2024 15:01

zwolf changed the title ~~[WIP] Batch aggregation~~ Batch aggregation Mar 28, 2024

zwolf requested a review from yuenmichelle1 March 28, 2024 16:10

yuenmichelle1 approved these changes Apr 1, 2024

View reviewed changes

zwolf added 2 commits May 24, 2024 17:01

Add project_id to Aggregations to allow Doorkeeper to scope by project

50f14c1

Merge branch 'batch-aggregation' of github.com:zooniverse/panoptes in…

83be921

…to batch-aggregation

hound bot reviewed May 24, 2024

View reviewed changes

app/controllers/api/v1/aggregations_controller.rb Show resolved Hide resolved

zwolf added 5 commits May 24, 2024 17:13

Clarify admin specs, add collab spec

ee47747

Remove ignored_columns

0271b4c

Add spec for failed service connections

2fccdb7

Add serializer spec

4f1d46c

Implement #destroy, add error specs for collisions

a08dbd1

zwolf added 4 commits May 28, 2024 16:59

Aggregation documentation

d97e027

reorder spec

b90bdcc

Merge branch 'master' into batch-aggregation

419dc65

Merge branch 'batch-aggregation' of github.com:zooniverse/panoptes in…

c350406

…to batch-aggregation

hound bot reviewed May 28, 2024

View reviewed changes

app/controllers/api/v1/aggregations_controller.rb Show resolved Hide resolved

spec/controllers/api/v1/aggregations_controller_spec.rb Outdated Show resolved Hide resolved

zwolf commented May 28, 2024

View reviewed changes

spec/controllers/api/v1/aggregations_controller_spec.rb Outdated Show resolved Hide resolved

Fix hash alignment

b175f36

zwolf mentioned this pull request May 28, 2024

Batch Aggregation zooniverse/aggregation-for-caesar#785

Merged

tms-epcc mentioned this pull request May 29, 2024

D3.1.4 Build aggregation tools lsst-uk/EPO-software#32

Open

zwolf added 2 commits May 30, 2024 15:55

Remove aggregation association from subject

cc1fbfe

Remove user uniqueness constraint

871a147

hound bot reviewed May 30, 2024

View reviewed changes

zwolf mentioned this pull request Jun 3, 2024

Remove references to aggregation attribute on Workflows #4339

Merged

4 tasks

zwolf mentioned this pull request Jun 5, 2024

Delete aggregations column from workflows table #4342

Merged

4 tasks

Merge branch 'master' into batch-aggregation

1923b00

hound bot reviewed Jun 10, 2024

View reviewed changes

zwolf added 2 commits June 10, 2024 17:06

MIgrate database, update structure.sql

99cb844

Resolve migrations

7009cf4

zwolf merged commit 509c773 into master Jun 13, 2024
8 checks passed

zwolf deleted the batch-aggregation branch June 13, 2024 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch aggregation #4303

Batch aggregation #4303

zwolf commented Mar 6, 2024 •

edited

yuenmichelle1 left a comment •

edited

zwolf commented May 28, 2024

hound bot May 30, 2024

hound bot May 30, 2024

hound bot May 30, 2024

lcjohnso commented Jun 5, 2024

zwolf commented Jun 5, 2024

hound bot Jun 10, 2024

hound bot Jun 10, 2024

zwolf commented Jun 13, 2024

Batch aggregation #4303

Batch aggregation #4303

Conversation

zwolf commented Mar 6, 2024 • edited

Review checklist

yuenmichelle1 left a comment • edited

Choose a reason for hiding this comment

zwolf commented May 28, 2024

hound bot May 30, 2024

Choose a reason for hiding this comment

hound bot May 30, 2024

Choose a reason for hiding this comment

hound bot May 30, 2024

Choose a reason for hiding this comment

lcjohnso commented Jun 5, 2024

zwolf commented Jun 5, 2024

hound bot Jun 10, 2024

Choose a reason for hiding this comment

hound bot Jun 10, 2024

Choose a reason for hiding this comment

zwolf commented Jun 13, 2024

zwolf commented Mar 6, 2024 •

edited

yuenmichelle1 left a comment •

edited