Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make full name updates on tree changes more efficient #3175

Merged
merged 9 commits into from Oct 3, 2023

Conversation

realVinayak
Copy link
Collaborator

@realVinayak realVinayak commented Mar 14, 2023

Fixes #388

Currently, the backend updates the full name for all the nodes in a tree table. However, we can restrict the full name update to just the nodes which are moved/inserted. I have added code that restricts the full name update to nodes that might be affected.

I have also modified the merge node function to modify full names after the nodes have been merged. Even though it will make stray full name updates (to nodes that were already in the target), it will do it only once. This is done because there is the additional overhead of updating the full name for every child. This will require additional testing and performance measurements. The change is trivial to remove it causes a problem.

@maxpatiiuk
Copy link
Member

Going forward, please use issue-388 instead of issue_388 for branch naming (so that all issue branches are sorted consistently in the test panel)

ALso, use a more descriptive pull request name than Issue 388

Copy link
Member

@maxpatiiuk maxpatiiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but like you mentioned, needs testing and performance testing

Also, does it make sense to add some tests for tree node numbering as part of this pull request?

@melton-jason
Copy link
Contributor

I did some simple overall testing of tree functions and could not break anything.
Of course, more in-depth and comprehensive testing should be done, but move, merge, synonymization are all operational and full names are correctly built

@realVinayak
Copy link
Collaborator Author

realVinayak commented Mar 17, 2023

Not sure if node numbering tests should be added here, because I'm working on a different method to renumber trees, so it is possible that I might have to refactor the tests. I've identified a few more changes that could be done with merge functionality. Should this branch be merged into production before node numbering fix @maxpatiiuk?

@maxpatiiuk
Copy link
Member

My recommendation:

Since this is not the final solution to the tree problems, and just a step on the way there, finish all the things we planned to do for trees. What things are planned to do depends on the performance gains for a given improvement vs the complexity/implementation time cost.

After you identified what needs to be implemented and implemented the changes in a production-ready way, you should mark the pull request as being ready for review. This way, the testers don't have to test your code several times after you make incremental changes to tree performance but can do very comprehensive testing after you are done with all changes.

As far as automated tests, yes, we need to have tests at the end of this process. Whether that means you would write tests as you go along, or write them all at the end depends on your preference. The drawback of writing them as you go like you mentioned is that you may end up having to rewrite tests, but the positive is that you can use tests for testing your code as you go along, instead of having to manually test it (i.e, read about test driven development and it's pros/cons).

Also, as a golden rule, try to write tests before testers start testing. This is because if tests have to be written anyway, writing them before testers start testing would prevent wasting their time (in case the tests would help discover the issue before testers do)

@grantfitzsimmons
Copy link
Contributor

Works well according to Edinburgh

Copy link
Contributor

@melton-jason melton-jason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the code looks good!
I do have some questions strictly regarding 'semantics' and code quality.

There is nothing here to outright prevent the changes getting merged, but I would like some clarification on things before it is merged into production.

specifyweb/specify/tree_extras.py Outdated Show resolved Hide resolved
specifyweb/specify/tree_extras.py Outdated Show resolved Hide resolved
@grantfitzsimmons
Copy link
Contributor

grantfitzsimmons commented May 10, 2023

@maxpatiiuk Can we merge this into production for our next release?

@specifysoftware
Copy link

This pull request has been mentioned on Specify Community Forum. There might be relevant details there:

https://discourse.specifysoftware.org/t/geo-tree-limitations-in-specify-7-and-locality-name-problems/1008/5

@realVinayak
Copy link
Collaborator Author

not yet, I need to think about an edge case that comes with merging. Haven't gotten time yet.

@maxpatiiuk
Copy link
Member

maxpatiiuk commented May 10, 2023

@maxpatiiuk Can we merge this into production for our next release?

looks good. As soon as Vinny fixes the things that Jason found we can merge this

@realVinayak
Copy link
Collaborator Author

@grantfitzsimmons good for testing now.

@realVinayak
Copy link
Collaborator Author

realVinayak commented May 24, 2023

Important note: Since there is no index for node numbers, SQL will have to look at every row to find matching nodes. We should probably add an index to it. Even considering that, this is faster than previous code, because scanning all rows is cheaper than updating all rows.

There are other places on the backend where tree nodes are found using node numbers - they will also be faster.
@maxpatiiuk going back to our discussion about indexing common relationships, can we also discuss adding indexing here? Another thing to consider - update might become slower because indexing makes update more expensive.

@maxpatiiuk
Copy link
Member

If we index node numbers, the node numbers update will become slower, correct?
But taxon updates that don't change node numbers won't be affected?
And workbench taxon uploads will be affected because they will cause new indexes and node renumber?
Seems like a tradeoff. You could benchmark the impact on large tree move and large workbench upload VS time saved in query builder and stats

@realVinayak
Copy link
Collaborator Author

If we index node numbers, the node numbers update will become slower, correct? But taxon updates that don't change node numbers won't be affected?

Yes they will become slower. Actions other than desynonymize and synomize modify node numbers in someway. But also note that these exact actions also reset fullname for the entire table which we don't do now so depends on the total time saved.

And workbench taxon uploads will be affected because they will cause new indexes and node renumber?

Yes, at the end, renumbering will take more time. Each update statement in that function will take longer.

Seems like a tradeoff. You could benchmark the impact on large tree move and large workbench upload VS time saved in query builder and stats

Well, the stats and queries don't use node numbers right now. The only reason to add indexing for node numbers here is to help the "where nodenumber between number1 and number2 clause (and their variations) which backend uses when opening, closing, and moving intervals, and now selecting nodes to update for fullname. Without node indexing, MySQL currently locks entire table (not a good thing - that's one of the reasons why we occasionally see lock timeout errors when using trees) - but these locks are usually for a small amount of time because these actions usually go fast.

I'll set up some bench mark testing for these to get a comprehensive view

@grantfitzsimmons grantfitzsimmons added this to the 7.9.1 milestone Aug 28, 2023
@CarolineDenis CarolineDenis requested a review from a team August 28, 2023 18:32
Copy link
Contributor

@melton-jason melton-jason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to address https://github.com/specify/specify7/pull/3175/files#r1180411187 before this is merged.

@melton-jason melton-jason requested a review from a team August 29, 2023 20:11
@specifysoftware
Copy link

This pull request has been mentioned on Specify Community Forum. There might be relevant details there:

https://discourse.specifysoftware.org/t/taxon-tree-manipulations-lock-database/1309/2

@grantfitzsimmons
Copy link
Contributor

Conclusion: Vinny will make one change today. Will merge after.

@realVinayak
Copy link
Collaborator Author

@bronwyncombs @specify/ux-testing
Testing:

  1. Try merging two any nodes together. production should take more time
  2. Try moving nodes from one parent to another. production should be slower.
  3. Make sure full names are correctly constructed. To do this, do the above actions in two different copies of the same database (an older version would work) but on the same nodes. Ideally, do this on species nodes. Make sure that after doing the above actions, the form says the same full name.
  4. Do a workbench upload of at least till species level. The production and this branch should take the same time. View the uploaded nodes in the form, the nodes should have the same full name.
  5. Make new species in the forms. Add two node with the same name, and production and this branch should have the same full name.

@realVinayak realVinayak merged commit b63632a into production Oct 3, 2023
9 checks passed
@realVinayak realVinayak deleted the issue_388 branch October 3, 2023 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
5 participants