-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix moving large amounts of content/media #7907
Conversation
…th the perf improvements that were done to media.
@Shazwazza I'm afraid I'm running into the same corruption issue. I get 2 NuCache errors when copying a node with it's descendants and trying to publish the node after copy (without descendants).
The other one is:
But I'm unable to reproduce that one. |
@stefankip I'm a little unsure what you're reporting here. Are you reporting that you're still having issues or are you reporting that you're seeing the same (initial) issue as others were seeing? |
I think I'm having the same issue, but I don't know how to tell. That's what I'm trying to achieve here; is there a way for me to check and maybe even fix the issue? |
Same issue here. While we wait for the PR to be processed and released, how exactly can we identify these "broken" items in the database? |
I'm still finalizing this work, once that is done the best thing I'll be able to advise for your current installation will be to copy the code required for the new health check and re-create this health check into your installation. As there is quite a lot of code changes here the fix for this would need to wait until a minor release. Stay tuned, i'll post more info soon. |
@stefankip the issue I'm resolving is specifically about Moving nodes and only move operations that contain more than 500 nodes (i.e. when there are more than 500 descendants of a node being moved). I haven't looked at the copy logic and the one null reference you've pointed out isn't something I've looked at either. If the copy logic has a similar issue it would also only be affected with large amounts of descendants too (more than 500). Is this the case for you? Or perhaps have you previously Moved large amounts of data? |
Hi all, I have created a public GIST with an SQL script which will report and fix path issues. This is a work around you can use for now if you have this issue. Backup your db before you execute this. Use this script at your own risk. I have not tested this on production sites. As the notes say, you can do a dry run without changing data. I'd suggest you do that and then manually verify any nodes in the error report to validate that the data is inconsistent. Works on my machine :) https://gist.github.com/Shazwazza/1a04dcd1c1b6f16d3b7f167874770a84 |
Hi @Shazwazza , TL;DR I think there's a way of corrupting items when a moved item is transferred via uSync. Back to business : I ran your script for contentNodes on a copy of the database and found 106 corrupt items. Just the report kind, no actual fixing yet, but this confirms my theory that this issue applies to my situation. Just fyi, I hope it helps if I explain my theory on shedding some light on the corruption after moving an item. I am suspecting the following to be the case (so something else then moving +500 items)
I noticed the items are where they should be in the target content tree, but when re-exported to usync files, they still display level 4 instead of what should be level 5. Manually fixing the level in the usync-file and re-importing via usync does not alleviate the situation. I was first alerted to a problem when you select an item at lvl 5 in the content tree and after loading the item, the tree collapses back to lvl 4. (should remain open and highlight the selected item). For 1 item I identified thus, I could trigger a fix, by moving it to a diffferent parent and back to where it needs to be. I will try your fix on the database copy and swap umbraco temporarily to that copy to test validity of your fix. Will update when I know more. |
I was hoping corrupted data was the cause of the errors I was having, maybe due to large copy actions from the past, but the SQL script didn't report any issues, so I'll have to look further for some other cause. |
@stefankip it does seems strange you have a null ref in nucache though, be sure you've done a database rebuild and then a mem rebuild in your Published Caches dashboard. I had a look at one of the null refs you have and in the code there doesn't seem to be a way that it can be null, i've added a new null check and notes there though https://github.com/umbraco/Umbraco-CMS/pull/7907/files#diff-1983717019d1fd1da9fa80f10ab00b38R885 |
Yeah I'm running v8.6.0 and also tried all buttons inside the nucache dashboard for rebuilding and cleaning. |
@stefankip if you can replicate the null reference problem now with your current DB please take a back up and if you wouldn't mind you can send to shannon@umbraco.com and i can figure out what is going on. |
@Shazwazza that's a really nice gesture, thanks! I don't mind sending the DB. Do you only need the DB or also the solution? |
Hi @Shazwazza , After this, I can rebuild database and memory cache without getting the aforementioned error
|
@stefankip much appreciated with you sharing your solution. I have managed to reproduce it with your solution and DB though strangely I could not reproduce the problem with the umbraco source and your DB 🤔 which is super weird. I copied in the umbraco build DLLs to your solution and the problem would still occur... that's a good thing since I was able to figure out the problem. ... phew! I don't quite understand yet how this could not be replicated on a normal site. I'm going to setup a unit test for this specific thing. The underlying problem is when we do |
@Shazwazza nice, much appreciated! Looking forward to the PR! |
@stefankip here's the PR #7961 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only small comments..
Code changes looks fine, and my tests don't find anything broken, and I could fix the manuel corrupted data using the healthcheck.
But I did not get any errors logged when I rebuild the cache and the published cache status was always:
Database cache is ok. ...
src/Umbraco.Core/Persistence/Repositories/Implement/ContentRepositoryBase.cs
Show resolved
Hide resolved
src/Umbraco.Core/Persistence/Repositories/Implement/MediaRepository.cs
Outdated
Show resolved
Hide resolved
src/Umbraco.Core/Persistence/Repositories/Implement/MediaRepository.cs
Outdated
Show resolved
Hide resolved
private HealthCheckStatus CheckMedia(bool fix) | ||
{ | ||
return CheckPaths(_fixMediaPaths, _fixMediaPathsTitle, Core.Constants.UdiEntityType.Media, fix, | ||
() => _mediaService.CheckDataIntegrity(new ContentDataIntegrityReportOptions {FixIssues = fix})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this safe without locking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just reading data, no need to have a concurrent lock, there's an SQL lock anyways.
private HealthCheckStatus CheckDocuments(bool fix) | ||
{ | ||
return CheckPaths(_fixContentPaths, _fixContentPathsTitle, Core.Constants.UdiEntityType.Document, fix, | ||
() => _contentService.CheckDataIntegrity(new ContentDataIntegrityReportOptions {FixIssues = fix})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this safe without locking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an SQL write lock during the reading and fixing
@bergmania can you clarify what you mean here:
So are you saying that you did manually change your database to have corrupted paths and when using the health check it did report problems that it fixed? Or are you saying no errors were reported at all even though you messed up the data in your database?
Are you saying that when you messed up the data in your database and before you fixed the data in the healthcheck that you didn't get errors when doing this? If so, that's because we log warnings instead of crashing so you might just need to check your log for warnings in this case for testing. |
@bergmania that looks like you've fixed the data issues before rebuilding the cache but the data issues are already fixed so there's no problems. If you mess up your db first, then try rebuilding the cache, then check for warnings, then use the health check and try that again |
@bergmania right so to get the warnings logged, you need to mess the data up pretty good :P the above examples of changing the data do corrupt it but nucache will still load it because it can still sort of determine it's order. To get the warnings logged you want to change an item that is at [level] = 3, change it's level to be 2 and then remove the middle part of it's path. This will cause a warning in your log when you rebuild the caches, then when you run the health check fix and rebuild the caches you wont see the warning. |
All good then.., we can merge when the conflicts are resolved. |
# Conflicts: # src/Umbraco.Web/PublishedCache/NuCache/ContentStore.cs
sweet, all resolved, will merge 🎉 |
@Shazwazza Thanks a lot for this fix and especially the SQL Script, this has helped us fix a data corruption issue as well. I did notice one thing when I applied this: Whenever the move action was to the recycle bin, the 'trashed' flag should be set on all nodes (at least, that is what I see for my 'normal' nodes). However, this may also have failed in this process. The current fix does not seem to check for and correct this as well. Is that something that needs to be added, or am I making the wrong assumptions? These nodes that are a descendant of the recycle bin, but do not have trashed flag set, give a warning in the logs when the cache is being rebuilt, because their parent was not found. |
@rbottema It could/should probably be another check as part of the fixup, any chance you can create a PR for that? |
was this new health check released? I don't see it in 8.6.2.. thx |
This is tagged for 8.7.0 |
Turns out that after investigating this issue #7868 which was originally reported as a nucache issue that its actually a data issue.
Nucache did explode because the data in the umbracoNode table was corrupt but how did that data get like that? Luckily a couple community members were able to provide me databse snapshots of before and after they performed some move operations which seemed to be the trigger for seeing errors. More lucky, i was able to replicate the problem locally.
So it turns out we have a bug in the moving logic for both media and content and this same logic is used to delete stuff (move to the bin). The bug only rears it's head if there's a lot of stuff being moved in one operation (i.e. if there are a lot of descendant items). The problem is that a bunch of nodes don't end up getting moved at all because during the move operation, we are paging over descendants of a certain path and we are doing a normal page operation by checking while(currentPage x pageSize < totalItems) type of thing until paging is complete. Well that is wrong in this case because for each page, we then move the descendants returned to a different path which means on the next page, there are less total items. So the logic doesn't work and we exit the while loop way to early before all descendants are moved. This is only for large amounts > 500 (our page size). This ends up with a bunch of descendant nodes that should have had their paths/levels updated not be updated which means they are now inconsistent with the actual tree data structure. This will end up exploding nucache and not being able to traverse or find these items correctly.
After researching #7756 it also means that if there is corrupted data you will get YSOD errors when modifying document types.
Performance
The logic to move means an extraordinary amount of CPU and SQL calls. moving a branch with 2700 descendants takes more than 5 minutes on my computer. I've fixed this, it now takes 20 seconds.
That is a 96% improvement.
New Feature
I've created a new health check and health check category for checking data integrity. This will now validate that paths/levels are correct for content/media in the umbraconode table and have a button to try to fix it.
Changes
PersistUpdatedItem
method to check if the item being updated is explicitly just being moved, we can detect that by the properties that have changed. In this case if its a descendant being moved, we don't have to do the majority of checks and updates that the repository normally does. We do not need to create new versions, change tags, relations, publish states, etc... We just need to update the umbraco node table and raise eventsTesting notes
You can test this all out by manually messing your database up. Here's how:
execute this sql:
and for rows that are not at the root, you can muck with the data, for example, the below changes will cause 4 different types of path problems:
If you do that and then restart your website and with this PR, go to the Published Status dashboard and rebuild database cache and then Memory Cache
You will not get an YSODs, but you will see some warning logged in your log because it cannot create the cache structure because the data is corrupt (also depends on how much data and nested data you have).
Then go to the new health check and execute it, you will see that it will fix these problems with a report.
Then go back and execute the steps on the Published Status dashboard and you will no longer get warnings in your log.