Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diffwr option OOM kill #16

Closed
szolnokit opened this issue Apr 15, 2023 · 16 comments
Closed

diffwr option OOM kill #16

szolnokit opened this issue Apr 15, 2023 · 16 comments

Comments

@szolnokit
Copy link
Contributor

szolnokit commented Apr 15, 2023

Out of memory: Killed process 9737 (dcfldd-v1.9) total-vm:847432kB, anon-rss:844620kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1696kB oom_score_adj:0

Prepare:

swapoff -a  # Just for a faster OOM
fallocate -l 10G test.dd  #create bigger file than your RAM

The bug:
dcfldd diffwr=on if=/dev/zero of=test.dd # OOM kill

Default option (without diffwr) isn't affected.

The problem:
When destination blocks are same (not written), memory blocks remain allocated.

Suggestion:
Use my "full_write" function, instead code in released v1.9


At first, thank you for commit.
And sorry, because I have no time to test @davidpolverari suggestion.

Dear @davidpolverari
Thank you for suggestion, but your code is BAD :(. And unfortunately your suggested code became part of the version, instead of mine code. Your code eat all memory, and trigger OOM kill in a real life. My code maybe not beautiful as your, but I tested weeks, and used on a real life work before I published.

I only noticed that the v1.9 release contains memory leak bug, because I installed it from a distribution instead of mine version. When I tired my routine job with "diffwr" option, I get OOM kill after dcfldd eat all memory. And this is caused by your proposed code.

But I tested with valgrind against my version vs released v1.9. And the suspicion turned out to be true. With diffwr=on option almost all malloc-ed block remain allocated. But only in your suggested code, not in my code.

I see only this leak in your (=v1.9 released) version:

valgrind -s --track-origins=yes --leak-check=full --show-leak-kinds=all ./dcfldd-v1.9 bs=1k count=1k diffwr=on if=/dev/sda of=dst.dd
...
==9430== 1,048,576 bytes in 1,024 blocks are definitely lost in loss record 17 of 17
==9430==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==9430==    by 0x113F23: full_write (full-write.c:56)
==9430==    by 0x11807C: outputlist_write (output.c:161)
==9430==    by 0x115017: dd_copy (copy.c:322)
==9430==    by 0x10C656: main (dcfldd.c:753)
...

The leak size almost same size as the (not) written data size. With GB (not) written, easy to eat all memory.

Remark:
My do..while(0) not nice I know, but this just a "simulated" try...throw..finally statement. The "break" is the "throw". :D Not nice, but my code always free the malloc-ed block. I don't know why, but your solution doesn't do that.

Maybe my code would have been better in the released version. Now the current release is buggy. :(

Originally posted by @szolnokit in #13 (comment)

@davidpolverari
Copy link
Collaborator

Fair enough. I'll take a look at your claims.

@szolnokit
Copy link
Contributor Author

szolnokit commented Apr 15, 2023

It is just a remark...
@davidpolverari version would be good (OOM free), if the "free" command moved outside from 3rd if:

      if (diffwr) { /* Check destination block content is same as the buffer */
        char *rptr = NULL;
        off_t pos = lseek(desc, 0, SEEK_CUR);
        if ((pos >= 0) && (rptr = malloc(len))) {
          int rlen = safe_read(desc, rptr, len);
          if ((rlen <= 0) || (rlen != len) || (memcmp(rptr, ptr, len))) {
/*remove            free(rptr);  */
            lseek(desc, pos, SEEK_SET);
          } else {
            written = len;
          }
          free(rptr);  /*add here*/
        }
      }

It's not tested, just a suggestion.

@davidpolverari
Copy link
Collaborator

It is just a remark... @davidpolverari version would be good (OOM free), if the "free" command moved outside from 3rd if:

      if (diffwr) { /* Check destination block content is same as the buffer */
        char *rptr = NULL;
        off_t pos = lseek(desc, 0, SEEK_CUR);
        if ((pos >= 0) && (rptr = malloc(len))) {
          int rlen = safe_read(desc, rptr, len);
          if ((rlen <= 0) || (rlen != len) || (memcmp(rptr, ptr, len))) {
/*remove            free(rptr);  */
            lseek(desc, pos, SEEK_SET);
          } else {
            written = len;
          }
          free(rptr);  /*add here*/
        }
      }

It's not tested, just a suggestion.

Yes, I was analyzing the code before and I came to that conclusion, too. In the process of converting your code, I missed the free() on the "same block" path. I am preparing a fix for it, and I would like to ask you to test it when I push the test branch before I commit the final version. Is that ok?

@szolnokit
Copy link
Contributor Author

Moving "free" outside... working like a charm...

Now, there are two options:

  1. Use my do...break...while(0) version
  2. Use modified code above

@szolnokit
Copy link
Contributor Author

szolnokit commented Apr 15, 2023

Yes, I was analyzing the code before and I came to that conclusion, too. In the process of converting your code, I missed the free() on the "same block" path. I am preparing a fix for it, and I would like to ask you to test it when I push the test branch before I commit the final version. Is that ok?

It was a sneaking bug. Because only occur when:

  • Test with big data (bigger as virtual memory) AND
  • destination almost same data like as source :D

Ok, no problem...
Currently, I use diffwr option on a real job with big flash drives (w/ multiple of= ), plus readback hash compare.
The OOM will be revealed soon :D

@davidpolverari
Copy link
Collaborator

Yes, I'm working on a diff moving free() outside. The reason I didn't merge your code as it were is long-term maintenance: I wanted to simplify the branching logic to make it easier to understand for future maintainers. Of course, I did miss this detail and I messed up. I´m sorry for that! 😳

@szolnokit
Copy link
Contributor Author

I can do that moving in minutes, if you like.
Just only some minutes. I have free time currently :)

@davidpolverari
Copy link
Collaborator

davidpolverari commented Apr 15, 2023

I already did the patch and I'm almost ready to push it for testing. The only thing I'm wondering now is whether should we free() the pointer conditionally, eg:

if (rptr)
  free(rptr);

I know that it would be virtually impossible for it to happen, as if the malloc() failed, we wouldn't reach that point, and then this other if would be useless (and costly). I think we can free it unconditionally, right?

davidpolverari added a commit that referenced this issue Apr 15, 2023
Before this patch, free() was called only when blocks were not the same,
causing memory leaks when they were, what could eventually cause an OOM
due to the accumulated effect of those leaks on large files.

This patch frees allocated memory for both paths, thus fixing the bug.
Thanks @szolnokit for the report.

Closes #16.
@szolnokit
Copy link
Contributor Author

szolnokit commented Apr 15, 2023

Put it boldly without if (rptr) . Because of && (rptr = malloc(len))), if (rptr) always would be true :D :D

This is good, I have tested:

      if (diffwr) { /* Check destination block content is same as the buffer */
        char *rptr = NULL;
        off_t pos = lseek(desc, 0, SEEK_CUR);
        if ((pos >= 0) && (rptr = malloc(len))) {
          int rlen = safe_read(desc, rptr, len);
          if ((rlen <= 0) || (rlen != len) || (memcmp(rptr, ptr, len))) {
            lseek(desc, pos, SEEK_SET);
          } else {
            written = len;
          }
          free(rptr);    /* rptr impossible to NULL here */
        }
      }

@davidpolverari
Copy link
Collaborator

davidpolverari commented Apr 15, 2023

Yes, I was just trying to check if there could be a corned case in the middle :).

I have pushed a branch with the fix. In my tests with valgrind, I didn't see any leaks1. Please test it and tell me if it is ok now.

Footnotes

  1. There are other leaks, but not related to diffwr. I will have a look at them in another opportunity.

@szolnokit
Copy link
Contributor Author

szolnokit commented Apr 15, 2023

Ok, I'm beginning of some test with diffwr. With big test files, and real block devices. With commit 2cddf08

@szolnokit
Copy link
Contributor Author

szolnokit commented Apr 15, 2023

Yes, I have seen. There are some other mem leaks, but these are small leaks, and all occurs only once per execute. Not "inflating" memory leaks. From strdup, etc... And not related with "diffwr" option, these are "ancient" leaks :D :D

@davidpolverari
Copy link
Collaborator

Yeah, but I will try to hunt them anyway :).

@szolnokit
Copy link
Contributor Author

OK. I tested commit 2cddf08 . No more OOM.

@davidpolverari
Copy link
Collaborator

Thanks a lot! And sorry for the mistake! I will merge it into master now.

@szolnokit
Copy link
Contributor Author

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants