Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up buffers in case AssertionError #13262

Merged
merged 20 commits into from
May 11, 2023

Conversation

razajafri
Copy link
Contributor

In case ColumnView throws an AssertionError which it can after #13071, we should close out the open buffers.

fixes #13225

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@razajafri razajafri added bug Something isn't working non-breaking Non-breaking change labels May 1, 2023
@razajafri razajafri self-assigned this May 1, 2023
@razajafri razajafri requested a review from a team as a code owner May 1, 2023 23:24
@github-actions github-actions bot added the Java Affects Java cuDF API. label May 1, 2023
@abellina
Copy link
Contributor

abellina commented May 2, 2023

Calling .close() here is not what we want, and does not work (would throw here). This is similar to what @ttnghia brought up.

The problem is with sub classes like ColumnVector (what original issue was filed for). Since this subclass is not a ColumnView, calling close() from the constructor of ColumnView is going to call ColumnVector.close() at the wrong time, which does all kinds of bad things at this point of execution:

  1. At the time when ColumnView constructor is entered refCount=0, as it gets incremented in the ColumnVector constructor only after the view has successfully constructed.

  2. The close constructor will then decrement this refCount setting it to -1, and it should throw "Close called too many times"

  @Override
  public synchronized void close() {
    refCount--;
    offHeap.delRef();                                  
    if (eventHandler != null) {
      eventHandler.onClosed(refCount);
    }
    if (refCount == 0) {
      offHeap.clean(false);
    } else if (refCount < 0) {
      offHeap.logRefCountDebug("double free " + this);
      throw new IllegalStateException("Close called too many times " + this);
    }
  }

@abellina
Copy link
Contributor

abellina commented May 2, 2023

You want to treat this case differently since it is a special error on construction.

In the viewHandle constructor case, the code you have seems fine but I would also 0 it out for good measure.

In the OffHeapState constructor case, the state of concern is offHeap. This is created externally in ColumnVector and handed down to the view. I think the only thing you want to do is to close offHeap state directly if you are going to throw. I would also 0 out viewHandle for good measure, it's a pointer to nothing after offHeap closes.

I would put comments on both constructors that this is now happening to make it extra clear and we don't forget this in the future.

@razajafri razajafri requested review from abellina and ttnghia May 2, 2023 20:24
@razajafri razajafri requested a review from abellina May 5, 2023 17:54
@razajafri
Copy link
Contributor Author

@abellina do you have anything else to add? I think I have addressed all your concerns

columnVectors[i] = new ColumnVector(nativeHandles[i]);
}
} catch (Throwable t) {
for (ColumnView columnView: columnVectors) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit, since we know we are creating column vectors, I don't see the point of using the superclass here.

columnVectors[i] = new ColumnVector(nativeHandles[i]);
}
} catch (Throwable t) {
for (ColumnView columnView: columnVectors) {
Copy link
Contributor

@abellina abellina May 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to closing the column vectors/views in the array, we also have pending nativeHandles, sorry I just caught this one. In order to make it safer, we should also need to close the handles which are leaked. Should probably create a helper handles -> vectors, or handles -> views, if we don't have those already, since all the other places you changed are similar.

I would change the loop like so:

try {
  for (int i = 0; i < nativeHandles.length; i++) {
    columnVectors[i] = new ColumnVector(nativeHandles[i]);
    nativeHandles[i] = 0; // mark the handle as consumed
  }
} catch (Throwable t) {
  for (ColumnVector cv : columnVectors) {
    if (cv != null) {
      cv.close();
    }
  }
  for (long handle : nativeHandles) { 
    if (handle != 0) {
      ColumnVector.deleteCudfColumn(handle);
    }
  }
  throw t;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! But why not close all the nativeHandles? We got them as a result of slice They should all be cleaned out since we have encountered an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the same thing should happen in splitAsViews

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but since you have handed some handles to the vector/view, the expectation is that the vector/view will close the handle they were given. So you don't want to double delete these.

@razajafri
Copy link
Contributor Author

@abellina thank you for taking a look and catching things that I had missed. I have covered the missed cases I think. I have also created a helper method in ColumnView and also replaced places where it should be called instead of the code we were using before. Please take a look again when you have a chance

Copy link
Contributor

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had one typo change request, otherwise LGTM

@razajafri
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit e4e65a9 into rapidsai:branch-23.06 May 11, 2023
51 checks passed
@razajafri razajafri deleted the clear-buffers branch May 11, 2023 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Java Affects Java cuDF API. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] JNI GPU memory leak when assertions are enabled
4 participants