Skip to content
This repository was archived by the owner on Oct 9, 2023. It is now read-only.

Conversation

@alexjpwalker
Copy link
Member

@alexjpwalker alexjpwalker commented Nov 1, 2022

What is the goal of this PR?

We no longer call close on any of our gRPC Channels. This fixes possible segfaults caused by resources being deallocated while they are still in use.

What are the changes implemented in this PR?

Users in a wide variety of scenarios had reported intermittent crashes, often accompanied by warnings saying "1 metadata element(s) leaked". The logs would look similar to the following:

7786| E1031 17:59:35.477338904 207 metadata.cc:253] WARNING: 1 metadata elements were leaked
7787| E1031 17:59:35.477409424 207 metadata.cc:260] mdelem ':authority' = 'typedb-cluster-1.typedb-cluster:1729'
7788| [mutex.cc : 435] RAW: Lock blocking 0x563cbf829fc0 @
7789| [mutex.cc : 1908] RAW: Check (v & (kMuWait | kMuWrWait)) != kMuWrWait failed: Lock: Mutex corrupt: waiting writer with no waiters: 0x563cbf815670

The same issue has also been reported in googleads/google-ads-python#384, and a fix was suggested in:

From this issue we infer that Channel.close is not behaving nicely in gRPC Python, and can cause resources to be deallocated while they are still in use. As gRPC itself uses native C libraries, this results in segfaults and crashes. We determine that the best course of action is to not close the Channel ourselves.

We've simply deleted the 3 places in our code that called close on a gRPC Channel. It has passed our CI tests and fixed user-reported issues, and it is what the gRPC maintainers themselves appear to recommend in the linked grpc issue.

@typedb-bot
Copy link
Member

typedb-bot commented Nov 1, 2022

PR Review Checklist

Do not edit the content of this comment. The PR reviewer should simply update this comment by ticking each review item below, as they get completed.


Code

  • Packages, classes, and methods have a single domain of responsibility.
  • Packages, classes, and methods are grouped into cohesive and consistent domain model.
  • The code is canonical and the minimum required to achieve the goal.
  • Modules, libraries, and APIs are easy to use, robust (foolproof and not errorprone), and tested.
  • Logic and naming has clear narrative that communicates the accurate intent and responsibility of each module (e.g. method, class, etc.).
  • The code is algorithmically efficient and scalable for the whole application.

Architecture

  • Any required refactoring is completed, and the architecture does not introduce technical debt incidentally.
  • Any required build and release automations are updated and/or implemented.
  • Any new components follows a consistent style with respect to the pre-existing codebase.
  • The architecture intuitively reflects the application domain, and is easy to understand.
  • The architecture has a well-defined hierarchy of encapsulated components.
  • The architecture is extensible and scalable.

type: foreground
command: |
pyenv install 3.7.12
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had to make a number of changes to the way we load Python and Pip and install packages in order to work in the new Factory CI machines.


def close(self) -> None:
super().close()
self._channel.close()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See PR description for the full explanation of why we do this. In brief, explicitly calling close on a gRPC Channel in Python can result in resources being deallocated while they are still being used, which in turn may cause gRPC's native C libraries to segfault. Not closing the channel is the approach recommended in an issue in the gRPC repository itself.

@alexjpwalker alexjpwalker merged commit 03a0340 into typedb:master Nov 1, 2022
@alexjpwalker alexjpwalker deleted the dont-close-channel branch November 1, 2022 13:01
flyingsilverfin pushed a commit to typedb/typedb-docs that referenced this pull request Nov 1, 2022
## What is the goal of this PR?

We upgraded Client Python to the latest version, which should fix intermittent and common test failures in CI.

## What are the changes implemented in this PR?

Our CI jobs have been failing for some time due to a "metadata elements leaked" error, which we've fixed in Client Python in typedb/typedb-driver-python#266, and now we've upgraded to the latest release of Client Python.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GRPC memory leak from opening session using empty with block

3 participants