Skip to content

Conversation

ejguan
Copy link
Contributor

@ejguan ejguan commented May 25, 2021

Stack from ghstack:

When run test_datapipe.py, python gc would report lots of ResourceWarnings due to unclosed stream. It's not only annoying, there are two potential problems:

  • Performance regression because gc requires additional memory and computation to track reference
  • Python gc runs periodically so we many encountered an error of too many open files due to OS limitation
    To reduce the warning:
  • Explicitly close byte stream
  • Modify test_datapipe.py to use context manager

Small fix:

  • Reorder import in test_datapipe.py

Further investigation:
Can we directly use context manager in LoadFileFromDisk and ReadFileFromTar to eliminate this Error?

  • Probably no. It's feasible only if the pipeline is synchronized and without prefetching. When we enable these two features, the scope guard of the context manager doesn't work.
  • We may need to implement some reference counter attached to these file byte stream to close by itself.

Differential Revision: D28689862

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented May 25, 2021

💊 CI failures summary and remediations

As of commit 2ad6a79 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 2/2 non-scanned failure(s)

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

ejguan added a commit that referenced this pull request May 25, 2021
ghstack-source-id: e3b8d3b
Pull Request resolved: #58938
When run `test_datapipe.py`, python `gc` would report lots of `ResourceWarning`s due to unclosed stream. It's not only annoying, there are two potential problems:
- Performance regression because `gc` requires additional memory and computation to track reference
- Python `gc` runs periodically so we many encountered an error of too many open files due to OS limitation
To reduce the warning:
- Explicitly close byte stream 
- Modify `test_datapipe.py` to use context manager

Small fix:
- Reorder import in `test_datapipe.py`

Further investigation:
Can we directly use context manager in `LoadFileFromDisk` and `ReadFileFromTar` to eliminate this Error?
- Probably no. It's feasible only if the pipeline is synchronized and without prefetching. When we enable these two features, the scope guard of the context manager doesn't work.
- We may need to implement some reference counter attached to these file byte stream to close by itself.

[ghstack-poisoned]
When run `test_datapipe.py`, python `gc` would report lots of `ResourceWarning`s due to unclosed stream. It's not only annoying, there are two potential problems:
- Performance regression because `gc` requires additional memory and computation to track reference
- Python `gc` runs periodically so we many encountered an error of too many open files due to OS limitation
To reduce the warning:
- Explicitly close byte stream 
- Modify `test_datapipe.py` to use context manager

Small fix:
- Reorder import in `test_datapipe.py`

Further investigation:
Can we directly use context manager in `LoadFileFromDisk` and `ReadFileFromTar` to eliminate this Error?
- Probably no. It's feasible only if the pipeline is synchronized and without prefetching. When we enable these two features, the scope guard of the context manager doesn't work.
- We may need to implement some reference counter attached to these file byte stream to close by itself.

[ghstack-poisoned]
ejguan added a commit that referenced this pull request May 25, 2021
ghstack-source-id: ca1cf6c
Pull Request resolved: #58938
@ejguan
Copy link
Contributor Author

ejguan commented May 25, 2021

@ejguan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ejguan ejguan requested a review from VitalyFedyunin May 25, 2021 20:53
When run `test_datapipe.py`, python `gc` would report lots of `ResourceWarning`s due to unclosed stream. It's not only annoying, there are two potential problems:
- Performance regression because `gc` requires additional memory and computation to track reference
- Python `gc` runs periodically so we many encountered an error of too many open files due to OS limitation
To reduce the warning:
- Explicitly close byte stream 
- Modify `test_datapipe.py` to use context manager

Small fix:
- Reorder import in `test_datapipe.py`

Further investigation:
Can we directly use context manager in `LoadFileFromDisk` and `ReadFileFromTar` to eliminate this Error?
- Probably no. It's feasible only if the pipeline is synchronized and without prefetching. When we enable these two features, the scope guard of the context manager doesn't work.
- We may need to implement some reference counter attached to these file byte stream to close by itself.

Differential Revision: [D28689862](https://our.internmc.facebook.com/intern/diff/D28689862)

[ghstack-poisoned]
ejguan added a commit that referenced this pull request May 26, 2021
ghstack-source-id: a4637c1
Pull Request resolved: #58938
When run `test_datapipe.py`, python `gc` would report lots of `ResourceWarning`s due to unclosed stream. It's not only annoying, there are two potential problems:
- Performance regression because `gc` requires additional memory and computation to track reference
- Python `gc` runs periodically so we many encountered an error of too many open files due to OS limitation
To reduce the warning:
- Explicitly close byte stream 
- Modify `test_datapipe.py` to use context manager

Small fix:
- Reorder import in `test_datapipe.py`

Further investigation:
Can we directly use context manager in `LoadFileFromDisk` and `ReadFileFromTar` to eliminate this Error?
- Probably no. It's feasible only if the pipeline is synchronized and without prefetching. When we enable these two features, the scope guard of the context manager doesn't work.
- We may need to implement some reference counter attached to these file byte stream to close by itself.

Differential Revision: [D28689862](https://our.internmc.facebook.com/intern/diff/D28689862)

[ghstack-poisoned]
ejguan added a commit that referenced this pull request Jun 1, 2021
ghstack-source-id: 5d9fb9e
Pull Request resolved: #58938
@ejguan
Copy link
Contributor Author

ejguan commented Jun 1, 2021

@ejguan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

When run `test_datapipe.py`, python `gc` would report lots of `ResourceWarning`s due to unclosed stream. It's not only annoying, there are two potential problems:
- Performance regression because `gc` requires additional memory and computation to track reference
- Python `gc` runs periodically so we many encountered an error of too many open files due to OS limitation
To reduce the warning:
- Explicitly close byte stream 
- Modify `test_datapipe.py` to use context manager

Small fix:
- Reorder import in `test_datapipe.py`

Further investigation:
Can we directly use context manager in `LoadFileFromDisk` and `ReadFileFromTar` to eliminate this Error?
- Probably no. It's feasible only if the pipeline is synchronized and without prefetching. When we enable these two features, the scope guard of the context manager doesn't work.
- We may need to implement some reference counter attached to these file byte stream to close by itself.

Differential Revision: [D28689862](https://our.internmc.facebook.com/intern/diff/D28689862)

[ghstack-poisoned]
Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a test that asserts that a file handle is closed after it has been processed by the decoder? Otherwise LGTM, thanks @ejguan!

@pmeier
Copy link
Collaborator

pmeier commented Jun 7, 2021

On a side note: we should document somewhere that it is the users responsibility to close the opened file handles if he doesn't use the builtin decoder.

ejguan added 3 commits June 7, 2021 20:44
When run `test_datapipe.py`, python `gc` would report lots of `ResourceWarning`s due to unclosed stream. It's not only annoying, there are two potential problems:
- Performance regression because `gc` requires additional memory and computation to track reference
- Python `gc` runs periodically so we many encountered an error of too many open files due to OS limitation
To reduce the warning:
- Explicitly close byte stream 
- Modify `test_datapipe.py` to use context manager

Small fix:
- Reorder import in `test_datapipe.py`

Further investigation:
Can we directly use context manager in `LoadFileFromDisk` and `ReadFileFromTar` to eliminate this Error?
- Probably no. It's feasible only if the pipeline is synchronized and without prefetching. When we enable these two features, the scope guard of the context manager doesn't work.
- We may need to implement some reference counter attached to these file byte stream to close by itself.

Differential Revision: [D28689862](https://our.internmc.facebook.com/intern/diff/D28689862)

[ghstack-poisoned]
When run `test_datapipe.py`, python `gc` would report lots of `ResourceWarning`s due to unclosed stream. It's not only annoying, there are two potential problems:
- Performance regression because `gc` requires additional memory and computation to track reference
- Python `gc` runs periodically so we many encountered an error of too many open files due to OS limitation
To reduce the warning:
- Explicitly close byte stream 
- Modify `test_datapipe.py` to use context manager

Small fix:
- Reorder import in `test_datapipe.py`

Further investigation:
Can we directly use context manager in `LoadFileFromDisk` and `ReadFileFromTar` to eliminate this Error?
- Probably no. It's feasible only if the pipeline is synchronized and without prefetching. When we enable these two features, the scope guard of the context manager doesn't work.
- We may need to implement some reference counter attached to these file byte stream to close by itself.

Differential Revision: [D28689862](https://our.internmc.facebook.com/intern/diff/D28689862)

[ghstack-poisoned]
When run `test_datapipe.py`, python `gc` would report lots of `ResourceWarning`s due to unclosed stream. It's not only annoying, there are two potential problems:
- Performance regression because `gc` requires additional memory and computation to track reference
- Python `gc` runs periodically so we many encountered an error of too many open files due to OS limitation
To reduce the warning:
- Explicitly close byte stream 
- Modify `test_datapipe.py` to use context manager

Small fix:
- Reorder import in `test_datapipe.py`

Further investigation:
Can we directly use context manager in `LoadFileFromDisk` and `ReadFileFromTar` to eliminate this Error?
- Probably no. It's feasible only if the pipeline is synchronized and without prefetching. When we enable these two features, the scope guard of the context manager doesn't work.
- We may need to implement some reference counter attached to these file byte stream to close by itself.

Differential Revision: [D28689862](https://our.internmc.facebook.com/intern/diff/D28689862)

[ghstack-poisoned]
@ejguan
Copy link
Contributor Author

ejguan commented Jun 7, 2021

@ejguan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ejguan merged this pull request in 1b578c4.

deniskokarev pushed a commit to deniskokarev/pytorch that referenced this pull request Jun 9, 2021
Summary:
Pull Request resolved: pytorch#58938

When run `test_datapipe.py`, python `gc` would report lots of `ResourceWarning`s due to unclosed stream. It's not only annoying, there are two potential problems:
- Performance regression because `gc` requires additional memory and computation to track reference
- Python `gc` runs periodically so we many encountered an error of too many open files due to OS limitation
To reduce the warning:
- Explicitly close byte stream
- Modify `test_datapipe.py` to use context manager

Small fix:
- Reorder import in `test_datapipe.py`

Further investigation:
Can we directly use context manager in `LoadFileFromDisk` and `ReadFileFromTar` to eliminate this Error?
- Probably no. It's feasible only if the pipeline is synchronized and without prefetching. When we enable these two features, the scope guard of the context manager doesn't work.
- We may need to implement some reference counter attached to these file byte stream to close by itself.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D28689862

Pulled By: ejguan

fbshipit-source-id: bb2a85defb8a4ab5384db902ef6ad062185c2653
@facebook-github-bot facebook-github-bot deleted the gh/ejguan/66/head branch June 12, 2021 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants