-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPU testing config and GPU roberta model tests #2025
Conversation
This is awesome, @rshraga, thank you! As part of this PR, can you add a very simple GPU test for T5? That will also make sure that we have the architecture to ensure GPU tests only run on the GPU and CPU tests only run on the CPU. I think you can borrow some logic from torchaudio on how to do this. And let's get @mthrok 's eyes on this, too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks good.
.github/workflows/test-linux-gpu.yml
Outdated
printf "* Downloading SpaCy German models\n" | ||
python -m spacy download de_core_news_sm | ||
|
||
# Install PyTorch, Torchvision, and TorchData |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
conda env update --file ".circleci/unittest/linux/scripts/environment.yml" --prune | ||
|
||
# TorchText-specific Setup | ||
printf "* Downloading SpaCy English models\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
off the topic: but I feel like this can be run in background with &
and wait for the process to finish before the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just copied most of this over from the other github workflow files. If improvements are needed, can we do that in separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it’s just a random thought. Not necessary to follow up
@@ -37,11 +38,34 @@ def get_temp_path(self, *paths): | |||
return path | |||
|
|||
|
|||
class TestBaseMixin: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to introduce this for the sake of adding GPU tests? I'd split the PR.
NVM. After looking at the rest of the code, I see this is part of enabling GPU test.
from torch.nn import functional as torch_F | ||
|
||
from ..common.torchtext_test_case import TorchtextTestCase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary and right to replace TorchtextTestCase
?
Making TestBaseMixin
be part of TorchtextTestCase
is more aligned to the original design in torchaudio. (though I do not necessarily know all the details about torchtext test suite, I could be wrong here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes because pytest will run all tests for all super classes of TorchtextTestCase
So BaseTestModels
must not extend it directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Thanks for clarifying
torch.random.manual_seed(2434) | ||
|
||
@property | ||
def complex_dtype(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complex tensor stuff is something we needed in audio.
If text does not use it, I recommend to not to add it.
The lighter the test fixtures are, the better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed, thanks!
.github/workflows/test-linux-gpu.yml
Outdated
printf "Installing torchdata nightly\n" | ||
python3 -m pip install --pre torchdata --extra-index-url https://download.pytorch.org/whl/nightly/cpu | ||
python3 setup.py develop | ||
python3 -m pip install parameterized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel like parameterized
can be incorporated into .circleci/unittest/linux/scripts/environment.yml
workflow_dispatch: | ||
|
||
env: | ||
CHANNEL: "nightly" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a advantage of defining the env var here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just following the pattern from other workflows. If you suggest fixing / cleanup can we do so to all of them?
@mthrok thank you for the comments! Will address these in the next version. When I check the result of unit tests in "Unit-tests on Linux GPU" workflow, I see that the new GPU ones did not even run (only the cpu ones are there) I checked and saw that the instance used for the job is g5.4xlarge which has a GPU so I am not sure why the GPU tests get skipped. Do you have any ideas how I could debug and get these tests to run? |
This is very peculiar. GPU tests are not even shown skipped. They are not recognized. Couple of suggestions for debugging
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're no longer just enabling GPU tests, can we update the PR description to go into details on the changes that were made and the reasoning behind them?
from ..common.torchtext_test_case import TorchtextTestCase | ||
from .models_test_impl import BaseTestModels | ||
|
||
|
||
@skipIfNoCuda | ||
@pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA is not available") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason we're no longer using a custom skipIfNoCuda
decorator to do this check? Wouldn't it be cleaner for the future when we have many gpu specific unit tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the issue with GPU test not executed resolved?
If not then this appears to be no1 suspect for me.
The unit test suite in torchaudio (which torchtext’s suite is based off of), was originally (and still) written with Python unittest module. No pytest code is used. (pytest command is used to run the code though) The reason for that is 1. to reuse the helper functions from PyTorch and 2. fbcode was not supporting pytest.
One of the biggest downsides is that ‘unittest’ does not natively support parameterization. So we used ‘parameterized’. ‘parameterized’ (and ‘pytest’ as well) employs many tricks to manipulate meta data and class/type objects, which we wouldn’t (and shouldn’t ) do in normal circumstances writing a library, so they often show strange behavior.
So far unittest + parameterized for writing tests then using pytest for running them worked fine.
But this one seems to be combining three to write tests, which we avoided in audio repo and could have some unexpected effects under the hood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea using skipIfNoCuda was not working properly, but using pytest.mark.skipif (as vision uses) works fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torchvision's test suites are based on pytest. So that's a bit different.
Another thing to consider is whether we gain any benefits from running all tests on GPU enabled hosts. Since these tests are already being run on CPU hosts, do we want to save time and resources by only selecting GPU specific tests to be run on GPU hosts? |
The pattern for the other libraries is like it is here: i.e cpu machines skip gpu tests, while gpu machines run everything. I think to do what you are describing would require keeping all cpu tests for all code and all gpu tests for all code in separate folders. This would make things look a bit messy imo |
.github/workflows/test-linux-gpu.yml
Outdated
fi | ||
|
||
# Create Conda Env | ||
conda create -yp ci_env python="${PYTHON_VERSION}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make all the call to conda
and pip
quiet? The CI log is super long but not providing value much.
I would be in favor of this approach. Currently the only thing that needs to be run on GPU is modeling and our tests already take a long time. |
I gave some thoughts and it's not much relevant here but if you have custom kernel implementations for both CPU/GPU, then it's worth running the tests for both implementations on CPU-only build and GPU-only build. If the GPU code is written in Python only (which I believe is the case for torchtext), then running tests on GPU machine to check that devices are properly propagated should be enough, because it's PyTorch's premise that they implement the basic ops in CPU and GPU properly. |
@mthrok, just to make sure I understand, are you suggesting we do go forward with only running GPU specific tests on GPU hosts or to go with the status quo of running all tests on GPUs? |
I think it's okay to run just GPU tests for Python-based implementations. But when you have custom ops with CPU and GPU implementations, I think it's good to test both CPU and GPU of those ops. |
…d gpu tests to separate folder
LGTM! Feel free to merge :) |
strategy: | ||
matrix: | ||
python_version: ["3.8"] | ||
cuda_arch_version: ["11.6"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are dropping the support for 11.6, so might be better to move to 11.7.
https://pytorch.slack.com/archives/C2077MFDL/p1675256463971369
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking care of this in #2040
This PR adds a unit testing workflow to .github/workflows which is run on a machine with CUDA. At the same time, it modifies torchtext_unittest/models tests by changing the main test file to
test_models.py
tomodels_test_impl.py
and renaming the main test class fromTestModels
toBaseTestModels
Next it adds separate cpu and gpu unit tests which inherit fromBaseTestModels
With this change only the cpu and gpu unit tests are actually run, notBaseTestModels
A similar pattern can be used for gpu unit tests elsewhere