Add GPU testing config and GPU roberta model tests #2025

rshraga · 2023-01-18T15:17:34Z

This PR adds a unit testing workflow to .github/workflows which is run on a machine with CUDA. At the same time, it modifies torchtext_unittest/models tests by changing the main test file to test_models.py to models_test_impl.py and renaming the main test class from TestModels to BaseTestModels Next it adds separate cpu and gpu unit tests which inherit from BaseTestModels With this change only the cpu and gpu unit tests are actually run, not BaseTestModels A similar pattern can be used for gpu unit tests elsewhere

joecummings · 2023-01-19T16:17:29Z

This is awesome, @rshraga, thank you! As part of this PR, can you add a very simple GPU test for T5? That will also make sure that we have the architecture to ensure GPU tests only run on the GPU and CPU tests only run on the CPU. I think you can borrow some logic from torchaudio on how to do this. And let's get @mthrok 's eyes on this, too.

mthrok

Overall it looks good.

mthrok · 2023-01-19T21:48:03Z

.github/workflows/test-linux-gpu.yml

+        printf "* Downloading SpaCy German models\n"
+        python -m spacy download de_core_news_sm
+
+        # Install PyTorch, Torchvision, and TorchData


mthrok · 2023-01-19T21:49:03Z

.github/workflows/test-linux-gpu.yml

+        conda env update --file ".circleci/unittest/linux/scripts/environment.yml" --prune
+
+        # TorchText-specific Setup
+        printf "* Downloading SpaCy English models\n"


off the topic: but I feel like this can be run in background with & and wait for the process to finish before the test.

I just copied most of this over from the other github workflow files. If improvements are needed, can we do that in separate PR?

Yeah it’s just a random thought. Not necessary to follow up

mthrok · 2023-01-19T21:52:23Z

test/torchtext_unittest/common/case_utils.py

@@ -37,11 +38,34 @@ def get_temp_path(self, *paths):
        return path


+class TestBaseMixin:


~~Do we need to introduce this for the sake of adding GPU tests? I'd split the PR.~~

NVM. After looking at the rest of the code, I see this is part of enabling GPU test.

mthrok · 2023-01-19T21:55:52Z

test/torchtext_unittest/models/models_test_impl.py

 from torch.nn import functional as torch_F

-from ..common.torchtext_test_case import TorchtextTestCase


Is it necessary and right to replace TorchtextTestCase?
Making TestBaseMixin be part of TorchtextTestCase is more aligned to the original design in torchaudio. (though I do not necessarily know all the details about torchtext test suite, I could be wrong here.)

Yes because pytest will run all tests for all super classes of TorchtextTestCase So BaseTestModels must not extend it directly

That makes sense. Thanks for clarifying

mthrok · 2023-01-19T21:58:51Z

test/torchtext_unittest/common/case_utils.py

+        torch.random.manual_seed(2434)
+
+    @property
+    def complex_dtype(self):


Complex tensor stuff is something we needed in audio.
If text does not use it, I recommend to not to add it.
The lighter the test fixtures are, the better.

removed, thanks!

mthrok · 2023-01-19T22:03:18Z

.github/workflows/test-linux-gpu.yml

+        printf "Installing torchdata nightly\n"
+        python3 -m pip install --pre torchdata --extra-index-url https://download.pytorch.org/whl/nightly/cpu
+        python3 setup.py develop
+        python3 -m pip install parameterized


Feel like parameterized can be incorporated into .circleci/unittest/linux/scripts/environment.yml

mthrok · 2023-01-19T22:05:15Z

.github/workflows/test-linux-gpu.yml

+  workflow_dispatch:
+
+env:
+  CHANNEL: "nightly"


Is there a advantage of defining the env var here?

Just following the pattern from other workflows. If you suggest fixing / cleanup can we do so to all of them?

rshraga · 2023-01-20T19:38:19Z

@mthrok thank you for the comments! Will address these in the next version. When I check the result of unit tests in "Unit-tests on Linux GPU" workflow, I see that the new GPU ones did not even run (only the cpu ones are there)

I checked and saw that the instance used for the job is g5.4xlarge which has a GPU so I am not sure why the GPU tests get skipped. Do you have any ideas how I could debug and get these tests to run?

mthrok · 2023-01-20T20:40:01Z

@mthrok thank you for the comments! Will address these in the next version. When I check the result of unit tests in "Unit-tests on Linux GPU" workflow, I see that the new GPU ones did not even run (only the cpu ones are there)

I checked and saw that the instance used for the job is g5.4xlarge which has a GPU so I am not sure why the GPU tests get skipped. Do you have any ideas how I could debug and get these tests to run?

This is very peculiar. GPU tests are not even shown skipped. They are not recognized.

Couple of suggestions for debugging

Install torchtext with python setup.py install instead of python setup.py develop.
Run python3 -m torch.utils.collect_env after cd test.

Nayef211

Since we're no longer just enabling GPU tests, can we update the PR description to go into details on the changes that were made and the reasoning behind them?

Nayef211 · 2023-01-24T21:37:32Z

test/torchtext_unittest/models/models_gpu_test.py

 from ..common.torchtext_test_case import TorchtextTestCase
 from .models_test_impl import BaseTestModels


-@skipIfNoCuda
+@pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA is not available")


Any reason we're no longer using a custom skipIfNoCuda decorator to do this check? Wouldn't it be cleaner for the future when we have many gpu specific unit tests?

Is the issue with GPU test not executed resolved?
If not then this appears to be no1 suspect for me.

The unit test suite in torchaudio (which torchtext’s suite is based off of), was originally (and still) written with Python unittest module. No pytest code is used. (pytest command is used to run the code though) The reason for that is 1. to reuse the helper functions from PyTorch and 2. fbcode was not supporting pytest.

One of the biggest downsides is that ‘unittest’ does not natively support parameterization. So we used ‘parameterized’. ‘parameterized’ (and ‘pytest’ as well) employs many tricks to manipulate meta data and class/type objects, which we wouldn’t (and shouldn’t ) do in normal circumstances writing a library, so they often show strange behavior.

So far unittest + parameterized for writing tests then using pytest for running them worked fine.
But this one seems to be combining three to write tests, which we avoided in audio repo and could have some unexpected effects under the hood.

Yea using skipIfNoCuda was not working properly, but using pytest.mark.skipif (as vision uses) works fine.

torchvision's test suites are based on pytest. So that's a bit different.

Nayef211 · 2023-01-24T21:45:01Z

Another thing to consider is whether we gain any benefits from running all tests on GPU enabled hosts. Since these tests are already being run on CPU hosts, do we want to save time and resources by only selecting GPU specific tests to be run on GPU hosts?

rshraga · 2023-01-25T19:44:32Z

Another thing to consider is whether we gain any benefits from running all tests on GPU enabled hosts. Since these tests are already being run on CPU hosts, do we want to save time and resources by only selecting GPU specific tests to be run on GPU hosts?

The pattern for the other libraries is like it is here: i.e cpu machines skip gpu tests, while gpu machines run everything. I think to do what you are describing would require keeping all cpu tests for all code and all gpu tests for all code in separate folders. This would make things look a bit messy imo

mthrok · 2023-01-25T19:49:02Z

.github/workflows/test-linux-gpu.yml

+        fi
+
+        # Create Conda Env
+        conda create -yp ci_env python="${PYTHON_VERSION}"


Can you make all the call to conda and pip quiet? The CI log is super long but not providing value much.

joecummings · 2023-01-25T20:13:16Z

Another thing to consider is whether we gain any benefits from running all tests on GPU enabled hosts. Since these tests are already being run on CPU hosts, do we want to save time and resources by only selecting GPU specific tests to be run on GPU hosts?

I would be in favor of this approach. Currently the only thing that needs to be run on GPU is modeling and our tests already take a long time.

mthrok · 2023-01-25T20:46:35Z

Another thing to consider is whether we gain any benefits from running all tests on GPU enabled hosts. Since these tests are already being run on CPU hosts, do we want to save time and resources by only selecting GPU specific tests to be run on GPU hosts?

I would be in favor of this approach. Currently the only thing that needs to be run on GPU is modeling and our tests already take a long time.

I gave some thoughts and it's not much relevant here but if you have custom kernel implementations for both CPU/GPU, then it's worth running the tests for both implementations on CPU-only build and GPU-only build.

If the GPU code is written in Python only (which I believe is the case for torchtext), then running tests on GPU machine to check that devices are properly propagated should be enough, because it's PyTorch's premise that they implement the basic ops in CPU and GPU properly.

Nayef211 · 2023-01-25T22:40:13Z

I gave some thoughts and it's not much relevant here but if you have custom kernel implementations for both CPU/GPU, then it's worth running the tests for both implementations on CPU-only build and GPU-only build.

If the GPU code is written in Python only (which I believe is the case for torchtext), then running tests on GPU machine to check that devices are properly propagated should be enough, because it's PyTorch's premise that they implement the basic ops in CPU and GPU properly.

@mthrok, just to make sure I understand, are you suggesting we do go forward with only running GPU specific tests on GPU hosts or to go with the status quo of running all tests on GPUs?

mthrok · 2023-01-26T00:06:44Z

I gave some thoughts and it's not much relevant here but if you have custom kernel implementations for both CPU/GPU, then it's worth running the tests for both implementations on CPU-only build and GPU-only build.
If the GPU code is written in Python only (which I believe is the case for torchtext), then running tests on GPU machine to check that devices are properly propagated should be enough, because it's PyTorch's premise that they implement the basic ops in CPU and GPU properly.

@mthrok, just to make sure I understand, are you suggesting we do go forward with only running GPU specific tests on GPU hosts or to go with the status quo of running all tests on GPUs?

I think it's okay to run just GPU tests for Python-based implementations. But when you have custom ops with CPU and GPU implementations, I think it's good to test both CPU and GPU of those ops.

…d gpu tests to separate folder

Nayef211 · 2023-01-26T19:51:06Z

LGTM! Feel free to merge :)

mthrok · 2023-02-01T18:58:14Z

.github/workflows/test-linux-gpu.yml

+    strategy:
+      matrix:
+        python_version: ["3.8"]
+        cuda_arch_version: ["11.6"]


They are dropping the support for 11.6, so might be better to move to 11.7.

https://pytorch.slack.com/archives/C2077MFDL/p1675256463971369

Thanks for taking care of this in #2040

added gpu test

eac42ab

facebook-github-bot added the cla signed label Jan 18, 2023

rshraga requested review from Nayef211, joecummings, atalman and osalpekar January 19, 2023 15:45

rshraga marked this pull request as draft January 19, 2023 16:32

rshraga added 2 commits January 19, 2023 13:49

separated model tests into cpu and gpu specific versions

7190d15

removed redundant Test from name

ff07271

mthrok reviewed Jan 19, 2023

View reviewed changes

Nayef211 reviewed Jan 20, 2023

View reviewed changes

addressed comments and fixed test skip implemenation

5f5a65b

rshraga changed the title ~~added gpu test~~ Add GPU testing config and GPU roberta model tests Jan 24, 2023

fix gpu test train case by sending model input to device

ac8e415

rshraga marked this pull request as ready for review January 24, 2023 21:13

Nayef211 approved these changes Jan 24, 2023

View reviewed changes

mthrok reviewed Jan 25, 2023

View reviewed changes

rshraga added 3 commits January 26, 2023 12:19

silenced setup script steps, changed skip decorator to unittest, move…

3343f0b

…d gpu tests to separate folder

add back gpu tests

1d40bb3

fix lint

38fdaf2

rshraga merged commit 961dc67 into main Jan 26, 2023

rshraga deleted the add_gpu_test branch January 26, 2023 20:19

joecummings mentioned this pull request Jan 26, 2023

Add GPU tests #2008

Closed

mthrok reviewed Feb 1, 2023

View reviewed changes

wangshuai09 mentioned this pull request Mar 29, 2024

Add NPU test case #2251

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU testing config and GPU roberta model tests #2025

Add GPU testing config and GPU roberta model tests #2025

rshraga commented Jan 18, 2023 •

edited

Loading

joecummings commented Jan 19, 2023

mthrok left a comment

mthrok Jan 19, 2023

rshraga Jan 24, 2023

mthrok Jan 19, 2023

rshraga Jan 24, 2023

mthrok Jan 25, 2023 •

edited

Loading

mthrok Jan 19, 2023

mthrok Jan 19, 2023

rshraga Jan 24, 2023

mthrok Jan 25, 2023

mthrok Jan 19, 2023

rshraga Jan 24, 2023

mthrok Jan 19, 2023

mthrok Jan 19, 2023

rshraga Jan 24, 2023

rshraga commented Jan 20, 2023

mthrok commented Jan 20, 2023

Nayef211 left a comment

Nayef211 Jan 24, 2023

mthrok Jan 25, 2023

rshraga Jan 25, 2023

mthrok Jan 25, 2023

Nayef211 commented Jan 24, 2023

rshraga commented Jan 25, 2023

mthrok Jan 25, 2023

joecummings commented Jan 25, 2023

mthrok commented Jan 25, 2023

Nayef211 commented Jan 25, 2023

mthrok commented Jan 26, 2023

Nayef211 commented Jan 26, 2023

mthrok Feb 1, 2023

Nayef211 Feb 2, 2023

		@@ -37,11 +38,34 @@ def get_temp_path(self, *paths):
		return path


		class TestBaseMixin:

		from torch.nn import functional as torch_F

		from ..common.torchtext_test_case import TorchtextTestCase

Add GPU testing config and GPU roberta model tests #2025

Add GPU testing config and GPU roberta model tests #2025

Conversation

rshraga commented Jan 18, 2023 • edited Loading

joecummings commented Jan 19, 2023

mthrok left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mthrok Jan 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rshraga commented Jan 20, 2023

mthrok commented Jan 20, 2023

Nayef211 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nayef211 commented Jan 24, 2023

rshraga commented Jan 25, 2023

Choose a reason for hiding this comment

joecummings commented Jan 25, 2023

mthrok commented Jan 25, 2023

Nayef211 commented Jan 25, 2023

mthrok commented Jan 26, 2023

Nayef211 commented Jan 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rshraga commented Jan 18, 2023 •

edited

Loading

mthrok Jan 25, 2023 •

edited

Loading