Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU testing config and GPU roberta model tests #2025

Merged
merged 8 commits into from
Jan 26, 2023
Merged

Conversation

rshraga
Copy link
Contributor

@rshraga rshraga commented Jan 18, 2023

This PR adds a unit testing workflow to .github/workflows which is run on a machine with CUDA. At the same time, it modifies torchtext_unittest/models tests by changing the main test file to test_models.py to models_test_impl.py and renaming the main test class from TestModels to BaseTestModels Next it adds separate cpu and gpu unit tests which inherit from BaseTestModels With this change only the cpu and gpu unit tests are actually run, not BaseTestModels A similar pattern can be used for gpu unit tests elsewhere

@joecummings
Copy link
Contributor

This is awesome, @rshraga, thank you! As part of this PR, can you add a very simple GPU test for T5? That will also make sure that we have the architecture to ensure GPU tests only run on the GPU and CPU tests only run on the CPU. I think you can borrow some logic from torchaudio on how to do this. And let's get @mthrok 's eyes on this, too.

@rshraga rshraga marked this pull request as draft January 19, 2023 16:32
Copy link
Contributor

@mthrok mthrok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks good.

printf "* Downloading SpaCy German models\n"
python -m spacy download de_core_news_sm

# Install PyTorch, Torchvision, and TorchData
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vision?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

conda env update --file ".circleci/unittest/linux/scripts/environment.yml" --prune

# TorchText-specific Setup
printf "* Downloading SpaCy English models\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

off the topic: but I feel like this can be run in background with & and wait for the process to finish before the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copied most of this over from the other github workflow files. If improvements are needed, can we do that in separate PR?

Copy link
Contributor

@mthrok mthrok Jan 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it’s just a random thought. Not necessary to follow up

@@ -37,11 +38,34 @@ def get_temp_path(self, *paths):
return path


class TestBaseMixin:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to introduce this for the sake of adding GPU tests? I'd split the PR.

NVM. After looking at the rest of the code, I see this is part of enabling GPU test.

from torch.nn import functional as torch_F

from ..common.torchtext_test_case import TorchtextTestCase
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary and right to replace TorchtextTestCase?
Making TestBaseMixin be part of TorchtextTestCase is more aligned to the original design in torchaudio. (though I do not necessarily know all the details about torchtext test suite, I could be wrong here.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes because pytest will run all tests for all super classes of TorchtextTestCase So BaseTestModels must not extend it directly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Thanks for clarifying

torch.random.manual_seed(2434)

@property
def complex_dtype(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complex tensor stuff is something we needed in audio.
If text does not use it, I recommend to not to add it.
The lighter the test fixtures are, the better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed, thanks!

printf "Installing torchdata nightly\n"
python3 -m pip install --pre torchdata --extra-index-url https://download.pytorch.org/whl/nightly/cpu
python3 setup.py develop
python3 -m pip install parameterized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel like parameterized can be incorporated into .circleci/unittest/linux/scripts/environment.yml

workflow_dispatch:

env:
CHANNEL: "nightly"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a advantage of defining the env var here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just following the pattern from other workflows. If you suggest fixing / cleanup can we do so to all of them?

@rshraga
Copy link
Contributor Author

rshraga commented Jan 20, 2023

@mthrok thank you for the comments! Will address these in the next version. When I check the result of unit tests in "Unit-tests on Linux GPU" workflow, I see that the new GPU ones did not even run (only the cpu ones are there)
image

I checked and saw that the instance used for the job is g5.4xlarge which has a GPU so I am not sure why the GPU tests get skipped. Do you have any ideas how I could debug and get these tests to run?

@mthrok
Copy link
Contributor

mthrok commented Jan 20, 2023

@mthrok thank you for the comments! Will address these in the next version. When I check the result of unit tests in "Unit-tests on Linux GPU" workflow, I see that the new GPU ones did not even run (only the cpu ones are there) image

I checked and saw that the instance used for the job is g5.4xlarge which has a GPU so I am not sure why the GPU tests get skipped. Do you have any ideas how I could debug and get these tests to run?

This is very peculiar. GPU tests are not even shown skipped. They are not recognized.

Couple of suggestions for debugging

  1. Install torchtext with python setup.py install instead of python setup.py develop.
  2. Run python3 -m torch.utils.collect_env after cd test.

Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're no longer just enabling GPU tests, can we update the PR description to go into details on the changes that were made and the reasoning behind them?

@rshraga rshraga changed the title added gpu test Add GPU testing config and GPU roberta model tests Jan 24, 2023
@rshraga rshraga marked this pull request as ready for review January 24, 2023 21:13
from ..common.torchtext_test_case import TorchtextTestCase
from .models_test_impl import BaseTestModels


@skipIfNoCuda
@pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA is not available")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we're no longer using a custom skipIfNoCuda decorator to do this check? Wouldn't it be cleaner for the future when we have many gpu specific unit tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the issue with GPU test not executed resolved?
If not then this appears to be no1 suspect for me.

The unit test suite in torchaudio (which torchtext’s suite is based off of), was originally (and still) written with Python unittest module. No pytest code is used. (pytest command is used to run the code though) The reason for that is 1. to reuse the helper functions from PyTorch and 2. fbcode was not supporting pytest.

One of the biggest downsides is that ‘unittest’ does not natively support parameterization. So we used ‘parameterized’. ‘parameterized’ (and ‘pytest’ as well) employs many tricks to manipulate meta data and class/type objects, which we wouldn’t (and shouldn’t ) do in normal circumstances writing a library, so they often show strange behavior.

So far unittest + parameterized for writing tests then using pytest for running them worked fine.
But this one seems to be combining three to write tests, which we avoided in audio repo and could have some unexpected effects under the hood.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea using skipIfNoCuda was not working properly, but using pytest.mark.skipif (as vision uses) works fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torchvision's test suites are based on pytest. So that's a bit different.

@Nayef211
Copy link
Contributor

Another thing to consider is whether we gain any benefits from running all tests on GPU enabled hosts. Since these tests are already being run on CPU hosts, do we want to save time and resources by only selecting GPU specific tests to be run on GPU hosts?

@rshraga
Copy link
Contributor Author

rshraga commented Jan 25, 2023

Another thing to consider is whether we gain any benefits from running all tests on GPU enabled hosts. Since these tests are already being run on CPU hosts, do we want to save time and resources by only selecting GPU specific tests to be run on GPU hosts?

The pattern for the other libraries is like it is here: i.e cpu machines skip gpu tests, while gpu machines run everything. I think to do what you are describing would require keeping all cpu tests for all code and all gpu tests for all code in separate folders. This would make things look a bit messy imo

fi

# Create Conda Env
conda create -yp ci_env python="${PYTHON_VERSION}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make all the call to conda and pip quiet? The CI log is super long but not providing value much.

@joecummings
Copy link
Contributor

Another thing to consider is whether we gain any benefits from running all tests on GPU enabled hosts. Since these tests are already being run on CPU hosts, do we want to save time and resources by only selecting GPU specific tests to be run on GPU hosts?

I would be in favor of this approach. Currently the only thing that needs to be run on GPU is modeling and our tests already take a long time.

@mthrok
Copy link
Contributor

mthrok commented Jan 25, 2023

Another thing to consider is whether we gain any benefits from running all tests on GPU enabled hosts. Since these tests are already being run on CPU hosts, do we want to save time and resources by only selecting GPU specific tests to be run on GPU hosts?

I would be in favor of this approach. Currently the only thing that needs to be run on GPU is modeling and our tests already take a long time.

I gave some thoughts and it's not much relevant here but if you have custom kernel implementations for both CPU/GPU, then it's worth running the tests for both implementations on CPU-only build and GPU-only build.

If the GPU code is written in Python only (which I believe is the case for torchtext), then running tests on GPU machine to check that devices are properly propagated should be enough, because it's PyTorch's premise that they implement the basic ops in CPU and GPU properly.

@Nayef211
Copy link
Contributor

I gave some thoughts and it's not much relevant here but if you have custom kernel implementations for both CPU/GPU, then it's worth running the tests for both implementations on CPU-only build and GPU-only build.

If the GPU code is written in Python only (which I believe is the case for torchtext), then running tests on GPU machine to check that devices are properly propagated should be enough, because it's PyTorch's premise that they implement the basic ops in CPU and GPU properly.

@mthrok, just to make sure I understand, are you suggesting we do go forward with only running GPU specific tests on GPU hosts or to go with the status quo of running all tests on GPUs?

@mthrok
Copy link
Contributor

mthrok commented Jan 26, 2023

I gave some thoughts and it's not much relevant here but if you have custom kernel implementations for both CPU/GPU, then it's worth running the tests for both implementations on CPU-only build and GPU-only build.
If the GPU code is written in Python only (which I believe is the case for torchtext), then running tests on GPU machine to check that devices are properly propagated should be enough, because it's PyTorch's premise that they implement the basic ops in CPU and GPU properly.

@mthrok, just to make sure I understand, are you suggesting we do go forward with only running GPU specific tests on GPU hosts or to go with the status quo of running all tests on GPUs?

I think it's okay to run just GPU tests for Python-based implementations. But when you have custom ops with CPU and GPU implementations, I think it's good to test both CPU and GPU of those ops.

@Nayef211
Copy link
Contributor

LGTM! Feel free to merge :)

@rshraga rshraga merged commit 961dc67 into main Jan 26, 2023
@rshraga rshraga deleted the add_gpu_test branch January 26, 2023 20:19
@joecummings joecummings mentioned this pull request Jan 26, 2023
strategy:
matrix:
python_version: ["3.8"]
cuda_arch_version: ["11.6"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are dropping the support for 11.6, so might be better to move to 11.7.

https://pytorch.slack.com/archives/C2077MFDL/p1675256463971369

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking care of this in #2040

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants