Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

CHORE: Use faster test translation scenario, cut CI time by ~5mins #3046

Merged
merged 8 commits into from Jun 27, 2023

Conversation

connortann
Copy link
Collaborator

@connortann connortann commented Jun 26, 2023

Supports #3045

Overview

Changes the model used in the translation scenario to amuch smaller one that will run much faster:
https://huggingface.co/mesolitica/finetune-translation-t5-super-super-tiny-standard-bahasa-cased

Timings

The change seems to save ~5 min on Linux, and 7+ min on MacOS.

On Linux GH runner python 3.11, test timings before:

80.04s call     tests/explainers/test_partition.py::test_translation
76.44s call     tests/explainers/test_partition.py::test_translation_auto
76.27s call     tests/explainers/test_partition.py::test_translation_algorithm_arg
74.24s call     tests/explainers/test_partition.py::test_serialization
73.27s call     tests/explainers/test_partition.py::test_serialization_custom_model_save
69.42s call     tests/explainers/test_partition.py::test_serialization_no_model_or_masker

Test timings after:

22.75s call     tests/explainers/test_partition.py::test_translation
<19s   call     tests/explainers/test_partition.py::test_translation_auto
<19s   call     tests/explainers/test_partition.py::test_translation_algorithm_arg
20.38s call     tests/explainers/test_partition.py::test_serialization
20.70s call     tests/explainers/test_partition.py::test_serialization_custom_model_save
19.80s call     tests/explainers/test_partition.py::test_serialization_no_model_or_masker

Overall that's 328 seconds faster on python 3.11 馃帀

Timings vary between python versions and platforms, so the overall average speedup may differ.

Note about protobuf

This new model require that we use protobuf<=3.20.x, or otherwise a TypeError is thrown. There is a related thread on stackoverflow here.

Here is the full traceback:

____________ ERROR at setup of test_serialization_custom_model_save ____________

    @pytest.mark.skipif(sys.platform == 'win32', reason="Integer division bug in HuggingFace on Windows")
    @pytest.fixture(scope="session")
    def basic_translation_scenario():
        """ Create a basic transformers translation model and tokenizer.
        """
        AutoTokenizer = pytest.importorskip("transformers").AutoTokenizer
        AutoModelForSeq2SeqLM = pytest.importorskip("transformers").AutoModelForSeq2SeqLM
    
        # Use a very small model, for speed
        name = "mesolitica/finetune-translation-t5-super-super-tiny-standard-bahasa-cased"
>       tokenizer = AutoTokenizer.from_pretrained(name)

tests/explainers/conftest.py:16: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/hostedtoolcache/Python/3.11.4/x64/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py:691: in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
/opt/hostedtoolcache/Python/3.11.4/x64/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1825: in from_pretrained
    return cls._from_pretrained(
/opt/hostedtoolcache/Python/3.11.4/x64/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1988: in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
/opt/hostedtoolcache/Python/3.11.4/x64/lib/python3.11/site-packages/transformers/models/t5/tokenization_t5_fast.py:133: in __init__
    super().__init__(
/opt/hostedtoolcache/Python/3.11.4/x64/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py:114: in __init__
    fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
/opt/hostedtoolcache/Python/3.11.4/x64/lib/python3.11/site-packages/transformers/convert_slow_tokenizer.py:1307: in convert_slow_tokenizer
    return converter_class(transformer_tokenizer).converted()
/opt/hostedtoolcache/Python/3.11.4/x64/lib/python3.11/site-packages/transformers/convert_slow_tokenizer.py:445: in __init__
    from .utils import sentencepiece_model_pb2 as model_pb2
/opt/hostedtoolcache/Python/3.11.4/x64/lib/python3.11/site-packages/transformers/utils/sentencepiece_model_pb2.py:91: in <module>
    _descriptor.EnumValueDescriptor(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

cls = <class 'google.protobuf.descriptor.EnumValueDescriptor'>, name = 'UNIGRAM'
index = 0, number = 1, type = None, options = None, serialized_options = None
create_key = <object object at 0x7f886b0750d0>

    def __new__(cls, name, index, number,
                type=None,  # pylint: disable=redefined-builtin
                options=None, serialized_options=None, create_key=None):
>     _message.Message._CheckCalledFromGeneratedFile()
E     TypeError: Descriptors cannot not be created directly.
E     If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
E     If you cannot immediately regenerate your protos, some other possible workarounds are:
E      1. Downgrade the protobuf package to 3.20.x or lower.
E      2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
E     
E     More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

In future, we might be able to relax this pin if the transformers library is updated, or if we find an alternative Tokenizer model that was trained with a more recent version of protobuf.

@connortann connortann added the ci Relating to Continuous Integration / GitHub Actions label Jun 26, 2023
@connortann connortann changed the title Use faster test translation scenario CHORE: Use faster test translation scenario Jun 26, 2023
@codecov
Copy link

codecov bot commented Jun 26, 2023

Codecov Report

Merging #3046 (1bb00b8) into master (9d72ec7) will not change coverage.
The diff coverage is n/a.

@@          Coverage Diff           @@
##           master   #3046   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files          90      90           
  Lines       12850   12850           
======================================
  Misses      12850   12850           

馃摚 We鈥檙e building smart automated test selection to slash your CI/CD build times. Learn more

@connortann connortann changed the title CHORE: Use faster test translation scenario CHORE: Use faster test translation scenario, faster tests by ~5min Jun 27, 2023
@connortann connortann added the enhancement Indicates new feature requests label Jun 27, 2023
@connortann connortann self-assigned this Jun 27, 2023
@connortann connortann changed the title CHORE: Use faster test translation scenario, faster tests by ~5min CHORE: Use faster test translation scenario Jun 27, 2023
@connortann connortann marked this pull request as ready for review June 27, 2023 10:04
@connortann connortann changed the title CHORE: Use faster test translation scenario CHORE: Use faster test translation scenario, cut CI time by ~5mins Jun 27, 2023
Copy link
Collaborator

@thatlittleboy thatlittleboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

sand-dune-shipit

@thatlittleboy thatlittleboy merged commit 8f9f7d1 into master Jun 27, 2023
15 checks passed
@connortann connortann deleted the chore/faster-tests branch June 27, 2023 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Relating to Continuous Integration / GitHub Actions enhancement Indicates new feature requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants