Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable localisation when SOURCE_DATE_EPOCH is set #10949

Conversation

jayaddison
Copy link
Contributor

@jayaddison jayaddison commented Oct 31, 2022

Subject: disables localization when reproducible builds are requested, as determined by a non-empty SOURCE_DATE_EPOCH environment variable.

Feature or Bugfix

  • Bugfix

Purpose

The Reproducible Builds project aims to provide confidence to consumers of packaged software that the artefacts they're downloading and installing have not been altered by the environment they were built in, and can be replicated at a later date if required.

The project performs continuous reproducibility testing of many different software packages - many of which fail reproducibility tests for various reasons. The project then identifies ways to fix the causes of differences, or filter out irrelevant differences that should not affect the software's practical behaviour.

Builds of localized documentation using sphinx currently account for a large category of reproducible build testing failures, because the builders (intentionally) use varying environment locales at build-time. This can affect the contents of the objects.inv file.

During investigation, it turned out that many gettext-localized values -- particularly in Python modules under sphinx.domains -- were being translated at module-load-time (in other words, when each Python module was import'ed) -- and would not subsequently be re-localized.

This creates two unusual effects:

  1. Attempting to write a test case to build the same application in two different languages was not initially possible -- the first-loaded translation catalog (as found in the sphinx.locale.translators global variable) -- would remain in-use for subsequent application builds under different locales
  2. Localization of strings could vary depending on whether the relevant modules were loaded before or after the resource catalogs were populated

A possible approach to a fix here is a naive one: all translations are performed lazily so that module imports can occur in any order and localization of inventory entries should occur only when translations of those items are requested (during the objects.inv dump, in the context of the investigation).

With that change in place, it's relatively straightforward to then disable localization (set languages = None) when a reproducible build is requested.

Edit: description correction: the branch has been updated to configure languages = ["und"] instead of languages = None. und is a three-character ISO-639-3 code for 'undetermined' language. This is used to prevent Python's stdlib gettext from attempting to determine the host's locale from various environment variables (including the LANGUAGE variable).

Relates

@jayaddison
Copy link
Contributor Author

Hmm.. the original issue report was specifically about the LANGUAGE environment variable, and that isn't exercised here at the moment. Perhaps I'm on the wrong track here.

@jayaddison jayaddison marked this pull request as draft October 31, 2022 13:25
Comment on lines +115 to +125
if getenv('SOURCE_DATE_EPOCH') is not None:
# Disable localization during reproducible source builds
# See https://reproducible-builds.org/docs/source-date-epoch/
#
# Note: Providing an empty/none value to gettext.translation causes
# it to consult various language-related environment variables to find
# locale(s). We don't want that during a reproducible build; we want
# to run through the same code path, but to return NullTranslations.
#
# To achieve that, specify the ISO-639-3 'undetermined' language code,
# which should not match any translation catalogs.
languages: Optional[List[str]] = ['und']
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may seem overly complicated compared to, for example:

if getenv('SOURCE_DATE_EPOCH') is not None:
    return NullTranslations(), False

...which would achieve similar results.

My thinking here is that the presence/absence of SOURCE_DATE_EPOCH should, ideally, minimize the effect on the control flow of the program and build.

If the environment variable results in completely different control flows, then arguably the program being built is different.

My sense is that the ideal roadmap for SOURCE_DATE_EPOCH is to introduce it as-required to achieve reproducible builds, but also to keep in mind that it should be removed in future when no-longer-required.

(and to be precise in the case of sphinx: a potential time for removal would be when it's possible to perform deterministic documentation builds in all-available-locales -- meaning that the software would be packaged for universal applicability and should build into an identical artefact regardless of the build host's locale)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and to be precise in the case of sphinx: a potential time for removal would be when it's possible to perform deterministic documentation builds in all-available-locales -- meaning that the software would be packaged for universal applicability and should build into an identical artefact regardless of the build host's locale)

If it is realistic to achive this, I think it's a good goal -- would you be happy to open a new issue to track the 'locale independent' workstream?

A

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From searching around before doing that, I think that this would be approximately the same as #788 (although that is specific to HTML, currently).

I'll spend some more time reading before deciding whether to add more comments there or to open a separate ticket.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I appreciate your help here!

A

sphinx/locale/__init__.py Outdated Show resolved Hide resolved
@jayaddison jayaddison marked this pull request as ready for review October 31, 2022 16:03
@jayaddison
Copy link
Contributor Author

Hmm.. the original issue report was specifically about the LANGUAGE environment variable, and that isn't exercised here at the moment. Perhaps I'm on the wrong track here.

To confirm: exporting LANGUAGE environment values before running tox to run unit tests did uncover an unhandled situation here. Essentially, as noted here, gettext.translation consults a number of environment variables to auto-detect a language when the languages parameter is empty.

That means that we can't simply configure an empty language list when running in reproducible build mode; gettext.translation would again consult those other env vars. The solution currently in-place is to use a special language tag: und (undetermined). gettext shouldn't find any translated resources for that language and will return a NullTranslations translator, regardless of the build machine's configured locale(s).

@lamby
Copy link
Contributor

lamby commented Nov 7, 2022

[…] we can't simply configure an empty language list when running in reproducible build mode; gettext.translation would again consult those other env vars. The solution currently in-place is to use a special language tag: und (undetermined). gettext shouldn't find any translated resources for that language and will return a NullTranslations translator, regardless of the build machine's configured locale(s).

Thank you for the summary of what is going on here as well as an overview of the proposed solution. It would be great to see this land in Sphinx.

@jayaddison
Copy link
Contributor Author

After looking back at the changes, I think my use of str typecasts to patch up a few test failure situations is potentially fragile and/or wasteful.

I'm planning to push a few commits soon to move those typecasts earlier in the code flow - to the declaration of the relevant localized string variables.

Even so: the fact that this typecasting was necessary was discovered by unit test coverage, and that makes me concerned that there could be other locations in the code that are affected too (without test coverage to highlight that). I don't have a sense for the level of test coverage that's in place here, though.

…egardless of environment locale) when reproducible builds are enabled
…at localization has an effect on the test application's inventory output
…, since this should not have an associated translation catalog
… LANGUAGE environment variable is configured
… (and confirming that LANGUAGE variable does not affect the result)
…for additional plural forms to the lazy loader translation method
…util.i18n.format_date method (note: mypy doesn't appear to catch this?)
@AA-Turner
Copy link
Member

@jayaddison this is only missing a CHANGES entry, then happy to merge!

A

@AA-Turner
Copy link
Member

AA-Turner commented Apr 6, 2023

Commit message draft, for later:

This commit disables Sphinx's localisation features when reproducible
builds are requested, as determined by a non-empty SOURCE_DATE_EPOCH_
environment variable.

The `Reproducible Builds`_ project aims to provide confidence to
consumers of packaged software that the artefacts they're downloading
and installing have not been altered by the environment they were
built in, and can be replicated at a later date if required.

Builds of localised documentation using Sphinx currently account for
a large category of reproducible build testing failures, because the
builders intentionally use varying environment locales at build-time.
This can affect the contents of the ``objects.inv`` file.

During investigation, it turned out that many ``gettext``-localised
values (particularly in Python modules under ``sphinx.domains``) were
being translated at module-load-time and would not subsequently be
re-localised.

This creates two unusual effects:

1. Attempting to write a test case to build the same application in
   two different languages was not initially possible, as the
   first-loaded translation catalogue (as found in the 
   ``sphinx.locale.translators`` global variable) would remain in-use
   for subsequent application builds under different locales.

2. Localisation of strings could vary depending on whether the
   relevant modules were loaded before or after the resource
   catalogues were populated.

We fix this by performing all translations lazily so that module
imports can occur in any order and localisation of inventory entries
should occur only when translations of those items are requested.

Localisation can then be disabled by configuring the ``gettext``
language to the ISO-639-3 'undetermined' code (``'und'``), as this
should not have an associated translation catalogue. We also want to
prevent ``gettext`` from  attempting to determine the host's locale
from environment variables (including ``LANGUAGE``).

.. _SOURCE_DATE_EPOCH: https://reproducible-builds.org/docs/source-date-epoch/
.. _Reproducible Builds: https://www.reproducible-builds.org/

CHANGES Outdated Show resolved Hide resolved
@AA-Turner AA-Turner changed the title Disable localization of 'objects.inv' directory inventory when reproducible build output is requested Disable localisation when SOURCE_DATE_EPOCH is set Apr 7, 2023
@AA-Turner AA-Turner merged commit f82c3c9 into sphinx-doc:master Apr 7, 2023
@AA-Turner
Copy link
Member

Thank you @jayaddison!

A

@jayaddison
Copy link
Contributor Author

You're welcome - thank you too for the review and merge! I'll keep a lookout for any issue reports related to this in future, especially in the weeks/months after it's released.

@jayaddison jayaddison deleted the issue-9778/reproducible-inventory-file-build branch April 7, 2023 16:57
@AA-Turner AA-Turner added this to the 6.2.0 milestone Apr 8, 2023
jayaddison added a commit to jayaddison/sphinx that referenced this pull request Apr 8, 2023
jayaddison added a commit to jayaddison/sphinx that referenced this pull request Apr 8, 2023
jayaddison added a commit to jayaddison/sphinx that referenced this pull request Apr 8, 2023
jayaddison added a commit to jayaddison/sphinx that referenced this pull request Apr 9, 2023
@jayaddison
Copy link
Contributor Author

After further investigation of the original bug #9778 today, I believe that the fix here was unnecessary; see commentary at #9778 (comment) for details.

It may make sense to scavenge some of the test coverage, but I think that this change should be reverted (and #11306 does that).

@jayaddison
Copy link
Contributor Author

The changes here also cause breakage when help arguments are requested from command-line tools such as sphinx-build; I hadn't tested that during development.

mgeier added a commit to mgeier-forks/sphinx-last-updated-by-git that referenced this pull request Apr 12, 2023
AA-Turner added a commit that referenced this pull request Apr 21, 2023
…10949)" (#11343)

This keeps some of the added tests, and avoids a full revert of ``sphinx.locale``.
@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LANGUAGE environment variable inconsistently affects output of objects.inv
4 participants