Skip to content

Ensure 'pip wheel' can create .so artifacts deterministically  #6505

@thundergolfer

Description

@thundergolfer

What's the problem this feature will solve?

The Bazel build system has the major selling point of supporting both local and remote-caching.

In order for that caching to work though, Bazel targets must be built deterministically so that the same target always has the same content-addressable hash.

Currently pip wheel is non-deterministic, so our Python Bazel targets will cache miss if they depend on something built with pip wheel.

Describe the solution you'd like

Note: The following is the output of a Bazel execution log. A bit unrelated to the pip wheel command but shows the relevant information.

inputs {
  path: "external/pypi__PyYAML_5_1/PyYAML-5.1.dist-info/LICENSE"
  digest {
    hash: "a2adb9c959b797494a0ef80bdf60e22db2749ee3e0c0908556e3eb548f967c56"
    size_bytes: 1101
    hash_function_name: "SHA-256"
  }
}
inputs {
  path: "external/pypi__PyYAML_5_1/PyYAML-5.1.dist-info/METADATA"
  digest {
    hash: "df7bc0c7cbd2ce350c5c61ceda3a74bbcb6f82446a7c01f7f8e1034a98df231f"
    size_bytes: 1704
    hash_function_name: "SHA-256"
  }
}
inputs {
  path: "external/pypi__PyYAML_5_1/PyYAML-5.1.dist-info/RECORD"
  digest {
    hash: "6fe803b74ab4fcab1f23e96060cf062d12779598af7e72692c492c2dd7cad0ed"
    size_bytes: 1701
    hash_function_name: "SHA-256"
  }
}
inputs {
  path: "external/pypi__PyYAML_5_1/PyYAML-5.1.dist-info/WHEEL"
  digest {
    hash: "cdf2c8f141bc498ae490a88870d655dd174abe3db8c1f57562224b168930c624"
    size_bytes: 104
    hash_function_name: "SHA-256"
  }
}
inputs {
  path: "external/pypi__PyYAML_5_1/PyYAML-5.1.dist-info/top_level.txt"
  digest {
    hash: "ae98f42153138ac02387fd6f1b709c7fdbf98e9090c00cfa703d48554e597614"
    size_bytes: 11
    hash_function_name: "SHA-256"
  }
}
inputs {
  path: "external/pypi__PyYAML_5_1/_yaml.cpython-36m-x86_64-linux-gnu.so"
  digest {
    hash: "a7f3774015f839ccee5e2281bbfdf22a42e0e1dafaac33ef4c91db83a07210d9"
    size_bytes: 1133288
    hash_function_name: "SHA-256"
  }
}
inputs {
  path: "external/pypi__PyYAML_5_1/yaml/__init__.py"
  digest {
    hash: "2af8b6dbcb1df5c63597f215421cad02f2317e291061b181b0f7bbf4f71ac0dd"
    size_bytes: 12012
    hash_function_name: "SHA-256"
  }
}

The following is a subset of the build outputs of the PyYAML package. Of the build outputs, it is the RECORD files and the _yaml.cpython-36m-x86_64-linux-gnu.so shared object file that have non-deterministic hashes build to build. I have inspected the RECORD file and found that it contains the hash of the .so file, so it is non-deterministic because of the .so file, and I think only because of that.

So the problem is the .so file.

I ran the strings program on the .so file and found this printable string: /tmp/pip-wheel-_bd8v3f2/pyyaml. That is coming from here:

with TempDirectory(kind="wheel") as temp_dir:

So while I found other differences between different _yaml.cpython-36m-x86_64-linux-gnu.so, this tmp directory usage leaking in itself is sufficient to break determinism.

Additional context

rules_python issue discussing this problem: bazel-contrib/rules_python#154
rules_python repo: https://github.com/bazelbuild/rules_python

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions