Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: update to TF 2.10 #1160

Merged
merged 6 commits into from Oct 19, 2022
Merged

CI: update to TF 2.10 #1160

merged 6 commits into from Oct 19, 2022

Conversation

albertz
Copy link
Member

@albertz albertz commented Oct 19, 2022

No description provided.

@albertz
Copy link
Member Author

albertz commented Oct 19, 2022

Currently fails:

OSError: /var/tmp/runner/returnn_tf_cache/tf_utils/kernels_registered_for_op/86a95c7ddc/kernels_registered_for_op.so: undefined symbol: _ZN10tensorflow22KernelsRegisteredForOpB5cxx11EN4absl12lts_2022062311string_viewE

Or:

tensorflow.python.framework.errors_impl.NotFoundError: /var/tmp/runner/returnn_tf_cache/ops/NextEditDistanceRowOp/eb12481e22/NextEditDistanceRowOp.so: undefined symbol: _ZN10tensorflow8str_util13StringReplaceB5cxx11EN4absl12lts_2022062311string_viewES3_S3_b

Or:

tensorflow.python.framework.errors_impl.NotFoundError: /var/tmp/runner/returnn_tf_cache/ops/EditDistanceOp/5f70f62376/EditDistanceOp.so: undefined symbol: _ZN10tensorflow8str_util13StringReplaceB5cxx11EN4absl12lts_2022062311string_viewES3_S3_b

Or:

tensorflow.python.framework.errors_impl.NotFoundError: /var/tmp/runner/returnn_tf_cache/ops/LstmGenericBase/30db5737ad/LstmGenericBase.so: undefined symbol: _ZN10tensorflow8str_util13StringReplaceB5cxx11EN4absl12lts_2022062311string_viewES3_S3_b

@albertz
Copy link
Member Author

albertz commented Oct 19, 2022

GitHub Actions seems to pick up a reasonable GCC version. From test_TFNativeOp:

TF compiler version: 9.3.1 20200408
GCC for TF: /usr/bin/gcc-9

Other relevant:

TF __file__: /home/runner/.local/lib/python3.7/site-packages/tensorflow/__init__.py
TF version: 2.10.0
TF describe version: 2.10.0 (v2.10.0-rc3-6-g359c3cdfc5f) (<site-package> in /home/runner/.local/lib/python3.7/site-packages/tensorflow)
TF include: /home/runner/.local/lib/python3.7/site-packages/tensorflow/include
TF lib: /home/runner/.local/lib/python3.7/site-packages/tensorflow
TF link flags: ['-L/home/runner/.local/lib/python3.7/site-packages/tensorflow', '-l:libtensorflow_framework.so.2']
TF compile flags: ['-I/home/runner/.local/lib/python3.7/site-packages/tensorflow/include', '-D_GLIBCXX_USE_CXX11_ABI=1', '-DEIGEN_MAX_ALIGN_BYTES=64']
TF cxx11 abi flag: 1
...
TF lib so does not(!) exist: /home/runner/.local/lib/python3.7/site-packages/tensorflow/libtensorflow_framework.so
TF pywrap so exists: /home/runner/.local/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so

@albertz
Copy link
Member Author

albertz commented Oct 19, 2022

I don't really understand: Why are some of the seemingly unrelated tests using older TF versions failing now? And also, why with such weird stacktrace? E.g. tf-tests (TEST=TFUtil, 3.7, 1.15.3):

NotImplementedError: Cannot convert a symbolic Tensor (single_strided_slice_3/mod:0) to a numpy array.
__________________________ test_single_strided_slice ___________________________
Traceback (most recent call last):
  File "/home/runner/work/returnn/returnn/tests/test_TFUtil.py", line 1709, in test_single_strided_slice
    line: assert_equal(list(single_strided_slice(x, axis=tf.constant(1), end=3)[0].eval()), [0, 1, 2])
    locals:
      assert_equal = <global> <bound method TestCase.assertEqual of <nose.tools.trivial.Dummy testMethod=nop>>
      list = <builtin> <class 'list'>
      single_strided_slice = <global> <function single_strided_slice at 0x7f406542b830>
      x = <local> <tf.Tensor 'ExpandDims_2:0' shape=(1, 10) dtype=int32>
      axis = <not found>
      tf = <global> <module 'tensorflow' from '/home/runner/.local/lib/python3.7/site-packages/tensorflow/__init__.py'>
      tf.constant = <global> <function constant_v1 at 0x7f406ffd4b90>
      end = <not found>
      eval = <builtin> <built-in function eval>
  File "/home/runner/work/returnn/returnn/returnn/tf/util/basic.py", line 3502, in single_strided_slice
    line: begins = tf.concat([tf.zeros((axis,), tf.int32), (begin,)], axis=0)
    locals:
      begins = <not found>
      tf = <global> <module 'tensorflow' from '/home/runner/.local/lib/python3.7/site-packages/tensorflow/__init__.py'>
      tf.concat = <global> <function concat at 0x7f406fe144d0>
      tf.zeros = <global> <function zeros at 0x7f406fe14dd0>
      axis = <local> <tf.Tensor 'single_strided_slice_3/mod:0' shape=() dtype=int32>
      tf.int32 = <global> tf.int32
      begin = <local> 0
  File "/home/runner/.local/lib/python3.7/site-packages/tensorflow_core/python/ops/array_ops.py", line 2338, in zeros
    line: output = _constant_if_small(zero, shape, dtype, name)
    locals:
      output = <not found>
      _constant_if_small = <global> <function _constant_if_small at 0x7f406fe14a70>
      zero = <local> 0
      shape = <local> (<tf.Tensor 'single_strided_slice_3/mod:0' shape=() dtype=int32>,)
      dtype = <local> tf.int32
      name = <local> 'single_strided_slice_3/zeros/', len = 29
  File "/home/runner/.local/lib/python3.7/site-packages/tensorflow_core/python/ops/array_ops.py", line 2295, in _constant_if_small
    line: if np.prod(shape) < 1000:
    locals:
      np = <global> <module 'numpy' from '/home/runner/.local/lib/python3.7/site-packages/numpy/__init__.py'>
      np.prod = <global> <function prod at 0x7f4089f34dd0>
      shape = <local> (<tf.Tensor 'single_strided_slice_3/mod:0' shape=() dtype=int32>,)
  File "<__array_function__ internals>", line 6, in prod
    -- code not available --
  File "/home/runner/.local/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 3052, in prod
    line: return _wrapreduction(a, np.multiply, 'prod', axis, dtype, out,
                                keepdims=keepdims, initial=initial, where=where)
    locals:
      _wrapreduction = <global> <function _wrapreduction at 0x7f4089f1e3b0>
      a = <local> (<tf.Tensor 'single_strided_slice_3/mod:0' shape=() dtype=int32>,)
      np = <global> <module 'numpy' from '/home/runner/.local/lib/python3.7/site-packages/numpy/__init__.py'>
      np.multiply = <global> <ufunc 'multiply'>
      axis = <local> None
      dtype = <local> None
      out = <local> None
      keepdims = <local> <no value>
      initial = <local> <no value>
      where = <local> <no value>
  File "/home/runner/.local/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction
    line: return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
    locals:
      ufunc = <local> <ufunc 'multiply'>
      ufunc.reduce = <local> <built-in method reduce of numpy.ufunc object at 0x7f408a85d050>
      obj = <local> (<tf.Tensor 'single_strided_slice_3/mod:0' shape=() dtype=int32>,)
      axis = <local> None
      dtype = <local> None
      out = <local> None
      passkwargs = <local> {}
  File "/home/runner/.local/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 736, in Tensor.__array__
    line: raise NotImplementedError("Cannot convert a symbolic Tensor ({}) to a numpy"
                                    " array.".format(self.name))
    locals:
      NotImplementedError = <builtin> <class 'NotImplementedError'>
      format = <builtin> <built-in function format>
      self = <local> <tf.Tensor 'single_strided_slice_3/mod:0' shape=() dtype=int32>
      self.name = <local> 'single_strided_slice_3/mod:0', len = [28](https://github.com/rwth-i6/returnn/actions/runs/3284624825/jobs/5410808347#step:8:29)
NotImplementedError: Cannot convert a symbolic Tensor (single_strided_slice_3/mod:0) to a numpy array.

And when you look up _constant_if_small, actually it should cover that?

Maybe the GitHub CI cache is somehow messing up different TF versions?

NumPy: 1.21.6
TensorFlow: v1.15.2-30-g4386a66 1.15.3 /home/runner/.local/lib/python3.7/site-packages/tensorflow/__init__.py

In the log I see:

      tf = <global> <module 'tensorflow' from '/home/runner/.local/lib/python3.7/site-packages/tensorflow/__init__.py'>

And then:

  File "/home/runner/.local/lib/python3.7/site-packages/tensorflow_core/python/ops/array_ops.py", line 2338, in zeros

So tensorflow and tensorflow_core. Is that correct?

Or maybe Numpy version?

@albertz
Copy link
Member Author

albertz commented Oct 19, 2022

And when you look up _constant_if_small, actually it should cover that?

Actually no it does not, at least TF 2.3 and probably all earlier versions: https://github.com/tensorflow/tensorflow/blob/v2.3.0/tensorflow/python/ops/array_ops.py#L2730

@albertz
Copy link
Member Author

albertz commented Oct 19, 2022

Maybe the GitHub CI cache is somehow messing up different TF versions?

NumPy: 1.21.6
TensorFlow: v1.15.2-30-g4386a66 1.15.3 /home/runner/.local/lib/python3.7/site-packages/tensorflow/__init__.

For comparison, in the current master:

Python env: python is /opt/hostedtoolcache/Python/3.7.14/x64/bin/python Python 3.7.14
NumPy: 1.19.5
TensorFlow: v1.15.2-30-g4386a66 1.15.3 /home/runner/.local/lib/python3.7/site-packages/tensorflow/__init__.py

So interestingly it has an older NumPy version. So the newer NumPy version probably causes the error.

@albertz
Copy link
Member Author

albertz commented Oct 19, 2022

Ok, that old Numpy version was too specific. For Python 2.7, it fails:

ERROR: Could not find a version that satisfies the requirement numpy==1.19.5 (from versions: 1.3.0, 1.4.1, 1.5.0, 1.5.1, 1.6.0, 1.6.1, 1.6.2, 1.7.0, 1.7.1, 1.7.2, 1.8.0, 1.8.1, 1.8.2, 1.9.0, 1.9.1, 1.9.2, 1.9.3, 1.10.0, 1.10.0.post2, 1.10.1, 1.10.2, 1.10.3, 1.10.4, 1.11.0, 1.11.1, 1.11.2, 1.11.3, 1.12.0, 1.12.1, 1.13.0rc1, 1.13.0rc2, 1.13.0, 1.13.1, 1.13.3, 1.14.0rc1, 1.14.0, 1.14.1, 1.14.2, 1.14.3, 1.14.4, 1.14.5, 1.14.6, 1.15.0rc1, 1.15.0rc2, 1.15.0, 1.15.1, 1.15.2, 1.15.3, 1.15.4, 1.16.0rc1, 1.16.0rc2, 1.16.0, 1.16.1, 1.16.2, 1.16.3, 1.16.4, 1.16.5, 1.16.6)

@albertz albertz marked this pull request as ready for review October 19, 2022 21:02
@albertz albertz requested a review from a team as a code owner October 19, 2022 21:02
@albertz
Copy link
Member Author

albertz commented Oct 19, 2022

The PyCharm inspections raise many new warnings for TF 2.10 (see here), for example:

  returnn/datasets/hdf.py:1180: WARNING PyUnresolvedReferencesInspection: Cannot find reference 'object' in '__init__.pyi | __init__.pyi'
  returnn/tf/layers/rec.py:8558: WEAK WARNING PyAbstractClassInspection: Class VanillaLSTMCell must implement all abstract methods
  returnn/tf/updater.py:956: WEAK WARNING PyAbstractClassInspection: Class _KerasOptimizerWrapper must implement all abstract methods
  returnn/tf/util/basic.py:4679: WARNING PyUnresolvedReferencesInspection: Cannot find reference 'float' in '__init__.pyi | __init__.pyi'
  tools/tf_inspect_checkpoint.py:123: WARNING PyTypedDictInspection: TypedDict key must be a string literal; expected one of ('precision', 'threshold', 'edgeitems', 'linewidth', 'suppress', 'nanstr', 'infstr', 'formatter', 'sign', 'floatmode', 'legacy')

I will just keep the old TF version for the PyCharm inspections for now. We can later look at that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant