Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++ Const and Assign to initialize variable causes a segfault depending on the Const constructor used #18149

Open
rajha-korithrien opened this issue Mar 31, 2018 · 3 comments
Assignees
Labels
1.9 comp:runtime c++ runtime, performance issues (cpu) stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug

Comments

@rajha-korithrien
Copy link

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    Yes see a very short example below.

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    macOS 10.13.3 clang 900.0.39.2 and CentOS Linux 7 gcc-4.8.5

  • TensorFlow installed from (source or binary):
    Source from the 1.7.0 release tag

  • TensorFlow version (use command below):
    I have not actually installed the python pip package, but the source tree came from:
    https://github.com/tensorflow/tensorflow/archive/v1.7.0.tar.gz

  • Python version:
    N/A using the C++ API

  • Bazel version (if compiling from source):
    macOS Build label: 0.11.1-homebrew and Centos Linux 7 Build label: 0.11.1- (@non-git)

  • GCC/Compiler version (if compiling from source):
    macOS clang 900.0.39.2 and CentOS Linux 7 gcc-4.8.5

  • CUDA/cuDNN version:
    N/A

  • GPU model and memory:
    N/A

  • Exact command to reproduce:
    extract the sources/configure
    tar -xzvf v1.7.0.tar.gz
    cd tensorflow-1.7.0
    ./configure

Then add the following directory to hold the work:
mkdir tensorflow/basic-example

Put into basic-example the following BUILD file:

load("//tensorflow:tensorflow.bzl", "tf_cc_binary")

tf_cc_binary(
    name = "basic-example",
    srcs = [
        "basic-example.cc",
    ],
    deps = [
        "//tensorflow/cc:cc_ops",
        "//tensorflow/cc:client_session",
        "//tensorflow/core:tensorflow"
    ]
)

Put into basic-example the following C++ source file:

#include "tensorflow/cc/client/client_session.h"
#include "tensorflow/cc/ops/standard_ops.h"
#include "tensorflow/core/framework/tensor.h"

using namespace tensorflow;
using namespace tensorflow::ops;
using namespace std;

int main() {

  Scope scope = Scope::NewRootScope();
 
  auto c = Const(scope.WithOpName("const_c"), {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0}, {3,3});

  auto v = Variable(scope.WithOpName("var1"), {3, 3}, DT_FLOAT);
  auto init_v = Assign(scope.WithOpName("init_v"), v, c);

  std::vector<Tensor> outputs;
  ClientSession session(scope);

  TF_CHECK_OK(session.Run({init_v}, &outputs));
}

Now compile and run the resulting program:
bazel build -c dbg //tensorflow/basic-example
./bazel-bin/tensorflow/basic-example/basic-example

Observe the following behavior:

./bazel-bin/tensorflow/basic-example/basic-example
2018-03-31 11:47:57.135532: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX
Segmentation fault: 11

Describe the problem

The code given above causes a segfault when the session runner tries to get the name of a node because the node is nullptr. I have included a stacktrace using lldb below (a trace showing the same information can be created using gdb on Linux).

However the following slightly modified C++ program works fine:

#include "tensorflow/cc/client/client_session.h"
#include "tensorflow/cc/ops/standard_ops.h"
#include "tensorflow/core/framework/tensor.h"

using namespace tensorflow;
using namespace tensorflow::ops;
using namespace std;

int main() {

  std::vector<float> initConstData = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0};

  Scope scope = Scope::NewRootScope();

  Tensor initConstT(DT_FLOAT, TensorShape({3,3}));
  std::copy_n(initConstData.begin(), initConstData.size(), initConstT.flat<float>().data());

  auto c = Const(scope.WithOpName("const_c"), initConstT);

  auto v = Variable(scope.WithOpName("var1"), {3, 3}, DT_FLOAT);
  auto init_v = Assign(scope.WithOpName("init_v"), v, c);

  std::vector<Tensor> outputs;
  ClientSession session(scope);

  TF_CHECK_OK(session.Run({init_v}, &outputs));
}

The difference between the code that works and the code that doesn't:
a) the explicit creation of a tensor initConstT
b) calling Const with a Tensor rather than an Input::Initializer

The behavior is identical if I omit the use of scope.WithOpName and just pass scope.
I have been able to test this back as far as Tensorflow 1.4 I can not build Tensorflow 1.3.1 with my installed version of bazel.

If I have done something wrong, please point it out. Otherwise I feel that because there is no semantic difference between the two programs and the API allows the former program to compile then they should both work.

Source code / logs

Stacktrace of the problem:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x60)
  * frame #0: 0x0000000126e677bc libtensorflow_framework.so`tensorflow::Node::name() const [inlined] std::__1::shared_ptr<tensorflow::NodeProperties>::operator->(this=0x0000000000000060) const at memory:4071
    frame #1: 0x0000000126e677bc libtensorflow_framework.so`tensorflow::Node::name(this=0x0000000000000000) const at graph.cc:140
    frame #2: 0x000000010018592f basic-example`tensorflow::Output::name(this=0x000000012bc020f0) const at ops.h:76
    frame #3: 0x0000000100184e7a basic-example`tensorflow::ClientSession::Run(this=0x00007ffeefbff4a8, run_options=0x00007ffeefbfefd0, inputs=size=0, fetch_outputs=size=1, run_outputs=size=1, outputs=0x00007ffeefbff4b0 size=1, run_metadata=0x0000000000000000) const at client_session.cc:118
    frame #4: 0x0000000100184145 basic-example`tensorflow::ClientSession::Run(this=0x00007ffeefbff4a8, inputs=size=0, fetch_outputs=size=1, run_outputs=size=1, outputs=0x00007ffeefbff4b0 size=1) const at client_session.cc:89
    frame #5: 0x000000010018408a basic-example`tensorflow::ClientSession::Run(this=0x00007ffeefbff4a8, fetch_outputs=size=1, outputs=0x00007ffeefbff4b0 size=1) const at client_session.cc:76
    frame #6: 0x0000000100002bfc basic-example`main at basic-example.cc:22
    frame #7: 0x00007fff76249115 libdyld.dylib`start + 1
    frame #8: 0x00007fff76249115 libdyld.dylib`start + 1
(lldb)
@michaelisard michaelisard assigned skye and unassigned michaelisard Apr 10, 2018
@skye skye added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Apr 18, 2018
@davidscherer
Copy link

davidscherer commented Jun 7, 2018

If, before the call to session.Run(), you do something like this:

    if (!scope.ok()) {
        LOG(FATAL) << scope.status().ToString();
        abort();
    }

then you will get a more helpful error message:

Invalid argument: Inconsistent values for attr 'T' DT_FLOAT vs. DT_DOUBLE while building NodeDef 'Model/init_v' using Op<name=Assign; signature=ref:Ref(T), value:T -> output_ref:Ref(T); attr=T:type; attr=validate_shape:bool,default=true; attr=use_locking:bool,default=true; allows_uninitialized_input=true>

But frankly I am more concerned about the SIGSEGV and lack of diagnostics. What I have discovered in trying to use the Tensorflow C++ API is that as soon as you construct an operation with a shape or type error, scope.ok() becomes false. Any subsequent operations added to the graph bail out of their constructors immediately, leaving them with node()==nullptr. Then any call to Run() using these operations results in a segfault, as does Output::name() and probably other things that depend on a valid node.

If it's intended that the user always check explicitly for scope errors before calling Run(), on penalty of undefined behavior, it seems to me that that should be reflected in the documentation and examples for the C++ API. For example, in the first example at https://www.tensorflow.org/api_guides/cc/guide if you modify the matrix A to not be rectangular (like auto A = Const(root, { {3.f, 2.f}, {-1.f} });) you will get a segfault rather than the error message "Invalid argument: Initializer list components should all have the same shape".

Ideally I think calling Run() in this scenario should result in an error status rather than a segfault! Surely it would be simple to check session.ok() there. I hesitate to offer architectural advice on a project I'm so new to, but it might be better not to initialize Operations to having a NULL node at all.

@rajha-korithrien
Copy link
Author

rajha-korithrien commented Jun 22, 2018

@davidscherer Thanks for your explanation! My apologies for my late response. With your added information I know what is going on now.

std::vector<float> initConstData = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0};

Forces C++ to hold the numbers as floats rather than doubles. So when the following is executed:

Tensor initConstT(DT_FLOAT, TensorShape({3,3}));
std::copy_n(initConstData.begin(), initConstData.size(), initConstT.flat<float>().data());
auto c = Const(scope.WithOpName("const_c"), initConstT);

The underlying data held in initConstT is float data. However when we use the implicit initialization code:

auto c = Const(scope.WithOpName("const_c"), {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0}, {3,3});

The complier creates the initializer array as doubles instead of floats, which in turn means that 'c' holds double data. This causes a problem here:

auto v = Variable(scope.WithOpName("var1"), {3, 3}, DT_FLOAT);
auto init_v = Assign(scope.WithOpName("init_v"), v, c);

Because the tensorflow variable 'v' is defined to hold DT_FLOAT, but 'c' is holding DT_DOUBLE due to the implicit initializer array.

Thank you for your help in figuring it out!

I agree completely that the error was caused and known at the time of creating the init_v operation and the C++ API could raise an exception then and there much like the Java API does.
If the API designers do not want to use the exception feature of C++, then I also agree with your approach of having session.Run() check the scope object for the user, rather than allowing a segfault occur.

@koriavinash1
Copy link

can anyone share an example for weight initialization in multi-layered networks?

@mohantym mohantym self-assigned this Jan 14, 2023
@mohantym mohantym added comp:runtime c++ runtime, performance issues (cpu) 1.9 type:bug Bug labels Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.9 comp:runtime c++ runtime, performance issues (cpu) stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug
Projects
None yet
Development

No branches or pull requests

6 participants