Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input fact propagation wonky for NNEF #742

Closed
tgolsson opened this issue Jun 14, 2022 · 32 comments
Closed

Input fact propagation wonky for NNEF #742

tgolsson opened this issue Jun 14, 2022 · 32 comments

Comments

@tgolsson
Copy link
Contributor

tgolsson commented Jun 14, 2022

Maybe I was a bit too quick to close #718 as it seems to still have some issues when running depending on exact flags I pass.

I'll post these in one go as I think they're related; but we'll see.

As a base; I'm using the image.nnef.tar we can now generate. I'm using the following base command line:

tract image.nnef.tar --nnef-tract-core --nnef-tract-onnx -i input:1,3,224,224,f32 --allow-random-input 

This always works with dump (except when passing --profile), but fails run with the following error:

[2022-06-14T13:55:45.961766439Z WARN  tract::tensor] Using random input for input called "input": 1,3,224,224,F32
[2022-06-14T13:55:45.969444249Z ERROR tract] Evaluating #1 "ConstantOfShape_25" MultiBroadcastTo

    Caused by:
        Undetermined symbol in expression: N

Adding --set N=1 to the run fixes this. I'd have expect something like --override-fact input:1,3,224,224,f32 to also work correctly as a more aggressive -i input:1,3,224,224,f32.

If I attempt to optimize the graph it fails with the following wonky error where it fails to unify two compatible shapes?

[2022-06-14T13:42:11.116655254Z ERROR tract] Error at stage optimize

    Caused by:
        0: codegen node #4 "Conv_0" ConvUnary
        1: Trying to substitute a N,768,7,7,F32 by 1,768,7,7,F32.
           ModelPatch { context: ["wire_as_lazy_im2col"], dont_apply_twice: None, model: Graph { nodes: [Node { id: 0, name: "incoming-3/0", inputs: [], op: TypedSource { fact: 1,3,224,224,F32 }, outputs: [1,3,224,224,F32 >1/0] }, Node { id: 1, name: "Conv_0.matmatmul", inputs: [0/0>], op: LirMatMulUnary { c_fact: 1,768,49,F32, c_m_axis: 1, c_n_axis: 2, micro_ops: [(2359296,F32 0.0050811768, 0.002538681, -0.0051002502, -0.0015630722, 0.0034770966, 0.0017652512, -0.0231781, -0.0051574707, -0.013504028, 0.002796173, 0.00044894218, -0.0076141357..., [Store])], shape=[1], strides=[1], layout=CFcf (0xf), dynamic ndim=1, c_final_shape: 1,768,49, geometry: Concrete(ConcreteMatMulGeometry { m: 768, k: 3072, n: 49, b_storage: VirtualPacking { packer: Packer { r: 5, alignment: 4, end_padding_record: 0 }, func: LazyIm2colSpec { n_bytes_offsets: [0, ...], k_bytes_offsets: [0, 4, ...] }, k: 3072 } }), mmm: MMM (fma_mmm_f32_16x5 16x5), reshape_post: [] }, outputs: [1,768,49,F32 >2/0] }, Node { id: 2, name: "Conv_0", inputs: [1/0>], op: Reshape(2, [Val(49)], [Val(7), Val(7)]), outputs: [1,768,7,7,F32 ] }], inputs: [0/0>], outputs: [], outlet_labels: {}, properties: {} }, inputs: {}, incoming: {0/0>: 3/0>}, shunt_outlet_by: {}, obliterate: [] }`

Looks like it's somehow fails to propagate the input facts?

@kali
Copy link
Collaborator

kali commented Jun 14, 2022

override-fact is a dirty hack to override a fact in the case of inconsistent models (I have seen a couple where input and output are tagged with different batch size). I think I always used it on output facts :) So it is not supposed to propagate these changes.

There is no way in the current state of affair to "reflow" over the graph and recompute the shapes. It is something that we may want to add, not super hard, but in the end, if the graph is consistent, we should never (?) have to do this.

@kali
Copy link
Collaborator

kali commented Jun 14, 2022

Also if you run on actual input, the size should just induce the N value... Not sure what the --set brings to the table actually. Maybe for random inputs ?

@tgolsson
Copy link
Contributor Author

tgolsson commented Jun 14, 2022

Also if you run on actual input, the size should just induce the N value... Not sure what the --set brings to the table actually. Maybe for random inputs ?

Wouldn't -i input:1,3,224,224,f32 --allow-random-input solve that? I can try running from a numpy file just to be sure.

Edit:

$ tract image.nnef.tar --input-bundle io.npz --nnef-tract-core --nnef-tract-onnx run
[2022-06-14T14:15:50.308273477Z ERROR tract] For input input, can not reconcile model input fact N,3,224,224,F32 with provided input 1,3,224,224,F32 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

With -i ...

tract image.nnef.tar --input-bundle io.npz --nnef-tract-core --nnef-tract-onnx -i 1,3,224,224,f32 run
[2022-06-14T14:16:01.007384386Z ERROR tract] Evaluating #1 "ConstantOfShape_25" MultiBroadcastTo

    Caused by:
        Undetermined symbol in expression: N

With --set N=1 it works. Optimizing fails with same issue; too.

@kali
Copy link
Collaborator

kali commented Jun 14, 2022

Mmmm... I was thinking... you optimise the network with N, -i input:N,3,224,224,f32 -O --allow-random-input and then you want to profile it on random input for N=1 (or any other N value). But the --set is on run not on dump... so not sure there is a way to achieve this.

@tgolsson
Copy link
Contributor Author

Mmmm... I was thinking... you optimise the network with N, -i input:N,3,224,224,f32 -O --allow-random-input and then you want to profile it on random input for N=1 (or any other N value). But the --set is on run not on dump... so not sure there is a way to achieve this.

Good point, yeah, optimizing for N does work! But as you say then I can't run.

@kali
Copy link
Collaborator

kali commented Jun 14, 2022

I guess this is another nice little mess. Can we take a step back ? What is it you are trying to achieve ?

@tgolsson
Copy link
Contributor Author

tgolsson commented Jun 14, 2022

Sure! I'm trying to build an asset workflow around using NNEF as live assets; so we generate ONNXes from our ML pipelines; and ingest/store those as internal assets. At packaging time I want to convert (without necessarily concretizing!) those assets to NNEF as it's noticeably faster to load during my benchmarks. Then I want to be able to concretize at load time (or use Tracts live concretization).

I haven't tried it (i.e., with cervo) with any of the transformers I sent you last time but that has seemed to work with the small RL brains I've used previously. For clarity, I haven't tried those on the tract-cli before so it's entirely plausible that this workflow only works from code.

Ninja edit: With cervo, I can late-concretize NNEF RL models with symbols. With tract CLI I'm trying to do the same with transformer models.

Ninja edit 2: I'm fine with "Symbolic NNEF's requires --set N=... for tract-cli". It just seems quite arbitrary since it's not required for ONNXs. But --set N= would need be a global option or exist for dump as well so --profile works.

I'll stop editing now. I lied.

All I'm using tract-cli for is validation various theories I have before going to code. So all of the above is just the user-story of why I'm using NNEF. So what I'm trying to achieve is just validating that things work the way I expect them to work.

@kali
Copy link
Collaborator

kali commented Jun 15, 2022

Lol. Well, the cli has grown organically to meet what I needed to make tract-the-library development easier. So yeah, no surprise it's not consistent.

Being consistent betweean ONNX/TF and NNEF is pretty difficult. For instance, TF more or less never includes input shape info, ONNX can and does maybe half of the time. So we need -i for these. NNEF have them all the time so we should not need -i. But we also use -i to force symbolic dimension on the network. This work great on ONNX and TF, as we 1/load protobuf 2/ apply -i, 3/ analyse the network. But with NNEF the workflow is a bit different, as soon as the network is loaded we get a TypedModel with the shape we read from the file.

Maybe we should make -i on NNEF override the shape read from the model files to make the behaviour more consistent.

I think it would be interesting to clarify the location the various options appear too. From the API, optimising with symbols then running works (the symbols are resolved when the Source emit the input tensor). In order to make this work with --profile (or run with random data) we should maybe make --set available on both of them, or allow to specify an actual input (another set of -i ?) in the sub-command arguments.

Fitting all use-cases in a command line is pretty challenging... For instance, we may want in some case to run the --set after decluttering (like --concretize was doing), but we may also want it after optimisation (not sure, if we instead set a way to specify the concrete fact we want to run/bench/criterion/dump --profile).

@tgolsson
Copy link
Contributor Author

Yeah; the difference between NNEF and ONNX was a bit of pain and cervo and lead to lots of duplication (essentially all APIs come in both "from_model" and "from_typed". I understand that capturing that workflow difference in a CLI is even harder since well, more PEBKAC. :P Jokes aside, I think having the CLI tell me that I'm giving it gibberish for a NNEF would be a helpful start :P

I unfortunately have no super-helpful suggestions for other improvements. I think that having --set go on the main args would fix some of the issues for me; but it's just bouncing the problems around. I'm not sure how well clap would adapt but maybe having some more explicit DSL-y workflow would make sense? tract LOAD-NNEF foo.nnef.tar --concretize n:1 OPTIMIZE RUN --allow-random-input. Topically, input config is some of the args I always misplace after the command instead of before...

@kali
Copy link
Collaborator

kali commented Jul 21, 2022

@tgolsson I think #771 may address most of these concerns. It moves shape overrides very earlier on, changing the nnef AST before we build the model. Tell me what you think.

@tgolsson
Copy link
Contributor Author

@kali Awesome! I'll see if I can find some time to update and try it out this week or early next week!

@kali
Copy link
Collaborator

kali commented Aug 2, 2022

Are we happy with the current solution ? Should we close this one ?

@tgolsson
Copy link
Contributor Author

tgolsson commented Aug 2, 2022

Sorry, I never got around to trying! Will do so immediately!

@tgolsson
Copy link
Contributor Author

tgolsson commented Aug 2, 2022

@kali Any immediate diagnostic?

RUST_BACKTRACE=1 tract image.nnef.tar --nnef-tract-core --nnef-tract-onnx -i input:1,3,224,224,f32 --allow-random-input run
thread 'main' panicked at 'assertion failed: inv.id == \"external\"', /home/ts/.cargo/registry/src/github.com-1ecc6299db9ec823/tract-0.17.3/src/params.rs:261:25
stack backtrace:
   0: rust_begin_unwind
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/core/src/panicking.rs:142:14
   2: core::panicking::panic
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/core/src/panicking.rs:48:5
   3: tract::params::Parameters::load_model
   4: tract::params::Parameters::from_clap
   5: tract::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

@kali
Copy link
Collaborator

kali commented Aug 2, 2022

mmm... maybe try --input-node input -i 1,3,... ?

@tgolsson
Copy link
Contributor Author

tgolsson commented Aug 2, 2022

Same result. Also tried a bunch of variants with external, and various combos of the both. No dice. I'll sync down the code tomorrow and see what inv.id is, might have some hints.

@tgolsson
Copy link
Contributor Author

tgolsson commented Aug 3, 2022

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `"tract_core_external"`,
 right: `"external"`', cli/src/params.rs:261:25

That looks like it's coming from inside tract-nnef, and only in one location. Should that OP be renamed or should the input checker handle both possibilities?

@kali
Copy link
Collaborator

kali commented Aug 3, 2022

The assert is overzealous. We need to distinguish the two kind of Source: "external" is a strict nnef-compatible source, while "tract_core_external" adds support for symbolic dimensions.

So the assert should just accept both. Please tell me if this is not enough to fix.

@tgolsson
Copy link
Contributor Author

tgolsson commented Aug 3, 2022

Getting further (with some additional changes...) but it looks like concretization still isn't getting far enough:

$ RUST_BACKTRACE=1 cargo run --bin tract -- ../brains/image.nnef.tar --nnef-tract-core --nnef-tract-onnx -i input:1,3,224,224 --allow-random-input run
    Finished dev [unoptimized + debuginfo] target(s) in 0.12s
     Running `target/debug/tract ../brains/image.nnef.tar --nnef-tract-core --nnef-tract-onnx -i 'input:1,3,224,224' --allow-random-input run`
┣ 2 MultiBroadcastTo ConstantOfShape_25
┃   ━━━ N,1,768,F32
┣┻ 4 Add Add_26
    ━━━ N,1,768,F32
┏ 0 Source input
┃   ━━━ 1,3,224,224,F32
┣ 6 ConvUnary Conv_0
┃   ━━━ 1,768,7,7,F32
┣ 7 Reshape Reshape_12_0
┃   ━━━ 1,768,49,F32
┣ 8 MoveAxis Transpose_13_MoveAxis_0
    ━━━ 1,49,768,F32
[2022-08-03T15:38:26.829465746Z ERROR tract] In ModelBuilder::translate

    Caused by:
        0: In ModelBuilder::translate
        1: Wiring root graph body
        2: Plugging in assignement for "Concat_27"
        3: Resolving invocation concat
        4: Interrogating registry tract_nnef
        5: Deserializing op `concat'
        6: inputs are [4/0>, 8/0>]
        7: wiring Concat_27.Concat (TypedConcat { axis: 1, slices: [Var, Var] }), determining output_facts
        8: in output_facts invocation
        9: Inconsistent concat TypedConcat { axis: 1, slices: [Var, Var] } inputs: [N,1,768,F32, 1,49,768,F32]

@kali
Copy link
Collaborator

kali commented Aug 3, 2022

All right, thanks for giving it a shot, I guess I'll take a look. Is this model one you've already shared with me ?

@tgolsson
Copy link
Contributor Author

tgolsson commented Aug 3, 2022

Indeed! Made a PR with the other small touchups I did.

@kali
Copy link
Collaborator

kali commented Aug 5, 2022

OK... Not sure I chose the right way to fix. Trying to recap here. Our model NNEF starts with these lines:

graph network( input ) -> ( output ) {
  input = tract_core_external(shape = [N, 3, 224, 224], datum_type = "F32");
  ConstantOfShape_25_scalar = 0.0;
  ConstantOfShape_25 = tract_core_broadcast(ConstantOfShape_25_scalar, shape = [N, 1, 768]);
  Add_26_a = variable<scalar>(label = "Add_26-a", shape = [1, 1, 768]);
  Add_26 = add(Add_26_a, ConstantOfShape_25);
  Conv_0_weigths = variable<scalar>(label = "Conv_0_weigths", shape = [768, 3, 32, 32]);
  Conv_0_conv = conv(input, Conv_0_weigths, dilation = [1, 1], stride = [32, 32], border = "constant", groups = 1, padding = [(0, 0), (0, 0)]);
  Conv_0 = Conv_0_conv;
  Reshape_12_0 = reshape(Conv_0, shape = [49], axis_start = 2, axis_count = 2);
  Transpose_13_MoveAxis_0 = transpose(Reshape_12_0, axes = [0, 2, 1]);
  Concat_27 = concat([Add_26, Transpose_13_MoveAxis_0], axis = 1);

Here we have two independent reference to N: one in the input that the recent AST hack and fixes gives us the opportunity to override, one at ConstantOfShape_25. They have to agree when we get at Concat_27.

Now, in this case, the check happens while we translate the model from the patched AST to TypedModel, so all we managed to get is an inconsistent model from a working one... This hack would work if N was only appearing once, but...

So I guess it's back to the drawing board with this: I'm gonna try to move --set to the global set of arguments, just as --concretize used to be (except it was for the magic streaming dimension). Let's see what happens.

@kali kali mentioned this issue Aug 6, 2022
@kali
Copy link
Collaborator

kali commented Aug 6, 2022

I think I get it to work (there was a hard-to-track bug in Concat...) This new global --set is performed right after decluttering.

So cargo run --bin tract -- image.nnef.tar --nnef-tract-core --nnef-tract-onnx --allow-random-input -vvv --set N=1 -O dump --profile --cost works.

@tgolsson
Copy link
Contributor Author

tgolsson commented Oct 7, 2022

Hey @kali

I think this is still not 100% working as intended. It does still work for image.nnef.tar, but I tried it on the other file I've sent you (text.nnef.tar), and that crashes as soon as it starts running:

[2022-10-07T11:17:08.513712816Z INFO  tract::profile] Running entire network
[2022-10-07T11:17:08.513948393Z WARN  tract::tensor] Using random input for input called "input": 1,77,I32
[2022-10-07T11:17:08.513988527Z TRACE tract_core::plan] Running step 0, node #0 "token_embedding_weight_0" Const
[2022-10-07T11:17:08.514010584Z TRACE tract_core::plan] Running step 1, node #1 "input" Source
[2022-10-07T11:17:08.514029224Z TRACE tract_core::plan] Running step 2, node #2 "Gather_0" Gather
[2022-10-07T11:17:08.514047325Z TRACE tract_core::plan]   use input 0/0>
[2022-10-07T11:17:08.514050776Z TRACE tract_core::plan]   use input 1/0>
[2022-10-07T11:17:08.514052171Z TRACE tract_core::plan]   Ran #2 "Gather_0" Gather can now flush #0 "token_embedding_weight_0" Const
thread 'main' panicked at 'assertion failed: index < dim', /home/ts/.cargo/registry/src/github.com-1ecc6299db9ec823/ndarray-0.15.6/src/dimension/mod.rs:361:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panicking.rs:142:14
   2: core::panicking::panic
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panicking.rs:48:5
   3: ndarray::impl_methods::<impl ndarray::ArrayBase<S,D>>::index_axis
   4: tract_core::ops::array::gather::Gather::eval_t
   5: <tract_core::ops::array::gather::Gather as tract_core::ops::EvalOp>::eval
   6: tract::profile::profile
   7: tract::dump::handle
   8: tract::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Running the same network as ONNX crashes during inference instead:

[2022-10-07T11:21:05.493866694Z DEBUG tract_hir::infer::analyser] Failed analyse for node #1691 "MatMul_1541" MatMulInference

    Caused by:
        0: Infering facts
        1: Applying rule outputs[0].shape == 1,512
        2: Unifying shapes b,512 and 1,512
        3: Impossible to unify Sym(Symbol('b', 1)) with Val(1).
[2022-10-07T11:21:05.493889102Z TRACE tract_hir::infer::analyser] Remaining nodes 0
[2022-10-07T11:21:05.493891345Z TRACE tract_hir::infer::analyser] analyse done
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', /home/ts/.cargo/registry/src/github.com-1ecc6299db9ec823/tract-0.18.1/src/terminal.rs:140:49
stack backtrace:
   0: rust_begin_unwind
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panicking.rs:142:14
   2: core::panicking::panic
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panicking.rs:48:5
   3: tract::terminal::render_node_prefixed
   4: tract::terminal::render_prefixed
   5: tract::terminal::render
   6: tract::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

For some context; I'm trying to force LirMatMulUnary in this network to see whether that helps with performance - it's underperforming by a huge margin. And tract-cli with (with --input-from-bundle) is 2-3x slower than running it from code, which by itself is 6x-2x slower than ORT which I want to replace.

@kali
Copy link
Collaborator

kali commented Oct 8, 2022

Did you really send me this model ? :)

@tgolsson
Copy link
Contributor Author

tgolsson commented Oct 8, 2022

The ONNX variant - I can send it to you in NNEF format if you want as well. But the .tar.gz I sent you for the batch size investigation contained an image.onnx and a text.onnx.

@kali
Copy link
Collaborator

kali commented Oct 9, 2022

Found them. So I can at least right now clarify what happens with text.onnx.
Now we are (somewhat) correctly reading symbols from models. The error you get is a conflict between a forced input dimension to 1 (from cli or from API) and the output shape of the network which still contain a symbol.

So now we can load the model like this (old school, overriding symbol values or ignoring them):

tract text.onnx -i 1,77,i32 --onnx-ignore-output-shapes run --allow-random-input

or

tract text.onnx --set b=1 run --allow-random-input

And get one more ugly error both ways:

thread 'main' panicked at 'collapse_axis: Index 1249691715 must be less than axis length 49408 for array with shape IxDynImpl(Inline(2, [49408, 512, 0, 0]))', /home/kali/.cargo/registry/src/github.com-1ecc6299db9ec823/ndarray-0.15.6/src/dimension/mod.rs:361:5

I'm looking into this.

@kali
Copy link
Collaborator

kali commented Oct 9, 2022

Ok, this panic is just due to the fact that we are trying to access a tensor by index with a value bigger than the axis dimension. This is pretty common with text models (numbers up to some boundary used to represented the vocabulary, so inputs with values higer than the boundary are invalid). Replaced the panic by a regular error.

Can you tell me what are the assumptions on the input of the text model ?

@kali
Copy link
Collaborator

kali commented Oct 9, 2022

The more the merrier... new command line argument: --random-range

cargo run -p tract -- text.onnx --set b=1 -O run --const --allow-random-input --random-range 1.0..1000.0

So this works for me. Do we have anything else here before looking at the matmul performance thing ?

@kali
Copy link
Collaborator

kali commented Oct 9, 2022

See #831

@tgolsson
Copy link
Contributor Author

tgolsson commented Oct 9, 2022

Ah, duh! Yes, the input to this model is some tokenized input so word->index at some previous stage outside the network. I guess the 49408 is the vocab size or something like that, as you say. Not my domain normally, I'm just interested in running it. :P The --random-range sounds great!

I think this should be all for this issue then :-)

@kali
Copy link
Collaborator

kali commented Oct 9, 2022

All right, I'm very happy to close this one :)

@kali kali closed this as completed Oct 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants