New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Input fact propagation wonky for NNEF #742
Comments
There is no way in the current state of affair to "reflow" over the graph and recompute the shapes. It is something that we may want to add, not super hard, but in the end, if the graph is consistent, we should never (?) have to do this. |
Also if you run on actual input, the size should just induce the |
Wouldn't Edit:
With
With |
Mmmm... I was thinking... you optimise the network with N, |
Good point, yeah, optimizing for |
I guess this is another nice little mess. Can we take a step back ? What is it you are trying to achieve ? |
Sure! I'm trying to build an asset workflow around using NNEF as live assets; so we generate ONNXes from our ML pipelines; and ingest/store those as internal assets. At packaging time I want to convert (without necessarily concretizing!) those assets to NNEF as it's noticeably faster to load during my benchmarks. Then I want to be able to concretize at load time (or use Tracts live concretization). I haven't tried it (i.e., with cervo) with any of the transformers I sent you last time but that has seemed to work with the small RL brains I've used previously. For clarity, I haven't tried those on the tract-cli before so it's entirely plausible that this workflow only works from code. Ninja edit: With cervo, I can late-concretize NNEF RL models with symbols. With tract CLI I'm trying to do the same with transformer models. Ninja edit 2: I'm fine with "Symbolic NNEF's requires --set N=... for tract-cli". It just seems quite arbitrary since it's not required for ONNXs. But
All I'm using tract-cli for is validation various theories I have before going to code. So all of the above is just the user-story of why I'm using NNEF. So what I'm trying to achieve is just validating that things work the way I expect them to work. |
Lol. Well, the cli has grown organically to meet what I needed to make tract-the-library development easier. So yeah, no surprise it's not consistent. Being consistent betweean ONNX/TF and NNEF is pretty difficult. For instance, TF more or less never includes input shape info, ONNX can and does maybe half of the time. So we need Maybe we should make I think it would be interesting to clarify the location the various options appear too. From the API, optimising with symbols then running works (the symbols are resolved when the Source emit the input tensor). In order to make this work with --profile (or run with random data) we should maybe make --set available on both of them, or allow to specify an actual input (another set of Fitting all use-cases in a command line is pretty challenging... For instance, we may want in some case to run the --set after decluttering (like --concretize was doing), but we may also want it after optimisation (not sure, if we instead set a way to specify the concrete fact we want to run/bench/criterion/dump --profile). |
Yeah; the difference between NNEF and ONNX was a bit of pain and cervo and lead to lots of duplication (essentially all APIs come in both "from_model" and "from_typed". I understand that capturing that workflow difference in a CLI is even harder since well, more PEBKAC. :P Jokes aside, I think having the CLI tell me that I'm giving it gibberish for a NNEF would be a helpful start :P I unfortunately have no super-helpful suggestions for other improvements. I think that having --set go on the main args would fix some of the issues for me; but it's just bouncing the problems around. I'm not sure how well clap would adapt but maybe having some more explicit DSL-y workflow would make sense? |
@kali Awesome! I'll see if I can find some time to update and try it out this week or early next week! |
Are we happy with the current solution ? Should we close this one ? |
Sorry, I never got around to trying! Will do so immediately! |
@kali Any immediate diagnostic?
|
mmm... maybe try |
Same result. Also tried a bunch of variants with external, and various combos of the both. No dice. I'll sync down the code tomorrow and see what inv.id is, might have some hints. |
That looks like it's coming from inside tract-nnef, and only in one location. Should that OP be renamed or should the input checker handle both possibilities? |
The assert is overzealous. We need to distinguish the two kind of Source: "external" is a strict nnef-compatible source, while "tract_core_external" adds support for symbolic dimensions. So the assert should just accept both. Please tell me if this is not enough to fix. |
Getting further (with some additional changes...) but it looks like concretization still isn't getting far enough:
|
All right, thanks for giving it a shot, I guess I'll take a look. Is this model one you've already shared with me ? |
Indeed! Made a PR with the other small touchups I did. |
OK... Not sure I chose the right way to fix. Trying to recap here. Our model NNEF starts with these lines:
Here we have two independent reference to N: one in the Now, in this case, the check happens while we translate the model from the patched AST to TypedModel, so all we managed to get is an inconsistent model from a working one... This hack would work if N was only appearing once, but... So I guess it's back to the drawing board with this: I'm gonna try to move --set to the global set of arguments, just as --concretize used to be (except it was for the magic streaming dimension). Let's see what happens. |
I think I get it to work (there was a hard-to-track bug in Concat...) This new global --set is performed right after decluttering. So |
Hey @kali I think this is still not 100% working as intended. It does still work for
Running the same network as ONNX crashes during inference instead:
For some context; I'm trying to force LirMatMulUnary in this network to see whether that helps with performance - it's underperforming by a huge margin. And tract-cli with (with |
Did you really send me this model ? :) |
The ONNX variant - I can send it to you in NNEF format if you want as well. But the .tar.gz I sent you for the batch size investigation contained an image.onnx and a text.onnx. |
Found them. So I can at least right now clarify what happens with text.onnx. So now we can load the model like this (old school, overriding symbol values or ignoring them):
or
And get one more ugly error both ways:
I'm looking into this. |
Ok, this panic is just due to the fact that we are trying to access a tensor by index with a value bigger than the axis dimension. This is pretty common with text models (numbers up to some boundary used to represented the vocabulary, so inputs with values higer than the boundary are invalid). Replaced the panic by a regular error. Can you tell me what are the assumptions on the input of the |
The more the merrier... new command line argument: --random-range
So this works for me. Do we have anything else here before looking at the matmul performance thing ? |
See #831 |
Ah, duh! Yes, the input to this model is some tokenized input so word->index at some previous stage outside the network. I guess the 49408 is the vocab size or something like that, as you say. Not my domain normally, I'm just interested in running it. :P The --random-range sounds great! I think this should be all for this issue then :-) |
All right, I'm very happy to close this one :) |
Maybe I was a bit too quick to close #718 as it seems to still have some issues when running depending on exact flags I pass.
I'll post these in one go as I think they're related; but we'll see.
As a base; I'm using the
image.nnef.tar
we can now generate. I'm using the following base command line:This always works with
dump
(except when passing--profile
), but failsrun
with the following error:Adding
--set N=1
to the run fixes this. I'd have expect something like--override-fact input:1,3,224,224,f32
to also work correctly as a more aggressive-i input:1,3,224,224,f32
.If I attempt to optimize the graph it fails with the following wonky error where it fails to unify two compatible shapes?
Looks like it's somehow fails to propagate the input facts?
The text was updated successfully, but these errors were encountered: