Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integer-sizing a decluttered streaming TypedModel without Pulse (for non causal models) #300

Closed
bminixhofer opened this issue Jun 13, 2020 · 36 comments

Comments

@bminixhofer
Copy link
Contributor

bminixhofer commented Jun 13, 2020

Hey, I came across another problem trying the bidirectional LSTM model in a browser.
It is the same LSTM that is now in CI (download link). Now normally I'd use code similar to this:

use tract_onnx::prelude::*;

fn main() -> TractResult<()> {
    let model = tract_onnx::onnx()
        .model_for_path("model.onnx")?
        .into_optimized()?
        .into_runnable()?;

    let input: Tensor = tract_ndarray::Array2::<u8>::zeros((1, 100)).into();
    model.run(tvec!(input))?;

    Ok(())
}

but I get an error:

➜  ~/Documents/Experiments/sblstmtest git:(master) ✗ cargo run
   Compiling sblstmtest v0.1.0 (/Users/bminixhofer/Documents/Experiments/sblstmtest)
    Finished dev [unoptimized + debuginfo] target(s) in 4.51s
     Running `target/debug/sblstmtest`
Error: TractError(Msg("Translating node #1 \"input\" Source ToTypedTranslator"), State { next_error: Some(TractError(Msg("Output type not determined"), State { next_error: None, backtrace: InternalBacktrace { backtrace: None } })), backtrace: InternalBacktrace { backtrace: None } })

Running it without into_optimized, or with an input fact works.
So I understand that the model can not be optimized because the shape of the input (batch size and seq len) is not known at the time of building. Is that correct?
In practice I don't want to fix the input shape at build time because it has to work with different batch sizes.

Now so far it wouldn't be a problem, I'd just add an option optimize to the JS API to turn optimization on or off depending on whether dynamic shapes are needed during inference.

The problem comes when I try to store the model that I got by calling into_runnable without calling into_optimized before.

I get a model of type SimplePlan<InferenceFact, Box<dyn InferenceOp>, ModelImpl<InferenceFact, Box<dyn InferenceOp>>>. When I want to store such a model in a struct like:

use tract_onnx::prelude::*;

struct Model {
    inner: SimplePlan<InferenceFact, Box<dyn tract_hir::infer::ops::InferenceOp>, InferenceModel>,
}

I get an error which says that the module ops is private:

➜  ~/Documents/Experiments/sblstmtest git:(master) ✗ cargo run
   Compiling sblstmtest v0.1.0 (/Users/bminixhofer/Documents/Experiments/sblstmtest)
error[E0603]: module `ops` is private
  --> src/main.rs:4:64
   |
4  |     inner: SimplePlan<InferenceFact, Box<dyn tract_hir::infer::ops::InferenceOp>, InferenceModel>,
   |                                                                ^^^ private module
   |
note: the module `ops` is defined here
  --> /Users/bminixhofer/.cargo/registry/src/github.com-1ecc6299db9ec823/tract-hir-0.7.0/src/infer/mod.rs:12:1
   |
12 | mod ops;
   | ^^^^^^^^

error: aborting due to previous error

For more information about this error, try `rustc --explain E0603`.
error: could not compile `sblstmtest`.

To learn more, run the command again with --verbose.

So I can't store the result. Am I missing something? And if not, is there some way to work around this?

Thanks for all your help :)

@bminixhofer
Copy link
Contributor Author

In other, better news custom outputs are now tried and tested in tractjs, so I'm pretty confident this issue (and possibly #296) are the last blockers on the tract side for the initial tractjs release :)

@kali
Copy link
Collaborator

kali commented Jun 13, 2020

Well, now this one is interesting... Once again you're pushing the envelope a tiny bit.

So I understand that the model can not be optimized because the shape of the input (batch size and seq len) is not known at the time of building. Is that correct?

Yes.

So the simple fix is to make whetever types we need from hir::ops public. We'll certainly do that, I am actually a bit surprised the compiler let us have public type alias with private types, but... ok.

But we could do better: storing an InferenceModel and doing the the full optimization every time is quite expensive, while I think we are equiped to perform a significant part of it (up to declutter()) while working with the symbolic streaming S dimension. This is what we do when we go the pulsing route for our real time voice networks : into_normalize(), then creating a pulse network, then into_type(), obtaining a network with only numerical dimensions this time, and finally codegen().

If your network was causal, using the pulsing tranformation would be an option. But of course, bidi LSTM are not causal.

In your case we could imagine having a new "route": first just go up to declutter using the S dimension as for a streaming network, and we could store the obtained TypedModel. From there, we would introduce new stuff: a simple transformation that just put a value on S (the actualy sequence length, taken as a parameter of the transformation), and evaluate all fact expressions that depend on it to obtain a numerically-sized TypedModel. From there, we could call codegen() as usual.

If you're interested in getting your hands dirty and revisiting / fixing more half baked APIs in the process, I'd be glad to help you doing this. I don't mind at all just fixing the visibility stuff right away (PR welcome) to release tractjs, then consider doing this later.

@bminixhofer
Copy link
Contributor Author

Turns out tract_hir::infer::InferenceSimplePlan<InferenceModel> works. But InferenceSimplePlan is not in the prelude of tract_hir. I'll make a quick PR for that.

Sounds good! I'm not sure how much time I'll have in the coming weeks so I'd rather try to do an initial release of tractjs now. But I might give it a shot in the future :)

@kali kali changed the title Storing a non-optimized model in another struct Integer-sizing a decluttered streming TypedModel without Pulse (for non causal models) Jun 14, 2020
@kali kali changed the title Integer-sizing a decluttered streming TypedModel without Pulse (for non causal models) Integer-sizing a decluttered straeming TypedModel without Pulse (for non causal models) Jun 14, 2020
@kali kali changed the title Integer-sizing a decluttered straeming TypedModel without Pulse (for non causal models) Integer-sizing a decluttered streaming TypedModel without Pulse (for non causal models) Jun 14, 2020
@kali
Copy link
Collaborator

kali commented Jun 18, 2020

FYI, I actually implemented it (turns out I started needing it in tract stream-check subcommand). I'll merge it soon.

@bminixhofer
Copy link
Contributor Author

Wow, awesome, you're on fire with the changes to tract recently! :) Just now adding a test for the custom inputs to tractjs from #296.

@kali
Copy link
Collaborator

kali commented Jun 19, 2020

@kali kali closed this as completed Jun 19, 2020
@bminixhofer
Copy link
Contributor Author

Great! One more question, and I'm sorry but: there can only ever be one streaming dimension, right? Because the LSTM in CI would technically have two - batch size and sequence length.

@kali kali reopened this Jun 19, 2020
@kali
Copy link
Collaborator

kali commented Jun 19, 2020

Reopening because of some issues... did not realize a few operators need a specific treatment as the have sizes expressions in their attributes. Identified so far: TypedReshape, MultiBroadcast, there may be others. So beware. If these are the only ones, I'll fix them soon.

And yes, at this stage, we can only have one variable in dimension, which is kind of rigged to be the time (but maybe only the streaming/pulsing code assumes that... not too sure). I think I need to start thinking about generalizing this (it may also be a way for runtime-dynamicly shape tensors...)

So... need to think about this before anything happend on this front.

@kali
Copy link
Collaborator

kali commented Jun 19, 2020

  • Scan have some too (@bminixhofer so this reamaining bug may affect your lstms)

@bminixhofer
Copy link
Contributor Author

Ok, sure. In that case I think it is best to release tractjs with the current feature set - nothing speaks against adding this later :)

@kali
Copy link
Collaborator

kali commented Jun 22, 2020

@DreamerMind I think we have what we need with this.
@bminixhofer curious to know if you can check concretization of S in your LSTM examples works with this. As we discussed, other dimensions (including N) will have to wait a bit.

@bminixhofer
Copy link
Contributor Author

Sure, I can try that and the invalid rank catching from #269 along with it.

@bminixhofer
Copy link
Contributor Author

Ok, I'm not sure I am trying this correctly.

Cargo.toml

[package]
name = "sblstmtest"
version = "0.1.0"
authors = ["Benjamin Minixhofer <bminixhofer@gmail.com>"]
edition = "2018"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
tract-onnx = { git = "https://github.com/snipsco/tract", branch = "fixes-for-concretize-dim" }

src/main.rs

use tract_onnx::prelude::*;

fn main() -> TractResult<()> {
    let model = tract_onnx::onnx()
        .model_for_path("model.onnx")?
        .into_optimized()?
        .concretize_stream_dim(1)?
        .into_runnable()?;

    let input: Tensor = tract_ndarray::Array2::<u8>::zeros((1, 50)).into();
    model.run(tvec!(input))?;

    Ok(())
}

This throws an error upon calling into_optimized:

➜  ~/Documents/Experiments/sblstmtest git:(master) ✗ cargo run
   Compiling sblstmtest v0.1.0 (/Users/bminixhofer/Documents/Experiments/sblstmtest)
    Finished dev [unoptimized + debuginfo] target(s) in 4.36s
     Running `target/debug/sblstmtest`
Error: TractError(Msg("Translating node #0 \"input\" Source ToTypedTranslator"), State { next_error: Some(TractError(Msg("Output type not determined"), State { next_error: None, backtrace: InternalBacktrace { backtrace: None } })), backtrace: InternalBacktrace { backtrace: None } })

When I fix this by providing a fixed shape, again only that shape works:

use tract_onnx::prelude::*;

fn main() -> TractResult<()> {
    let model = tract_onnx::onnx()
        .model_for_path("model.onnx")?
        .with_input_fact(0, InferenceFact::dt_shape(u8::datum_type(), tvec!(1, 100)))?
        .into_optimized()?
        .concretize_stream_dim(1)?
        .into_runnable()?;

    let input: Tensor = tract_ndarray::Array2::<u8>::zeros((1, 50)).into();
    model.run(tvec!(input))?;

    Ok(())
}
➜  ~/Documents/Experiments/sblstmtest git:(master) ✗ cargo run
   Compiling sblstmtest v0.1.0 (/Users/bminixhofer/Documents/Experiments/sblstmtest)
    Finished dev [unoptimized + debuginfo] target(s) in 4.59s
     Running `target/debug/sblstmtest`
Error: TractError(Msg("Evaluating #1 \"input\" Source: output 0, expected 1x100xU8, got 1x50xU8 0, 0, 0, 0, 0, 0, 0, 0..."), State { next_error: None, backtrace: InternalBacktrace { backtrace: None } })

So I'm not sure how to use concretize_stream_dim. model.onnx is the model from https://github.com/snipsco/tract/tree/main/onnx/test_cases/byte_sb_bidi_lstm just so we're on the same page.

@bminixhofer
Copy link
Contributor Author

Oh and the shape + rank checking really works. Thanks for that! I am removing it from the FAQ. Already one down ;)

@kali
Copy link
Collaborator

kali commented Jun 23, 2020

So what I had in mind was:

use tract_onnx::prelude::*;

fn main() -> TractResult<()> {
    let decluttered_model = tract_onnx::onnx()
        .model_for_path("model.onnx")?
        .with_input_fact(0, InferenceFact::dt_shape(u8::datum_type(), tvec!(1, TDim::s())))?
        .into_type()?
        .declutter()?;

[...]

     let optimized_model = decluttered_model
        .concretize_stream_dim(40)? // borrows &self
        .into_optimized()?
        .into_runnable()?;

    let input: Tensor = tract_ndarray::Array2::<u8>::zeros((1, 40)).into();
    optimized_model.run(tvec!(input))?;

    Ok(())
}

It's a bit of a lot, we may need more ergonomics when we've established it does work.

But there still are bugs :) TypedReshape is giving me a lot of pain, specifically for something I ultimately want to exclude from the core set (see #307), but I'll still give a shot at fixing it.

@bminixhofer
Copy link
Contributor Author

bminixhofer commented Jun 23, 2020

Hmm so where we are trying to get is that I can load and optimize the model first, then predict input of shape e. g. (1, 40), and some time later input of shape e. g. (1, 50) without having to do any expensive optimization in between, right?

Is that already what the code above achieves? I would have thought that into_optimized and into_runnable are quite expensive..

@kali
Copy link
Collaborator

kali commented Jun 23, 2020

For now, we're trying to "save" the analysis and declutter() time.

optimize() is relatively cheap compared to decluttering. We may try to address it at a later time, but there is a bit of work as the optimized operators typically expect fixed sizes, without S or any other symbol. I'm getting more and more convinced that we can generalize and accept symbols along some dimensions in optimized ops at least, but this is out of scope for the current exercise. Baby steps.

@bminixhofer
Copy link
Contributor Author

Ok, that makes sense. I still don't know how the graph from an ONNX model can be "decluttered" (because when I view the graph all of the ops look pretty essential to me ;) ) but that's probably a topic for another time.

I'll try the code snippet from above.

@kali
Copy link
Collaborator

kali commented Jun 23, 2020

Na, don't rush, I'm still working out some issues. :) I'll tell you when I think it's ok :)

@kali
Copy link
Collaborator

kali commented Jun 23, 2020

As for the term "declutter", it is true that ONNX graphes are not the worse that I have seen, and the concept have evolved since I introduced it. The main idea is actually to convert as much as possible to what is shaping out as tract core operator set:

  • get rid of purely idiosyncratic patterns (like the terrible SpaceToBatch/Conv2D/SpaceToBatch dance in TF for dilated conv)
  • expand complex operations:
    • LSTM -> Scan
    • activations as simple primitives (mul, recip, square, ln, exp... and usual primitives are in the core-opset, but not relu, elu, prelu, selu etc. Sigmoid and tanh are exceptions: we have faster implementations than their exponentiation based expansion, so they are first class citizens in core.)
    • sames goes for BatchNormalization and a few other high level NN ops that can be re-expressed in simple tensor arithmetics
  • incorporate most constants as fields in the operator that needs them (so they can be reshaped to the best form at optimize() time if needed)
  • get rid of high level "pythonisms", (e.g. negative indexing, rank broadcast, ...)
  • performs some graph level simplifications: permutation cancelling each other accross the graph, useless axis, move split up and concat down to get more "atomic" operations that are easier to reason about, move downsampling up, etc.

On your model, the "decluttering" aspect is not obvious as the decluttered form looks bigger than the original form. But on some networks where machine learners abused shapes and padding computations in tensor form, the decluttered network can be 10 times smaller (in number of ops) than the input graph...

In comparison, optimize is relatively cheap. As a matter of fact, most operators do not have an optimized that differs from the decluttered one. The remaining (Scan, MatMul, Conv mostly) are just translated one-to-one. Then Matmul can aggregates a few simple operations that happen at its output (like a bias addition, a relu or quantization fix), but all of these optimisations only require to "observe" and fix locally the graph, while decluttering may visit and rewrite big portions of the graph.

@bminixhofer
Copy link
Contributor Author

Ok, that makes sense. Thanks!

@kali
Copy link
Collaborator

kali commented Jun 28, 2020

@bminixhofer I think it's ready for you to give it a shoot now :) thanks for your patience, I'm on vacation, so keeping it light ;)

@bminixhofer
Copy link
Contributor Author

@kali Sure, will do. Enjoy your vacation! Well deserved after all the recent tract improvements ;)

@bminixhofer
Copy link
Contributor Author

bminixhofer commented Jun 29, 2020

Ok, I've tried the following:

Cargo.toml

[package]
name = "sblstmtest"
version = "0.1.0"
authors = ["Benjamin Minixhofer <bminixhofer@gmail.com>"]
edition = "2018"

[dependencies]
tract-onnx = { git = "https://github.com/snipsco/tract", branch = "fixes-for-concretize-dim" }
tract-core = { git = "https://github.com/snipsco/tract", branch = "fixes-for-concretize-dim" }

src/main.rs

use tract_core::dim::ToDim;
use tract_onnx::prelude::*;

fn main() -> TractResult<()> {
    let _model = tract_onnx::onnx()
        .model_for_path("model.onnx")?
        .with_input_fact(
            0,
            InferenceFact::dt_shape(u8::datum_type(), tvec!(1.to_dim(), TDim::s())),
        )?
        .into_typed()?
        .declutter()?;

    Ok(())
}

I get an error:

➜  ~/Documents/Experiments/sblstmtest git:(master) ✗ cargo run
   Compiling sblstmtest v0.1.0 (/Users/bminixhofer/Documents/Experiments/sblstmtest)
    Finished dev [unoptimized + debuginfo] target(s) in 4.79s
     Running `target/debug/sblstmtest`
Error: TractError(Msg("Failed analyse for node #26 \"Reshape_12\" Reshape"), State { next_error: Some(TractError(Msg("Infering facts"), State { next_error: Some(TractError(Msg("Applying rule Given2Rule { (inputs[0].shape, inputs[1]) }: Can not compute shape with streaming dimension and -1 placeholder"), State { next_error: None, backtrace: InternalBacktrace { backtrace: None } })), backtrace: InternalBacktrace { backtrace: None } })), backtrace: InternalBacktrace { backtrace: None } })

so it seems there is still some problem.

@kali
Copy link
Collaborator

kali commented Jun 29, 2020

Mmmm... bit weird, I have this working:

TSAR 29/06 21:13 ~/dev/snips/tmp/tract_300% cat Cargo.toml
[package]
name = "sblstmtest"
version = "0.1.0"
authors = ["Benjamin Minixhofer <bminixhofer@gmail.com>"]
edition = "2018"

[dependencies]
tract-onnx = { path = "../../tract/onnx" }
TSAR 29/06 21:13 ~/dev/snips/tmp/tract_300% cat src/main.rs
use tract_onnx::prelude::*;

fn main() -> TractResult<()> {
    let decluttered_model = tract_onnx::onnx()
        .model_for_path("model.onnx")?
        .with_input_fact(0, InferenceFact::dt_shape(u8::datum_type(), tvec!(1.into(), TDim::s())))?
        .into_typed()?
        .declutter()?;

     let optimized_model = decluttered_model
        .concretize_stream_dim(40)? // borrows &self
        .optimize()?
        .into_runnable()?;

    let input: Tensor = tract_ndarray::Array2::<u8>::zeros((1, 40)).into();
    optimized_model.run(tvec!(input))?;

    Ok(())
}
TSAR 29/06 21:13 ~/dev/snips/tmp/tract_300% cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.03s
     Running `target/debug/sblstmtest`
TSAR 29/06 21:14 ~/dev/snips/tmp/tract_300% ls -al model.onnx
lrwxrwxrwx 1 kali kali 56 Jun 23 14:37 model.onnx -> ../../tract/onnx/test_cases/byte_sb_bidi_lstm/model.onnx

Do you see what we do differently ?

@bminixhofer
Copy link
Contributor Author

The relevant code looks the same to me..

  1. Is the branch fixes-for-concretize-dim I'm using correct?
  2. Are we using the same model? I use the one in CI of this repo (or intend to at least):
➜  ~/Documents/Experiments/sblstmtest git:(master) ✗ md5sum model.onnx
c8ce25fcb49be278a751e5fd870c1b05  model.onnx

@kali
Copy link
Collaborator

kali commented Jun 29, 2020

TSAR 29/06 22:38 ~/dev/snips/tract% git rev-parse HEAD
cc2052fc57c9d7ba6e02d580f6d82b5f64b2eceb
TSAR 29/06 22:39 ~/dev/snips/tract% cd ../tmp/tract_300
TSAR 29/06 22:39 ~/dev/snips/tmp/tract_300% md5sum model.onnx
c8ce25fcb49be278a751e5fd870c1b05  model.onnx

@kali
Copy link
Collaborator

kali commented Jun 29, 2020

Also, tried your main, no error. (and FYI, tract_core is re-exported from tract_onnx)

@bminixhofer
Copy link
Contributor Author

Hmm that's really strange. Can you try cargo runin this directory: sblstmtest.zip. Maybe you can reproduce it that way. I think it has to be some mismatch in the git branch / tract version.

@kali
Copy link
Collaborator

kali commented Jun 30, 2020

I think you just need a cargo update... at least that did the trick here.

@bminixhofer
Copy link
Contributor Author

Oh, sorry, I thought deleting target was enough. Now it is working. So to get back on track:

It is working now, and I get the same output using concretize_stream_dim as I do when passing fixed dimensions. I also went ahead and ran some simple benchmarks:

src/lib.rs

use tract_onnx::prelude::*;

pub fn infer_dynamic(model: &TypedModel) -> TractResult<TVec<Arc<Tensor>>> {
    let optimized_model = model
        .concretize_stream_dim(40)? // borrows &self
        .optimize()?
        .into_runnable()?;

    let input: Tensor = tract_ndarray::Array2::<u8>::zeros((1, 40)).into();
    optimized_model.run(tvec!(input))
}

pub fn infer_unoptimized_dynamic(
    model: &InferenceSimplePlan<InferenceModel>,
) -> TractResult<TVec<Arc<Tensor>>> {
    let input: Tensor = tract_ndarray::Array2::<u8>::zeros((1, 40)).into();
    model.run(tvec!(input))
}

pub fn infer_static(model: &TypedSimplePlan<TypedModel>) -> TractResult<TVec<Arc<Tensor>>> {
    let input: Tensor = tract_ndarray::Array2::<u8>::zeros((1, 40)).into();
    model.run(tvec!(input))
}

and running a simple criterion benchmark with these functions:

infer_dynamic           time:   [19.420 ms 19.657 ms 19.893 ms]                          

infer_unoptimized_dynamic                                                                            
                        time:   [40.084 ms 40.605 ms 41.133 ms]

infer_static            time:   [5.5758 ms 5.6629 ms 5.7513 ms]                        

So this is already a 2x speedup, but still significantly slower than infer_static, but I guess that is expected.

@kali
Copy link
Collaborator

kali commented Jul 1, 2020

All right ! I'm going to merge this then.

Thanks for doing the bench, i thought we would be closer to infer_static, but I may have found one easy optimisation opportunity impacting both decluttering and optimizing. We'll see how it goes.

@kali
Copy link
Collaborator

kali commented Jul 1, 2020

Closing to close this one to switch to #313 for the next possible steps.

@kali kali closed this as completed Jul 1, 2020
@kali
Copy link
Collaborator

kali commented Jul 2, 2020

@bminixhofer so you may to check the impact of #312 on your bench if you still have them around. I can see huge decluttering time improvements on some networks.

@bminixhofer
Copy link
Contributor Author

Wow, looking much better!

infer_dynamic           time:   [8.3267 ms 8.4138 ms 8.4957 ms]                          
                        change: [-57.894% -57.196% -56.526%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) low severe
  7 (7.00%) low mild

infer_unoptimized_dynamic                                                                            
                        time:   [38.940 ms 39.448 ms 39.957 ms]
                        change: [-4.5982% -2.8500% -1.1127%] (p = 0.00 < 0.05)
                        Performance has improved.

infer_static            time:   [5.3221 ms 5.3788 ms 5.4354 ms]                          
                        change: [-6.7529% -5.0173% -3.2052%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  4 (4.00%) low severe
  9 (9.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

@kali
Copy link
Collaborator

kali commented Jul 2, 2020

Yeah. That's more like it :) And it will leave me time to think about the next steps :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants