Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combining 2 input sequences with sequence Combiner #729

Closed
jmizgajski opened this issue Jun 8, 2020 · 11 comments
Closed

Combining 2 input sequences with sequence Combiner #729

jmizgajski opened this issue Jun 8, 2020 · 11 comments
Labels
bug Something isn't working

Comments

@jmizgajski
Copy link
Contributor

Is your feature request related to a problem? Please describe.
I am trying to figure out how to combine 2 input sequences with sequence Combiner and I am not sure if this is possible. It would be great if it would ;) Right now I am getting

ValueError: Decoder inputs rank is 2, but should be 3 [batch x sequence x hidden] when using a tagger sequential decoder. Consider setting reduce_output to null / None if a sequential encoder / combiner is used.

if I try to put two sequences as inputs to sequence combiner.

Describe the use case
Consider following (simplified) features where tag is the ouput feature.

text                                lexicon                                      tag
I am a robot                   0 0 0 1                                      O O O X
George killed Greg        1 0 1                                         X O X

Describe the solution you'd like
I would like to embed text and lexicon series and predict tag with tagger.
So I would like to set up the combiner in a way that the first row that goes into the encoder of the sequence combiner would look like

[ (emb(I) | emb(0)), (emb(am) | emb(0)), (emb(a) | emb(0)), (emb(robot) | emb(1)] 

like this:

input_features:
  - name: text
    type: text
    level: word
    encoder: embed
    reduce_output: null
  - name: lexicon
    type: sequence
    encoder: embed
    representation: sparse
    reduce_output: null
combiner:
  type: sequence
  main_sequence_feature: text
  encoder: rnn
  cell_type: lstm
  bidirectional: true
  num_layers: 2
  reduce_output: null
output_features:
  - name: tag
    type: sequence
    decoder: tagger
    reduce_inputs: null

Describe alternatives you've considered
I was hoping for this to work out of the box, but now I am not sure if this is supported.

Additional context

@w4nderlust
Copy link
Collaborator

Your task is exactly the reason why the sequence combiner exists :)

It looks like you did everything correctly apart from reduce_inputs in the output feature, which should be reduce_input. With that, the tesnor the error is complaining about should have the correct sapes.

Let me know if this fixes your issue. (this also suggests me to provide a mechanism to notify the user when wrong keys are used in the model definition, this would have avoided your error to happen if you got notified).

I am really glad that people are starting to use those more advanced features in Ludwig!

@w4nderlust w4nderlust added the waiting for answer Further information is requested label Jun 9, 2020
@jmizgajski
Copy link
Contributor Author

jmizgajski commented Jun 9, 2020

@w4nderlust Thank you for your fast reply, the promise of an easy way to use those advanced features was what has drawn me to Ludwig in the first place (after seeing you present at Interspeech 2019).

Unfortunately your recommendation does not solve the problem (same issue persists) - likely because reduce_input was set to None anyway as it is the default value for tagger.

As to your other points:

I think that ensuring correct user input is an essential addition, as there are few things more frustrating that chasing a configuration typo for half a day ;) You could use something like jsonschema for it - it is pretty easy to write and convenient to use for validation. Alternatively Data Classes in 3.7's PEP 557 could be of use, but maybe its too early for that.

I guess there is also a bug in documentation, as this was a copy-paste from the tagger definition. Where it says reduce_inputs (with the s at the end), the same typo appears at least 13 times in the user guide.

@jmizgajski
Copy link
Contributor Author

Also it would be awesome to have a clear example for such a use-case in Examples :)

@w4nderlust
Copy link
Collaborator

@w4nderlust Thank you for your fast reply, the promise of an easy way to use those advanced features was what has drawn me to Ludwig in the first place (after seeing you present at Interspeech 2019).

Nice :)

Unfortunately your recommendation does not solve the problem (same issue persists) - likely because reduce_input was set to None anyway as it is the default value for tagger.

Let me try to fix it. Could you please provide a complete reproducible example with data, configuration and command? The data can be synthetic data generated using the script in data/dataset_synthesizer.py and can also be just a couple of lines, no need for the full dataset, as I believe the same problem will appear whith any amount of data.

As to your other points:

I think that ensuring correct user input is an essential addition, as there are few things more frustrating that chasing a configuration typo for half a day ;) You could use something like jsonschema for it - it is pretty easy to write and convenient to use for validation. Alternatively Data Classes in 3.7's PEP 557 could be of use, but maybe its too early for that.

These are good suggestions, thank you. Although Ludwig's configuration file is highly dynamic, in the sense that it maps directly the parmeters you put in the configuration to parameters in the classes and functions it uses. For that reason, and for the fact that the classes themselves are often changing and new ones are added, so far I preferred not to have a static parsers for the configuration (it's a v0.x still so I know things will change quite a lot still). Moreover, if a use adds their own additional encoder for instance, they would have to modify the parser too, and I want to avoid that. What I plan to do instead is find a simple mechanism that notifies of all the unrecognized keyowrds passed to each function / object, that could be a temporary solution that preserves dynamicity and avoids failing silently like it happens right now.

I guess there is also a bug in documentation, as this was a copy-paste from the tagger definition. Where it says reduce_inputs (with the s at the end), the same typo appears at least 13 times in the user guide.

Fixed it. Although the default value for sequence features is sum not null.

Also it would be awesome to have a clear example for such a use-case in Examples :)

If you agree I can use yours as an example :)

@jmizgajski
Copy link
Contributor Author

jmizgajski commented Jun 9, 2020

this will generate the data

import pandas as pd


tagged = {'Greg', 'Xavier', 'Smith', 'Piotr', 'robot', 'dad', 'George'}
lex = {'Greg', 'Xavier', 'robot'}

texts = [
    "I like Greg and Greg likes me more than George",
    "Xavier Smith went to the gym with robot dad",
    "robot dad was not the best dad cause he didn't have limbs just bits",
    "Piotr Smith is a better name than Xavier Greg cause at least it is not both first names",
    "the gut is called the second brain for good reasons listen to your gut",
    "omg how many more examples I have to think of am I a robot or what",
    "this is not going to be easy I should ask my dad for help",
    "well he is busy so here is one more from the top of my head",
    "this is a very nice libary btw i hope we manage to solve this issue so I can continue using it",
    "i have great hopes for it"
]

df = pd.DataFrame(dict(
    text=texts,
    lexicon=[" ".join('1' if w in lex else '0' for w in t.split()) for t in texts],
    tag=[" ".join('X' if w in tagged else 'O' for w in t.split()) for t in texts]  
))
df.to_csv('combiner_test.csv')

here is the example.yaml (ideally I would also set pretrained_embeddings: glove.txt on the text feature but this is not necessary for the example to reproduce the problem)

input_features:
  - name: text
    type: text
    level: word
    embeddings_trainable: true
    embedding_size: 300
    encoder: embed
    preprocessing:
      word_format: space
    reduce_output: null

  - name: lexicon
    type: sequence
    encoder: embed
    reduce_output: null


combiner:
  type: sequence
  main_sequence_feature: text
  encoder: rnn
  cell_type: lstm
  bidirectional: true
  num_layers: 2
  reduce_input: null
  reduce_output: null

output_features:
  - name: tag
    type: sequence
    decoder: tagger
    reduce_input: null

and here is a command to reproduce

ludwig train --data_csv combiner_test.csv --model_definition_file example.yaml 

@w4nderlust w4nderlust added bug Something isn't working and removed waiting for answer Further information is requested labels Jun 10, 2020
@w4nderlust
Copy link
Collaborator

Thank you, I found the bug and fixed it. Can you please confirm it now works for you?
Please install from master with:
pip uninstall ludwig & pip install git+http://github.com/uber/ludwig.git

Also if you confirm your approval, I will add this to both examples and test.

@jmizgajski
Copy link
Contributor Author

I will do that first thing after the Wed meeting spree ;) Thank you @w4nderlust ! :)

You have my approval to use the examples I provided in any way you like, as they are purely synthetic :)

@jmizgajski
Copy link
Contributor Author

jmizgajski commented Jun 10, 2020

@w4nderlust now I get the following error (both on the synthetic example and the real dataset).

 Traceback (most recent call last):
  File "/home/aci-user/venv36/bin/ludwig", line 8, in <module>
    sys.exit(main())
  File "/home/aci-user/venv36/lib64/python3.6/site-packages/ludwig/cli.py", line 118, in main
    CLI()
  File "/home/aci-user/venv36/lib64/python3.6/site-packages/ludwig/cli.py", line 64, in __init__
    getattr(self, args.command)()
  File "/home/aci-user/venv36/lib64/python3.6/site-packages/ludwig/cli.py", line 74, in train
    train.cli(sys.argv[2:])
  File "/home/aci-user/venv36/lib64/python3.6/site-packages/ludwig/train.py", line 809, in cli
    full_train(**vars(args))
  File "/home/aci-user/venv36/lib64/python3.6/site-packages/ludwig/train.py", line 355, in full_train
    debug=debug
  File "/home/aci-user/venv36/lib64/python3.6/site-packages/ludwig/train.py", line 503, in train
    debug=debug
  File "/home/aci-user/venv36/lib64/python3.6/site-packages/ludwig/models/model.py", line 112, in __init__
    **kwargs
  File "/home/aci-user/venv36/lib64/python3.6/site-packages/ludwig/models/model.py", line 178, in __build
    **kwargs
  File "/home/aci-user/venv36/lib64/python3.6/site-packages/ludwig/models/combiners.py", line 294, in __call__
    **kwargs
  File "/home/aci-user/venv36/lib64/python3.6/site-packages/ludwig/models/combiners.py", line 250, in __call__
    self.reduce_output
  File "/home/aci-user/venv36/lib64/python3.6/site-packages/ludwig/models/modules/reduction_modules.py", line 85, in reduce_sequence
    reduce_mode_registry
  File "/home/aci-user/venv36/lib64/python3.6/site-packages/ludwig/utils/misc.py", line 132, in get_from_registry
    key, registry.keys()
ValueError: Key False not supported, available options: dict_keys(['last', 'sum', 'mean', 'avg', 'max', 'concat', 'attention', 'none', 'None', None])

i also tried changing the reduce_input and reduce_output to none but it did not work either.

It did work on the sample dataset when I edited "ludwig/models/modules/reduction_modules.py" locally to map False: dont_reduce, but likely you have a parsing/casting problem somewhere so this would be just a cover up.

when I run it with this fix on the real dataset I get the following

Traceback (most recent call last):
  File "/home/aci-user/venv36/lib64/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/aci-user/venv36/lib64/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/aci-user/venv36/lib64/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: ConcatOp : Dimensions of inputs should match: shape[0] = [512,172,300] vs. shape[1] = [512,128,4]
         [[{{node sequence_combiner/sequence_concat_combiner/concat}}]]
         [[tag/measures_tag/rowwise_accuracy_tag/_169]]
  (1) Invalid argument: ConcatOp : Dimensions of inputs should match: shape[0] = [512,172,300] vs. shape[1] = [512,128,4]
         [[{{node sequence_combiner/sequence_concat_combiner/concat}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/aci-user/venv36/bin/ludwig", line 8, in <module>
    sys.exit(main())
  File "/home/aci-user/venv36/lib64/python3.6/site-packages/ludwig/cli.py", line 118, in main
    CLI()
  File "/home/aci-user/venv36/lib64/python3.6/site-packages/ludwig/cli.py", line 64, in __init__
    getattr(self, args.command)()
  File "/home/aci-user/venv36/lib64/python3.6/site-packages/ludwig/cli.py", line 74, in train
    train.cli(sys.argv[2:])
  File "/home/aci-user/venv36/lib64/python3.6/site-packages/ludwig/train.py", line 809, in cli
    full_train(**vars(args))
  File "/home/aci-user/venv36/lib64/python3.6/site-packages/ludwig/train.py", line 355, in full_train
    debug=debug
  File "/home/aci-user/venv36/lib64/python3.6/site-packages/ludwig/train.py", line 523, in train
    **model_definition[TRAINING]
  File "/home/aci-user/venv36/lib64/python3.6/site-packages/ludwig/models/model.py", line 597, in train
    is_training=True
  File "/home/aci-user/venv36/lib64/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/aci-user/venv36/lib64/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/aci-user/venv36/lib64/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/aci-user/venv36/lib64/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: ConcatOp : Dimensions of inputs should match: shape[0] = [512,172,300] vs. shape[1] = [512,128,4]
         [[node sequence_combiner/sequence_concat_combiner/concat (defined at /lib64/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
         [[tag/measures_tag/rowwise_accuracy_tag/_169]]
  (1) Invalid argument: ConcatOp : Dimensions of inputs should match: shape[0] = [512,172,300] vs. shape[1] = [512,128,4]
         [[node sequence_combiner/sequence_concat_combiner/concat (defined at /lib64/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'sequence_combiner/sequence_concat_combiner/concat':
  File "/bin/ludwig", line 8, in <module>
    sys.exit(main())
  File "/lib64/python3.6/site-packages/ludwig/cli.py", line 118, in main
    CLI()
  File "/lib64/python3.6/site-packages/ludwig/cli.py", line 64, in __init__
    getattr(self, args.command)()
  File "/lib64/python3.6/site-packages/ludwig/cli.py", line 74, in train
    train.cli(sys.argv[2:])
  File "/lib64/python3.6/site-packages/ludwig/train.py", line 809, in cli
    full_train(**vars(args))
  File "/lib64/python3.6/site-packages/ludwig/train.py", line 355, in full_train
    debug=debug
  File "/lib64/python3.6/site-packages/ludwig/train.py", line 503, in train
    debug=debug
  File "/lib64/python3.6/site-packages/ludwig/models/model.py", line 112, in __init__
    **kwargs
  File "/lib64/python3.6/site-packages/ludwig/models/model.py", line 178, in __build
    **kwargs
  File "/lib64/python3.6/site-packages/ludwig/models/combiners.py", line 294, in __call__
    **kwargs
  File "/lib64/python3.6/site-packages/ludwig/models/combiners.py", line 234, in __call__
    hidden = tf.concat(representations, 2)
  File "/lib64/python3.6/site-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "/lib64/python3.6/site-packages/tensorflow_core/python/ops/array_ops.py", line 1420, in concat
    return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
  File "/lib64/python3.6/site-packages/tensorflow_core/python/ops/gen_array_ops.py", line 1257, in concat_v2
    "ConcatV2", values=values, axis=axis, name=name)
  File "/lib64/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/lib64/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/lib64/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/lib64/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/lib64/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

which is the same problem I was running into when trying to use the sequence_concat combiner. I also checked my dataset and the features I dry run on are all sequences of equal length and all under 128 (no wonder, this is how I generated them), so I doubt this is a problem of the dataset. So it seems the rabbit hole goes deeper :( I would reopen this issue until at least the test is in place.

I think it is unlikely that ludwig is applying some other tokenization on the text than spliting on spaces (which is what I want here) so could it be some other vector is wrongfully concatenated to the end of the text or tag feature?

@w4nderlust
Copy link
Collaborator

Pushed a new commit, it should fix The first issue, sorry about the additional step ;)

As for the second issue, that may be data dependent. The logic is that vectors are concatenated along the the third (hidden) dimension, so they should have the first two that match: batch size and sequence length.
What I suspect is happening is that there is a mismatch between the length of the text after tokenization and the length of the tag feature.
The default text cokenizer is space_punct that splits on space and punctuation, so that "Hello, world!" becomes ["Hello", ",", "world", "!"]. If you want to use just space as tokenizer, just add to the text feature:

preprocessing:
  word_tokenizer: space

If even this does not fix the issue, i can try taking a look at it myself (but i would need at least a couple datapoints for which the problem is originated).

@jmizgajski
Copy link
Contributor Author

jmizgajski commented Jun 11, 2020

@w4nderlust I guess we found another bug in the docs then :) as it does say word_format: space in the tagger example. (or maybe I am looking at release docs and we changed to work with master throughout the development of this thread, not sure)

also adding these lines under the text feature like in the example above did not help, I had to add it as a top level item in the yaml file, but now I am able to train. Thank you!

I think this is something to underline in the documentation :)

If any one has the same problem the following config could be of help:

input_features:
  - name: text
    type: text
    level: word
    embeddings_trainable: true
    embedding_size: 300
    encoder: embed
    reduce_output: null

  - name: lexicon
    type: sequence
    encoder: embed
    reduce_output: null

combiner:
  type: sequence
  main_sequence_feature: text
  encoder: rnn
  cell_type: lstm
  bidirectional: true
  num_layers: 2
  reduce_input: null
  reduce_output: null

output_features:
  - name: tag
    type: sequence
    decoder: tagger
    reduce_input: null

preprocessing:
  text:
    word_tokenizer: space

@w4nderlust
Copy link
Collaborator

w4nderlust commented Jun 11, 2020

Good catch. yes word_format was an old name for word_tokenizer, I'm updating the docs right now. Regarding your model definition. You need only one word_tokenizer: space, either in the text feature or at the top level. The top level one applies to all text features, the one inside the feature applies only to that specific feature, but you have only one text feature so it makes no difference.

Glad you were able to train the model!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants