How to train using my own dataset?

I would like to train en->fr using the Transformer model on one GPU, and by using wordpieces (sentencepiece) of 32k vocab, like in GNMT. Now I have my own dataset of English sentences and corresponding French sentences. 

What's the right way to do it with T2T? Is there a way to pass the parameters of the test/dev/test source and target files in the command line, or must I register my own Problem, by adding a class to the data-generators folder?

Will the 32k vocab be generated automatically given the source and target corpuses I supply?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to train using my own dataset? #516

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to train using my own dataset? #516

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions