Welcome to Seq2SeqSharp Discussions! #38

zhongkaifu · 2021-10-21T01:18:46Z

zhongkaifu
Oct 21, 2021
Maintainer

👋 Welcome!

We’re using Discussions as a place to connect with other members of our community. We hope that you:

Ask questions you’re wondering about.
Share ideas.
Engage with other community members.
Welcome others and are open-minded. Remember that this is a community we
build together 💪.

To get started, comment below with an introduction of yourself and tell us about what you do with this community.

SileNTViP · 2022-04-11T10:22:28Z

SileNTViP
Apr 11, 2022

Hi. How about tokenizing a dataset on the fly like in OpenNMT? Without having to tokenize new text separately.

0 replies

zhongkaifu · 2022-04-11T17:27:24Z

zhongkaifu
Apr 11, 2022
Maintainer Author

Hi @SileNTViP

Seq2SeqSharp has supported it. You can try "-SrcSentencePieceModelPath" and "-TgtSentencePieceModelPath" parameters along with SentencePiece model file path in the command line. Here is an example for translation from English to Chinese.

.\bin\Seq2SeqConsole\Seq2SeqConsole.exe -Task Test -ModelFilePath .\model\seq2seq_mt_enu_chs.model -InputTestFile .\data\test\test_enu_raw.txt -OutputFile out_chs.txt -MaxTestSrcSentLength 110 -MaxTestTgtSentLength 110 -ProcessorType CPU -SrcSentencePieceModelPath .\spm\enuSpm.model -TgtSentencePieceModelPath .\spm\chsSpm.model -BeamSearchSize 1 -BatchSize 2 -DeviceIds 0,1,2,3 -ShuffleType Random

Thanks
Zhongkai Fu

4 replies

SileNTViP Apr 11, 2022

How about Train action?

zhongkaifu Apr 11, 2022
Maintainer Author

Not now. Actually, all necessary functions have already been implemented, so it's pretty easy to let Train action support it.
The reason why I didn't do it is because I use Seq2SeqSharp for my daily work and after training set get tokenized, I have some other data processing, such as tagging, so I cannot let Seq2SeqSharp directly tokenize the data set and go for training.

I'm currently focusing on tag embeddings and prompts, so I don't have enough time for it. If you want to have it, you can do it yourself. :)

SileNTViP Apr 12, 2022

Can you tell me what parameters you gave to train the spm model for spm_train.exe?

zhongkaifu Apr 12, 2022
Maintainer Author

Ah, I see. I thought you were talking about Seq2SeqSharp Train Action. Seq2SeqSharp only wraps encoding and decoding functions for SentencePiece, and you can call it naturally. The wrapper class is at https://github.com/zhongkaifu/Seq2SeqSharp/blob/master/Seq2SeqSharp/_SentencePiece/SentencePiece.cs

For SPM model training, you have to call spm_train.exe in separate. For parameters, it depends on your data set and task. I usually only specify vocabulary size and use default values for other parameters.

SileNTViP · 2022-04-16T16:05:01Z

SileNTViP
Apr 16, 2022

I''ll try train on big dataset and get this:
info,16.04.2022 15:40:56 Loading and shuffling corpus from '5' files.
info,16.04.2022 15:40:56 Process file 'data\train01.enu.snt' and 'data\train01.rus.snt'
info,16.04.2022 15:40:56 Process file 'data\train04.enu.snt' and 'data\train04.rus.snt'
info,16.04.2022 15:40:56 Process file 'data\train02.enu.snt' and 'data\train02.rus.snt'
info,16.04.2022 15:40:56 Process file 'data\train03.enu.snt' and 'data\train03.rus.snt'
info,16.04.2022 15:40:56 Process file 'data\train05.enu.snt' and 'data\train05.rus.snt'
info,16.04.2022 15:41:09 Starting shuffle 813004 sentence pairs.
info,16.04.2022 15:42:13 Starting shuffle 1944574 sentence pairs.
info,16.04.2022 15:46:51 Exception: 'One or more errors occurred. (Object reference not set to an instance of an object.) (Object reference not set to an instance of an object.) (Object reference not set to an instance of an object.)'
info,16.04.2022 15:46:51 Call stack: ' at System.Threading.Tasks.TaskReplicator.Run[TState](ReplicatableUserAction1 action, ParallelOptions options, Boolean stopOnFirstFailure) at System.Threading.Tasks.Parallel.ForWorker[TLocal](Int32 fromInclusive, Int32 toExclusive, ParallelOptions parallelOptions, Action1 body, Action2 bodyWithState, Func4 bodyWithLocal, Func1 localInit, Action1 localFinally)
--- End of stack trace from previous location ---
at System.Threading.Tasks.Parallel.ThrowSingleCancellationExceptionOrOtherException(ICollection exceptions, CancellationToken cancelToken, Exception otherException)
at System.Threading.Tasks.Parallel.ForWorker[TLocal](Int32 fromInclusive, Int32 toExclusive, ParallelOptions parallelOptions, Action1 body, Action2 bodyWithState, Func4 bodyWithLocal, Func1 localInit, Action1 localFinally) at System.Threading.Tasks.Parallel.For(Int32 fromInclusive, Int32 toExclusive, Action1 body)
at Seq2SeqSharp.Tools.ParallelCorpus1.ShuffleAll() in D:\FakeApps\Seq2SeqSharp\Seq2SeqSharp\Corpus\ParallelCorpus.cs:line 178 at Seq2SeqSharp.Tools.ParallelCorpus1.GetEnumerator()+MoveNext() in D:\FakeApps\Seq2SeqSharp\Seq2SeqSharp\Corpus\ParallelCorpus.cs:line 362
at Seq2SeqSharp.Tools.BaseSeq2SeqFramework1.TrainOneEpoch(Int32 ep, IParallelCorpus1 trainCorpus, IParallelCorpus1[] validCorpusList, ILearningRate learningRate, IOptimizer solver, Dictionary2 taskId2metrics, DecodingOptions decodingOptions, Func6 forwardOnSingleDevice) in D:\FakeApps\Seq2SeqSharp\Seq2SeqSharp\Tools\BaseSeq2SeqFramework.cs:line 333 at Seq2SeqSharp.Tools.BaseSeq2SeqFramework1.Train(Int32 maxTrainingEpoch, IParallelCorpus1 trainCorpus, IParallelCorpus1[] validCorpusList, ILearningRate learningRate, Dictionary2 taskId2metrics, IOptimizer optimizer, DecodingOptions decodingOptions) in D:\FakeApps\Seq2SeqSharp\Seq2SeqSharp\Tools\BaseSeq2SeqFramework.cs:line 303 at Seq2SeqSharp.Tools.BaseSeq2SeqFramework1.Train(Int32 maxTrainingEpoch, IParallelCorpus1 trainCorpus, IParallelCorpus1[] validCorpusList, ILearningRate learningRate, List`1 metrics, IOptimizer optimizer, DecodingOptions decodingOptions) in D:\FakeApps\Seq2SeqSharp\Seq2SeqSharp\Tools\BaseSeq2SeqFramework.cs:line 315
at Seq2SeqConsole.Program.Main(String[] args) in D:\FakeApps\Seq2SeqSharp\Tools\Seq2SeqConsole\Program.cs:line 117'

3 replies

zhongkaifu Apr 16, 2022
Maintainer Author

The exception happened in the loop of Parallel.For while it's trying to shuffle your training set. Since it has been eaten Parallel.For, above logging didn't show me where the exception happened. I just checked-in a new change that ask Logger to print out details in the exception. You can pull my change, rebuild it (Debug version would be better), and rerun it to collect correct exception message and call stack.

SileNTViP Apr 16, 2022

This exception: incorrect pair count in dataset. When lines in src and tgt not equal. But large files are hard to edit ) is it possible to ignore, if the source language has 100 lines in the final 105, then take only the first 100 of the final and vice versa?

zhongkaifu Apr 16, 2022
Maintainer Author

Trainer should only valid if given data set is correct or not and it should not assume what the data looks like. So I would suggest you to update your data processing to generate valid data set, and then send it to Seq2SeqSharp.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Welcome to Seq2SeqSharp Discussions! #38

{{title}}

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Welcome to Seq2SeqSharp Discussions! #38

zhongkaifu Oct 21, 2021 Maintainer

👋 Welcome!

Replies: 3 comments · 7 replies

SileNTViP Apr 11, 2022

zhongkaifu Apr 11, 2022 Maintainer Author

SileNTViP Apr 11, 2022

zhongkaifu Apr 11, 2022 Maintainer Author

SileNTViP Apr 12, 2022

zhongkaifu Apr 12, 2022 Maintainer Author

SileNTViP Apr 16, 2022

zhongkaifu Apr 16, 2022 Maintainer Author

SileNTViP Apr 16, 2022

zhongkaifu Apr 16, 2022 Maintainer Author

zhongkaifu
Oct 21, 2021
Maintainer

Replies: 3 comments 7 replies

SileNTViP
Apr 11, 2022

zhongkaifu
Apr 11, 2022
Maintainer Author

zhongkaifu Apr 11, 2022
Maintainer Author

zhongkaifu Apr 12, 2022
Maintainer Author

SileNTViP
Apr 16, 2022

zhongkaifu Apr 16, 2022
Maintainer Author

zhongkaifu Apr 16, 2022
Maintainer Author