Training custom data not working #18

vietvudanh · 2018-06-04T08:35:36Z

I am trying to train custom input data.

Juts a simple text including:

custom_train_data.txt

capuchino B-W
.   B-W

cà B-W
phê I-W
việt I-W
.   B-W

macchiato B-W
.   B-W

trà B-W
đào I-W
.   B-W

bánh B-W
ngọt I-W
.   B-W

latte B-W
.   B-W

cà B-W
phê I-W
ý I-W
.   B-W

capuchino B-W
.   B-W

cà B-W
phê I-W
việt I-W
.   B-W

macchiato B-W
.   B-W

trà B-W
đào I-W
.   B-W

bánh B-W
ngọt I-W
.   B-W

latte B-W
.   B-W

cà B-W
phê I-W
ý I-W
.   B-W

bún B-W
huế I-W
.   B-W

bánh B-W
huế I-W
.   B-W

chè B-W
huế I-W
.   B-W

cuốn B-W
huế I-W
.   B-W

cơm B-W
hến I-W
.   B-W

bánh B-W
canh I-W
.   B-W

and I ran the training script as

python train.py --train custom_train_data.txt

which generated the model.bin file ok.

However, when I replaced the model into underthesea at underthesea/underthesea/word_tokenize/model_9.bin. (I have set debug and was sure the right model was made). And tried to tokenize string using the model, which was not working.

>> underthesea.word_tokenize('cơm hến tại nhà hàng Việt')
['cơm', 'hến', 'tại', 'nhà', 'hàng', 'Việt']

So what do you think is the problem here?

The text was updated successfully, but these errors were encountered:

vietvudanh · 2018-06-04T09:38:02Z

There is one mistake in above data which the separator of each line is space and caused train process failed.
I have fixed with tab, and train successfully. But the word_tokenize still failed.

rain1024 · 2018-06-04T15:47:35Z

@vietvudanh please provide me the error log when you call word_tokenize function?

vietvudanh · 2018-06-05T02:43:30Z

There was no error. But the output is not tokenized correctly at all.
I think the training process might be wrong at some point. But the output seems ok, with dev score 0.96.

***** Iteration #257 *****
Loss: 38780.572262
Feature norm: 134.695505
Error norm: 126.477712
Active features: 11327
Line search trials: 3
Line search step: 0.250000
Seconds required for this iteration: 0.925

L-BFGS terminated with the stopping criteria
Total seconds required for training: 135.847

Storing the model
Number of active features: 11327 (1443864)
Number of active attributes: 8968 (1415832)
Number of active labels: 2 (2)
Writing labels
Writing attributes
Writing feature references for transitions
Writing feature references for attributes
Seconds required: 0.013

Dev score:  0.9685743131361209

My question are:

Can I use really small training data (like in first post), and the tokenization still work with just those word? (i.e: train with bánh bèo, and tokenize ăn bánh bèo)Because I have tried to train model using the 1.sample VLSP data 2. sample VLSP data append with my custom data.

python train.py  --train data/vlsp2016/corpus/train.txt
python train.py  --train data_vlsp_and_custom.txt

and the both generated models still not work.

Is copying model.bin to the location underthesea/underthesea/word_tokenize/model_9.bin and install underthesea with pip install -e is correct?

rain1024 · 2018-06-07T05:57:02Z

About your first question, I think we need more data to feed into the model than one ore two simple examples. It is not enough for model "learn" the pattern.

If you want to make your custom word_tokenize, don't mind to integrate with underthesea. You can simple "export" your model and wrap it on a script. I know it's not obvious right now, so I will update documentation and pipeline in next few days.

Don't hesitate to update your current work in this issue.

vietvudanh · 2018-06-07T07:06:42Z

Sure, I will spend more time working on the code. Further guidelines is much appreciated.

I still don't understand why the model trained on original VLSP data not working though.

rain1024 closed this as completed Aug 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training custom data not working #18

Training custom data not working #18

vietvudanh commented Jun 4, 2018 •

edited

Loading

vietvudanh commented Jun 4, 2018

rain1024 commented Jun 4, 2018

vietvudanh commented Jun 5, 2018 •

edited

Loading

rain1024 commented Jun 7, 2018

vietvudanh commented Jun 7, 2018

Training custom data not working #18

Training custom data not working #18

Comments

vietvudanh commented Jun 4, 2018 • edited Loading

vietvudanh commented Jun 4, 2018

rain1024 commented Jun 4, 2018

vietvudanh commented Jun 5, 2018 • edited Loading

rain1024 commented Jun 7, 2018

vietvudanh commented Jun 7, 2018

vietvudanh commented Jun 4, 2018 •

edited

Loading

vietvudanh commented Jun 5, 2018 •

edited

Loading