Skip to content
This repository has been archived by the owner on Feb 15, 2023. It is now read-only.

Training custom data not working #18

Closed
vietvudanh opened this issue Jun 4, 2018 · 5 comments
Closed

Training custom data not working #18

vietvudanh opened this issue Jun 4, 2018 · 5 comments

Comments

@vietvudanh
Copy link

vietvudanh commented Jun 4, 2018

I am trying to train custom input data.

Juts a simple text including:

custom_train_data.txt

capuchino B-W
.   B-W

cà B-W
phê I-W
việt I-W
.   B-W

macchiato B-W
.   B-W

trà B-W
đào I-W
.   B-W

bánh B-W
ngọt I-W
.   B-W

latte B-W
.   B-W

cà B-W
phê I-W
ý I-W
.   B-W

capuchino B-W
.   B-W

cà B-W
phê I-W
việt I-W
.   B-W

macchiato B-W
.   B-W

trà B-W
đào I-W
.   B-W

bánh B-W
ngọt I-W
.   B-W

latte B-W
.   B-W

cà B-W
phê I-W
ý I-W
.   B-W

bún B-W
huế I-W
.   B-W

bánh B-W
huế I-W
.   B-W

chè B-W
huế I-W
.   B-W

cuốn B-W
huế I-W
.   B-W

cơm B-W
hến I-W
.   B-W

bánh B-W
canh I-W
.   B-W

and I ran the training script as

python train.py --train custom_train_data.txt

which generated the model.bin file ok.

However, when I replaced the model into underthesea at underthesea/underthesea/word_tokenize/model_9.bin. (I have set debug and was sure the right model was made). And tried to tokenize string using the model, which was not working.

>> underthesea.word_tokenize('cơm hến tại nhà hàng Việt')
['cơm', 'hến', 'tại', 'nhà', 'hàng', 'Việt']

So what do you think is the problem here?

@vietvudanh
Copy link
Author

There is one mistake in above data which the separator of each line is space and caused train process failed.
I have fixed with tab, and train successfully. But the word_tokenize still failed.

@rain1024
Copy link
Contributor

rain1024 commented Jun 4, 2018

@vietvudanh please provide me the error log when you call word_tokenize function?

@vietvudanh
Copy link
Author

vietvudanh commented Jun 5, 2018

There was no error. But the output is not tokenized correctly at all.
I think the training process might be wrong at some point. But the output seems ok, with dev score 0.96.

***** Iteration #257 *****
Loss: 38780.572262
Feature norm: 134.695505
Error norm: 126.477712
Active features: 11327
Line search trials: 3
Line search step: 0.250000
Seconds required for this iteration: 0.925

L-BFGS terminated with the stopping criteria
Total seconds required for training: 135.847

Storing the model
Number of active features: 11327 (1443864)
Number of active attributes: 8968 (1415832)
Number of active labels: 2 (2)
Writing labels
Writing attributes
Writing feature references for transitions
Writing feature references for attributes
Seconds required: 0.013

Dev score:  0.9685743131361209

My question are:

  1. Can I use really small training data (like in first post), and the tokenization still work with just those word? (i.e: train with bánh bèo, and tokenize ăn bánh bèo)Because I have tried to train model using the 1.sample VLSP data 2. sample VLSP data append with my custom data.
python train.py  --train data/vlsp2016/corpus/train.txt
python train.py  --train data_vlsp_and_custom.txt

and the both generated models still not work.

  1. Is copying model.bin to the location underthesea/underthesea/word_tokenize/model_9.bin and install underthesea with pip install -e is correct?

@rain1024
Copy link
Contributor

rain1024 commented Jun 7, 2018

About your first question, I think we need more data to feed into the model than one ore two simple examples. It is not enough for model "learn" the pattern.

If you want to make your custom word_tokenize, don't mind to integrate with underthesea. You can simple "export" your model and wrap it on a script. I know it's not obvious right now, so I will update documentation and pipeline in next few days.

Don't hesitate to update your current work in this issue.

@vietvudanh
Copy link
Author

Sure, I will spend more time working on the code. Further guidelines is much appreciated.

I still don't understand why the model trained on original VLSP data not working though.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants