About text_to_sequences #1

Muzhi1920 · 2018-05-24T10:00:35Z

About tokenizer.text_to_sequences method ,for train_set,val_set and test_set.Maybe the same word has different index. So when predict the unknown label dataSet,this kind of issue can make a very big mistake.

I train a Chinese comments sentiment_analysis model, like your code ,every set directly text_to_sequences,,but f1-score is only 0.7382.It is poor,and lower than bayes(0.8988).

When I find this issue, I use all data_set to make tokenizer.wordindex,so every word just has one index .As a result,every word of different sets has the unique index. If not ,as above,the same word has different index,and different vector. F1-score is 0.9325.By the way,my result's match ranking is the first!

The core is tokenizer.wordindex built strategy，and whether different dataSet has complete word_Vocab.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About text_to_sequences #1

About text_to_sequences #1

Muzhi1920 commented May 24, 2018

About text_to_sequences #1

About text_to_sequences #1

Comments

Muzhi1920 commented May 24, 2018