Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About text_to_sequences #1

Open
Muzhi1920 opened this issue May 24, 2018 · 0 comments
Open

About text_to_sequences #1

Muzhi1920 opened this issue May 24, 2018 · 0 comments

Comments

@Muzhi1920
Copy link

About tokenizer.text_to_sequences method ,for train_set,val_set and test_set.Maybe the same word has different index. So when predict the unknown label dataSet,this kind of issue can make a very big mistake.

I train a Chinese comments sentiment_analysis model, like your code ,every set directly text_to_sequences,,but f1-score is only 0.7382.It is poor,and lower than bayes(0.8988).

When I find this issue, I use all data_set to make tokenizer.wordindex,so every word just has one index .As a result,every word of different sets has the unique index. If not ,as above,the same word has different index,and different vector. F1-score is 0.9325.By the way,my result's match ranking is the first!

The core is tokenizer.wordindex built strategy,and whether different dataSet has complete word_Vocab.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant