Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is nnlm-zh trained in both traditional and simplified Chinese? #601

Closed
penut85420 opened this issue May 28, 2020 · 6 comments
Closed

Is nnlm-zh trained in both traditional and simplified Chinese? #601

penut85420 opened this issue May 28, 2020 · 6 comments

Comments

@penut85420
Copy link

The introduction of nnlm-zh like this only mentions that "Token based text embedding trained on Chinese Google News 100B corpus.", I'm wondering if the corpus is traditional or simplified Chinese or in both?

@djdongjin
Copy link
Contributor

I also didn't find the specification. Can you try several examples? Like you input a simplified Chinese word and a traditional Chinese word, and see whether the produced embeddings are the same or not, or whether one of them is mapped to the unknown token.

@penut85420
Copy link
Author

I tried some examples in this colab notebook.
I input a pair of characters and sentences both in simplified Chinese and traditional Chinese, and the result shows that they are different embeddings. I think this is reasonable since they are different characters basically. But in some of my task shows that if I translate the traditional Chinese dataset into simplified Chinese, the performance will improve a little bit. So I am wondering if the difference of performance is caused by the training corpus of text embedding or the language itself?

@djdongjin
Copy link
Contributor

But in some of my task shows that if I translate the traditional Chinese dataset into simplified Chinese, the performance will improve a little bit.

Yeah, my assumption is that they train the USE on a multi-lingual corpus because I found the embedding matrix shows a 971177 vocab size.

Given that, I think it's reasonable that performance on simplified Chinese is higher than performance on traditional Chinese. First, my intuition is that they have more simplified Chinese data. Second, the text segmentation for simplified Chinese might work better, leading to better word/phrase vocab.

@penut85420
Copy link
Author

This can give me many things to think of...

the embedding matrix shows a 971177 vocab size.

I am wondering how you figure out the vocab size?

@djdongjin
Copy link
Contributor

This can give me many things to think of...

the embedding matrix shows a 971177 vocab size.

I am wondering how you figure out the vocab size?

I suggest you have a look at the paper to see if there is any helpful detail. I checked the model variables and found only one matrix (embed.variables[0].shape), which might be the embedding matrix (just my guess, I am not very familiar with how the model works).

@penut85420
Copy link
Author

penut85420 commented Jun 4, 2020

@djdongjin thanks! you give me a really big hint.
After research on the saved model, I found a tokens.txt under the assets directory, and the Chinese tokens in it are basically simplified Chinese, e.g. the cat in traditional Chinese is "貓" and this character is not found in this vocab file, but the cat in simplified Chinese "猫" is included in many tokens. The character like "馮" also have the same situation. Although there are some characters that are the same in both traditional Chinese and simplified Chinese, I think this text embedding is more suitable for simplified Chinese.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants