Is nnlm-zh trained in both traditional and simplified Chinese? #601

penut85420 · 2020-05-28T01:16:32Z

The introduction of nnlm-zh like this only mentions that "Token based text embedding trained on Chinese Google News 100B corpus.", I'm wondering if the corpus is traditional or simplified Chinese or in both?

djdongjin · 2020-06-02T21:48:08Z

I also didn't find the specification. Can you try several examples? Like you input a simplified Chinese word and a traditional Chinese word, and see whether the produced embeddings are the same or not, or whether one of them is mapped to the unknown token.

penut85420 · 2020-06-03T07:46:03Z

I tried some examples in this colab notebook.
I input a pair of characters and sentences both in simplified Chinese and traditional Chinese, and the result shows that they are different embeddings. I think this is reasonable since they are different characters basically. But in some of my task shows that if I translate the traditional Chinese dataset into simplified Chinese, the performance will improve a little bit. So I am wondering if the difference of performance is caused by the training corpus of text embedding or the language itself?

djdongjin · 2020-06-03T15:16:25Z

But in some of my task shows that if I translate the traditional Chinese dataset into simplified Chinese, the performance will improve a little bit.

Yeah, my assumption is that they train the USE on a multi-lingual corpus because I found the embedding matrix shows a 971177 vocab size.

Given that, I think it's reasonable that performance on simplified Chinese is higher than performance on traditional Chinese. First, my intuition is that they have more simplified Chinese data. Second, the text segmentation for simplified Chinese might work better, leading to better word/phrase vocab.

penut85420 · 2020-06-03T15:19:36Z

This can give me many things to think of...

the embedding matrix shows a 971177 vocab size.

I am wondering how you figure out the vocab size?

djdongjin · 2020-06-03T15:23:36Z

This can give me many things to think of...

the embedding matrix shows a 971177 vocab size.

I am wondering how you figure out the vocab size?

I suggest you have a look at the paper to see if there is any helpful detail. I checked the model variables and found only one matrix (embed.variables[0].shape), which might be the embedding matrix (just my guess, I am not very familiar with how the model works).

penut85420 · 2020-06-04T13:01:58Z

@djdongjin thanks! you give me a really big hint.
After research on the saved model, I found a tokens.txt under the assets directory, and the Chinese tokens in it are basically simplified Chinese, e.g. the cat in traditional Chinese is "貓" and this character is not found in this vocab file, but the cat in simplified Chinese "猫" is included in many tokens. The character like "馮" also have the same situation. Although there are some characters that are the same in both traditional Chinese and simplified Chinese, I think this text embedding is more suitable for simplified Chinese.

gowthamkpr self-assigned this May 29, 2020

gowthamkpr added type:support subtype:text-embedding labels May 29, 2020

gowthamkpr assigned andresusanopinto and unassigned gowthamkpr May 30, 2020

gowthamkpr added the stat:awaiting tensorflower label May 30, 2020

penut85420 closed this as completed Jun 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is nnlm-zh trained in both traditional and simplified Chinese? #601

Is nnlm-zh trained in both traditional and simplified Chinese? #601

penut85420 commented May 28, 2020

djdongjin commented Jun 2, 2020

penut85420 commented Jun 3, 2020

djdongjin commented Jun 3, 2020

penut85420 commented Jun 3, 2020

djdongjin commented Jun 3, 2020

penut85420 commented Jun 4, 2020 •

edited

Loading

Is nnlm-zh trained in both traditional and simplified Chinese? #601

Is nnlm-zh trained in both traditional and simplified Chinese? #601

Comments

penut85420 commented May 28, 2020

djdongjin commented Jun 2, 2020

penut85420 commented Jun 3, 2020

djdongjin commented Jun 3, 2020

penut85420 commented Jun 3, 2020

djdongjin commented Jun 3, 2020

penut85420 commented Jun 4, 2020 • edited Loading

penut85420 commented Jun 4, 2020 •

edited

Loading