-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is nnlm-zh trained in both traditional and simplified Chinese? #601
Comments
I also didn't find the specification. Can you try several examples? Like you input a simplified Chinese word and a traditional Chinese word, and see whether the produced embeddings are the same or not, or whether one of them is mapped to the unknown token. |
I tried some examples in this colab notebook. |
Yeah, my assumption is that they train the USE on a multi-lingual corpus because I found the embedding matrix shows a 971177 vocab size. Given that, I think it's reasonable that performance on simplified Chinese is higher than performance on traditional Chinese. First, my intuition is that they have more simplified Chinese data. Second, the text segmentation for simplified Chinese might work better, leading to better word/phrase vocab. |
This can give me many things to think of...
I am wondering how you figure out the vocab size? |
I suggest you have a look at the paper to see if there is any helpful detail. I checked the model variables and found only one matrix ( |
@djdongjin thanks! you give me a really big hint. |
The introduction of nnlm-zh like this only mentions that "Token based text embedding trained on Chinese Google News 100B corpus.", I'm wondering if the corpus is traditional or simplified Chinese or in both?
The text was updated successfully, but these errors were encountered: