New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some Japanese are detected as Chinese mandarin #63
Comments
Both Franc and Google Translate detect both examples as Japanese. I believe that’s correct, but I don’t know Japanese so I don’t know. If this is an issue with Amazon or Yahoo, that’s not something I can help with. |
@wooorm tnx for the response! you are right the example I pasted here is not good but I was able to detect what's wrong with it and do a workaround. so i solved it like this:
|
Do you have an example text that includes numbers which is reported wrong? |
I'm getting mixed results too with Japanese text Using some random Japanese website:
Google translate returns
Google translate returns The last example is with even smaller text but has the language right. I tried @ThisIsRoy1 's solution by stripping out numbers, but that didn't work. Both texts have all numbers and new lines stripped out. What characters could return the wrong result in the first one? |
Hi, add this to your Japanese unicode regex, and it will fix it! [\u3000-\u303F\u3300-\u33FF\u4E00-\u9FFF] Recognition of the random test above is now 90% confident Japanese. This is CJK Unified Ideographs and CJK Symbols and Punctuation |
However, if you add the character set I suggested above to Japanese, the algorithm does not have the ability to choose between Chinese and Japanese, as they could both be valid for that character. So it would need a rework of the getTopScript function to return multiple scripts and then merge those in with the results. Suggested code changes:
|
Hello folks, we’ve already worked on this here: #77 |
Hi, I see something strange about Japanese detection,
if I put a translated text from google translate to Japanese:
裁判の周辺のラオスにUターンした元元兵士
the lib detects it and returns 'jpn', but if I put a Japanese text from yahoo japan or amazon japan:
ここ最近、よく拡散されたつぶやきや画像をまとめてご紹介。気になるも
it returns 'cmn', does anyone know why?
The text was updated successfully, but these errors were encountered: