Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Fix language detection of non-latin alphabets even at few characters #10276
CLD3 is unreliable on short text for latin alphabets, because a lot of languages share them. However, some languages have specific character sets, which makes detecting them more reliable even at short lengths. This PR adds an exception to the 140 characters threshold rule, so that we can reliably detect Japanese, Chinese, Hebrew, Korean, and some others.