Some Japanese are detected as Chinese mandarin #63

ThisIsRoy1 · 2018-10-04T13:56:30Z

Hi, I see something strange about Japanese detection,

if I put a translated text from google translate to Japanese:
裁判の周辺のラオスにUターンした元元兵士

the lib detects it and returns 'jpn', but if I put a Japanese text from yahoo japan or amazon japan:
ここ最近、よく拡散されたつぶやきや画像をまとめてご紹介。気になるも

it returns 'cmn', does anyone know why?

wooorm · 2019-04-30T09:10:24Z

Both Franc and Google Translate detect both examples as Japanese. I believe that’s correct, but I don’t know Japanese so I don’t know.

If this is an issue with Amazon or Yahoo, that’s not something I can help with.
If franc is incorrect, please see https://github.com/wooorm/franc#whats-not-so-cool-about-franc, #68, and other closed issues about this!

ThisIsRoy1 · 2019-05-06T11:16:24Z

@wooorm tnx for the response! you are right the example I pasted here is not good but I was able to detect what's wrong with it and do a workaround.
apparently, when there are numbers inside the string, franc can get confused between 'jap' and 'cmn'

so i solved it like this:

function detectLanguage(str) {
	const removedNumFromStr  = str.replace(/[0-9]/g, ''); //franc lib bug, we need to clear from str 
numbers to get accurate detection
	return franc(removedNumFromStr, { maxLength: 10000 });
}

wooorm · 2019-05-08T06:20:13Z

Do you have an example text that includes numbers which is reported wrong?

jvandenaardweg · 2019-07-31T15:46:40Z

I'm getting mixed results too with Japanese text

Using some random Japanese website:

https://www.asahi.com/articles/ASM7056M6M70UTIL037.html?iref=comtop_8_06

年前の年夏、東京の旧国鉄・三鷹駅で無人電車が暴走して人が死亡した「三鷹事件」の裁判をやり直す再審の扉は開かなかった。連合国軍総司令部（ＧＨＱ）が「反共」にかじを切り、国鉄職員らが大量解雇された時代。「国鉄三大ミステリー」と称される不可解な事件が相次ぎ、時の政府は「共産党の仕業」と主張したが、多くの謎が残されたままだ。-電車暴走「三鷹事件」元死刑囚の再審認めず東京高裁戦後に日本を占領したＧＨＱは当初、民主化・非軍事化を進めた。しかし、朝鮮半島での米ソ対立が次第に鮮明となり、中国でも共産党の勢力が増すと、日本を「アジアにおける共産主義の防波堤」と位置づける路線に転換した。共産党は国内でも人気を集め、年月の衆院選では議席から議席に躍進。吉田茂内閣が復員で膨れあがった官公庁の人員を整理するため、国鉄を含めて万人以上を解雇する行政機関職員定員法を同年月に成立させると、共産党が率いた国鉄労組は一部列車を止めるストライキを実施。ＧＨＱが止め、緊張が強まった。月に入り、労組に大量解雇を…

Google translate returns Japanese
Franc returns [ [ 'cmn', 1 ] ]

https://www.asahi.com/articles/ASM7061QRM70PLXB00S.html?iref=comtop_latestnews_01

高松空港（高松市香南町岡）に、讃岐うどんのだし汁が出てくる蛇口が設置された。ひねると、香川県内にある人気うどん店のだしが出てくる。紙コップに注ぎ、無料で飲める。店は月ごとに変わり、月は「うどん本陣山田家」（同市）のコンブがベースのだしが味わえる。屋台風のスタンドに黒色の蛇口が取り付けられた。横には、だしを提供している山田家の土産用のうどんも置かれている。空港会社が直営する土産店「四国空市場（ＹＯＳＯＲＡ〈ヨソラ〉）」が日にオープンするのに合わせ、店の前に設置された。空港内ではカ所目で、最初の蛇口は県が年、同じビル階の休憩スペースに設けた。従来の蛇口はチェーン店の温かいだしで、新たな蛇口は冷やだしが出る。猛暑が続くなか、「ひんやりと、塩分補給にもいかがですか」と担当者。営業時間は午前時～午後時。日は午前時にオープン。（石川友恵）

Google translate returns Japanese
Franc returns [ [ 'jpn', 1 ] ]

The last example is with even smaller text but has the language right.

I tried @ThisIsRoy1 's solution by stripping out numbers, but that didn't work. Both texts have all numbers and new lines stripped out.

What characters could return the wrong result in the first one?

mike-nelson · 2019-11-03T02:22:10Z

Hi, add this to your Japanese unicode regex, and it will fix it!

[\u3000-\u303F\u3300-\u33FF\u4E00-\u9FFF]

Recognition of the random test above is now 90% confident Japanese.

This is CJK Unified Ideographs and CJK Symbols and Punctuation

mike-nelson · 2019-11-03T05:09:47Z

However, if you add the character set I suggested above to Japanese, the algorithm does not have the ability to choose between Chinese and Japanese, as they could both be valid for that character. So it would need a rework of the getTopScript function to return multiple scripts and then merge those in with the results.

Suggested code changes:

  function detectAll(value, options) {
    var settings = options || {}
    var minLength = MIN_LENGTH
    var scripts

    if (settings.minLength !== null && settings.minLength !== undefined) {
      minLength = settings.minLength
    }

    if (!value || value.length < minLength) {
      return und()
    }

    value = value.substr(0, MAX_LENGTH)

    /* Get the scripts which characters occur the most
    * in `value`. */
    scripts = getTopScripts(value, expressions)
    
    // If no matches occured, such as a digit only string, exit with `und`.
    if (scripts.length==0){
      return und();
    }

    var inputTrigrams = getCleanTrigramsAsTuples(value);
    var distances = [];
    for(var i=0;i<scripts.length;i++){
      var script = scripts[i].script;
      var scriptProportion = scripts[i].proportion;
      var langsInScript = data[script];
      if (langsInScript){
        var dists = getDistances(inputTrigrams, langsInScript, settings)
        var normalised = normalize(value, dists, scriptProportion);
        distances = distances.concat(normalised);
      }else{
        distances.push([script,scriptProportion]);
      }
    }

    var guesses = distances.sort(sort).reverse();

    return guesses;
  }

  function normalize(value, distances, multiplier) {
    var min = distances[0][1]
    var max = value.length * MAX_DIFFERENCE - min
    var index = -1
    var length = distances.length

    while (++index < length) {
      distances[index][1] = multiplier * (1 - (distances[index][1] - min) / max || 0)
    }

    return distances
  }

  /**
   * From `scripts`, get the most occurring expression for
   * `value`.
   *
   * @param {string} value - Value to check.
   * @param {Object.<RegExp>} scripts - Top-Scripts.
   * @return {Array} Top script and its
   *   occurrence percentage.
   */
  function getTopScripts(value, scripts) {
    var topCount = -1
    var topScript
    var script
    var count
    var validScripts = [];
    var likelyScripts = [];

    for (script in scripts) {
      count = getOccurrence(value, scripts[script])

      if (count > 0){
        validScripts.push({script:script, proportion:count});
      }
      if (count > topCount) {
        topCount = count
        topScript = script
      }
    }
    
    var likely = topCount/2;       // eg if topCount is 1, this will add include any script over 0.5 as likely
    for(var i=0;i<validScripts.length;i++){
      if (validScripts[i].proportion > likely && topCount>0){
        likelyScripts.push(validScripts[i]);
      }
    }

    return likelyScripts;
    //return [topScript, topCount]
  }

wooorm · 2019-11-03T09:54:10Z

Hello folks, we’ve already worked on this here: #77

wooorm closed this as completed Apr 30, 2019

Repository owner locked as resolved and limited conversation to collaborators Nov 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some Japanese are detected as Chinese mandarin #63

Some Japanese are detected as Chinese mandarin #63

ThisIsRoy1 commented Oct 4, 2018

wooorm commented Apr 30, 2019

ThisIsRoy1 commented May 6, 2019 •

edited

wooorm commented May 8, 2019

jvandenaardweg commented Jul 31, 2019 •

edited

mike-nelson commented Nov 3, 2019

mike-nelson commented Nov 3, 2019 •

edited

wooorm commented Nov 3, 2019

Some Japanese are detected as Chinese mandarin #63

Some Japanese are detected as Chinese mandarin #63

Comments

ThisIsRoy1 commented Oct 4, 2018

wooorm commented Apr 30, 2019

ThisIsRoy1 commented May 6, 2019 • edited

wooorm commented May 8, 2019

jvandenaardweg commented Jul 31, 2019 • edited

mike-nelson commented Nov 3, 2019

mike-nelson commented Nov 3, 2019 • edited

wooorm commented Nov 3, 2019

ThisIsRoy1 commented May 6, 2019 •

edited

jvandenaardweg commented Jul 31, 2019 •

edited

mike-nelson commented Nov 3, 2019 •

edited