Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Japanese are detected as Chinese mandarin #63

Closed
ThisIsRoy1 opened this issue Oct 4, 2018 · 7 comments
Closed

Some Japanese are detected as Chinese mandarin #63

ThisIsRoy1 opened this issue Oct 4, 2018 · 7 comments

Comments

@ThisIsRoy1
Copy link

Hi, I see something strange about Japanese detection,

if I put a translated text from google translate to Japanese:
裁判の周辺のラオスにUターンした元元兵士

the lib detects it and returns 'jpn', but if I put a Japanese text from yahoo japan or amazon japan:
ここ最近、よく拡散されたつぶやきや画像をまとめてご紹介。気になるも

it returns 'cmn', does anyone know why?

@wooorm
Copy link
Owner

wooorm commented Apr 30, 2019

Both Franc and Google Translate detect both examples as Japanese. I believe that’s correct, but I don’t know Japanese so I don’t know.

If this is an issue with Amazon or Yahoo, that’s not something I can help with.
If franc is incorrect, please see https://github.com/wooorm/franc#whats-not-so-cool-about-franc, #68, and other closed issues about this!

@wooorm wooorm closed this as completed Apr 30, 2019
@ThisIsRoy1
Copy link
Author

ThisIsRoy1 commented May 6, 2019

@wooorm tnx for the response! you are right the example I pasted here is not good but I was able to detect what's wrong with it and do a workaround.
apparently, when there are numbers inside the string, franc can get confused between 'jap' and 'cmn'

so i solved it like this:

function detectLanguage(str) {
	const removedNumFromStr  = str.replace(/[0-9]/g, ''); //franc lib bug, we need to clear from str 
numbers to get accurate detection
	return franc(removedNumFromStr, { maxLength: 10000 });
}

@wooorm
Copy link
Owner

wooorm commented May 8, 2019

Do you have an example text that includes numbers which is reported wrong?

@jvandenaardweg
Copy link

jvandenaardweg commented Jul 31, 2019

I'm getting mixed results too with Japanese text

Using some random Japanese website:

https://www.asahi.com/articles/ASM7056M6M70UTIL037.html?iref=comtop_8_06

年前の年夏、東京の旧国鉄・三鷹駅 で無人電車が暴走して人が死亡した「三鷹事件 」の裁判をやり直す再審の扉は開かなかった。連合国軍総司令部 (GHQ)が「反共」にかじを切り、国鉄職員らが大量解雇された時代。「国鉄三大ミステリー」と称される不可解な事件が相次ぎ、時の政府は「共産党 の仕業」と主張したが、多くの謎が残されたままだ。-電車暴走「三鷹事件」 元死刑囚の再審認めず 東京高裁 戦後に日本を占領したGHQは当初、民主化・非軍事化を進めた。しかし、朝鮮半島での米ソ対立が次第に鮮明となり、中国でも共産党 の勢力が増すと、日本を「アジアにおける共産主義 の防波堤」と位置づける路線に転換した。共産党 は国内でも人気を集め、年月の衆院選では議席から議席に躍進。吉田茂 内閣が復員で膨れあがった官公庁の人員を整理するため、国鉄を含めて万人以上を解雇する行政機関職員定員法を同年月に成立させると、共産党 が率いた国鉄労組は一部列車を止めるストライキを実施。GHQが止め、緊張が強まった。月に入り、労組に大量解雇を…

Google translate returns Japanese
Franc returns [ [ 'cmn', 1 ] ]

https://www.asahi.com/articles/ASM7061QRM70PLXB00S.html?iref=comtop_latestnews_01

高松空港 (高松市 香南町岡)に、讃岐うどん のだし汁が出てくる蛇口が設置された。ひねると、香川県 内にある人気うどん店のだしが出てくる。紙コップ に注ぎ、無料で飲める。店は月ごとに変わり、月は「うどん本陣 山田家」(同市)のコンブがベースのだしが味わえる。屋台風のスタンドに黒色の蛇口が取り付けられた。横には、だしを提供している山田家の土産用のうどんも置かれている。空港会社が直営する土産店「四国空市場(YOSORA〈ヨソラ〉)」が日にオープンするのに合わせ、店の前に設置された。空港内ではカ所目で、最初の蛇口は県が年、同じビル階の休憩スペースに設けた。従来の蛇口はチェーン店の温かいだしで、新たな蛇口は冷やだしが出る。猛暑が続くなか、「ひんやりと、塩分補給にもいかがですか」と担当者。営業時間は午前時~午後時。日は午前時にオープン。(石川友恵)

Google translate returns Japanese
Franc returns [ [ 'jpn', 1 ] ]

The last example is with even smaller text but has the language right.

I tried @ThisIsRoy1 's solution by stripping out numbers, but that didn't work. Both texts have all numbers and new lines stripped out.

What characters could return the wrong result in the first one?

@mike-nelson
Copy link

Hi, add this to your Japanese unicode regex, and it will fix it!

[\u3000-\u303F\u3300-\u33FF\u4E00-\u9FFF]

Recognition of the random test above is now 90% confident Japanese.

This is CJK Unified Ideographs and CJK Symbols and Punctuation

@mike-nelson
Copy link

mike-nelson commented Nov 3, 2019

However, if you add the character set I suggested above to Japanese, the algorithm does not have the ability to choose between Chinese and Japanese, as they could both be valid for that character. So it would need a rework of the getTopScript function to return multiple scripts and then merge those in with the results.

Suggested code changes:

  function detectAll(value, options) {
    var settings = options || {}
    var minLength = MIN_LENGTH
    var scripts

    if (settings.minLength !== null && settings.minLength !== undefined) {
      minLength = settings.minLength
    }

    if (!value || value.length < minLength) {
      return und()
    }

    value = value.substr(0, MAX_LENGTH)

    /* Get the scripts which characters occur the most
    * in `value`. */
    scripts = getTopScripts(value, expressions)
    
    // If no matches occured, such as a digit only string, exit with `und`.
    if (scripts.length==0){
      return und();
    }

    var inputTrigrams = getCleanTrigramsAsTuples(value);
    var distances = [];
    for(var i=0;i<scripts.length;i++){
      var script = scripts[i].script;
      var scriptProportion = scripts[i].proportion;
      var langsInScript = data[script];
      if (langsInScript){
        var dists = getDistances(inputTrigrams, langsInScript, settings)
        var normalised = normalize(value, dists, scriptProportion);
        distances = distances.concat(normalised);
      }else{
        distances.push([script,scriptProportion]);
      }
    }

    var guesses = distances.sort(sort).reverse();

    return guesses;
  }

  function normalize(value, distances, multiplier) {
    var min = distances[0][1]
    var max = value.length * MAX_DIFFERENCE - min
    var index = -1
    var length = distances.length

    while (++index < length) {
      distances[index][1] = multiplier * (1 - (distances[index][1] - min) / max || 0)
    }

    return distances
  }

  /**
   * From `scripts`, get the most occurring expression for
   * `value`.
   *
   * @param {string} value - Value to check.
   * @param {Object.<RegExp>} scripts - Top-Scripts.
   * @return {Array} Top script and its
   *   occurrence percentage.
   */
  function getTopScripts(value, scripts) {
    var topCount = -1
    var topScript
    var script
    var count
    var validScripts = [];
    var likelyScripts = [];

    for (script in scripts) {
      count = getOccurrence(value, scripts[script])

      if (count > 0){
        validScripts.push({script:script, proportion:count});
      }
      if (count > topCount) {
        topCount = count
        topScript = script
      }
    }
    
    var likely = topCount/2;       // eg if topCount is 1, this will add include any script over 0.5 as likely
    for(var i=0;i<validScripts.length;i++){
      if (validScripts[i].proportion > likely && topCount>0){
        likelyScripts.push(validScripts[i]);
      }
    }

    return likelyScripts;
    //return [topScript, topCount]
  }

@wooorm
Copy link
Owner

wooorm commented Nov 3, 2019

Hello folks, we’ve already worked on this here: #77

Repository owner locked as resolved and limited conversation to collaborators Nov 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants