Explain the output of 'all' #13

stockholmux · 2014-11-07T17:29:48Z

The results of 'all' consist of the language code and a score number. I've guessed that the lowest number is the detected language, but what can be learned from the score number? Doesn't seem to be documented.

I'm looking to detect the language of job titles in English and French only (because Canada) and I was getting results all over the place using just franc(jobTitle) but whitelisting english and french then applying a threshold to the score I was able to tune in a much more accurate result (still a 3.92% error rate over 1020 job titles, but it was in the 25% range before the threshold). Is this a good use for the score or am I just getting lucky?

The text was updated successfully, but these errors were encountered:

wooorm · 2014-11-07T17:38:37Z

Good question, the lowest number does indeed mean highest probability.

Thing is, franc is horrible at short inputs. As in, "thethethethe" is probably Scottish, rather than English.

Could you paste your code? I’m thinking your doing something like:

franc(jobTitle, {
  'whitelist': ['fra', 'eng']
});

Right? Although this does not include your threshold code.

stockholmux · 2014-11-07T17:53:14Z

Here is what I'm using - it's a little terrible as I'm just testing the code out to see if it would work in production. In my context, it is more likely for aTitle to be English and it seems to be harder to detect so, I default to English.

function limitFranc(aTitle, threshold) {
  var
    possible = franc.all(aTitle, { whitelist : ['eng','fra'] }),
    toReturn,
    difference = Math.abs(possible[0][1]-possible[1][1])

  if (difference < threshold) {
    toReturn = 'eng';
  } else if (possible[0][1] <= possible[1][1]) {
    toReturn = possible[0][0]
  } else {
    toReturn = possible[1][0]
  }
  return toReturn;
}

Setting the threshold in the above function to 500, this is where English is being detected as French:

User Experience Designer / Developer
Plant Functional Genomics BOT 464/564, Contract Lecturer
Animal Physiology II:  Intercellular Communication ZOOL 242, Contract Lecturer
Business Experts System Implementation, Graduate Admissions
Business Experts System Implementation, Undergraduate Admissions
IT Service Desk Analyst
Communication Control Center Operator
Interpersonal Communication COMM 1610U, Sessional Lecturer
Cybercrime SSCI 3021U, Sessional Lecturer
Lecturer, Computer Science, Tenure Stream
Music Trends MUS1015 Part Time Professor
Application Consultant
Lab Supervisor, IDEA Department
Snow Shovellers (18 positions)
Test Centre Clerk
Administrative Assistant, Aboriginal Centre, Learner Success Services
Advancement Proposal Writer, Philanthropy
Gymnasium Attendant
Academic Bone Marrow Transplant Lead Physician
Clerical Support, Aerospace Department
Digital Transmission INFR 3720U, Sessional Lecturer
Advanced Communications Network MITS 5200G, Sessional Lecturer
Grounds Person
Library  Assistant, Client Svcs / Commun / Eve Lead
Journeyperson Painter, Facilities Management
Engineering Hydrology CE5742, Contract Academic Position
Fundamentals of Chemical Process Design, Lecture / Lab CHE  2525, Contract Academic Position
Information Centre Attendant
Senior Business Analyst, PeopleSoft, Campus Solutions
Social Psychology / Psychology 2801, Contract Lecturer
Introductory Psychology / Psychology 1100, Contract Lecturer
Abnormal Psychology / Psychology 2004, Contract Lecturer
Soil Mechanics I CE 2113, Contract Academic Position
Drainage Basin Geomorphology / Geography 3333B, Part Time Appointment
Assessment Administrator, Test Centres
Movement Education KHS 139, Sessional Lecturer
Paralegal

I'm guessing that some of the job codes ('MITS 5002G', 'COMM 1610U', etc) are throwing off the detection, but that doesn't explain everything.

wooorm · 2014-11-07T18:02:19Z

The numbers returned by franc.all are based on the similarity between the trigrams from the input, and the trigrams from the corpora. Thus, the longer the input, the higher the number for corresponding languages. TBH, I don’t think using a magic number like 500 is going to help in this case.

Could you print the ratio between eng and fra for me for these erroneous cases (and maybe for some others too):

console.log(possible[0][0], possible[1][0], possible[0][1] / possible[1][1]);

Something like this could be more interesting I think.

stockholmux · 2014-11-07T18:53:36Z

Right. I felt like the magic number didn't grasp the nature of the beast, I just dialed it in until it started to respond relatively okay.

Here are some examples - your requested log is in parens.

Detected as French (with my magic number threshold) - some are right and some are wrong. Should be pretty evident.

' IT Service Desk Analyst (fra,eng,0.8827648114901256)',
' Communication Control Center Operator (fra,eng,0.9133554817275747)',
' Interpersonal Communication COMM 1610U, Sessional Lecturer (fra,eng,0.9231916480238628)'
' Cybercrime SSCI 3021U, Sessional Lecturer (fra,eng,0.9426628895184136)',
' Lecturer, Computer Science, Tenure Stream (fra,eng,0.9215228307836039)',
' Agent au services d’appui à la réussite scolaire (fra,eng,0.7623375587977358)',
' Approches éducatives NUEF 2701, Chargé de cours (fra,eng,0.8995447059963189)',
'Les études familiales ETFA 1151, Chargé de cours (fra,eng,0.7619047619047619)',
' Music Trends MUS1015 Part Time Professor (fra,eng,0.9265549690153776)'
' Application Consultant (fra,eng,0.8717443641934778)',
' Commis Centre d’accès (fra,eng,0.77443315089914)',
'Réviseure Ou Réviseur De Traduction Juridique (fra,eng,0.755209986247752)',
' Analyse math appliquée I MATH 1063, chargé de cours (fra,eng,0.9483581519663994)',
' Éléments de math discrètes MATH 1563, chargé de cours (fra,eng,0.8751891837352437)',
' Labo de solutions chimiques CHIM 2582, chargé de cours (fra,eng,0.7835260701967944)',
' Nombres et leurs propriétés MATH 1143, chargé de cours (fra,eng,0.7571030155133345)',
' Notions calcul diff et intég MATH 3133, chargé de cours (fra,eng,0.8856369480629335)',
' Statistique descriptive STAT 2653, chargé de cours (fra,eng,0.836179188429088)',
' Lab Supervisor, IDEA Department (fra,eng,0.917835941716136)',
' Snow Shovellers (18 positions) (fra,eng,0.8846729905142287)'

These were detected correctly as English with the magic number:

' Program Assistant, Hair Design / Esthetics (fra,eng,0.9586810922205049)',
'Accommodation Assistant, Counselling & Disability Services   (eng,fra,0.947556800497977)',
'Research Assistant Professor, Coherent Control Of Quantum Devices (eng,fra,0.9511855921006884)',
' Coherent Control Of Quantum Devices, Research Assistant Professor (eng,fra,0.9452054794520548)',
'  Executive Officer, Dean of Arts Office (eng,fra,0.812515931684935)',
' Administrative Coordinator, Department of Earth and Environmental Sciences (eng,fra,0.8948873584685421)',
' Graduate Admissions Specialist, Graduate Studies Office (eng,fra,0.9884856300663227)',
' Administrative Support Staff, Centre for English Language Development (eng,fra,0.9920134772571286)',
' Data Entry Clerk, Department Of Finance, Payroll (eng,fra,0.9793103448275862)',
' Communications Officer (eng,fra,0.9705112960760999)',
'Assistant Professor, Sociology, Tenure Track (fra,eng,0.9889911929543634)',
' Campaign Marketing &amp; Communications Coordinator (eng,fra,0.9658697158697158)',
' Research Technician, Paediatrics (eng,fra,0.9848884624610218)',
' Counsellor (eng,fra,0.9403563129357088)',
' Secretary to the Dean, School of Business (eng,fra,0.789718322952496)',
' Dentist, School of Health Sciences and Emergency Services (eng,fra,0.8633609222583506)',
' Professors, Fitness Courses, School of Health Sciences and Emergency Services (eng,fra,0.9321683787405092)',
' Simulation Lab Educator,  Nursing (eng,fra,0.907374330186272)',
' Meeting and Events Coordinator (eng,fra,0.7278545826932924)',
' Institutional Quality Assistant (eng,fra,0.8367346938775511)',
' Film Production, Part Time Professor (fra,eng,0.9834261649232441)',
'Assistant Professor, Geophysics, Sedimentology, or Geochemistry, Tenure Track (eng,fra,0.9852693846778403)',
'Associate / Full Professor, Department Chair, Department of Earth Sciences, Tenured (eng,fra,0.908320292123109)',
' Professor, Interdisciplinary / Literacy, Community Integration Through Cooperative Education Program (eng,fra,0.8993322037276986)',
' Department Assistant, Continuing Education (fra,eng,0.9942472460220318)',
' General Maintenance Worker (eng,fra,0.9457707509881423)',
' Human Resources Advisor, Client Services (eng,fra,0.9504196978175713)',
' Assistant Professor, Poultry Nutrition, Tenure Track (fra,eng,0.9894366197183099)',
' Dispatcher, Police and Fire Prevention (eng,fra,0.8340091231626964)',
' Financial Analyst (eng,fra,0.8019269549630293)',
' Logistics Coordinator, Executive Programs, College of Business and Economics  (eng,fra,0.9072287208236752)',
' Manager, Finance and Administration, College of Arts (eng,fra,0.841984260323636)',
' Custodian, Physical Resources  (eng,fra,0.9914293613491844)',
' Grounds Machinery Operator, Snow and Ice Control (4 positions) (eng,fra,0.9049974904997491)',
' Groundskeeper, Snow and Ice Control (2 Positions) (eng,fra,0.953888778943202)',
' Custodian, Physical Resources (2 Positions) (fra,eng,0.9588351215239763)',
' Special Constable, Campus Community Police (eng,fra,0.9905070118662351)',
' Program Assistant, School of Health Sciences (eng,fra,0.8971591957203467)',
' Institutional Research Analyst (eng,fra,0.8545189504373177)',
' Instructional Associate, Access Programs For People With Disabilities (eng,fra,0.8723336719146775)',
' Custodian (eng,fra,0.9237037037037037)',
' Administrative Support (eng,fra,0.9706347810796554)'

These would have been misidentified as French if it wasn't for the threshold:

Custodian, Physical Resources (2 Positions)
Assistant Professor, Poultry Nutrition, Tenure Track
Department Assistant, Continuing Education
Film Production, Part Time Professor

wooorm · 2014-11-07T20:37:24Z

And what if you check the formula I gave you, and a result of 0.9 for 'sureness'? I'll check back tomorrow to read it better though.

Anyway, to me it seems to just be a problem with the shortness of the input, and French-like words in English descriptions; which I both think cannot be fixed with the current solution!

wooorm · 2014-11-08T11:28:18Z

@stockholmux

Alright, so I implemented my “sureness” idea in the code below, and on your data above it (only) seems to get 2 (out of 62) wrong. Note that the below code could be simplified a bit, but its more verbose for readability.

'use strict';

var franc;
var data;
var byLanguage;
var bias;
var SURENESS;

franc = require('franc');

bias = 'eng';

SURENESS = 0.9;

data = [
    'IT Service Desk Analyst',
    'Communication Control Center Operator',
    'Interpersonal Communication COMM 1610U, Sessional Lecturer',
    'Cybercrime SSCI 3021U, Sessional Lecturer',
    'Lecturer, Computer Science, Tenure Stream',
    'Agent au services d’appui à la réussite scolaire',
    'Approches éducatives NUEF 2701, Chargé de cours',
    'Les études familiales ETFA 1151, Chargé de cours',
    'Music Trends MUS1015 Part Time Professor',
    'Application Consultant',
    'Commis Centre d’accès',
    'Réviseure Ou Réviseur De Traduction Juridique',
    'Analyse math appliquée I MATH 1063, chargé de cours',
    'Éléments de math discrètes MATH 1563, chargé de cours',
    'Labo de solutions chimiques CHIM 2582, chargé de cours',
    'Nombres et leurs propriétés MATH 1143, chargé de cours',
    'Notions calcul diff et intég MATH 3133, chargé de cours',
    'Statistique descriptive STAT 2653, chargé de cours',
    'Lab Supervisor, IDEA Department',
    'Snow Shovellers (18 positions)',
    'Program Assistant, Hair Design / Esthetics',
    'Accommodation Assistant, Counselling & Disability Services  ',
    'Research Assistant Professor, Coherent Control Of Quantum Devices',
    'Coherent Control Of Quantum Devices, Research Assistant Professor',
    'Executive Officer, Dean of Arts Office',
    'Administrative Coordinator, Department of Earth and Environmental Sciences',
    'Graduate Admissions Specialist, Graduate Studies Office',
    'Administrative Support Staff, Centre for English Language Development',
    'Data Entry Clerk, Department Of Finance, Payroll',
    'Communications Officer',
    'Assistant Professor, Sociology, Tenure Track',
    'Campaign Marketing & Communications Coordinator',
    'Research Technician, Paediatrics',
    'Counsellor',
    'Secretary to the Dean, School of Business',
    'Dentist, School of Health Sciences and Emergency Services',
    'Professors, Fitness Courses, School of Health Sciences and Emergency Services',
    'Simulation Lab Educator,  Nursing',
    'Meeting and Events Coordinator',
    'Institutional Quality Assistant',
    'Film Production, Part Time Professor',
    'Assistant Professor, Geophysics, Sedimentology, or Geochemistry, Tenure Track',
    'Associate / Full Professor, Department Chair, Department of Earth Sciences, Tenured',
    'Professor, Interdisciplinary / Literacy, Community Integration Through Cooperative Education Program',
    'Department Assistant, Continuing Education',
    'General Maintenance Worker',
    'Human Resources Advisor, Client Services',
    'Assistant Professor, Poultry Nutrition, Tenure Track',
    'Dispatcher, Police and Fire Prevention',
    'Financial Analyst',
    'Logistics Coordinator, Executive Programs, College of Business and Economics ',
    'Manager, Finance and Administration, College of Arts',
    'Custodian, Physical Resources ',
    'Grounds Machinery Operator, Snow and Ice Control',
    'Groundskeeper, Snow and Ice Control',
    'Custodian, Physical Resources (2 Positions)',
    'Special Constable, Campus Community Police',
    'Program Assistant, School of Health Sciences',
    'Institutional Research Analyst',
    'Instructional Associate, Access Programs For People With Disabilities',
    'Custodian',
    'Administrative Support'
];

byLanguage = {};

data.map(function (title) {
    var result,
        primary,
        secondary,
        difference;

    result = franc.all(title, {
        'whitelist' : [bias, 'fra']
    });

    primary = result[0];
    secondary = result[1];

    /**
     * No good statistics are possible with franc,
     * guess the biased language.
     */

    if (primary[0] === 'und') {
        return [bias, title];
    }

    difference = primary[1] / secondary[1]

    /**
     * Pretty sure.
     */

    if (difference < SURENESS) {
        return [primary[0], title];
    }

    /**
     * Probably, as the language is detected as
     * the biased language
     */

    if (primary[0] === bias) {
        return [bias, title];
    }

    /**
     * Pretty sure, but we are biased...
     */

    return [bias, title];
}).forEach(function (result) {
    if (!(result[0] in byLanguage)) {
        byLanguage[result[0]] = [];
    }

    byLanguage[result[0]].push(result[1]);
});

/**
 * Wrong:
 * - 'Analyse math appliquée I MATH 1063, chargé de cours' as `eng`;
 * - 'Snow Shovellers (18 positions)' as `fra`.
 *
 * (I might have missed some).
 */

console.log(byLanguage);

stockholmux · 2014-11-08T15:37:11Z

@wooorm - Ran your code over a bigger dataset and it seems to be working well. Of 2293 job titles 9 (0.04%) French titles were misidentified and and 19 (0.8%) English titles were misidentified. Less than 1% is a reasonable error rate for my needs.

On a side note, I thought the job codes ('COMM 1921a', etc.) might have been throwing off the detection, but I used a regular expression to remove them and got roughly the same results.

wooorm · 2014-11-08T16:18:14Z

Great to hear, those numbers do indeed seem pretty good. You could try fiddling with the magic number. Maybe 0.89 or even lower would give better results?

On the side note, I’m guessing its just that, for a program that doesn’t know the concept of words, certain French words look quite “English” (and vice versa). I’d think that removing job codes would help, but that it isn’t everything!

This breaking new feature makes the numbers returned by `franc.all()` more usefull by interpolating them between the most probable language's distance, and the maximum distance. Normalized results make it easier for developers (see GH-13) to know how 'sure' franc is about the most probable language, for example by checking if the difference between the primary and secondary languages is more than `n` (where `n` could be, for example, `0.9`). The resulting numbers are now guaranteed to be between (including) `0` and `1` (including).

wooorm closed this as completed Nov 8, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explain the output of 'all' #13

Explain the output of 'all' #13

stockholmux commented Nov 7, 2014

wooorm commented Nov 7, 2014

stockholmux commented Nov 7, 2014

wooorm commented Nov 7, 2014

stockholmux commented Nov 7, 2014

wooorm commented Nov 7, 2014

wooorm commented Nov 8, 2014

stockholmux commented Nov 8, 2014

wooorm commented Nov 8, 2014

Explain the output of 'all' #13

Explain the output of 'all' #13

Comments

stockholmux commented Nov 7, 2014

wooorm commented Nov 7, 2014

stockholmux commented Nov 7, 2014

wooorm commented Nov 7, 2014

stockholmux commented Nov 7, 2014

wooorm commented Nov 7, 2014

wooorm commented Nov 8, 2014

stockholmux commented Nov 8, 2014

wooorm commented Nov 8, 2014