New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explain the output of 'all' #13
Comments
Good question, the lowest number does indeed mean highest probability. Thing is, Could you paste your code? I’m thinking your doing something like: franc(jobTitle, {
'whitelist': ['fra', 'eng']
}); Right? Although this does not include your threshold code. |
Here is what I'm using - it's a little terrible as I'm just testing the code out to see if it would work in production. In my context, it is more likely for aTitle to be English and it seems to be harder to detect so, I default to English.
Setting the threshold in the above function to 500, this is where English is being detected as French:
I'm guessing that some of the job codes ('MITS 5002G', 'COMM 1610U', etc) are throwing off the detection, but that doesn't explain everything. |
The numbers returned by Could you print the ratio between console.log(possible[0][0], possible[1][0], possible[0][1] / possible[1][1]); Something like this could be more interesting I think. |
Right. I felt like the magic number didn't grasp the nature of the beast, I just dialed it in until it started to respond relatively okay. Here are some examples - your requested log is in parens. Detected as French (with my magic number threshold) - some are right and some are wrong. Should be pretty evident.
These were detected correctly as English with the magic number:
These would have been misidentified as French if it wasn't for the threshold:
|
And what if you check the formula I gave you, and a result of 0.9 for 'sureness'? I'll check back tomorrow to read it better though. Anyway, to me it seems to just be a problem with the shortness of the input, and French-like words in English descriptions; which I both think cannot be fixed with the current solution! |
Alright, so I implemented my “sureness” idea in the code below, and on your data above it (only) seems to get 2 (out of 62) wrong. Note that the below code could be simplified a bit, but its more verbose for readability. 'use strict';
var franc;
var data;
var byLanguage;
var bias;
var SURENESS;
franc = require('franc');
bias = 'eng';
SURENESS = 0.9;
data = [
'IT Service Desk Analyst',
'Communication Control Center Operator',
'Interpersonal Communication COMM 1610U, Sessional Lecturer',
'Cybercrime SSCI 3021U, Sessional Lecturer',
'Lecturer, Computer Science, Tenure Stream',
'Agent au services d’appui à la réussite scolaire',
'Approches éducatives NUEF 2701, Chargé de cours',
'Les études familiales ETFA 1151, Chargé de cours',
'Music Trends MUS1015 Part Time Professor',
'Application Consultant',
'Commis Centre d’accès',
'Réviseure Ou Réviseur De Traduction Juridique',
'Analyse math appliquée I MATH 1063, chargé de cours',
'Éléments de math discrètes MATH 1563, chargé de cours',
'Labo de solutions chimiques CHIM 2582, chargé de cours',
'Nombres et leurs propriétés MATH 1143, chargé de cours',
'Notions calcul diff et intég MATH 3133, chargé de cours',
'Statistique descriptive STAT 2653, chargé de cours',
'Lab Supervisor, IDEA Department',
'Snow Shovellers (18 positions)',
'Program Assistant, Hair Design / Esthetics',
'Accommodation Assistant, Counselling & Disability Services ',
'Research Assistant Professor, Coherent Control Of Quantum Devices',
'Coherent Control Of Quantum Devices, Research Assistant Professor',
'Executive Officer, Dean of Arts Office',
'Administrative Coordinator, Department of Earth and Environmental Sciences',
'Graduate Admissions Specialist, Graduate Studies Office',
'Administrative Support Staff, Centre for English Language Development',
'Data Entry Clerk, Department Of Finance, Payroll',
'Communications Officer',
'Assistant Professor, Sociology, Tenure Track',
'Campaign Marketing & Communications Coordinator',
'Research Technician, Paediatrics',
'Counsellor',
'Secretary to the Dean, School of Business',
'Dentist, School of Health Sciences and Emergency Services',
'Professors, Fitness Courses, School of Health Sciences and Emergency Services',
'Simulation Lab Educator, Nursing',
'Meeting and Events Coordinator',
'Institutional Quality Assistant',
'Film Production, Part Time Professor',
'Assistant Professor, Geophysics, Sedimentology, or Geochemistry, Tenure Track',
'Associate / Full Professor, Department Chair, Department of Earth Sciences, Tenured',
'Professor, Interdisciplinary / Literacy, Community Integration Through Cooperative Education Program',
'Department Assistant, Continuing Education',
'General Maintenance Worker',
'Human Resources Advisor, Client Services',
'Assistant Professor, Poultry Nutrition, Tenure Track',
'Dispatcher, Police and Fire Prevention',
'Financial Analyst',
'Logistics Coordinator, Executive Programs, College of Business and Economics ',
'Manager, Finance and Administration, College of Arts',
'Custodian, Physical Resources ',
'Grounds Machinery Operator, Snow and Ice Control',
'Groundskeeper, Snow and Ice Control',
'Custodian, Physical Resources (2 Positions)',
'Special Constable, Campus Community Police',
'Program Assistant, School of Health Sciences',
'Institutional Research Analyst',
'Instructional Associate, Access Programs For People With Disabilities',
'Custodian',
'Administrative Support'
];
byLanguage = {};
data.map(function (title) {
var result,
primary,
secondary,
difference;
result = franc.all(title, {
'whitelist' : [bias, 'fra']
});
primary = result[0];
secondary = result[1];
/**
* No good statistics are possible with franc,
* guess the biased language.
*/
if (primary[0] === 'und') {
return [bias, title];
}
difference = primary[1] / secondary[1]
/**
* Pretty sure.
*/
if (difference < SURENESS) {
return [primary[0], title];
}
/**
* Probably, as the language is detected as
* the biased language
*/
if (primary[0] === bias) {
return [bias, title];
}
/**
* Pretty sure, but we are biased...
*/
return [bias, title];
}).forEach(function (result) {
if (!(result[0] in byLanguage)) {
byLanguage[result[0]] = [];
}
byLanguage[result[0]].push(result[1]);
});
/**
* Wrong:
* - 'Analyse math appliquée I MATH 1063, chargé de cours' as `eng`;
* - 'Snow Shovellers (18 positions)' as `fra`.
*
* (I might have missed some).
*/
console.log(byLanguage); |
@wooorm - Ran your code over a bigger dataset and it seems to be working well. Of 2293 job titles 9 (0.04%) French titles were misidentified and and 19 (0.8%) English titles were misidentified. Less than 1% is a reasonable error rate for my needs. On a side note, I thought the job codes ('COMM 1921a', etc.) might have been throwing off the detection, but I used a regular expression to remove them and got roughly the same results. |
Great to hear, those numbers do indeed seem pretty good. You could try fiddling with the magic number. Maybe On the side note, I’m guessing its just that, for a program that doesn’t know the concept of words, certain French words look quite “English” (and vice versa). I’d think that removing job codes would help, but that it isn’t everything! |
This breaking new feature makes the numbers returned by `franc.all()` more usefull by interpolating them between the most probable language's distance, and the maximum distance. Normalized results make it easier for developers (see GH-13) to know how 'sure' franc is about the most probable language, for example by checking if the difference between the primary and secondary languages is more than `n` (where `n` could be, for example, `0.9`). The resulting numbers are now guaranteed to be between (including) `0` and `1` (including).
The results of 'all' consist of the language code and a score number. I've guessed that the lowest number is the detected language, but what can be learned from the score number? Doesn't seem to be documented.
I'm looking to detect the language of job titles in English and French only (because Canada) and I was getting results all over the place using just
franc(jobTitle)
but whitelisting english and french then applying a threshold to the score I was able to tune in a much more accurate result (still a 3.92% error rate over 1020 job titles, but it was in the 25% range before the threshold). Is this a good use for the score or am I just getting lucky?The text was updated successfully, but these errors were encountered: