-
Notifications
You must be signed in to change notification settings - Fork 643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question / feature request #43
Comments
nice idea. will add to tokenize.js you wanna do it? |
note: I did it, I'll submit it with the fork but it raised two questions about This is what it looks now (with complete curly quotes replacement): function normalise (str) {
if (!str) { return "" }
str = str.toLowerCase();
str = str.replace(/[,\.!:;\?\(\)]/, '');
// single curly quotes
str = str.replace(/[\u2018\u2019\u201A\u201B\u2032\u2035]+/g, "'");
// double curly quotes
str = str.replace(/[\u201C\u201D\u201E\u201F\u2033\u2036]+/g, '"');
if (!str.match(/[a-z0-9]/i)) { return '' }
return str
} • Is the
we could normalize it further by your ../transliteration/unicode_normalisation.js
|
the fix looks great, i've added a test too. |
for normalizing the input:
How about all normalizing all typographic stuff like curly and special quotes
to well the normalized ones ?
maybe useful for e.g.
O’Reilly to O'Reilly etc.
see http://practicaltypography.com/straight-and-curly-quotes.html
and note to me :
Maybe it would be useful to write a "preprocess" test, testing if everything in .js and .min.js ("expanded") is the same.
The text was updated successfully, but these errors were encountered: