Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhance Inflector helper with ascii function #464

Merged
merged 15 commits into from Dec 28, 2013

Conversation

@tonydspaniard
Copy link
Contributor

@tonydspaniard tonydspaniard commented May 30, 2013

closes #364

Edit:
Seems to me that transliteration without PECL or the new Transliterator class (PHP >= 5.4), is something that a simple array won't do.

I think is ok on the first round, but if we wish to adopt transliteration the right way, we should consider some good rework on this matter.

References:

https://drupal.org/project/transliteration
http://php.net/manual/es/transliterator.transliterate.php
https://doc.wikimedia.org/mediawiki-core/master/php/html/UtfNormal_8php_source.html

* @param array $replace the characters to be replaced by spaces.
* @return string the translated
*/
public static function ascii($string, $replace = array())

This comment has been minimized.

@qiangxue

qiangxue May 30, 2013
Member

I think we don't need this $replace because it does something that has nothing to do with ascii.

This comment has been minimized.

@tonydspaniard

tonydspaniard May 31, 2013
Author Contributor

yes, it is true, but when you translate something like: I'll be back, you get Ill be back which will make a wrong slug: Ill-be-back instead of I-ll-be-back if we use replacement tokens

This comment has been minimized.

@qiangxue

qiangxue May 31, 2013
Member

Why I'll be back becomes Ill-be-back? Isn't ' an ASCII char already?

This comment has been minimized.

@tonydspaniard

tonydspaniard Jun 1, 2013
Author Contributor

You are right, I get your point, that should be work of the slug function.

if (!empty($replace)) {
$string = str_replace((array)$replace, ' ', $string);
}
$string = iconv('UTF-8', 'ASCII//TRANSLIT', $string);

This comment has been minimized.

@qiangxue

qiangxue May 30, 2013
Member

How about using the $transliteration map? Using iconv means additional dependency on libiconv.
Also, it would be great to refactor slug() to make use of ascii().

This comment has been minimized.

@tonydspaniard

tonydspaniard May 31, 2013
Author Contributor

Will give it a shot on my FT

Antonio Ramirez added 4 commits Jun 1, 2013
* upstream:
  Fixed build break.
  Fixes issue #194: Added Application::catchAll.
  added missing default to getValue in boostrap tabs
  \yii\widgets\Menu improvement
  Fixes issue #467: allow view file to be absent as long as the themed version exists.
  better auto scrolling.
  Fixes issue #446: automatically scroll to first error.
Antonio Ramirez
*/
public static $transliteration = array(
protected static $transliteration = array(

This comment has been minimized.

@qiangxue

qiangxue Jun 1, 2013
Member

Why turning this into protected?

This comment has been minimized.

@cebe

cebe Jun 1, 2013
Member

imo they should all be customizable.

This comment has been minimized.

@creocoder

creocoder Jun 1, 2013
Contributor

Also transliteration prop have incorrect map. Please check this http://iamseanmurphy.com/creating-seo-friendly-urls-in-php-with-url-slug/ .

This comment has been minimized.

@tonydspaniard

tonydspaniard Jun 2, 2013
Author Contributor

I have inspired my self on that link, how come is incorrect?

This comment has been minimized.

@creocoder

creocoder Jun 2, 2013
Contributor

@tonydspaniard For example there is no rule for cyrilic а. Also for example rule '/ъ|ь|Ъ|Ы|Ь/' => '' incorrect because Ы should be Y. If you inspired by this link we should take these rules carefully ;)

This comment has been minimized.

@creocoder

creocoder Jun 2, 2013
Contributor

@tonydspaniard I think its good idea to make some kind of converter to automatic convert rules from link to this format.

This comment has been minimized.

@tonydspaniard

tonydspaniard Jun 5, 2013
Author Contributor

I agree with @cebe, working on this

'/Ţ|Ț|Ť|Ŧ|τ/' => 'T',
'/ţ|ț|ť|ŧ|т/' => 't',
'/Ù|Ú|Û|Ũ|Ū|Ŭ|Ů|Ű|Ų|Ư|Ǔ|Ǖ|Ǘ|Ǚ|Ǜ|У/' => 'U',
'/ù|ú|û|ũ|ū|ŭ|ů|ű|ų|ư|ǔ|ǖ|ǘ|ǚ|ǜ/' => 'u',

This comment has been minimized.

@cebe

cebe Jun 2, 2013
Member

у is missing here in 272 which is the lowercase variant of У you added in 271.
It is this one: U+0443 in http://www.utf8-zeichentabelle.de/unicode-utf8-table.pl?start=1024 "CYRILLIC SMALL LETTER U"

This comment has been minimized.

@tonydspaniard

tonydspaniard Jun 5, 2013
Author Contributor

y is also y in Spanish, therefore I doubt this is the right approach for transliteration at all.

This comment has been minimized.

@cebe

cebe Jun 5, 2013
Member

Are you sure it is the same character? y(german and english y in ASCII) and у(russian u) have different Unicode bytes even if the symbol looks the same.

This comment has been minimized.

@tonydspaniard

tonydspaniard Jun 5, 2013
Author Contributor

@cebe yes... shouldn't we then convert the chars to their Unicode to make sure we are replacing the correct ones? I believe that allowing custom rules to override the ones provided would solve these issues, and everyone will be able to provide their own set of rules to add or override the ones provided...

The list is huge...

This comment has been minimized.

@cebe

cebe Jun 5, 2013
Member

shouldn't we then convert the chars to their Unicode to make sure we are replacing the correct ones?

The file is saved as utf8 so they are unicode, or what do you mean?

The list is huge...

I think we already have the most important ones. Just need to make sure they are correct.

This comment has been minimized.

@cebe

cebe Jun 5, 2013
Member

Maybe we can try to automatically create rules using Unicode table. http://www.utf8-zeichentabelle.de/ not sure if it will work good.

'/ţ|ț|ť|ŧ|т/' => 't',
'/Ù|Ú|Û|Ũ|Ū|Ŭ|Ů|Ű|Ų|Ư|Ǔ|Ǖ|Ǘ|Ǚ|Ǜ|У/' => 'U',
'/ù|ú|û|ũ|ū|ŭ|ů|ű|ų|ư|ǔ|ǖ|ǘ|ǚ|ǜ/' => 'u',
'/в/' => 'v',

This comment has been minimized.

@cebe

cebe Jun 2, 2013
Member

This one should be ve as far as I know an see in UTF table: U+0432
http://www.utf8-zeichentabelle.de/unicode-utf8-table.pl?start=1024 "CYRILLIC SMALL LETTER VE"
Also its upper case variant seems to be missing.

This comment has been minimized.

@andersonamuller

andersonamuller Jun 5, 2013
Contributor

Another reference for unicode character table:
http://unicode-table.com/en/

This comment has been minimized.

@creocoder

creocoder Jun 6, 2013
Contributor

@cebe All fine here. в should be transliterated as v.

This comment has been minimized.

@creocoder

creocoder Jun 6, 2013
Contributor

@cebe If i'm not wrong CLDR have transliteration rules for every language. Why not just use it?

This comment has been minimized.

@tonydspaniard

tonydspaniard Jun 6, 2013
Author Contributor

IMHO Transliteration the right way will take us to maintain a source like https://drupal.org/project/transliteration . This problem is solved from 5.4 onwards with the Transliteration object but unfortunately, there is none for 5.3.

I believe, that we should only provide transliteration for romanic languages (maybe include russian) and then allow custom rules to be included.

What do you guys think? @qiangxue @creocoder @andersonamuller @cebe

This comment has been minimized.

@resurtm

resurtm Jun 6, 2013
Contributor

maybe include russian

Better to support all languages based on cyrillic script, not just russian. They all have a very small set of different characters.

This comment has been minimized.

This comment has been minimized.

@resurtm

resurtm Jun 6, 2013
Contributor

http://www.unicode.org/charts/PDF/U0400.pdf

This is OK.

http://www.unicode.org/charts/PDF/U2DE0.pdf

I'm unsure whether Old Church Slavonic used anywhere nowadays. It's like Old English—i've never seen any modern texts which uses it.

http://www.unicode.org/charts/PDF/UA640.pdf

Looks meaningless as well.

* upstream/master: (91 commits)
  fixed init.bat paths
  renamed backstage → backend
  Fixed test break.
  Response WIP
  coding style fix.
  [1] Redone missing code. [2] Added empty line at the end of file [3] Removed exception.
  Updated code style. braces on the same line for control statements.
  Removed unused columsn from find constraint sql. Fixed typo. Added extra schema check for when a foreign table is not in the same schema. Updated indentation to conform to other classes.
  Multilevel Items
  Fixes issue #514.
  Removed the config setting that should not have been commited.
  Removed false exception catching.
  Removed custom pgsql PDO and added defaultSchema as public property.
  CS fixes.
  Minor refactoring of DbMessageSource.
  Response WIP
  Minor refactoring of t().
  Added Jsonable support.
  Fixes #13. Implement DB message source.
  Yii::t() minor fix.
  ...
@cebe
Copy link
Member

@cebe cebe commented Oct 22, 2013

@tonydspaniard as we are on PHP 5.4 now, can you check if/whats obsolete here now?

@samdark
Copy link
Member

@samdark samdark commented Oct 23, 2013

@cebe transliteration is now available in 5.4 intl: http://www.php.net/manual/en/class.transliterator.php

@iJackUA
Copy link
Contributor

@iJackUA iJackUA commented Oct 23, 2013

@samdark , @cebe it is availabe in Intl, but as I understand it is supposed that Intl could be not installed (as for Messages you are making a fallback functionality) so it seems that fallback for translitaration is also (still) needed.

@samdark
Copy link
Member

@samdark samdark commented Oct 23, 2013

We're creating a fallback for English only where transliteration isn't needed.

@iJackUA
Copy link
Contributor

@iJackUA iJackUA commented Oct 23, 2013

I understand it, so why not to make slug function with both options something like

    public static function slug($string, $replacement = '-')
    {
        if(extension_loaded(intl)) {

            $options = "Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC; [:Punctuation:] Remove;";
            //best options should be evaluated, just a quick copy-paste            
            $string = transliterator_transliterate($options, $string);
            $string = preg_replace('/[-\s]+/', '-', $string);
            return trim($string, '-');

        } else {
        $map = static::$transliteration + [
                '/[^\w\s]/' => ' ',
                '/\\s+/' => $replacement,
                '/(?<=[a-z])([A-Z])/' => $replacement . '\\1',
                str_replace(':rep', preg_quote($replacement, '/'), '/^[:rep]+|[:rep]+$/') => ''
            ];
        return preg_replace(array_keys($map), array_values($map), $string);
        }
    }
@tonydspaniard
Copy link
Contributor Author

@tonydspaniard tonydspaniard commented Dec 20, 2013

@cebe back on track with this, will review again all what is needed and what is not. @iJackUA fallback proposal seems good to me.

* upstream: (2012 commits)
  doc fix.
  Changed the signature of `urlCreator` and button creators for `yii\gridview\ActionColumn`
  parameter adjustment.
  The signature for `yii\gridview\ActionColumn::urlCreator` is changed - the `$action` parameter is moved to the first
  Fixed the signature of Schema::findUniqueIndexes().
  reverted #1598 and added a test for it
  Fix wrong array index in unique indexes for MySql
  Making accesors public
  Get DB unique indexes information
  Fixes #1610: `Html::activeCheckboxList()` and `Html::activeRadioList()` will submit an empty string if no checkbox/radio is selected
  Gii should keep horizontal layout
  Documentation at "yii\authclient" updated.
  Doc comments at "yii\authclient" updated.
  Auth clients "Choice" doc comments updated.
  Auth clients "Choice" widget javascript advanced.
  Bootstrap's dropdown encodes also trailing caret tag
  Auth clients "Choice" widget markup updated.
  Gii should keep horizontal layout
  extended from codeception testcase, added docs
  Auth clients for Facebook, GitHub, LinkedIn added.
  ...

Conflicts:
	framework/yii/helpers/BaseInflector.php
	tests/unit/framework/helpers/InflectorTest.php
@tonydspaniard
Copy link
Contributor Author

@tonydspaniard tonydspaniard commented Dec 26, 2013

After reviewing and checking lots of sources out there, I couldn't find a really good and simple method to transliterate text. iconv fails on cyrillic so I thought about Unicode char code replacements... The best solution so far was the one created on Drupal that was based on UtfNormal from media wiki.

I have added the TransliteratorHelper class for you guys to review it. It could be that the data to transliterate is at the wrong location transliteration/data or is too much for the framework. Please, any feedback is greatly appreciated @yiisoft/core-developers

Edit: TransliteratorHelper fallsback into intl extension if installed.

@qiangxue
Copy link
Member

@qiangxue qiangxue commented Dec 26, 2013

I just reviewed the code. I think it's a bit too much including transliteration/data for fallback purpose. The situation is a bit similar to the message formatting where we use intl if possible and the fallback just makes things work but not perfectly correct.

It seems to me @iJackUA's approach is good enough.

What do you think?

@tonydspaniard
Copy link
Contributor Author

@tonydspaniard tonydspaniard commented Dec 26, 2013

Yeah... I thought about transliteration/data was too much, but was the best approach so far. I will rollback to previous and use @iJackUA's approach updating $transliteration values to fit the replacements as good as possible according to UTF-8 tables referred before. We can always update those values.

Thanks @qiangxue

@iJackUA
Copy link
Contributor

@iJackUA iJackUA commented Dec 27, 2013

if I understand correctly $transliteration currently will work only with Latin character
It does not contain Cyrillic for example - so it returns nothing (empty string) if I will do something like ...::slufigy('Привет Йии фрэймворк')
and

$options = "Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC; [:Punctuation:] Remove;";
$string = transliterator_transliterate($options, $string);

makes much more right now - it transliterate any language (at least I have tried with Russian) to Latin characters

in my test with string
Привет Hello Йии Framework ! Как дела ? How it goes ?
I have such output (the first one is expected and considered correct)
INTL : Privet-Hello-Jii-Framework-Kak-dela-How-it-goes
Inflector::slug : Hello-Framework-How-it-goes

btw should there be an option for lowercasing the string in slug ? As in most cases it is used for Seo Urls and to remove inconsistency I prefere to make url all lowercase

@samdark
Copy link
Member

@samdark samdark commented Dec 27, 2013

Lowercasing it is a good idea. transliterator_transliterate is only available when there's intl installed. Since we're requiring intl for any advanced message translation (such as plurals) anyway I think the good solution is to use intl if available and fallback to English-only it not.

@tonydspaniard
Copy link
Contributor Author

@tonydspaniard tonydspaniard commented Dec 27, 2013

@iJackUA I am going to try to include the Cyrillic reference but do not expect great results as with the helper i have to remove. For those interested on using a more robust transliteration fallback, please check (done versions for both Yii1 + Yi2): https://github.com/2amigos/transliteration-helper

@samdark I guess add a bool parameter to the function won't hurt.

* upstream: (21 commits)
  Fixes #1643: Added default value for `Captcha::options`
  Fixes #1654: Fixed the issue that a new message source object is generated for every new message being translated
  Allow dash char in ActionColumn’s button names.
  Added SecurityTest.
  fixed functional test when enablePrettyUrl is false.
  fixed composer.json
  minor doc fix.
  Fixes #1634: Use masked CSRF tokens to prevent BREACH exploits
  Use better random CSRF token.
  GII unique indexes avoid autoIncrement columns
  updated debug retry params.
  Added sleep().
  Added unit test for ActiveRecord::updateAttributes().
  Fixes #1641: Added `BaseActiveRecord::updateAttributes()`
  Fixed #1504: Debug toolbar isn't loaded successfully in some environments when xdebug is enabled
  Mongo README.md updated.
  Fixes #1611: Added `BaseActiveRecord::markAttributeDirty()`
  Number validator was missing
  Fixes #1638: prevent table names from being enclosed within curly brackets twice.
  Unique indexes rules for single columns into array
  ...
@tonydspaniard
Copy link
Contributor Author

@tonydspaniard tonydspaniard commented Dec 27, 2013

@qiangxue please review new approach. Tested both with intl and no intl extension. Both returned the same results.

Updated char map and fallback transliteration approach.

];
return preg_replace(array_keys($map), array_values($map), $string);
// ensure UTF-8 and remove invalid UTF-8 chars.
$string = mb_convert_encoding((string) $string, 'UTF-8', mb_list_encodings());

This comment has been minimized.

@qiangxue

qiangxue Dec 27, 2013
Member

Is this really needed? Or is it correct to do?

This comment has been minimized.

@tonydspaniard

tonydspaniard Dec 28, 2013
Author Contributor

I tend to be very defensive always on code. Maybe it is not needed. I can remove it.

* upstream:
  Fixed CSRF token masking issue.
  improved error message of calling invalid scope method.
  Fixed repo URL
  Fixes #1650: Added Connection::pdoClass.
  code style fix.
  added changelog
  codestyle fix
  improved checkIntegrity method
  Modified extension guidlines
  fix sphinx command signature
  fixed bug with forgotten param, fixed behavior for one table integrity
  fixed sequence reset
  added postgresql features to reset seq/check integrity
qiangxue added a commit that referenced this pull request Dec 28, 2013
enhance Inflector  helper with ascii function
@qiangxue qiangxue merged commit f18e530 into yiisoft:master Dec 28, 2013
1 check passed
1 check passed
default The Travis CI build passed
Details
@qiangxue
Copy link
Member

@qiangxue qiangxue commented Dec 28, 2013

Thanks!

@tonydspaniard tonydspaniard deleted the tonydspaniard:364-toAscii branch Feb 5, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked issues

Successfully merging this pull request may close these issues.

8 participants
You can’t perform that action at this time.