Data normalization from some open sources
You need PHP >= 7.4 but the latest stable version of PHP is recommended
$ composer require Vgip/Datanorm
- Transliteration from Ukrainian into English KMU 2010-01-27 #55
- Kyiv street getter from kga.gov.ua
Transliteration from Ukrainian into English KMU 2010-01-27 #55
use Vgip\Datanorm\Transliteration\UkrEng\Cabmin2010;
$word = 'Єзгїґіпенєп';
$cabmin2010 = new Cabmin2010();
$wordTransliterated = $cabmin2010->transliterate($word);
echo $word.' -> '.$wordTransliterated;
Kyiv street getter from kga.gov.ua
Vgip\Datanorm\Parcer\Address\Ukr\Kyiv\StreetNameKga
Check and normalized street name data:
- Convert possible apostrophe symbols to one symbol (ʼ - 02BC).
- Check id (forbidden symbols, double). If error see to $this->warning.
- Check street type by whitelist. New type save to $this->warning and this->typeNotFound.
- Check Kyiv district name by whitelist. New Kyiv district name save to $this->warning and this->districtNotFound.
- Check the street names and normalized street names . (if data saved to $this->streetNormalization array)
- Generate $this->nameDouble array - save 2 or more double street name.
- Generate $this->nameList - all unique street names.
- Generate $this->typeCounter - quantity of all street types in Kyiv.
Result array from method getCsvAsArray():
- ['number'] - (int) serial number from file
- ['id'] - (int) identifier from file
- ['name_original'] - (string) street name from file
- ['name'] - (string) normalized street name
- ['type_name'] - (string) street type name from file
- ['type_key'] - (string) street type key
- ['district_string'] - (string) street districts from file
- ['district_list'] - (array) street districts ['district_key', 'district_key', ...]
- ['document_name'] - (string) Document on assigning the name of the object
- ['document_date'] - (string) Date of the document on assigning the name of the object
- ['document_number'] - (string) Number of the document on assigning the name of the object
- ['document_title'] - (string) The title of the document on the naming of the object
- ['place_description'] - (string) Location of the object in the city
- ['name_old'] - (string) Former name of the object
- ['type_old'] - (string)Former category (type) of the object
use Vgip\Datanorm\Parcer\Address\Ukr\Kyiv\StreetNameKga;
use Vgip\Datanorm\Directory\Address\Country\Ukr\Address AS DirUkrAddress;
use Vgip\Datanorm\Directory\Address\Country\Ukr\City\Kyiv AS DirKyiv;
use Vgip\Datanorm\Directory\Lang\Ukr\Pattern AS PatternUkrAddress;
use Vgip\Datanorm\Directory\Address\Country\Ukr\StreetNormalizedList;
use Vgip\Datanorm\Directory\Address\Country\Ukr\StreetNormalization;
$dirUkrAddress = DirUkrAddress::getInstance();
$dirKyiv = DirKyiv::getInstance();
$patternUkrAddress = PatternUkrAddress::getInstance();
$streetNormalizedListObj = StreetNormalizedList::getInstance();
$streetNormalizedList = $streetNormalizedListObj->getNormalization();
/** Get configuration and whitelist data */
$pathSourceFile = join(DIRECTORY_SEPARATOR, ['file', 'Reestr_vulits_Kyiva_2020_10_25.csv']);
$streetTypeList = $dirUkrAddress->getStreetTypeWhitelist();
$districtWhitelist = $dirKyiv->getDistrictWhitelist();
$patternStreetName = $patternUkrAddress->getStreetName();
/** Object initialization */
$streetNameKga = new StreetNameKga();
/** Set parameter */
$streetNameKga->setTypeWhitelist($streetTypeList);
$streetNameKga->setDistrictWhitelist($districtWhitelist);
$streetNameKga->setStreetNormalization($streetNormalizedList);
$streetNameKga->setPatternStreetName($patternStreetName);
/** Get a result (array) with normalized data */
$data = $streetNameKga->getCsvAsArray($pathSourceFile);
/** Get other data */
$res = [];
$res['type_list'] = $streetNameKga->getTypeList();
$res['type_counter'] = $streetNameKga->getTypeCounter();
$res['name_list'] = $streetNameKga->getNameList();
$res['name_double'] = $streetNameKga->getNameDouble();
$res['district_not_whitelist'] = $streetNameKga->getDistrictNotFound();
/** Get warnings if present */
$warning = $streetNameKga->getWarning();
$warningValue = $streetNameKga->getWarningValue();
if (null !== $warning AND count($warning) > 0) {
print_r($warning);
}
print_r($data);
print_r($res);
The resulting data will contain as ukrainian apostrophe symbol "ʼ" unicode symbol U+02BC. All other similar characters in source data (' - U+0027, ’ - U+2019, etc) will be replaced to ʼ (U+02BC). U+02BC - this symbol is used in the ukrainian domain name (ICANN).
- Position and surname - Академіка Єфремова, Генерала Авдєєнка, Маршала Бірюзова
- Name and surname - Леоніда Бикова
- Family relationships and surname - Братів Зерових, Родини Рудинських
Data normalization follows Semantic Versioning.