Skip to content

vgip/Datanorm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data normalization

Data normalization from some open sources

Installation

System Requirements

You need PHP >= 7.4 but the latest stable version of PHP is recommended

Composer

$ composer require Vgip/Datanorm

Functionality list

Transliteration from Ukrainian into English KMU 2010-01-27 #55

use Vgip\Datanorm\Transliteration\UkrEng\Cabmin2010;

$word = 'Єзгїґіпенєп';
$cabmin2010 = new Cabmin2010();
$wordTransliterated = $cabmin2010->transliterate($word);
echo $word.' -> '.$wordTransliterated;

Kyiv street getter from kga.gov.ua

Vgip\Datanorm\Parcer\Address\Ukr\Kyiv\StreetNameKga

Get array with normalized data from CSV file

Check and normalized street name data:

  • Convert possible apostrophe symbols to one symbol (ʼ - 02BC).
  • Check id (forbidden symbols, double). If error see to $this->warning.
  • Check street type by whitelist. New type save to $this->warning and this->typeNotFound.
  • Check Kyiv district name by whitelist. New Kyiv district name save to $this->warning and this->districtNotFound.
  • Check the street names and normalized street names . (if data saved to $this->streetNormalization array)
  • Generate $this->nameDouble array - save 2 or more double street name.
  • Generate $this->nameList - all unique street names.
  • Generate $this->typeCounter - quantity of all street types in Kyiv.

Result array from method getCsvAsArray():

  • ['number'] - (int) serial number from file
  • ['id'] - (int) identifier from file
  • ['name_original'] - (string) street name from file
  • ['name'] - (string) normalized street name
  • ['type_name'] - (string) street type name from file
  • ['type_key'] - (string) street type key
  • ['district_string'] - (string) street districts from file
  • ['district_list'] - (array) street districts ['district_key', 'district_key', ...]
  • ['document_name'] - (string) Document on assigning the name of the object
  • ['document_date'] - (string) Date of the document on assigning the name of the object
  • ['document_number'] - (string) Number of the document on assigning the name of the object
  • ['document_title'] - (string) The title of the document on the naming of the object
  • ['place_description'] - (string) Location of the object in the city
  • ['name_old'] - (string) Former name of the object
  • ['type_old'] - (string)Former category (type) of the object

Example

use Vgip\Datanorm\Parcer\Address\Ukr\Kyiv\StreetNameKga;
use Vgip\Datanorm\Directory\Address\Country\Ukr\Address AS DirUkrAddress;
use Vgip\Datanorm\Directory\Address\Country\Ukr\City\Kyiv AS DirKyiv;
use Vgip\Datanorm\Directory\Lang\Ukr\Pattern AS PatternUkrAddress;
use Vgip\Datanorm\Directory\Address\Country\Ukr\StreetNormalizedList;
use Vgip\Datanorm\Directory\Address\Country\Ukr\StreetNormalization;

$dirUkrAddress = DirUkrAddress::getInstance();
$dirKyiv = DirKyiv::getInstance();
$patternUkrAddress = PatternUkrAddress::getInstance();
$streetNormalizedListObj = StreetNormalizedList::getInstance();
$streetNormalizedList = $streetNormalizedListObj->getNormalization();

/** Get configuration and whitelist data */
$pathSourceFile = join(DIRECTORY_SEPARATOR, ['file', 'Reestr_vulits_Kyiva_2020_10_25.csv']);
$streetTypeList = $dirUkrAddress->getStreetTypeWhitelist();
$districtWhitelist = $dirKyiv->getDistrictWhitelist();
$patternStreetName = $patternUkrAddress->getStreetName();

/** Object initialization */
$streetNameKga = new StreetNameKga();

/** Set parameter */
$streetNameKga->setTypeWhitelist($streetTypeList);
$streetNameKga->setDistrictWhitelist($districtWhitelist);
$streetNameKga->setStreetNormalization($streetNormalizedList);
$streetNameKga->setPatternStreetName($patternStreetName);

/** Get a result (array) with normalized data */
$data = $streetNameKga->getCsvAsArray($pathSourceFile);

/** Get other data */
$res = [];
$res['type_list'] = $streetNameKga->getTypeList();
$res['type_counter'] = $streetNameKga->getTypeCounter();
$res['name_list'] = $streetNameKga->getNameList();
$res['name_double'] = $streetNameKga->getNameDouble();
$res['district_not_whitelist'] = $streetNameKga->getDistrictNotFound();

/** Get warnings if present */
$warning = $streetNameKga->getWarning();
$warningValue = $streetNameKga->getWarningValue();
if (null !== $warning AND count($warning) > 0) {
    print_r($warning);
}
print_r($data);
print_r($res);

Ukrainian language

Apostrophe

The resulting data will contain as ukrainian apostrophe symbol "ʼ" unicode symbol U+02BC. All other similar characters in source data (' - U+0027, ’ - U+2019, etc) will be replaced to ʼ (U+02BC). U+02BC - this symbol is used in the ukrainian domain name (ICANN).

Street name normalization

  • Position and surname - Академіка Єфремова, Генерала Авдєєнка, Маршала Бірюзова
  • Name and surname - Леоніда Бикова
  • Family relationships and surname - Братів Зерових, Родини Рудинських

Versioning

Data normalization follows Semantic Versioning.

About

Data normalization

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages