Skip to content

romikforest/runtime-common-library-dicountries

Repository files navigation

Note

This repository on github is just a task for job candidates. It's not for production and it's not exactly the version we use.

DiCountries introduction

About

Country name normalization for DataIntelligence project
Author: Koptev, Roman (roman.koptev@softwareone.com)
Copyright ©: 2020, SoftwareONE

Quick start

Installation

To use the module from this repository you need to have a working python [1] installation (version >= 3.6.0) with pre-installed pip [2] package installer. You can run the python and pip executables using python and pip commands, or, depending on your setup, you may have to run them like python3 and pip3 or using other aliases. The recommended way to install these scripts is to execute commands like:

% python -m pip install -U pip
% python -m pip install -U dicountries

To access the azure artifact feed you need to set up your pip to use appropriate indices. For example you can set PIP_EXTRA_INDEX_URL environment variable like:

% export PIP_EXTRA_INDEX_URL="https://di_libraries:<access token>@pkgs.dev.azure.com/swodataintelligence/71b5c973-6f2c-42b7-a0a9-8af59f1bf7ee/_packaging/di_libraries_test/pypi/simple/"

or you can add a file pip.ini (Windows) or pip.conf (Mac/Linux) to your virtualenv with the content like this:

[global]
extra-index-url=https://di_libraries:<access token>@pkgs.dev.azure.com/swodataintelligence/71b5c973-6f2c-42b7-a0a9-8af59f1bf7ee/_packaging/di_libraries_test/pypi/simple/

The template <access token> should be substituted with your valid access token that should have Packaging Read access rights. You can create this token in your AzureDevOps account.

Usage

A simple usage example:

from dicountries.whoosh_index import CountryIndex

country_index = CountryIndex()
country_index.refresh()

print(country_index.normalize_country('Russia'))
print(country_index.normalize_country('Korea, Republic of'))
print(country_index.refine_country('Korea, Republic of'))

The expected output will have lines like these in the end:

Russian Federation
Korea, Republic of
Republic of Korea

The method :py:meth:`dicountries.whoosh_index.CountryIndex.normalize_country` returns a normalized country name if possible (looking in country indexes and then using fuzzy search if had not found), otherwise it returns the country name from the incoming parameter.

This function will try to return the country name accordingly to the ISO 3166 [3] standard, but if a substitution for this name is determined in the file post_process_country_mapping.json in the package's data directory that substitution will be returned.

You can force to not use substitutions from the post_process_country_mapping.json using the input parameter postprocess=False:

print(country_index.normalize_country('Russia', postprocess=False))

The method :py:meth:`dicountries.whoosh_index.CountryIndex.refine_country` will return the same value as the :py:meth:`dicountries.whoosh_index.CountryIndex.normalize_country`, but if there is a comma [,] in the returned name it will recombine the name so that the part after the comma will precede the part before the comma. The comma will be deleted.

You can also do this transformation on any string using function :py:func:`dicountries.utils.reorder_name` from the :py:mod:`dicountries.utils` module.

Every time you run this script it will create a subdirectory indexes in the current working directory to backup indexes there. You can pass the index directory explicitly to the :py:class:`dicountries.whoosh_index.CountryIndex` constructor like this:

country_index = CountryIndex(index_path="<Your index directory>")

If you don't want the index being rebuilt every time the script is running just omit the line:

country_index.refresh()

Without this line the index will be rebuilt only if it doesn't exist, otherwise it will be read from the index directory (it's faster).

If you want the index to be updated as a background process or you want to have :py:mod:`asyncio` integration you can pass the parameter use_async=True to the :py:class:`dicountries.whoosh_index.CountryIndex` constructor. Also there is an async function for index refreshing:

await country_index.refresh_async()

The search process is normally optimized and uses a cache. You can control the size of the cache using the max_search_cache parameter, e.g.:

country_index = CountryIndex(max_search_cache=1000)

During the normalization the search process usually checks the cache first. If some country isn't found in the cache more complicated techniques will be used. Every found country is placed to the simple cache, but if the cache reaches max_search_cache size it will be cleared and the search process will be reinitialized.

[1]https://www.python.org/
[2]https://pypi.org/project/pip/
[3]https://en.wikipedia.org/wiki/ISO_3166

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages