Update the street name and sign data processing include language and pronunciations #4268

gknisely · 2023-08-23T19:57:53Z

Issue

Update the street name and sign data processing include language and pronunciations

Data processing updates

The heart of this work involves processing name:<lg> and destination:<lg> tags. This includes all variations of each of those tags (e.g., name:left:<lg>, name:right<lg>, official_name:<lg>, destination:street:lang:<lg>, destination:ref:lang:<lg>, etc.) Within the tag the lg stands for the language.

Language updates

In order to determine the languages for an LL we utilized the default_language that is defined for a country or providence/state. The problem with this is the fact that some areas in the world are multi-lingual and don't have one language (e.g., in Belgium they support Dutch, French, and German and in Switzerland they speak German, French, Romanish, and Italian). However, the value of the default_language would usually contain only one language. In order to resolve this issue, the administrative builder was updated to handle these special cases like Brussels via processing relations with boundary=political with political_division=linguistic_community to get areas that are bilingual. However, this logic was still not enough as other areas did not have these special polygons. Therefore, we added an "override" for the languages. So, in the future if we determine that an area is multilingual and we want to support additional language tags in this area. All we have to do is add them to the supported languages list. Our list currently consists of the following:

Wales = cy
United Kingdom = en
Ireland = ga
Northern Ireland = ga
Japan = ja and en
Canada = en and fr
Belarus = ru and be
Singapore = en, zh, ms, and ta
Saudi Arabia = ar and en

Using the languages for the keys

The processing of the new keys is based on the languages that were determined to be "good" for that country or area. OSM users will typically add languages for names and destinations in all parts of the world even though that language may not be spoken in that country. For example, 5th Avenue in NYC has name:ru=5-я авеню. Obviously, we do not want to process the Russian name here. Therefore, using the default languages we can toss the tags with languages that we don't want to support. Moreover, we can create a hierarchy for our languages. For instance, Canada supports both English and French. However, in Ottawa, English will be first and in the Québec province French will be first.

Edge Cases

We will now support names where they differ depending on which side of the street you are driving on. When combined with the multi-languages for some areas, it gets very complex. In this example, the official Dutch name differs depending on the municipality. Basically, the border of the towns runs down the middle of the road and on the right side the Dutch name of the street differs from the left side of the street. In part, this leads to the bizarre situation that the street on the Molenbeek side is called Steenweg op Gent and on the Koekelberg side Gentsesteenweg. However, the French name of the street does not change at all.

Data Before/After Examples

Chaussée de Gand - Steenweg op Gent/Gentsesteenweg Example

Notice that all dashes are removed and processed correctly.

Driving left to right the Dutch name should be Steenweg op Gent and the French street name(Chaussée de Gand) does not change. Notice that before we used to return the name tag with dashes and did not have the French and Dutch street names split up.

Before

After

Driving right to left the Dutch name should be Gentsesteenweg and the French street name(Chaussée de Gand) does not change.

Before

After

name:forward and name:backward is now processed correctly.

Notice in this example we have name:forward and name:backward tags set; however, before we would just process the name tag.

Before - Waltonville Road and Quarry Road returned regardless of direction

After - Quarry Road correctly returned

Before - Waltonville Road returned regardless of direction

After - Waltonville Road correctly returned.

Multilingual names are now processed.

Notice in this example the name tag has both Welsh and English set. Since we are in Wales we allow both of these languages and process them both.

Before - The name and name:en tag are both returned for the street: Stryd y Castell / Castle Street/Castle Street

After - Stryd y Castell and Castle Street both processed correctly and a language of cy is set for Stryd y Castell and en for Castle Street

Contributors @gknisely @dgearhart

…onunciations_mb_v2

kevinkreiser · 2023-09-13T13:19:07Z

src/mjolnir/admin.cc

@@ -59,6 +59,70 @@ uint32_t GetMultiPolyId(const std::multimap<uint32_t, multi_polygon_type>& polys
  return index;


the changes in the database are breaking right, meaning old code can't use new database (or maybe it can because we just added a couple columns?) but new code certainly cant use old databases. i think this is ok because it doesnt mean compatibility is broken for routing but for data building. i just wanted to call it out. maybe worth putting in the pr description

@kevinkreiser no you can use an old db but you will be missing languages. I just tested this with PA with an old db and it did not crash, but of course no languages are returned.

src/mjolnir/graphenhancer.cc

src/thor/triplegbuilder.cc

test/gurka/gurka.cc

…mb_v2' into gk_add_languages_pronunciations_mb_v2

kevinkreiser · 2023-10-11T01:13:40Z

i guess mac builds are now broken project wide... i freaking hate CI. i know they have to change and update and stuff but im completely sick of it

kevinkreiser

ok 2 more changes, fix the order of the entries in changelog and undo the formatting of the taginfo.json (you switched from 2 spaces to 3)

nilsnolde · 2023-10-11T05:01:06Z

I remember an email about a deprecated resource class and I kinda thought I/you PR'd that, but maybe I just mentioned it in some chat and forgot about it. Should be an easy fix. I'm more concerned over M1-only from Jan 24 on, but yeah, that pretty much falls in line with

i freaking hate CI

…lhalla/valhalla into gk_add_languages_pronunciations_mb_v2

gknisely · 2023-10-11T19:18:40Z

ok 2 more changes, fix the order of the entries in changelog and undo the formatting of the taginfo.json (you switched from 2 spaces to 3)

@kevinkreiser Sorry about that....fixed

kevinkreiser

do we really need 4 copies of the same tests?

in gtest its easy to do permutations where you just vary some component of the test for each run. in this case the pronunciation tag. i see you also use it for the enums from baldr and the directories but its really very trivial to make these test function generic and then call them with permutations. makes it much much easier to maintain, especially when we are talking about a test which is 3k lines

mandeepsandhu · 2023-10-13T23:54:54Z

do we really need 4 copies of the same tests?

in gtest its easy to do permutations where you just vary some component of the test for each run. in this case the pronunciation tag. i see you also use it for the enums from baldr and the directories but its really very trivial to make these test function generic and then call them with permutations. makes it much much easier to maintain, especially when we are talking about a test which is 3k lines

You could use parametrized fixtures in gtest. Something like this to call the tests with different languages.

gknisely · 2023-10-16T13:17:22Z

do we really need 4 copies of the same tests?
in gtest its easy to do permutations where you just vary some component of the test for each run. in this case the pronunciation tag. i see you also use it for the enums from baldr and the directories but its really very trivial to make these test function generic and then call them with permutations. makes it much much easier to maintain, especially when we are talking about a test which is 3k lines

You could use parametrized fixtures in gtest. Something like this to call the tests with different languages.

@mandeepsandhu yep that was my plan. ty

…lhalla/valhalla into gk_add_languages_pronunciations_mb_v2

gknisely · 2023-10-16T20:59:30Z

do we really need 4 copies of the same tests?

in gtest its easy to do permutations where you just vary some component of the test for each run. in this case the pronunciation tag. i see you also use it for the enums from baldr and the directories but its really very trivial to make these test function generic and then call them with permutations. makes it much much easier to maintain, especially when we are talking about a test which is 3k lines

@kevinkreiser done

kevinkreiser

this is waaaaay tooooo much code. there has to be more succinct ways of doing this :)

Greg Knisely and others added 22 commits June 3, 2023 12:22

process default_language

c18ee36

added LanguageTag

5704d23

added LanguageTag

3ddaa7e

added logic to process default langs

36073d7

added default_languages

da13555

added new GetTagTokens

752b5c0

added lang lookups

83dbbbe

added language

f86a577

ported language logic

5a3c382

more language updates

e6a0bb8

not needed.

b1ae5bd

rm jct

04980fb

format

80f3591

added lang check

af3de41

refactored to just have 1 linguistic record.

5f01a7e

added more entries

1b0c747

kLanguage removed. moved kNone to 5 to avoid versioning of the tiles

276eb1c

refactored

e8b86a6

moved pronunciations and langs to maps.

a5ce56e

Merge remote-tracking branch 'origin/master' into gk_add_languages_pr…

85637d3

…onunciations_mb_v2

cleanup and added lang and pronunciation tags.

0e35e2c

Wales fix

a2d2873

gknisely requested review from kevinkreiser, dnesbitt61, dgearhart and nilsnolde August 23, 2023 19:57

gknisely mentioned this pull request Aug 23, 2023

Update the street name and sign data processing include language #4152

Closed

gknisely added 3 commits August 23, 2023 16:00

updated

d43e9ea

clang-tidy

ff19e8e

Merge remote-tracking branch 'origin/master' into gk_add_languages_pr…

73ec72e

…onunciations_mb_v2

kevinkreiser reviewed Sep 13, 2023

View reviewed changes

src/mjolnir/graphenhancer.cc Outdated Show resolved Hide resolved

kevinkreiser reviewed Sep 13, 2023

View reviewed changes

src/thor/triplegbuilder.cc Outdated Show resolved Hide resolved

kevinkreiser reviewed Sep 13, 2023

View reviewed changes

test/gurka/gurka.cc Outdated Show resolved Hide resolved

gknisely added 9 commits September 13, 2023 10:28

clean up

5a5c389

more clean up

7da9bcf

added bridge

353b1a2

pr clean up

ebf7478

Merge remote-tracking branch 'origin/gk_add_languages_pronunciations_…

20e6d23

…mb_v2' into gk_add_languages_pronunciations_mb_v2

Merge branch 'master' into gk_add_languages_pronunciations_mb_v2

fd685c5

Merge branch 'master' into gk_add_languages_pronunciations_mb_v2

9b8e155

updated for new spell check

98a7322

more spell check

0d2fda4

kevinkreiser requested changes Oct 11, 2023

View reviewed changes

gknisely added 3 commits October 11, 2023 15:00

Update CHANGELOG.md

09675e7

switched from 3 spaces to 2

bc5725e

Merge branch 'gk_add_languages_pronunciations_mb_v2' of github.com:va…

8228e3e

…lhalla/valhalla into gk_add_languages_pronunciations_mb_v2

kevinkreiser requested changes Oct 11, 2023

View reviewed changes

gknisely added 3 commits October 16, 2023 16:46

Merge branch 'master' into gk_add_languages_pronunciations_mb_v2

cb10854

refactored to test_phonemes_w_langs.cc. Reduced code duplication

8f04fc8

Merge branch 'gk_add_languages_pronunciations_mb_v2' of github.com:va…

9f5ca53

…lhalla/valhalla into gk_add_languages_pronunciations_mb_v2

kevinkreiser approved these changes Oct 17, 2023

View reviewed changes

gknisely merged commit 0f6367a into master Oct 17, 2023
7 of 8 checks passed

gknisely deleted the gk_add_languages_pronunciations_mb_v2 branch October 17, 2023 13:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update the street name and sign data processing include language and pronunciations #4268

Update the street name and sign data processing include language and pronunciations #4268

gknisely commented Aug 23, 2023 •

edited

Loading

kevinkreiser Sep 13, 2023

gknisely Sep 13, 2023

kevinkreiser commented Oct 11, 2023

kevinkreiser left a comment

nilsnolde commented Oct 11, 2023

gknisely commented Oct 11, 2023

kevinkreiser left a comment

mandeepsandhu commented Oct 13, 2023

gknisely commented Oct 16, 2023

gknisely commented Oct 16, 2023

kevinkreiser left a comment

		@@ -59,6 +59,70 @@ uint32_t GetMultiPolyId(const std::multimap<uint32_t, multi_polygon_type>& polys
		return index;

Update the street name and sign data processing include language and pronunciations #4268

Update the street name and sign data processing include language and pronunciations #4268

Conversation

gknisely commented Aug 23, 2023 • edited Loading

Issue

Data processing updates

Language updates

Using the languages for the keys

Edge Cases

Data Before/After Examples

Chaussée de Gand - Steenweg op Gent/Gentsesteenweg Example

Before

After

Before

After

name:forward and name:backward is now processed correctly.

Before - Waltonville Road and Quarry Road returned regardless of direction

After - Quarry Road correctly returned

Before - Waltonville Road returned regardless of direction

After - Waltonville Road correctly returned.

Multilingual names are now processed.

Before - The name and name:en tag are both returned for the street: Stryd y Castell / Castle Street/Castle Street

After - Stryd y Castell and Castle Street both processed correctly and a language of cy is set for Stryd y Castell and en for Castle Street

kevinkreiser Sep 13, 2023

Choose a reason for hiding this comment

gknisely Sep 13, 2023

Choose a reason for hiding this comment

kevinkreiser commented Oct 11, 2023

kevinkreiser left a comment

Choose a reason for hiding this comment

nilsnolde commented Oct 11, 2023

gknisely commented Oct 11, 2023

kevinkreiser left a comment

Choose a reason for hiding this comment

mandeepsandhu commented Oct 13, 2023

gknisely commented Oct 16, 2023

gknisely commented Oct 16, 2023

kevinkreiser left a comment

Choose a reason for hiding this comment

gknisely commented Aug 23, 2023 •

edited

Loading