Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the street name and sign data processing include language and pronunciations #4268

Merged
merged 46 commits into from
Oct 17, 2023

Conversation

gknisely
Copy link
Member

@gknisely gknisely commented Aug 23, 2023

Issue

Update the street name and sign data processing include language and pronunciations

Data processing updates

The heart of this work involves processing name:<lg> and destination:<lg> tags. This includes all variations of each of those tags (e.g., name:left:<lg>, name:right<lg>, official_name:<lg>, destination:street:lang:<lg>, destination:ref:lang:<lg>, etc.) Within the tag the lg stands for the language.

Language updates

In order to determine the languages for an LL we utilized the default_language that is defined for a country or providence/state. The problem with this is the fact that some areas in the world are multi-lingual and don't have one language (e.g., in Belgium they support Dutch, French, and German and in Switzerland they speak German, French, Romanish, and Italian). However, the value of the default_language would usually contain only one language. In order to resolve this issue, the administrative builder was updated to handle these special cases like Brussels via processing relations with boundary=political with political_division=linguistic_community to get areas that are bilingual. However, this logic was still not enough as other areas did not have these special polygons. Therefore, we added an "override" for the languages. So, in the future if we determine that an area is multilingual and we want to support additional language tags in this area. All we have to do is add them to the supported languages list. Our list currently consists of the following:

Wales = cy
United Kingdom = en
Ireland = ga
Northern Ireland = ga
Japan = ja and en
Canada = en and fr
Belarus = ru and be
Singapore = en, zh, ms, and ta
Saudi Arabia = ar and en

Using the languages for the keys

The processing of the new keys is based on the languages that were determined to be "good" for that country or area. OSM users will typically add languages for names and destinations in all parts of the world even though that language may not be spoken in that country. For example, 5th Avenue in NYC has name:ru=5-я авеню. Obviously, we do not want to process the Russian name here. Therefore, using the default languages we can toss the tags with languages that we don't want to support. Moreover, we can create a hierarchy for our languages. For instance, Canada supports both English and French. However, in Ottawa, English will be first and in the Québec province French will be first.

Edge Cases

We will now support names where they differ depending on which side of the street you are driving on. When combined with the multi-languages for some areas, it gets very complex. In this example, the official Dutch name differs depending on the municipality. Basically, the border of the towns runs down the middle of the road and on the right side the Dutch name of the street differs from the left side of the street. In part, this leads to the bizarre situation that the street on the Molenbeek side is called Steenweg op Gent and on the Koekelberg side Gentsesteenweg. However, the French name of the street does not change at all.

Data Before/After Examples

Chaussée de Gand - Steenweg op Gent/Gentsesteenweg Example

Notice that all dashes are removed and processed correctly.

Driving left to right the Dutch name should be Steenweg op Gent and the French street name(Chaussée de Gand) does not change. Notice that before we used to return the name tag with dashes and did not have the French and Dutch street names split up.

Before

Screenshot from 2022-02-21 15-13-21

After

Screenshot from 2022-02-21 15-13-34

Driving right to left the Dutch name should be Gentsesteenweg and the French street name(Chaussée de Gand) does not change.

Before

Screenshot from 2022-02-21 15-14-36

After

Screenshot from 2022-02-21 15-14-26

name:forward and name:backward is now processed correctly.

Notice in this example we have name:forward and name:backward tags set; however, before we would just process the name tag.

Before - Waltonville Road and Quarry Road returned regardless of direction

Screenshot from 2022-02-21 15-35-34

After - Quarry Road correctly returned

Screenshot from 2022-02-21 15-36-03

Before - Waltonville Road returned regardless of direction

Screenshot from 2022-02-21 15-36-26

After - Waltonville Road correctly returned.

Screenshot from 2022-02-21 15-36-38

Multilingual names are now processed.

Notice in this example the name tag has both Welsh and English set. Since we are in Wales we allow both of these languages and process them both.

Before - The name and name:en tag are both returned for the street: Stryd y Castell / Castle Street/Castle Street

Screenshot from 2022-02-21 15-46-01

After - Stryd y Castell and Castle Street both processed correctly and a language of cy is set for Stryd y Castell and en for Castle Street

Screenshot from 2022-02-21 15-46-19

Contributors @gknisely @dgearhart

@@ -59,6 +59,70 @@ uint32_t GetMultiPolyId(const std::multimap<uint32_t, multi_polygon_type>& polys
return index;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the changes in the database are breaking right, meaning old code can't use new database (or maybe it can because we just added a couple columns?) but new code certainly cant use old databases. i think this is ok because it doesnt mean compatibility is broken for routing but for data building. i just wanted to call it out. maybe worth putting in the pr description

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinkreiser no you can use an old db but you will be missing languages. I just tested this with PA with an old db and it did not crash, but of course no languages are returned.

test/gurka/gurka.cc Outdated Show resolved Hide resolved
@kevinkreiser
Copy link
Member

i guess mac builds are now broken project wide... i freaking hate CI. i know they have to change and update and stuff but im completely sick of it

Copy link
Member

@kevinkreiser kevinkreiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok 2 more changes, fix the order of the entries in changelog and undo the formatting of the taginfo.json (you switched from 2 spaces to 3)

@nilsnolde
Copy link
Member

I remember an email about a deprecated resource class and I kinda thought I/you PR'd that, but maybe I just mentioned it in some chat and forgot about it. Should be an easy fix. I'm more concerned over M1-only from Jan 24 on, but yeah, that pretty much falls in line with

i freaking hate CI

@gknisely
Copy link
Member Author

ok 2 more changes, fix the order of the entries in changelog and undo the formatting of the taginfo.json (you switched from 2 spaces to 3)

@kevinkreiser Sorry about that....fixed

Copy link
Member

@kevinkreiser kevinkreiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need 4 copies of the same tests?
image

in gtest its easy to do permutations where you just vary some component of the test for each run. in this case the pronunciation tag. i see you also use it for the enums from baldr and the directories but its really very trivial to make these test function generic and then call them with permutations. makes it much much easier to maintain, especially when we are talking about a test which is 3k lines

@mandeepsandhu
Copy link
Contributor

do we really need 4 copies of the same tests? image

in gtest its easy to do permutations where you just vary some component of the test for each run. in this case the pronunciation tag. i see you also use it for the enums from baldr and the directories but its really very trivial to make these test function generic and then call them with permutations. makes it much much easier to maintain, especially when we are talking about a test which is 3k lines

You could use parametrized fixtures in gtest. Something like this to call the tests with different languages.

@gknisely
Copy link
Member Author

do we really need 4 copies of the same tests? image
in gtest its easy to do permutations where you just vary some component of the test for each run. in this case the pronunciation tag. i see you also use it for the enums from baldr and the directories but its really very trivial to make these test function generic and then call them with permutations. makes it much much easier to maintain, especially when we are talking about a test which is 3k lines

You could use parametrized fixtures in gtest. Something like this to call the tests with different languages.

@mandeepsandhu yep that was my plan. ty

@gknisely
Copy link
Member Author

do we really need 4 copies of the same tests? image

in gtest its easy to do permutations where you just vary some component of the test for each run. in this case the pronunciation tag. i see you also use it for the enums from baldr and the directories but its really very trivial to make these test function generic and then call them with permutations. makes it much much easier to maintain, especially when we are talking about a test which is 3k lines

@kevinkreiser done

Copy link
Member

@kevinkreiser kevinkreiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is waaaaay tooooo much code. there has to be more succinct ways of doing this :)

@gknisely gknisely merged commit 0f6367a into master Oct 17, 2023
7 of 8 checks passed
@gknisely gknisely deleted the gk_add_languages_pronunciations_mb_v2 branch October 17, 2023 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants