-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad results of address queries #60
Comments
Let's go through this example. All parsing is done country-level. So, for DE, we have parsing done with the first of its datasets and then reused. Hence the parsing for DE is reported once, I believe. When you look at parsing results, BE and LU resolved only one level which has to be hit as a substring in the address. Hence no hits. DE parsing has 2 levels, NL three with the hierarchy as shown (from smallest level to more general one). Currently, all search results are sorted by number of levels that were caught in the database. If 2 levels were found, all results with 1 level found are discarded. In this case, NL had a chance since its essentially looked for "6, Baum" while DE was "6, Vorm Baum". As we have many streets named something like "Dr. John Brown" and people search for "brown", all streets with "Baum" in the beginning of one of the words were a hit. There are few other ways results are sorted to ensure that city Glasgow will come before the pub with the same name. And only after that the closeness of match is found. That is probably a reason for putting DE results below NL ones ( As for 66 matching 6: we also have 6-2 and other combinations. So, not sure we can have very simple regex for it. Right now, geocoder doesn't check which of the hierarchy levels matched the parsed string. While it wouldn't save us this time, I would have to look for better matching strategy. This would require major rewrite of the import, generation of training sets for libpostal NLP parser, and rewrite of the search. Hopefully, I can do that in 2019. In addition, DE should get country as a part of the hierarchy. Then you could at least specify Germany in the string and get your perfect hit. Right now, the sub-territories do miss that information, unfortunately. |
BE was only searched but did not yield a result. Sorry for fud. |
The one I suggested as a starting point ( Reducing the numbers of logically extremely similar, but not intended results (e.g. the house_number search string "6" currently matching all Thank you for planning to rework the infrastructure in a larger timeframe. |
That makes sense for me as a "per country" measure. |
What about (in Bourne Shell syntax)
? |
I will look into it in due course.
Maybe adding a string with region cal be used to compare again that part of the performance (ensuring that it works correctly everywhere) |
As import has been reworked for the geocoder and starting from OSM Scout Server 3.0 we would have search ranking used internally, I am going to close this issue. Let's review situation when 3.0 will be out and file the issues of that search implementation. |
Basic description of this issue(s) at TMO (first observed in the context of a "speed comparison" between navigation apps).
Installed maps: BE, LU, NL and many parts of DE
Languages used for address parsing: de, en, fr, lb, nl
OSM Scout Server's log always provides the same (seemingly correct) output while testing address searches (full session.log):
INFO: 15:51:29 Request: /v2/search?search=Vorm+Baum+6
INFO: 15:52:00 Parsed query [DE]: house_number: {6}; road: {vorm baum};
INFO: 15:52:00 Parsed query [DE]: h-0: {vorm baum 6};
INFO: 15:52:33 Parsed query [NL]: house: {vorm}; house_number: {6}; road: {baum};
INFO: 15:52:33 Parsed query [NL]: h-0: {vorm baum 6};
INFO: 15:53:11 Parsed query [LU]: house: {vorm baum 6};
INFO: 15:53:11 Parsed query [LU]: h-0: {vorm baum 6};
INFO: 15:53:14 Parsed query [BE]: house: {vorm baum 6};
INFO: 15:53:14 Parsed query [BE]: h-0: {vorm baum 6};
A
curl 'http://localhost:8553/v2/search?limit=500&search=Vorm+Baum+6' | fgrep '"admin_region":' | cut -s -f 2 -d ':' | cut -s -f 2 -d '"' | tee osmss_search-l500-Vorm+Baum+6.txt | wc -l
results in 41 hits from the maps of the NL (40 hits) and a single DE state (the last hit).Hence with the current limit for the number of hits of 25 (e.g. by using
curl -o osmss_search-Vorm+Baum+6.txt 'http://localhost:8553/v2/search?search=Vorm+Baum+6'
), only hits from a couple of addresses in NL (a few groups of extremely similar ones) are retrieved (the first 25 of the 41).While increasing the limit of search hits may appear to be a "quick solution", I have made a couple of observations, which might lead to resolving this properly:
house_number
seems to be too "fuzzy search"-style:When one looks for "6", also matching "6A" is good, but not matching e.g. "66" or "666". A suitable RegEx may be
^$search-string_house_number[!0-9].*
for filteringhouse_number
s from the database.I assume this would generally reduce the number of results to less than 25.
This is actually the "right" one (accidentally?), providing the last and intended hit (number 41), and the one selected in the first line of OSM Scout Server's main window. But this may be unrelated (I have not tried looking at the code) and querying the maps of the other DE states for an address search is just not reported (yet 😉).
This would have to be a "fuzzy search"-style match and I spontaneously have no idea of a RegEx (or at least a proper metric) for that.
@peterleinchen also reported hits from BE for "Vorm Baum 6", which I do not see. Can anyone confirm this?See.The text was updated successfully, but these errors were encountered: