Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add population from GeoNames using concordance cross-walk #212

Closed
missinglink opened this issue Feb 19, 2016 · 15 comments
Closed

Add population from GeoNames using concordance cross-walk #212

missinglink opened this issue Feb 19, 2016 · 15 comments
Assignees

Comments

@missinglink
Copy link
Contributor

some geonames imports/concordances are missing the associated population metadata.

eg:

for the record above I'd expect the WOF record to have a qs:pop of 7556900

I'm not sure about the scale of the problem, I found this when running our acceptance test suite and found that it affects searches for: brooklyn, london, portland, paris etc.

we are using the gn population data in pelias to score more populous places higher in the results (eg London UK vs. London ON) so it's important that we can get this fixed ASAP.

thanks!

[edit] maybe I'm getting confused between qs:pop and gn:pop, is there such a thing as gn:pop?

@thisisaaronland
Copy link
Contributor

We have imported almost no data directly from Geonames. Quattroshapes is missing that information. If there needs to be a ticket to import GN stuff would you please create a different ticket.

@nvkelso
Copy link
Contributor

nvkelso commented Feb 19, 2016

(Quattroshapes only includes population values when the source provided them, for all others it provides a 0 population. This is the case with London, with a 0 qs_pop value.)

@missinglink
Copy link
Contributor Author

hi @nvkelso, I have an external contributor who is interested in having missing population data imported from geonames where concordances exist.

is this planned as part of the work which you are doing this quarter around alt-names?

also, is it possible to reopen this ticket? I'm confused about why it was closed and what a 'different ticket' would look like, this one seems to describe the issue sufficiently?

@nvkelso nvkelso changed the title geonames data missing population metadata Add population from GeoNames using concordance cross-walk Jul 31, 2017
@nvkelso nvkelso reopened this Jul 31, 2017
@nvkelso
Copy link
Contributor

nvkelso commented Jul 31, 2017

@missinglink Contributions towards GeoNames.org population imports are welcome :)

I've reopened this issue and retitled it. We are still planning on doing population work this quarter for WOF, but are focused more on alternate names at the moment.

Since GeoNames.org changes, and since WOF add more features / more concordances on a regular basis, the contribution should be two fold:

  • PR proposing new population values sourced from GeoNames
  • Script that allows the process to be run again, we'd add that to the public WOF script toolbox repo
    • We'll need to handle if there's already a population value, either from GN or another source (just for Canada where our sample data was added, but in the future too).

For the PR, https://github.com/whosonfirst-data/whosonfirst-data/pull/754/files is a good reference:

  • "src:population":"gn",
  • "wof:population":5592,
  • "wof:population_rank":5,

The linked script is private and only generally applicable to this issue for the "wof:population_rank" calculation

        if existing_wof_population >= 10000000: 
            props['wof:population_rank'] = 14
        elif existing_wof_population>= 5000000:
            props['wof:population_rank'] = 13
        elif existing_wof_population>= 1000000:
            props['wof:population_rank'] = 12
        elif existing_wof_population >= 500000:
            props['wof:population_rank'] = 11
        elif existing_wof_population >= 200000:
            props['wof:population_rank'] = 10
        elif existing_wof_population >= 100000:
            props['wof:population_rank'] = 9
        elif existing_wof_population >= 50000:
            props['wof:population_rank'] = 8
        elif existing_wof_population >= 20000:
            props['wof:population_rank'] = 7
        elif existing_wof_population >= 10000:
            props['wof:population_rank'] = 6
        elif existing_wof_population >= 5000:
            props['wof:population_rank'] = 5
        elif existing_wof_population >= 2000:
            props['wof:population_rank'] = 4
        elif existing_wof_population >= 1000:
            props['wof:population_rank'] = 3
        elif existing_wof_population >= 200:
            props['wof:population_rank'] = 2
        elif existing_wof_population > 0:
            props['wof:population_rank'] = 1
        else:
            props['wof:population_rank'] = 0

@stepps00 is there anything else we should think about?

@stepps00
Copy link
Contributor

I'll have a think about this.. but off the top of my head, we also need to include a src:population property to identify where we obtained the population numbers from.

@missinglink
Copy link
Contributor Author

missinglink commented Aug 2, 2017

thanks for re-opening!

re: the first bullet point, the simplest way to source the populations from geonames is:

awk -F"\t" '{ print $1 " " $15 }' allCountries.txt | grep -v ' 0$' > populations.txt

this will create a 6MB file with two columns, the geonames id on the left and the population on the right.

$ head -n5 populations.txt 
3039154 1052
3039162 9448
3039163 8022
3039604 2363
3039676 3467

that file is small enough to be loaded in to memory and it would be a simple O(1) hashmap lookup to do per record when the imports are run, as you mentioned it would also require one or two IF statements to handle special conditions.

I could write the code if it's helpful, @stepps00 could you point me to the right place? I would like to do it in an existing script which is already run regularly.

I was hoping to simply add a gn:pop or gn:population field (as per existing records) and have other scripts take care of all the wof:* fields, if that's possible, as it would save time and code duplication?

@stepps00
Copy link
Contributor

stepps00 commented Aug 2, 2017

hey @missinglink - that's great.. the script I've previously used for population imports for #240 can be found here: https://github.com/mapzen/whosonfirst-toolbox/blob/master/scripts/issue-240-wof_population.py

That code used existing properties to build new population properties, but could easily be adjusted to import data from a secondary GN file.

@stepps00
Copy link
Contributor

stepps00 commented Aug 2, 2017

Also, see: https://github.com/mapzen/whosonfirst-toolbox/blob/2bcb1cc1640597911504a0e53e1491a7a1e0736e/scripts/geonames_import_script_local.py.

This script does the same GN concordance mapping that you suggested (starting at line 75).

@nvkelso
Copy link
Contributor

nvkelso commented Aug 8, 2017

Scheduled for week of September 4.

@stepps00 stepps00 self-assigned this Sep 7, 2017
@nvkelso
Copy link
Contributor

nvkelso commented Sep 7, 2017

Related to: #240.

@nvkelso
Copy link
Contributor

nvkelso commented Sep 7, 2017

We should always add a gn:population value to features, but when there's already a wof:population value (like from Statoids), let's not also overwrite that existing value (new wof:population values from GeoNames are still desired).

@stepps00
Copy link
Contributor

stepps00 commented Sep 8, 2017

Output file of the allcountries.txt file that includes the GeoNames id and population count, here.

@stepps00
Copy link
Contributor

stepps00 commented Sep 8, 2017

Per IRL discussion with @nvkelso:

  • Text file with gn:id (harvested from wof:concordances), wof:id, and population number from geonames
  • Script used to run population updates for records in that list
  • Test branch

Since we have both Statoids and QS branches outstanding, we should wait to import anything in the test branch (we may be able to harvest more gn concordances for more records once all PRs are merged).

@stepps00
Copy link
Contributor

@stepps00
Copy link
Contributor

stepps00 commented Nov 9, 2017

The large QS PR (#824) was merged, which included a run to import GeoNames population data.

Commits: 1460568 and 028f288

This work should be complete, closing issue.

@stepps00 stepps00 closed this as completed Nov 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants