# Geonames

Datadumps can be found here: https://download.geonames.org/export/dump/

In this notebook we will explore:
* `allCountries.zip`: for feature explanations check 'geoname' table in README
* `hierarchy.zip`: parentId, childId, type. The type 'ADM' stands for the admin hierarchy modeled by the admin1-4 codes. The other entries are entered with the user 

In [None]:
data_raw = '../../../data/geonames/raw/'

In [None]:
import pandas as pd

## allCountries.zip

In [None]:
data = pd.read_csv(data_raw + 'allCountries.txt', sep="\t", header = None)

In [None]:
colnames = ['geoname_id', 'name', 'asciiname', 'alternativenames', 'latitude', 'longitude', 'feature_class', 
            'feature_code', 'country_code', 'cc2', 'admin1_code', 'admin2_code', 'admin3_code', 'admin4_code',
            'population', 'elevation', 'dem', 'timezone', 'mod_date']

data.columns = colnames

In [None]:
data.head()

In [None]:
data.shape

From README:

feature classes:
- A: country, state, region,...
- H: stream, lake, ...
- L: parks,area, ...
- P: city, village,...
- R: road, railroad 
- S: spot, building, farm
- T: mountain,hill,rock,... 
- U: undersea
- V: forest,heath,...

In [None]:
data['feature_class'].value_counts()

In [None]:
(
    data
    .loc[data['country_code'] == 'NL']
    .head()
)

For `feature_codes` see: http://www.geonames.org/export/codes.html

Findings so far: GeoNames is very detailed, and structured. 
* the geoinformation is very well organized
* insane detail: distincts different types of canals etc.
* has places that google maps doesn't even have. > might be hard to make the distinction in what is worthy to see and what's not.

In [None]:
data['feature_code'].value_counts()

Looking at places within Amsterdam:

In [None]:
amsterdam = (
    data
    .loc[(data['latitude'] > 52.37) & (data['latitude'] < 52.38)]
    .loc[(data['longitude'] > 4.89) & (data['longitude'] < 4.90)]
)

amsterdam.shape

This list contains mostly spots. Those could be filtered out:

In [None]:
(
    amsterdam
    .loc[amsterdam['feature_class'] != "S"]
)

Roads could also be filtered out; a street (feature_code is `ST`) like 'Damrak' is not what you would consider a place..

Also 'De Wallen' has feature class `PPLX` meaning 'section of populated place'. Probably also not what you want to have.

Now let's look at a little bit bigger area:

In [None]:
amsterdam = (
    data
    .loc[(data['latitude'] > 52.3) & (data['latitude'] < 52.4)]
    .loc[(data['longitude'] > 4.84) & (data['longitude'] < 4.95)]
    .loc[~data['feature_class'].isin(['S', 'R'])]
)

amsterdam

Lot's and lot's to filter out. Probably the only thing you want to keep eventually is the place. But this might be different for other regions. 

> What we need is some indication of what is important to list as a **point of interest** and what is not. 

It might be interesting to attempt to match the locations in Wikivoyage with this data. Because if it's in Wikivoyage, it's worth travelling too (most of the time). Combining it with Geonames gives you very accurate location data; instead of having to parse this information yourself from the Wikivoyage text.

Let's see what happens if we look up 'Amsterdam', can we match it easily?

In [None]:
(
    data
#     .loc[data['name'].str.contains('Amsterdam')]
    .loc[data['name'] == 'Amsterdam']
)

Ouch that's tough. Assume you don't know the country code, or geolocation; how do you know which to pick??

> That means we probably still need some additional data to make the join.. i.e. we do need to parse Wikivoyage..

## hierarchy.zip

In [None]:
hierarchy = pd.read_csv(data_raw + 'hierarchy.txt', sep="\t", header = None, 
                        names = ['parent_id', 'child_id', 'adm_type'])

In [None]:
hierarchy.head()

Let's look up Amsterdam's parents. Amsterdam has geoname_id `2759794`

In [None]:
(
    hierarchy
    .loc[hierarchy['child_id'] == 2759794]
)

Interesting; Amsterdam has no parents? i.e. it's not in the child column...

We can find children of Amsterdam though:

In [None]:
(
    hierarchy
    .loc[hierarchy['parent_id'] == 2759794]
    .head()
    .merge(data, left_on='child_id', right_on='geoname_id')
)

Let's see if we can find Amsterdam if we look top down from the province level ('Provincie Noord-Holland' has `geoname_id == 2749879`)?

In [None]:
(
    hierarchy
    .loc[hierarchy['parent_id'] == 2749879]
    .merge(data, left_on='child_id', right_on='geoname_id')
    .loc[lambda x: x['name'].str.contains('Amsterdam')]
)

Mmm we can find the municipality. Can we dig one deeper to city level?

In [None]:
(
    hierarchy
    .loc[hierarchy['parent_id'] == 2759793]
    .merge(data, left_on='child_id', right_on='geoname_id')
#     .loc[lambda x: x['name'].str.contains('Amsterdam')]
)

Odd... none of these corresponds with Amsterdam found earlier with `geoname_id == 2759794`. Let's compare both records:

In [None]:
(
    data
    .loc[data['geoname_id'].isin([2759793, 2759794])]
)

Difficult: one is a county and the other a city. They might also refer to each other through the alternative names?

Wikivoyage makes this a bit easier, as there is only one record 'Amsterdam' instead of two.

## Alternative names

Might be needed to map the two Amsterdams on each other.

However, at this point the whole dataset might be a bit to cumbersome. Let's stop analysing further.

Done.