Skip to content

scottlaird/clickhouse-geocoding

Repository files navigation

Clickhouse Geocoding

This is a collection of SQL for implementing geocoding in Clickhouse. It's effectively a version-controlled (and probably extended) copy of the code from my blog post Geocoding IP Addresses with Clickhouse. This is based on a blog post from Clickhouse and Guillaume Matheron's work extending this to IPv6.

Adding to Clickhouse

To set this up in the first place, apply the SQL from step-1-basic-geocoding.sql to Clickhouse. It will create 3 tables, one materialized view, and one dictionary.

If you need to update this for any reason, then you'll probably need to manually drop the dictionary and the materialized view, as Clickhouse doesn't currently have CREATE OR REPLACE for either.

On my machine, it takes up to 30 seconds to create the ip_trie dictionary on first use, but after that updates shouldn't add substantial latency. The underlying geocoding data gets updates monthly, and this will automatically update its view of the data roughly once per week.

Authentication

Note that creating dictionaries in Clickhouse frequently requires you to provide a Clickhouse username and password (or other auth method) as part of the CREATE DICTIONARY command. This is true even if you're accessing data in the same Clickhouse instance. If you miss this, then attempts to use the dict will get authentication errors. If this happens to you, then DROP DICTIONARY ip_trie and re-add it, changing SOURCE(clickhouse(table ‘geoip’)) to SOURCE(clickhouse(table 'geoip' user '...' password '...' db '...')).

Adding Autonomous System data

If you add step-2-adding-isp-information.sql, then you'll also have data on which organization owns which IP addresses. This will be whichever organization announces the IP block via BGP, so sometimes it'll be the end company and sometimes it'll be their ISP.

Usage

See Geocoding IP Addresses with Clickhouse for more detail, but here's the short version

SELECT
  ip,
  dictGet('ip_trie', 'country_code', toIPv6(ip)) AS country
FROM ...

You can fetch multiple fields from the ip_trie dictionary in a single lookup; see the blog post for syntax.

Looking up Autonomous System data

You can use dictGet('ip_asn_trie', 'as_organization', toIPv6(isp)) as as_organization to fetch the name of the organization that owns the BGP announcement for any given IP.

Attribution

The data that this uses is freely available, but is licensed by DB-IP and requires attribution for use. See https://github.com/sapics/ip-location-db/tree/main/dbip-city for details.

About

Schema and code for geocoding in clickhouse

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published