Edit setup. Add a "country" parameter for load only data (from data directory) for this specific country. Don't index us et gb postcode if is not the specified country. To achieve this I need split some big file from data/.
The result is a speedup on setup country specific Nominatim instance and reduce table number.
Add new argument on setup to select country, speedup setup
Filter static data on setup by country
The big commit, split data from sql file for setup, allow filtering
The ' Add new argument on setup to select country, speedup setup' I can vaguely see. The others seem like a lot of change - is there really any significant result in terms of speed or size?
Not loading useless data for one country (not us or gb) speed up the process by 20min on my non server hardware. Probably less on true server. Loading not big country can only take fee minutes.
I agree that processing of postcodes takes a rather long time during indexing. Taking them out would significantly reduce processing time for small extracts.
There is one good argument for leaving in the complete country data even when processing extracts: there is a good chance that the extract contains data from outside your country close to the borders. This data will be messed up if the country boundaries of the neighbouring countries are missing. It gets worse if you plan to run updates that pull in data from all over the planet.
It would be interesting to see what the actual speedup is with the smaller country data table. @frodrigo could test how much you gain from the country data only, leaving the postcodes out in both cases?
Loading a 400 kb bz2 country on my steup.
Optimized run with my patch give : 6min
Optimized run only without us et gb postcode (load and indexing) give : 12min
Full run give : 20min
Interesting numbers. Testing on my netbook, I get 4min gain from country tables and 20min gain from postcodes for a Liechtenstein import with 2min actual import time. IrlJidel hat similar numbers for Ireland: 6min and 30min gain with 1hour net import time.
It looks like most of the time for the country tables is actually spent creating partition tables that are never used. I think that the better solution for this problem is an option to build an unpartitioned version of the database. I just tried that by setting the partition id for all countries to 0 in country_name.sql. The import is only 1min slower than frodrigos' optimised version. Acceptable for the advantage of having complete country data.
So we could do the following: @frodrigo modifies his patch to take into account only postcode and US state data (Personally, I'd vote to negate the switch, i.e. have something like --disable-external-country-data us|gb|all.) and we pull that in. And I rework above hack into something proper that allows to build a version that has only partition 0.
@twain47 what do you think?
That sounds like a better plan. I think a complete country table is probably well worth that trade off.