# Creating Spatial Data

A common operation in spatial analysis is to take non-spatial data, such as CSV files, and creating a spatial dataset from it using coordinate information contained in the file. GeoPandas provides a convenient way to take data from a delimited-text file, create geometry and write the results as a spatial dataset.

The source data comes from [GeoNames](http://www.geonames.org/) - a free and open database of geographic names of the world. It is a huge database containing millions of records per country. The data is distributed as country-level text files in a tab-delimited format. 

We will be using the [dask dataframes](https://docs.dask.org/en/stable/) which enables us to work in the parallel computing in python which does faster computation and allows to work with clusters of thousands of cores.  

We will read a tab-delimited file of places in dask environment, filter it to a feature class, create a GeoDataFrame and export it as a GeoPackage file.

Input Layers:

* `CA.zip`: Geographical database of Canada.
* `MX.zip`: Geographical database of Mexico.
* `US.zip`: Geographical database of United States.

Output Layers:

*   `mountains.gpkg` : A GeoPackage containing a vector layer of mountains locations in North America.


Data Credit:

*   [Geonames](http://www.geonames.org/). Retrieved 2022-09

## Setup and Data Download

The following blocks of code will install the required packages and download the datasets to your Colab environment.

In [None]:
%%capture
if 'google.colab' in str(get_ipython()):
    !pip install --quiet dask_geopandas

In [None]:
import os
import dask.dataframe as dd
import dask_geopandas as dg
import geopandas as gpd
import zipfile

In [None]:
data_folder = 'data'
output_folder = 'output'

if not os.path.exists(data_folder):
    os.mkdir(data_folder)
if not os.path.exists(output_folder):
    os.mkdir(output_folder)

In [None]:
def download(url):
    filename = os.path.join(data_folder, os.path.basename(url))
    if not os.path.exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

countries = ['US', 'MX', 'CA']

download_url = 'https://download.geonames.org/export/dump/'

for country in countries:
  download(download_url + country + '.zip')

In [None]:
for country in countries:
  zip_file_path = os.path.join(data_folder, country + '.zip')
  with zipfile.ZipFile(zip_file_path) as f:
    f.extractall(data_folder)

## Data Pre-Processing

The files do not contain a header row with column names, so we need to specify them when reading the data. The data format is described in detail on the [Data Export]('https://www.geonames.org/export/') page.
We will be also specifying the data type of the column names.

In [None]:
column_names = [
    'geonameid', 'name', 'asciiname', 'alternatenames', 
    'latitude', 'longitude', 'feature class', 'feature code',
    'country code', 'cc2', 'admin1 code', 'admin2 code',
    'admin3 code', 'admin4 code', 'population', 'elevation',
    'dem', 'timezone', 'modification date'
]

dtypes = {'geonameid':int , 'name':object , 'asciiname':object, 'alternatenames':object, 
    'latitude':float, 'longitude':float, 'feature class':object, 'feature code':object,
    'country code':object, 'cc2':object, 'admin1 code':object, 'admin2 code':object,
    'admin3 code':object, 'admin4 code':object, 'population':int, 'elevation':float,
    'dem':int, 'timezone':object, 'modification date':object}

We specify the separator as \t (tab) as an argument to the read_csv() method in dask dataframe.

In [None]:
dd_list = []
for country in countries:
  country_txt = os.path.join(data_folder, country + '.txt')
  country_data = dd.read_csv(country_txt, sep = '\t', names = column_names, dtype=dtypes) 
  dd_list.append(country_data)

## Merging and Creating Spatial Data

We will now concat all the dataframes in one dataframe.

In [None]:
merged_dd = dd.concat(dd_list)

The input data as a column `feature_class` categorizing the place into [9 feature classes](https://www.geonames.org/export/codes.html). We can select all rows with the value `T` with the category *mountain,hill,rock…*

In [None]:
mountain_dd = merged_dd[merged_dd['feature class']== 'T']

GeoPandas has a conveinent function `points_from_xy()` that creates a Geometry column from X and Y coordinates. We can then take a dask dataframe and create a dask goDataFrame by specifying a *CRS* and the *geometry* column.

In [None]:
mountain_dd['geometry'] = dg.points_from_xy(mountain_dd, 'longitude', 'latitude', crs = 'EPSG:4326')

Converting the dask dataframe to dask geodatafeame by using the `from_dask_dataframe` function and then converting this lazy Dask collection into its in-memory equivalent i.e. Geodataframe by using the `compute` function.

In [None]:
mountain_df = dg.from_dask_dataframe(mountain_dd).compute()

In [None]:
mountain_df

We can write the resulting GeoDataFrame to a new GeoPackage file.

In [None]:
output_filename = 'mountains.gpkg'
output_path = os.path.join(output_folder, output_filename)

mountain_df.to_file(driver='GPKG', filename = output_path, layer = 'mountains',  encoding='utf-8')
print('Successfully written output file at {}'.format(output_path))