# Finding nearest points and mapping values from those points

*Written by Simon M. Mudd at the University of Edinburgh, last update 10/11/2021*

This short tutorial is for the case when you have two sets of point data and want to map values of one of the point datasets to the nearest point on the other dataset.

The example here will be a point dataset that represents a channel, and that includes elevation, drainage area, and other values, and a second dataset that represents measurements of channel width. 

## Import the necessary packages

In [None]:
import geopandas as gpd
import numpy as np
import pandas as pd

from scipy.spatial import cKDTree
from shapely.geometry import Point

## Load the necessary datasets

We have two datasets. One is the channel data and the other is data about channel width. This second dataset could be any set of points. 

We will, in the next step, merge these datasets based on the nearest neighbour to one of the set of points (i.e., mapping channel data to the nearest channel width point). 

For this to work, **the two datasets must be in the same coordinate reference system**.

In the example below, we use `.crs` to define the coordinate reference system. We can do this because we know that one of the datasets is in `EPSG:4326` because it has latitude and longitude data, and the other one is in `EPSG:27700`, which is the British National Grid, because it is mean to mimic data collected by students in the field using GPS that have the British National Grid as default. 

We then convert the data from British National Grid to `EPSG:4326` using the function `.to_crs`

In [None]:
# Load the channel data
dfA = pd.read_csv("el_study_chi_data_map.csv")
# Convert to a geopandas dataframe
gdfA = gpd.GeoDataFrame(
    dfA, geometry=gpd.points_from_xy(dfA.longitude, dfA.latitude))
# We have to tell the geopandas data what geographic system we are in by using something called an EPSG code. 
# All major geographic projection and transformation system have this code. 
gdfA.crs = "EPSG:4326" 


# Load the width data
dfB = pd.read_csv("channel_width_test.csv")
gdfB = gpd.GeoDataFrame(
    dfB, geometry=gpd.points_from_xy(dfB.easting, dfB.northing))
# We have to tell the geopandas data what geographic system we are in by using something called an EPSG code. 
# All major geographic projection and transformation system have this code. 
gdfB.crs = "EPSG:27700" 

# IMPORTANT: we convert one of the datasets to the coordinate reference system of the other
gdfC = gdfB.to_crs(4326)

The next three lines just show what the first few lines of data looks like.

In [None]:
gdfA.head()

In [None]:
gdfB.head()

In [None]:
gdfC.head()

## Add the function for combining datasets

The below function merges two datasets using nearest neighbours. 
**You don't need to change anything in this function.**
The first dataframe keeps its data elements and adds properties from the nearest neighbour that are closest to the points in the first dataframe. 

In [None]:
def ckdnearest(gdA, gdB):

    nA = np.array(list(gdA.geometry.apply(lambda x: (x.x, x.y))))
    nB = np.array(list(gdB.geometry.apply(lambda x: (x.x, x.y))))
    btree = cKDTree(nB)
    dist, idx = btree.query(nA, k=1)
    gdB_nearest = gdB.iloc[idx].drop(columns="geometry").reset_index(drop=True)
    gdf = pd.concat(
        [
            gdA.reset_index(drop=True),
            gdB_nearest,
            pd.Series(dist, name='dist')
        ], 
        axis=1)

    return gdf

## Merge the two files!

Now we merge the channel widths, that have been converted to the correct coordinate reference system, with the channel data

In [None]:
new_gdp = ckdnearest(gdfC, gdfA)
new_gdp.head()

Super! Now we can print this new dataset to a file using the `.to_csv` function:

In [None]:
new_gdp.to_csv("updated_channel_width.csv")

Okay, lets load this new csv file to see if it has the correct data.

In [None]:
new_df = pd.read_csv("updated_channel_width.csv")
new_df.head()