# Credits

This notebook is based on:

https://www.kaggle.com/code/lennarthaupts/spatial-neighbours-benchmark/notebook?scriptVersionId=93222994

which in turn gained inspiration from:

https://www.kaggle.com/code/hengzheng/distance-and-category-hard-match-baseline/notebook?scriptVersionId=93093506

which itself was based on:

https://www.kaggle.com/code/pjmathematician/matching-based-on-nearest-location

If you feel like upvoting, please upvote those first!

## What does this notebook add?

I noted that many rows in the train data are missing categories, I assume the test data is the same. However, the name field is almost always populated. So as well as the category matching of nearby points of interest, this notebook adds name matching. Additionally, the number of records returned from the BallTree query is changed from 10 to 20.

In [None]:
import numpy as np
import pandas as pd 
from sklearn.neighbors import BallTree
from tqdm import tqdm

In [None]:
train = pd.read_csv('../input/foursquare-location-matching/train.csv')
test = pd.read_csv('../input/foursquare-location-matching/test.csv')
sample_submission = pd.read_csv('../input/foursquare-location-matching/sample_submission.csv')

In [None]:
print("Missing value counts:")
for col in train.keys():
    print(f"{col}: {train[col].isna().sum()}")

In [None]:
test.categories = test.categories.fillna('__NAN__')
test.name = test.name.fillna('__NAN__')

In [None]:
tree = BallTree(np.deg2rad(test[['latitude', 'longitude']].values), metric='haversine')

In [None]:
# list for storing the points of interest
pois_out = []
# number of neigbours considered
n = min(20, len(test))
# max number of recommended points of interest
max_poi = 2
# max distance
max_dist_cat = 0.0005
max_dist_name = 0.005
max_dist = max(max_dist_cat, max_dist_name)

for i, row in tqdm(test.iterrows()):
    dist, ind = tree.query(np.deg2rad(np.c_[row['latitude'], row['longitude']]), k = n)
    poi = []
    for d, j in zip(dist[0], ind[0]):
        if d <= max_dist_cat and row['categories'] != '__NAN__' and (row['categories'] in test.categories.iloc[j] or test.categories.iloc[j] in row['categories']):
            poi.append(test.id.iloc[j])
        elif d <= max_dist_name and row['name'] != '__NAN__' and (row['name'].lower() == test.name.iloc[j].lower()):
            poi.append(test.id.iloc[j])
        if d > max_dist or len(poi) >= max_poi:
            break

    if len(poi) == 0:
        pois_out.append(row['id'])
    else:
        pois_out.append(' '.join(poi))


In [None]:
sample_submission.matches = pois_out
sample_submission.head()

In [None]:
sample_submission.to_csv('submission.csv', index=False)