![](https://upload.wikimedia.org/wikipedia/commons/thumb/d/dc/Foursquare_logo.svg/1200px-Foursquare_logo.svg.png)

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import folium
from folium import plugins
from folium.plugins import HeatMap

In [None]:
df_train = pd.read_csv('/kaggle/input/foursquare-location-matching/train.csv')
df_test = pd.read_csv('/kaggle/input/foursquare-location-matching/test.csv')

# Train & Test Set

**Important columns**
- `id`: unique identifier for each record (row)
- `point_of_interest`: identifier for records that refers to the same location. Only available in the training set. Our task is to identify records refering to the same location in the test set using the location's attributes


**Example Test Set**  
To help you author submission code, we include a few example instances selected from the test set. When you submit your notebook for scoring, this example data will be replaced by the actual test data. The actual test set has approximately 600,000 place entries with POIs that are distinct from the POIs in the training set.

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
print ('Train Set')
print (f'Total Number of Records: {df_train.shape[0]}')
print (f'Total Number of Columns: {df_train.shape[1]}')


print ('Test Set')
print (f'Total Number of Records: {df_test.shape[0]}')
print (f'Total Number of Columns: {df_test.shape[1]}')

# % of Missing Values For Each Column

In [None]:
missing = (df_train.isna().sum()/df_train.shape[0]*100).to_frame().reset_index().rename(columns = {'index':'column', 0:'pct_missing'})
fig, ax = plt.subplots()
ax = sns.barplot(data = missing, x = 'column', y = 'pct_missing')
ax.set_ylabel('% Missing Values')
ax.set_xlabel('Column Names')
plt.xticks(rotation = 90)
plt.show()

# Country

In [None]:
print (f'Total Number of Countries: {df_train.country.nunique()}')

In [None]:
# Display top 20 only

country = df_train.country.value_counts().to_frame().reset_index().rename(columns = {'index':'country', 'country':'count'})

fig, ax = plt.subplots()
ax = sns.barplot(data = country.head(20), x = 'country', y = 'count')
ax.set_ylabel('No of Records')
ax.set_xlabel('Country')
plt.xticks(rotation = 90)
plt.show()

# Categories

In [None]:
print (f'Total Number of Categories: {df_train.categories.nunique()}')

In [None]:
# Display top 20 only

categories = df_train.categories.value_counts().to_frame().reset_index().rename(columns = {'index':'categories', 'categories':'count'})

fig, ax = plt.subplots()
ax = sns.barplot(data = categories.head(20), x = 'categories', y = 'count')
ax.set_ylabel('No of Records')
ax.set_xlabel('Categories')
plt.xticks(rotation = 90)
plt.show()

- Some POIs might have multiple categories seperated by comma ( , )

In [None]:
df_train['categories'] = df_train['categories'].astype('str')
df_train['cat_split'] = df_train.categories.str.split(',')
df_train['cat_count'] = df_train.categories.str.split(',').apply(lambda x:len(x))
df_train.cat_count.value_counts()

- examples with multiple categories
- `Coffee Shop` appears as a category itself in some records (shown in the chart above) but can also appear together with other categories
- This will affect how we perform location matching using the `categories` column.

In [None]:
df_train.loc[(df_train['cat_count'] == 2) & (df_train['categories'].str.contains('Coffee Shop'))].head(10)

- Different categories within the same poi
- Finding matching location using exact match of the `categories` column may not work well in such scenario.

In [None]:
df_train.loc[df_train['point_of_interest'] == 'P_728a06a6dcb85e']

# Name

- Certain names such as Starbucks, MacDonald's appears more frequently in the dataset as they have multiple business locations

In [None]:
# Display top 20 only

name = df_train.name.value_counts().to_frame().reset_index().rename(columns = {'index':'name', 'name':'count'})

fig, ax = plt.subplots()
ax = sns.barplot(data = name.head(20), x = 'name', y = 'count')
ax.set_ylabel('No of Records')
ax.set_xlabel('Name')
plt.xticks(rotation = 90)
plt.show()

# Heatmap

In [None]:
df_train['latitude'] = df_train['latitude'].astype(float)
df_train['longitude'] = df_train['longitude'].astype(float)
heat_data = [[row['latitude'],row['longitude']] for index, row in df_train.iterrows()]

In [None]:
basemap = folium.Map(location=[63, -38], zoom_start = 2)
HeatMap(heat_data, radius = 10, blur = 5).add_to(basemap)
basemap

# Point of Interest

- Multiple rows assigned to same `point_of_interest`
- Same `point_of_interest` indicates that they are the same location

In [None]:
print (f'Number of Unique Point of Interest: {df_train.point_of_interest.nunique()}')

In [None]:
df_train.point_of_interest.value_counts()

Let's take a look at a `point_of_intetest` with multiple records. All of these records are referring to Soekarno Hatta Airport in Indonedia.

In [None]:
df_train.loc[df_train['point_of_interest'] == 'P_fb339198a31db3'].head(10)

There are cases where the locations have the same name but they are different `point_of_interest`. These might happen when the business has multiple stores e.g. Starbucks.

Therefore matching solely by `name` may result in false positives.

In [None]:
df_train.loc[df_train.name == 'Starbucks', 'point_of_interest'].value_counts()

**Starbucks Heatmap**

In [None]:
heat_data = [[row['latitude'],row['longitude']] for index, row in df_train.loc[df_train.name == 'Starbucks'].iterrows()]
basemap = folium.Map(location=[63, -38], zoom_start = 2)
HeatMap(heat_data, radius = 10, blur = 5).add_to(basemap)
basemap

- Let's check if there are cases where same `point_of_interest` are far apart from one another.
- We look at `point_of_interest` with 2 or more `id` assigned to it
- Make pairwise comparisons between 2 locations with the same `point_of_interest`

In [None]:
df_multi_poi = df_train[df_train['point_of_interest'].isin(df_train['point_of_interest'].value_counts()[df_train['point_of_interest'].value_counts()>1].index)]

In [None]:
cols = ['id', 'point_of_interest', 'name', 'latitude', 'longitude', 'categories', 'country']
order_cols = ['point_of_interest',
              'id_x', 'name_x', 'latitude_x', 'longitude_x', 'categories_x', 'country_x',
              'id_y', 'name_y', 'latitude_y', 'longitude_y', 'categories_y', 'country_y']
df_join_poi = (pd.merge(df_multi_poi[cols], df_multi_poi[cols], on = 'point_of_interest', how = 'outer')
               .loc[:, order_cols]
               .loc[lambda x: x['id_x'] > x['id_y']])
print (df_join_poi.shape)

In [None]:
df_join_poi.head()

In [None]:
from geopy.distance import geodesic
# find the distance between two locations
df_join_poi['distance'] = df_join_poi.apply(lambda x: geodesic((x['latitude_x'], x['longitude_x']), (x['latitude_y'], x['longitude_y'])).meters, axis = 1)

# find the mean distance between locations of the same POI
df_join_poi['mean_distance'] = df_join_poi.groupby('point_of_interest')['distance'].transform('mean')

get the mean distance between locations for each `point_of_interest`

In [None]:
df_max_distance = (df_join_poi
                   .groupby('point_of_interest', as_index = False)
                   .agg({'mean_distance':'max', 'id_x':'count'})
                   .sort_values('mean_distance', ascending = False))

Here are some `point_of_interest` with high mean distance (in meters) between the locations

In [None]:
df_max_distance.head(10)

- Lets take a look at `point_of_interest` == `P_6028ec26e535fd`.
- We compare pairs of locations with the same `point_of_interest` but large `mean_distance`
- Both have the same `name` and similar `categories`
- `latitude` is the same
- `longitude` has the same value but one has a negative sign infront of it and the other doesn't
- This might be a data entry error

In [None]:
df_join_poi.loc[df_join_poi['point_of_interest'] == 'P_6028ec26e535fd'].sort_values('mean_distance', ascending = False)

- Lets take a look at `point_of_interest` == `P_6028ec26e535fd`.
- The `name`, `latitude`, `longitude` and `categories` looks rather different
- Might be a case where different `id` are wrongly assigned to the same `point_of_interest`

In [None]:
df_join_poi.loc[df_join_poi['point_of_interest'] == 'P_53d2eaf18cee84'].sort_values('mean_distance', ascending = False)

Let's look cases with more than 2 locations assigned to the same `point_of_interest`

In [None]:
df_max_distance.loc[df_max_distance['id_x'] >= 3].head(5)

- `point_of_interest` == `'P_667592b7b1e199' seem to be referring to the Empire State Buidling NY,USA, however some of the records have coordinates in Indonesia (ID) and SIngapore (SG)
- We may need to identify and clean or drop cases with conflicting `country` and coordinates

In [None]:
df_join_poi.loc[df_join_poi['point_of_interest'] == 'P_667592b7b1e199'].sort_values('mean_distance', ascending = False)

In [None]:
df_train.loc[df_train['point_of_interest'] == 'P_667592b7b1e199']

- Checking how many countries are assigned to a `point_of_interest` (excluding nulls)
- Logcially each `point_of_interest` should only correspond to 1 `country`
- However we see that there are ~ 900 `point_of_interest` that are assigned to 2 or more countries

In [None]:
df_train.groupby('point_of_interest', as_index = False).agg({'country':'nunique'})['country'].value_counts()